Many modern HPC applications do not make good use of the limited available I/O bandwidth. Developing an understanding of the I/O subsystem is a critical first step in order to better utilize an HPC system. While expert insight is indispensable, I/O experts are in rare supply. We seek to automate this effort by developing and interpreting models of I/O throughput. Such interpretations may be useful to both application developers who can use them to improve their codes and to facility operators who can use them to identify larger problems in an HPC system.
The application of machine learning (ML) to HPC system analysis has been shown to be a promising direction. The direct application of ML methods to I/O throughput prediction, however, often leads to brittle models with low extrapolative power. In this work, we set out to understand the reasons why common methods underperform on this specific problem domain, and how to build models that better generalize on unseen data. We show that commonly used cross-validation testing yields sets that are too similar, preventing us from detecting overfitting. We propose a method for generating test sets that encourages training-test set separation. Next, we explore limits of I/O throughput prediction and show that we can estimate I/O contention noise by observing repeated runs of an application. Then, we show that by using our new test sets, we can better discriminate different architectures of ML models in terms of how well they generalize.