Understanding and alleviating I/O bottlenecks in HPC system workloads is difficult due to the complex, multi-layered nature of HPC I/O subsystems. Even with full visibility into the jobs executed on the system, the lack of tooling makes debugging I/O problems difficult. In this work, we introduce Gauge, an interactive, data-driven, web-based visualization tool for HPC I/O performance analysis.
Gauge aids in the process of visualizing and analyzing, in an interactive fashion, large sets of HPC application execution logs. It performs a number of functions met to significantly reduce the cognitive load of navigating these sets - some worth many years HPC logs. For instance, as its first step in many processing chains, it arranges unordered sets of collected HPC logs into a hierarchy of clusters for later analysis.
This clustering step allows application developers to quickly navigate logs, find how their jobs compare to those of their peer groups in terms of I/O utilization, as well as how to improve their future runs. Similarly, facility operators can use Gauge to `get a pulse' on the workloads running on their HPC systems, find clusters of underperforming applications, and diagnose the reason for poor I/O throughput. In this work, we describe how Gauge arrives at the HPC jobs clustering, how it presents data about the jobs, and how it can be used to further narrow down and understand behavior of sets of jobs. We also provide a case study on using Gauge from the perspective of a facility operator.