Northeastern University, Khoury College of Computer Sciences, United States of America
Checkpoint/restart (C/R) is a critical component of fault-tolerant computing, and provides scheduling flexibility for computing centers to support diverse workloads with different priorities. Because existing C/R tools are often research-oriented, there is a gap to close before they can be used reliably with production workloads, especially on cutting-edge HPC systems. In this talk, we present our strategy to enable C/R capabilities on NERSC production workloads, which are dominated by MPI and hybrid MPI+OpenMP applications. We share our journey to prepare a production-ready MPI-Agnostic Network-Agnostic (MANA) Distributed Multi-Threaded CheckPointing (DMTCP) tool for NERSC. We also present variable-time job scripts to automate preempted job submissions, queue policies and configurations we have adopted to incentivize C/R usage, our user training effort to increase NERSC users' uptake of C/R, and our effort to build an active C/R community. Finally, we showcase some applications enabled by C/R.