Northeastern University, Khoury College of Computer Sciences, United States of America
The share of the top 500 supercomputers with Nvidia GPUs is now over 25% and continues to grow. While fault tolerance is a critical issue for supercomputing, there does not currently exist an efficient, scalable solution for CUDA applications on Nvidia GPUs. CRAC is a new checkpoint-restart solution for fault tolerance that supports the full range of CUDA applications. CRAC combines low runtime overhead (less than 1%); fast checkpoint-restart; support for scalable CUDA streams (for efficient usage of all of the thousands of GPU cores) and support for the full features of Unified Virtual Memory (eliminating the programmer's burden of migrating memory between device and host). CRAC achieves its flexible architecture by segregating application code (checkpointed) and its external GPU communication via non-reentrant CUDA libraries (not checkpointed) within a single process' memory. This eliminates the high IPC overhead of earlier approaches.