Resilience is a critical issue for large-scale platforms. This tutorial provides a comprehensive survey of fault-tolerant techniques for high-performance and big-data applications, with a fair balance between theory and practice.
Outline: Overview of failure types and typical probability distributions; general-purpose techniques: checkpoint and rollback recovery protocols, replication, prediction, silent error detection; application-specific techniques: user-level in-memory checkpointing, data replication (map-reduce) or fixed-point convergence for iterative applications (back-propagation); practical deployment of fault tolerance techniques with User Level Fault Mitigation (a proposed MPI standard extension).
Examples: Monte-Carlo methods; SPMD stencil; map-reduce; back-propagation in neural networks.
A step-by-step approach shows how to protect these routines in a hands-on session. The tutorial is open to all SC20 attendees who are interested in the current status and expected promise of fault-tolerant approaches for scientific and big data applications. There are no audience prerequisites: background will be provided for all protocols and probabilistic models.