Clemson University Clemson, United States of America
High-performance computing applications are central to advancement in many fields of science and engineering. Central to this advancement is the supposed reliability of the HPC system. However, as system size grows and hardware components are run with near-threshold voltages, transient upset events become more likely. Many works have explored the problem of detection of silent data corruption. Recovery is often left to checkpoint-restart or application-specific techniques. This poster explores the use of spatial similarity to recover from silent data corruption. We explore eight reconstruction methods and find that Linear Regression yields the best results with over 90% of Linear Regression’s corrections having less than 1% relative error.