Skip to Main Content
Recent trends in high-performance computing point toward increasingly large machines with millions of processing, storage, and networking elements. Unfortunately, the reliability of these machines is inversely proportional to their size, resulting in a system-wide mean time between failures (MTBF), ranging from a few days to a few hours. As such, for long-running applications, the ability to efficiently recover from frequent failures is essential. Traditional forms of fault tolerance, such as checkpoint/restart, suffer from performance issues related to limited I/O and memory bandwidth. In this paper, we present a fault-tolerance mechanism that reduces the cost of failure recovery by maintaining shadow data structures and performing redundant remote memory accesses. Results from a computational chemistry application running at scale show that our techniques provide applications with a high degree of fault tolerance and low (2%-4%) overhead for 2048 processors.