Skip to Main Content
Extreme scale systems available before the end of this decade are expected to have 100 million to 1 billion CPU cores. The probability that a failure occurs during an application execution is expected to be much higher than today's systems. Counteracting this higher failure rate may require a combination of disk-based checkpointing, diskless checkpointing, and algorithmic fault tolerance. Diskless checkpointing is an efficient technique to tolerate a small number of process failures in large parallel and distributed systems. In the literature, a simultaneous failure of no more than N processes is often tolerated by using a one-level Reed-Solomon checkpointing scheme for N simultaneous process failures, whose overhead often increases quickly as N increases. We introduce an N-level diskless checkpointing scheme that reduces the overhead for tolerating a simultaneous failure of up to N processes. Each level is a diskless checkpointing scheme for a simultaneous failure of i processes, where i = 1, 2,..., N. Simulation results indicate the proposed N-level diskless checkpointing scheme achieves lower fault tolerance overhead than the one-level Reed-Solomon checkpointing scheme for N simultaneous processor failures.