Skip to Main Content
Today, the scale of high performance computing (HPC) systems is much larger than ever. Some HPC systems consist of thousands or even tens of thousands of processors. The larger scale leads to a challenge that how to deal with process failures. The most important programming tool for HPC is MPI (message passing interface). There are some existing methods to deal with fault-tolerance, such as MPICH-V, StarFish, MPI/FT and so on, using the MPI context. Most of them do the checkpoint on disk. In this paper, some erasure codes, which used in RAID systems usually, are applied to deal with the fault-tolerance in-memory. Based on fault-tolerance-MPI (FT-MPI) platform, RAID4, RAID5, RDP and X-code are implanted to do the checkpoint in-memory. The experimental results show that RDP is feasible for double-fault-tolerance in-memory.