Skip to Main Content
With the increasing number of processors in modern HPC (high performance computing) systems, there are two emergent problems to solve. One is scalability, the other is fault tolerance. In our previous work, we proposed an MPI operation level checkpoint/rollback system. The main benefits of the system is that it offers the opportunity to employ in-memory (disk-less) checkpoint/rollback techniques which has demonstrated a much better performance over its on-disk counterpart, and the opportunity to have a concurrent two level recover-and-continue MPI system which has been proven to have a high efficiency. To the scope of my knowledge, this is the first concurrent two-level checkpoint/recovery system in use. With the coming of multi-core era, it's time to utilize the multi-threading techniques to improve the performance of in-memory checkpointing algorithm. In this paper, we present two versions of MPI operation level checkpoint/rollback system, one is of single-threaded, the other is of multi-threaded. Also, we provide an in-depth performance analysis between these two approaches to illustrate the benefits of multi-threading techniques on multi-core platform. With the progress of our work, a picture of the hierarchy of future generation fault tolerant HPC system is gradually unrolled.
Date of Conference: 25-27 Sept. 2008