Skip to Main Content
High performance computing has an important role in scientific and engineering researches. As the size of high performance systems increases continuously, the average time between failures becomes increasingly small. So fault tolerance becomes a critical property for parallel applications running on these systems. MPI (message passing interface) paradigm is actually the most used to write parallel applications. However, in traditional implementations, when a failure occurs, the whole distributed application is shutdown and restarted. To avoid this, many solutions have been proposed, but the most used is rollback recovery. Rollback recovery is based upon the concept of a checkpoint. A checkpoint describes the state of one or more components of the system at a given time of its execution.