Skip to Main Content
Though a lot of research has been done on fault tolerance for MPI applications, process migration has not gained widespread use because the complexity of the requirement that the knowledge about the location of a migrated process has to be made known to every other process in the MPI application. In this paper, we present a novel and effective process migration method for MPI application. We implement a prototype called LAM/Migration which based on LAM/MPI + BLCR to provide transparent process migration for MPI application and the migration mechanism is built into LAM/MPI. All processes in MPI application including mpirun and MPI processes can be migrated to any different set of spare nodes in cluster under user specified in case of nodes failure in our method. Performance evaluation results showed that the checkpoint overhead is similar to plain LAM/MPI + BLCR, and the migration method is feasible and promising for overcoming nodes failure in large-scale parallel computing. By using LAM/Migration, the high availability and reliability of parallel computation can be achieved.