By Topic

Design, implementation and performance of fault-tolerant message passing interface (MPI)

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

4 Author(s)
Selvakumar, A.D. ; Real-Time Syst. Group, Centre for Dev. of Adv. Comput., Bangalore, India ; Sobha, P.M. ; Ravindra, G.C. ; Pitchiah, R.

Fault tolerant MPI (FTMPI) enables fault tolerance to the MPICH, an open source GPL licensed implementation of MPI standard by Argonne National Laboratory's Mathematics and Computer Science Division. FTMPI is a transparent fault-tolerant environment, based on synchronous checkpointing and restarting mechanism. FTMPI relies on non-multithreaded single process checkpointing library to synchronously checkpoint an application process. Global replicated system controller and cluster node specific node controller monitors and controls check pointing and recovery activities of all MPI applications within the cluster. This work details the architecture to provide fault tolerance mechanism for MPI based applications running on clusters and the performance of NAS parallel benchmarks and parallelized medium range weather forecasting models, P-T80 and P-TI26. The architecture addresses the following issues also: Replicating system controller to avoid single point of failure. Ensuring consistency of checkpoint files based on distributed two phase commit protocol, and robust fault detection hierarchy.

Published in:

High Performance Computing and Grid in Asia Pacific Region, 2004. Proceedings. Seventh International Conference on

Date of Conference:

20-22 July 2004