Home  |   Login  |   Logout  |   Access Information  |   Alerts  |   Purchase History  |   Cart  |   Sitemap  |   Help   
 
Abstract
BROWSE SEARCH IEEE XPLORE GUIDE SUPPORT
arrow_leftView TOC   |arrow_leftPrevious Article   |  Next Articlearrow_right
Email/Printer Friendly Format  
 

The Design and Implementation of Checkpoint/Restart Process Fault Tolerance for Open MPI
Hursey, J.   Squyres, J.M.   Mattox, T.I.   Lumsdaine, A.  
Open Syst. Lab., Indiana Univ., Bloomington, IN;

This paper appears in: Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International
Publication Date: 26-30 March 2007
On page(s): 1-8
Location: Long Beach, CA,
ISBN: 1-4244-0910-1
INSPEC Accession Number: 9516493
Digital Object Identifier: 10.1109/IPDPS.2007.370605
Current Version Published: 2007-06-11

Abstract
To be able to fully exploit ever larger computing platforms, modern HPC applications and system software must be able to tolerate inevitable faults. Historically, MPI implementations that incorporated fault tolerance capabilities have been limited by lack of modularity, scalability and usability. This paper presents the design and implementation of an infrastructure to support checkpoint/restart fault tolerance in the Open MPI project. We identify the general capabilities required for distributed checkpoint/restart and realize these capabilities as extensible frameworks within Open MPI's modular component architecture. Our design features an abstract interface for providing and accessing fault tolerance services without sacrificing performance, robustness, or flexibility. Although our implementation includes support for some initial checkpoint/restart mechanisms, the framework is meant to be extensible and to encourage experimentation of alternative techniques within a production quality MPI implementation.

Index Terms
Available to subscribers and IEEE members.

References
Available to subscribers and IEEE members.
Citing Documents
Available to subscribers and IEEE members.
You are not logged in.
Guests may access Abstract records free of charge.
Login
Username
Password
» Forgot your password?
Please remember to log out when you have finished your session.
You must log in to access:
• Advanced or Author Search
• CrossRef Search
• AbstractPlus Records
• Full Text PDF
• Full Text HTML
Access this document
Full Text: PDF (325 KB)
» Buy this document now
»  Learn more about
»  Learn more about
    purchasing articles
    and standards

Rights and Permissions
» Learn More
Download this citation
Available to subscribers and IEEE members.
 
arrow_leftView TOC   |arrow_leftPrevious Article   |  Next Articlearrow_right   |  Back to toparrow_up
Indexed by IEE Inspec
© Copyright 2009 IEEE – All Rights Reserved