By Topic

Extending a cluster SSI OS for transparently checkpointing message-passing parallel applications

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$33 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

2 Author(s)
M. Fertre ; IRISA, Rennes, France ; C. Morin

Nowadays, clusters are widely used to execute scientific applications. These applications are often message-passing parallel applications with long execution times. Since the number of nodes in clusters is growing, faults are more frequent. Thus the application execution time may be greater than the mean time before failure (MTBF) of the cluster. To avoid restarting application from the beginning, it is desirable that cluster systems provide some fault tolerant mechanisms such as checkpoint/restart. An approach to implement efficiently this mechanism is to implement it directly in the application or in the communication library. Another approach is to implement it in an operating system dedicated to clusters. This is more complex but let you checkpoint/restart any message-passing application whatever the communication library. This paper presents basic mechanisms for system initiated checkpoint of message-passing parallel applications running on top of a cluster. Performance results obtained from a prototype implemented in KERRIGHED Single System Image cluster Operating System based on LINUX are presented.

Published in:

8th International Symposium on Parallel Architectures,Algorithms and Networks (ISPAN'05)

Date of Conference:

7-9 Dec. 2005