By Topic

Fault-tolerant parallel applications with dynamic parallel schedules

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

2 Author(s)
Gerlach, S. ; Sch. of Comput. & Commun. Sci., Ecole Polytech. Fed. de Lausanne, Switzerland ; Hersch, R.D.

Commodity computer clusters are often composed of hundreds of computing nodes. These generally off-the-shelf systems are not designed for high reliability. Node failures therefore drive the MTBF of such clusters to unacceptable levels. The software frameworks used for running parallel applications need to be fault-tolerant in order to ensure continued execution despite node failures. We propose an extension to the flow graph based Dynamic Parallel Schedules (DPS) development framework that allows non-trivial parallel applications to pursue their execution despite node failures. The proposed fault-tolerance mechanism relies on a set of backup threads located in the volatile storage of alternate nodes. These backup threads are kept up to date by duplication of the transmitted data objects and periodical checkpointing of thread states. In case of a failure, the current state of the threads that were on the failed node is reconstructed on the backup threads by re-executing operations. The corresponding valid re-execution order is automatically deduced from the data flow graph of the DPS application. Multiple simultaneous failures can be tolerated, provided that for each thread either the active thread or its corresponding backup thread survives. For threads that do not store a local state, an optimized mechanism eliminates the need for duplicate data object transmissions. The overhead induced by the fault tolerance mechanism consists mainly of duplicate data object transmissions that can, for compute bound applications, be carried out in parallel with ongoing computations. The increase in execution time due to fault tolerance therefore remains relatively low. It depends on the communication to computation ratio and on the parallel programs efficiency.

Published in:

Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International

Date of Conference:

4-8 April 2005