By Topic

Adaptive independent checkpointing for reducing rollback propagation

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$33 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

2 Author(s)
J. Xu ; Dept. of Comput. Sci., Brown Univ., Providence, RI, USA ; R. H. D. Netzer

Independent checkpointing is a simple technique for providing fault tolerance in distributed systems. However, it can suffer from the domino effect, which causes the rollback of one process to potentially propagate to others. In this paper we present an adaptive checkpointing algorithm to practically eliminate rollback propagation for independent checkpointing. Our algorithm is based on proofs of the conditions necessary and sufficient for a checkpoint to belong to some consistent global checkpoint, previously an open question. We characterize these conditions with a generalization of Lamport's happened-before relation called a zigzag path. Our algorithm tracks zigzag paths on-line and checkpoints when certain paths are detected. Experiments on an iPSC/860 hypercube show that our algorithm reduces the average rollback required to recover from any fault to less than one checkpoint interval per process, and checkpoints only 4% more often than traditional periodic checkpointing algorithms. We thus eliminate rollback propagation without the runtime overhead of coordinated checkpoints or other schemes that attempt to reduce rollback propagation

Published in:

Parallel and Distributed Processing, 1993. Proceedings of the Fifth IEEE Symposium on

Date of Conference:

1-4 Dec 1993