By Topic

Coordinated checkpointing-rollback error recovery for distributed shared memory multicomputers

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$33 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

2 Author(s)
G. Janakiraman ; Dept. of Comput. Sci., California Univ., Los Angeles, CA, USA ; Y. Tamir

Most recovery schemes that have been proposed for Distributed Shared Memory (DSM) systems require unnecessarily high checkpointing frequency and checkpoint traffic, which are sensitive to the frequency of interprocess communication in the applications. For message-passing systems, low overhead error recovery based on coordinated checkpointing allows the frequency of checkpointing to be determined only by the reliability requirements of the application. Efficient adaptation of this approach to DSM multicomputers is complicated by the absence of explicit messages in DSM systems, the presence of a shared and partially replicated address space, and the presence of a distributed coherency directory. We present solutions to these issues, and propose an error recovery scheme based on coordinated checkpointing and rollback for DSM multicomputers. Our performance evaluation based on trace-driven simulations indicates that this scheme incurs less checkpoint traffic than recovery schemes previously proposed for DSM systems

Published in:

Reliable Distributed Systems, 1994. Proceedings., 13th Symposium on

Date of Conference:

25-27 Oct 1994