By Topic

Raptor: integrating checkpoints and thread migration for cluster management

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

3 Author(s)
Shafi, H. ; Austin Res. Lab., IBM Res., Austin, TX, USA ; Speight, E. ; Bennett, John K.

Software distributed shared-memory (SDSM) provides the abstraction necessary to run shared-memory applications on cost-effective parallel platforms such as clusters of workstations. However, problems such as cluster component reliability and cluster management, which are not directly related to performance, need to be addressed before SDSM solutions can be widely adopted. This paper presents Raptor, an SDSM cluster management system based on checkpoint/recovery and thread migration. Raptor checkpoints decouple the runtime system and application data from application threads, allowing efficient load balancing, resource allocation, and rollback recovery. There are two important features of the system. First, it reduces checkpoint overhead by only saving application-specific data that cannot be recreated at recovery time. Second, by integrating thread migration capability both at running and recovery, it allows the addition or removal of computing resources from a running application, while adding little or no additional burden on the SDSM application programmer.

Published in:

Reliable Distributed Systems, 2003. Proceedings. 22nd International Symposium on

Date of Conference:

6-18 Oct. 2003