By Topic

Reliable Software Distributed Shared Memory Using Page Migration

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

2 Author(s)
Jinpil Lee ; Grad. Sch. of Syst. & Inf. Eng., Univ. of Tsukuba, Tsukuba, Japan ; Sato, M.

Reliability has recently become an important issue in PC cluster technology. This research proposes a software distributed shared memory system, named SCASH-FT, as an execution platform for high performance and highly reliable parallel system for commodity PC clusters. To achieve fault tolerance, each node has redundant page data that allows recovery from node failure using SCASH-FT. All page data is checkpointed and duplicated to another node when a user explicitly calls the checkpoint function. When failure occurs, SCASH-FT invokes the rollback function by restarting an execution from the last checkpoint data. SCASH-FT takes charge of processes such as detecting failure and restarting execution. So, all you have to do is just adding checkpoint function calls in the source code to determine the timing of each checkpoint. Evaluation results show that the checkpoint cost and the rollback penalty depend on the data access pattern and the checkpoint frequency. Thus, users can control their application performance by adjusting checkpoint frequency.

Published in:

Parallel and Distributed Systems (ICPADS), 2009 15th International Conference on

Date of Conference:

8-11 Dec. 2009