Reliability has recently become an important issue in PC cluster technology. This research proposes a software distributed shared memory system, named SCASH-FT, as an execution platform for high performance and highly reliable parallel system for commodity PC clusters. To achieve fault tolerance, each node has redundant page data that allows recovery from node failure using SCASH-FT. All page data is checkpointed and duplicated to another node when a user explicitly calls the checkpoint function. When failure occurs, SCASH-FT invokes the rollback function by restarting an execution from the last checkpoint data. SCASH-FT takes charge of processes such as detecting failure and restarting execution. So, all you have to do is just adding checkpoint function calls in the source code to determine the timing of each checkpoint. Evaluation results show that the checkpoint cost and the rollback penalty depend on the data access pattern and the checkpoint frequency. Thus, users can control their application performance by adjusting checkpoint frequency.
Published in:
Parallel and Distributed Systems (ICPADS), 2009 15th International Conference on
Date of Conference: 8-11 Dec. 2009