By Topic

Peer-to-peer fault tolerance framework for a grid computing system

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

3 Author(s)
Tangmankhong, T. ; Dept. of Comput. Eng., King''s Mongkut Univ. of Technol. Thonburi, Thonburi, Thailand ; Siripongwutikorn, P. ; Achalakul, T.

A grid computing system provides high performance computing power, large storage space, or high communication bandwidth, to suit user requirements. The major concern in a grid computing system is the reliability, as a single node failure fails all running applications on the node. We proposed a fault-tolerance framework to improve the reliablity of a grid system. The proposed framework is novel in the sense that it uses the peer-to-peer replication model instead of a traditional client-server replication model, which reduces the replication time overhead and provides better degree of resiliency. Essentially, the checkpoint data file is split into chunks and distributed among a number of backup peers in parallel such that each chunk is replicated at two backup nodes. Moreover, the survival of the backup with the backup data redundancy in case of any one of the backup nodes in the group fails is also maintained. Detailed algorithms of modules of the complete framework are provided including group-forming, fault detection, replication, and fault recovery. Comparative performance evaluation of the replication time between the proposed peer-to-peer model and the client-server model has been conducted by using simulation over a wide range of chunk sizes and checkpoint data size. Our results show that, for a large enough chunk size, the replication time of the peer-to-peer replication model is reduced by half compared to that of the client-server model.

Published in:

Computer Science and Software Engineering (JCSSE), 2012 International Joint Conference on

Date of Conference:

May 30 2012-June 1 2012