Skip to Main Content
A grid computing system provides high performance computing power, large storage space, or high communication bandwidth, to suit user requirements. The major concern in a grid computing system is the reliability, as a single node failure fails all running applications on the node. We proposed a fault-tolerance framework to improve the reliablity of a grid system. The proposed framework is novel in the sense that it uses the peer-to-peer replication model instead of a traditional client-server replication model, which reduces the replication time overhead and provides better degree of resiliency. Essentially, the checkpoint data file is split into chunks and distributed among a number of backup peers in parallel such that each chunk is replicated at two backup nodes. Moreover, the survival of the backup with the backup data redundancy in case of any one of the backup nodes in the group fails is also maintained. Detailed algorithms of modules of the complete framework are provided including group-forming, fault detection, replication, and fault recovery. Comparative performance evaluation of the replication time between the proposed peer-to-peer model and the client-server model has been conducted by using simulation over a wide range of chunk sizes and checkpoint data size. Our results show that, for a large enough chunk size, the replication time of the peer-to-peer replication model is reduced by half compared to that of the client-server model.