Skip to Main Content
In this paper we present Chkpt2Chkpt, a desktop grid system that aims to reduce turnaround times of applications by replicating checkpoints. We target desktop computing projects with applications that are comprised of longrunning independent tasks, executed in hundreds or thousands of computers spread over the Internet. While these applications typically do local checkpointing to deal with failures, we propose to replicate those checkpoints in remote places to make them available to other worker nodes. The main idea is to organize the worker nodes of a desktop grid into a peer-to-peer Distributed Hash Table. Worker nodes can take advantage of this P2P network to keep track, share, manage and reclaim the space of the checkpoint files. We used simulation to validate our system and we show that remotely storing replicas of checkpoints can considerably reduce the turnaround times of the tasks, when compared to the traditional approaches where nodes manage their own checkpoints locally. These results make us conclude that the application of P2P techniques seems to be quite helpful in wide-scale desktop grid environments.