By Topic

Minimizing the Network Overhead of Checkpointing in Cycle-harvesting Cluster Environments

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

3 Author(s)
Nurmi, D. ; Dept. of Comput. Sci., California Univ., Santa Barbara, CA ; Brevik, J. ; Wolski, R.

Cycle-harvesting systems such as Condor have been developed to make desktop machines in a local area (which are often similar to clusters in hardware configuration) available as a compute platform. To provide a dual-use capability, opportunistic jobs harvesting cycles from the desktop must be checkpointed before the desktop resources are reclaimed by their owners and the job is evacuated. In this paper, we investigate a new system for computing efficient checkpoint schedules in cycle-harvesting environments. Our system records the historical availability from each resource and fits a statistical model to the observations. Because checkpointing must often traverse the network (i.e. the desktop hosts do not provide sufficient persistent storage for checkpoints), we combine this model with predictions of network performance to the storage site to compute a checkpoint schedule. When an application is initiated on a particular resource, the system uses the computed distribution to parameterize a Markov state-transition model for the application's execution, evaluates the expected time and network overhead as a function of the checkpoint interval, and numerically optimizes with respect to time. We report on the performance of and implementation of this system using the Condor cycle-harvesting environment at the University of Wisconsin. We also evaluate the efficiencies we achieve for a variety of network overheads using trace-based simulation. Finally, we validate our simulations against the observed performance with Condor. Our results indicate that while the choice of model distribution has a relatively small but positive effect on time efficiency, it has a substantial impact on network utilization

Published in:

Cluster Computing, 2005. IEEE International

Date of Conference:

Sept. 2005