Skip to Main Content
Many complex scientific, mathematical applications require large time for completion. To deal with this issue, parallelization is popularly used. Distributing an application onto several machines is one of the key aspects of grid-computing. This paper focuses on a check point/restart mechanism used to overcome the problem of job suspension at a failed node in a computational Grid. The ability to checkpoint a running application and restart it later can provide many useful benefits including fault recovery by rolling back an application to a previous checkpoint, advanced resources sharing, better application response time by restarting applications from checkpoints instead of from scratch, and improved system utilization, efficient high performance computing and improved service availability.