Skip to Main Content
Checkpointing and rollback recovery is a very effective technique to tolerate faults, provided the application is able to recover from a previous checkpoint and proceed with a failure-free computation. However, this technique may fall short if the checkpoint files are somehow contaminated by errors. This paper presents two mechanisms that may be used to determine if a committed checkpoint is error-free or not. These techniques can be used simultaneously for error detection and failure recovery. Both of them are based on checkpoint duplication: one makes use of spatial redundancy while the other is based on temporal redundancy. We discuss the main problems and trade-offs that have to be dealt with to implement these techniques. We then present a performance study that clearly shows the pros and cons of each one. As far as we know, this paper presents the first implementation of these mechanisms in a standard parallel computing system.