Backward error recovery is one of the most used schemes to ensure fault-tolerance in distributed systems. It consists, upon the occurrence of a failure, in restoring a distributed computation in an error-free global state from which it can be resumed to produce a correct behaviour. Checkpointing is one of the techniques to pursue the backward error recovery. As we consider large-scale distributed systems, on one side a coordinated approach to take checkpoints is not practicable, on the other side for an uncoordinated approach the probability to have a domino effect during a recovery could be no longer negligible. In this paper, we present a framework that allows first to define formally the domino effect and second to state and prove a theorem to determine if an arbitrary set of check points is consistent. This theorem is very general as it considers a semantic including missing and orphan messages. This plays a key role in designing uncoordinated checkpointing algorithms that require to take as less additional checkpoints as possible in order to ensure domino-free recovery
Published in:
Distributed Computing Systems, 1995., Proceedings of the Fifth IEEE Computer Society Workshop on Future Trends of
Date of Conference: 28-30 Aug 1995