Skip to Main Content
In distributed systems, there are many opportunities for failure. Any component in any compute node could fail. This includes, but is not limited to, the processor, disk, memory, or network interface on the node. Any of these failures will cause the processes running on the affected nodes to crash or produce incorrect results. The common method of ensuring the progress of these processes is to take a checkpoint, this issue is complicated if the processes are inter-communication processes. This paper presents a distributed non-blocking coordinated checkpointing algorithm that ensures producing global consistent checkpoints images. These consistent checkpoint images can be used to migrate application processes to different computing nodes when a failure takes place.