Correlated failures have recently gained more attention in the research of failures in large scale systems. Recent studies have pointed out the negative effect of ignoring such failures when designing a fault tolerant scheme for large scale systems. In this paper, we explore the behaviors of temporal correlated failures arising from cyclic dependency among task nodes via an abstract model. Using this model, we find that fast failure propagation and slow recovery from failures are two dominant factors which make recovering from such failures much difficult. To efficiently stop failure propagation and shorten the total recovering time, we propose a recovery protocol called GCCTS (group-based coordinated checkpointing and task suspending) against temporal correlated failures.
Published in:
Parallel and Distributed Systems, 2008. ICPADS '08. 14th IEEE International Conference on
Date of Conference: 8-10 Dec. 2008