By Topic

Tolerating Temporal Correlated Failures from Cyclic Dependency in High Performance Computing Systems

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$33 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

2 Author(s)
Xin Chen ; Dept. of Electr. & Comput. Eng., Tennessee Technol. Univ., Cookeville, TN, USA ; Xubin He

Correlated failures have recently gained more attention in the research of failures in large scale systems. Recent studies have pointed out the negative effect of ignoring such failures when designing a fault tolerant scheme for large scale systems. In this paper, we explore the behaviors of temporal correlated failures arising from cyclic dependency among task nodes via an abstract model. Using this model, we find that fast failure propagation and slow recovery from failures are two dominant factors which make recovering from such failures much difficult. To efficiently stop failure propagation and shorten the total recovering time, we propose a recovery protocol called GCCTS (group-based coordinated checkpointing and task suspending) against temporal correlated failures.

Published in:

Parallel and Distributed Systems, 2008. ICPADS '08. 14th IEEE International Conference on

Date of Conference:

8-10 Dec. 2008