By Topic

A Redundant Communication Approach to Scalable Fault Tolerance in PGAS Programming Models

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

4 Author(s)
Ali, N. ; Pacific Northwest Nat. Lab., Richland, WA, USA ; Krishnamoorthy, S. ; Govind, N. ; Palmer, B.

Recent trends in high-performance computing point toward increasingly large machines with millions of processing, storage, and networking elements. Unfortunately, the reliability of these machines is inversely proportional to their size, resulting in a system-wide mean time between failures (MTBF), ranging from a few days to a few hours. As such, for long-running applications, the ability to efficiently recover from frequent failures is essential. Traditional forms of fault tolerance, such as checkpoint/restart, suffer from performance issues related to limited I/O and memory bandwidth. In this paper, we present a fault-tolerance mechanism that reduces the cost of failure recovery by maintaining shadow data structures and performing redundant remote memory accesses. Results from a computational chemistry application running at scale show that our techniques provide applications with a high degree of fault tolerance and low (2%-4%) overhead for 2048 processors.

Published in:

Parallel, Distributed and Network-Based Processing (PDP), 2011 19th Euromicro International Conference on

Date of Conference:

9-11 Feb. 2011