By Topic

System-level fault-tolerance in large-scale parallel machines with buffered coscheduling

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

3 Author(s)
Petrini, F. ; Performance & Archit. Lab., Los Alamos Nat. Lab., NM, USA ; Davis, K. ; Sancho, J.C.

Summary form only given. As the number of processors for multiteraflop systems grows to tens of thousands, with proposed petaflops systems likely to contain hundreds of thousands of processors, the assumption of fully reliable hardware has been abandoned. Although the mean time between failures for the individual components can be very high, the large total component count will inevitably lead to frequent failures. It is therefore of paramount importance to develop new software solutions to deal with the unavoidable reality of hardware faults. We will first describe the nature of the failures of current large-scale machines, and extrapolate these results to future machines. Based on this preliminary analysis we will present a new technology that we are currently developing, buffered coscheduling, which seeks to implement fault tolerance at the operating system level. Major design goals include dynamic reallocation of resources to allow continuing execution in the presence of hardware failures, very high scalability, high efficiency (low overhead), and transparency - requiring no changes to user applications. Preliminary results show that this is attainable with current hardware.

Published in:

Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International

Date of Conference:

26-30 April 2004