By Topic

On integrating error detection into a fault diagnosis algorithm for massively parallel computers

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

3 Author(s)
Altmann, J. ; Dept. of Comput. Sci., Erlangen-Nurnberg Univ., Germany ; Bartha, T. ; Pataricza, A.

Scalable fault diagnosis is necessary for constructing fault tolerance mechanisms in large massively parallel multiprocessor systems. The diagnosis algorithm must operate efficiently even if the system consists of several thousand processors. We introduce an event-driven, distributed system-level diagnosis algorithm. It uses a small number of messages and is based on a general diagnosis model without the limitation of the number of simultaneously existing faults (an important requirement for massively parallel computers). The algorithm integrates both error detection techniques like ⟨I'm alive⟩ messages, and built in hardware mechanisms. The structure of the implemented algorithm is presented and the essential program modules are described. The paper also discusses the use of test results generated by error detection mechanisms for fault localization. Measurement results illustrate the effect of the diagnosis algorithm, in particular the error detection mechanism by ⟨I'm alive⟩, messages, on the application performance

Published in:

Computer Performance and Dependability Symposium, 1995. Proceedings., International

Date of Conference:

24-26 Apr 1995