By Topic

Algorithm-based fault tolerance for floating-point operations in massively parallel systems

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$33 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

2 Author(s)
J. Rexford ; Dept. of EECS, Michigan Univ., Ann Arbor, MI, USA ; N. K. Jha

Considers the applicability of algorithm-based fault tolerance (ABFT) to massively parallel scientific computation. Existing ABFT schemes can provide effective fault tolerance at a low cost for computation on matrices of moderate size; however, the methods do not scale well to floating-point operations on large systems. The authors propose the use of a partitioned linear encoding scheme to provide scalability. Matrix algorithms employing this scheme are presented and compared to current ABFT schemes with respect to numerical stability and hardware/time overhead. The partitioned scheme is shown to provide scalable linear codes with improved numerical properties with only s small increase in hardware and time overhead. The partitioned approach prevents overflow in encoding and can preserve the reflectivity of codes, while guarding against roundoff error in encoding. The sharper bound on numerical encoding error allows the method to provide more complete fault coverage

Published in:

Circuits and Systems, 1992. ISCAS '92. Proceedings., 1992 IEEE International Symposium on  (Volume:2 )

Date of Conference:

10-13 May 1992