By Topic

Algorithm-based fault location and recovery for matrix computations on multiprocessor systems

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

2 Author(s)
Roy-Chowdhury, A. ; IBM Thomas J. Watson Res. Center, Yorktown Heights, NY, USA ; Banerjee, P.

Algorithm-based fault-tolerance (ABFT) is an inexpensive method of incorporating fault-tolerance into existing applications. Applications are modified to operate on encoded data and produce encoded results which may then be checked for correctness. An attractive feature of the scheme is that it requires little or no modification to the underlying hardware or system software. Previous algorithm-based methods for developing reliable versions of numerical programs for general-purpose multicomputers have mostly concerned themselves with error detection. A truly fault-tolerant algorithm, however, needs to locate errors and recover from them once they are located. In a parallel processing environment, this corresponds to locating the faulty processors and recovering the data corrupted by the faulty processors. In this paper, we first present a general scheme for performing fault-location and recovery under the ABFT framework. Our fault model assumes that a faulty processor can corrupt all the data it possesses. The fault-location scheme is an application of system-level diagnosis theory to the ABFT framework, while the fault-recovery scheme uses ideas from coding theory to maintain redundant data and uses this to recover corrupted data in the event of processor failures. Results are presented on implementations of three numerical algorithms on a 16-processor Intel iPSC/2 hypercube multicomputer, which demonstrate acceptably low overheads for the single and double fault location and recovery cases

Published in:

Computers, IEEE Transactions on  (Volume:45 ,  Issue: 11 )