Algorithm-based fault tolerance (ABFT) is a method for improving the reliability of parallel architectures used for computation-intensive tasks. A two-stage approach to the synthesis of ABFT systems is proposed. In the first stage, a system-level code is chosen to encode the data used in the algorithm. In the second stage, the optimal architecture to implement the scheme is chosen using dependence graphs. Dependence graphs are a graph-theoretic form of algorithm representation. The authors demonstrate that not all architectures are ideal for the implementation of a particular ABFT scheme. They propose new measures to characterize the fault tolerance capability of a system to better exploit the proposed synthesis method. Dependence graphs can also be used for the synthesis of ABFT schemes for non-linear problems. An example of a fault-tolerant median filter is provided to illustrate their utility for such problems
Published in:
Parallel and Distributed Systems, IEEE Transactions on
(Volume:4
,
Issue:
8
)
Date of Publication: Aug 1993