High-performance distributed computing across wide-area networks has become an active topic of research. Achieving large-scale distributed computing in a seamless manner introduces a number of difficult problems. This paper examines one of the most critical problems, fault tolerance. We have examined fault tolerance options for a common class of high-performance parallel applications, single-program-multiple-data (SPMD). Performance models for two fault tolerance methods, checkpoint-recovery (CR) and wide-area replication (WR), have been developed. These models enable quantitative comparisons of the two methods as applied to SPMD applications
Published in:
High Performance Distributed Computing, 1999. Proceedings. The Eighth International Symposium on
Date of Conference: 1999