1 Introduction
Many systems are required to operate fault-free for a given period of time. Such systems are fault tolerant since they can continue to function even when one or more of its components has failed. Redundancy is often used to achieve high reliability. Over the last decade, the landscape of High Performance Computing (HPC) has changed drastically due to number of significant developments in the speed of the processors and the speed of the communication networks. With this development, it has become possible to build an economical distributed system consisting of powerful workstations interconnected through a high-speed communication link to replace an expensive special-purpose supercomputer. Cluster of Workstations (COWs), or network-based multi-computers, and Massively Parallel Systems (MPPs) are the most prominent distributed-memory systems to replace supercomputers.