Skip to Main Content
This paper introduces a new method based on parallel failure recovery, for the fault tolerance issue of parallel programs. In case a process fails, other surviving processes will compute the task of the failed one in parallel, so that the overhead for fault tolerance is leveled down. The paper presents the design and implementation of the parallel FFT using the new approach, and works on finding an optimum number of processes that participate in parallel failure recovery. Finally, an experiment is done to show the better performance of the parallel failure recovery over that of checkpointing, and to show the effectiveness of our solution for the best number of processes participating parallel failure recovery.