Skip to Main Content
In a distributed computing environment, particularly grid, fault-tolerance is one of the core functionalities the system should provide. MPICH-GF is such a resilient system designed to resist external or internal failures, especially for message passing applications in the grid environment. But it does not stand the loss of a valuable resource: files. In a normal case, users open files and write data into them in an asynchronous manner, and checkpointing is initiated with no regard to the state of the context of the process. Therefore, the checkpointing system should automatically recognize the running process and protect the open files transparently. We have implemented a recoverable file system, named ReFS, which is incorporated into our fault-tolerant system MPICH-GF. ReFS is a versioning-like file system. ReFS provides middleware libraries with the system call interface to protect specific files at a given time. This prevents applications from processing their jobs with corrupted data and resulting in incorrect results in case of failures. We have focused not only on the reliability of the system but also on the reduction of inevitable overheads. This paper describes the design and implementation of ReFS and justifies the validity of the behavior of ReFS. We have developed ReFS on Linux, based on Ext2.