Skip to Main Content
Constructing Computation Grid Service System with idle computers in an organization to provide computation service for Mobile Agent can save funds of high-performance computing and make full use of idle resources, but Fault-Tolerance mechanism must be researched to guarantee running of computation task when nodes or networks of the system fail. Three main parts of Fault-Tolerance mechanism of the system are researched in this paper. An adaptive Fault-Detection mechanism, a non-close, non-block and low-overhead Checkpointing mechanism, and a Partial Rollback Mechanism Based on Communication Domain are proposed, which can save overhead of Fault-Tolerance. Experiments have shown their advantages.