Skip to Main Content
Fault detection is core functionality required by most fault tolerance strategies, but it often depends on reliable communication between computing nodes exchanging monitoring information. We present techniques to improve the robustness of fault detectors for distributed platforms in situations where network connectivity is affected by packet loss and delays. Similar network conditions can be found in computing grids connecting geographically distant resources. We present results from experimental tests conducted in a simulated environment. The results show significant improvement over traditional approaches.