Skip to Main Content
A large-scale distributed system may experience software or hardware failures that lead to undesirable down-time of the system. While the failure of a hardware node is common for large distributed systems, the reliability of software can also be a significant factor. System reliability can be improved by integrating both hardware and software based reliability techniques. We presented a combined fault-tolerant approach to improve reliability for a large monitoring system through failure detection, isolation, and recovery. The proposed approach was applied to real-time distributed monitoring system and preliminary experiments showed substantial improvement on reliability. Experiments also showed that our approach is scaleable to meet the needs of large-scale monitoring systems.