Skip to Main Content
When developing networked or distributed systems, network monitoring is becoming an essential facility for controlling and managing their performance or quality of service. Especially as their network rapidly scales up, distributed monitoring schemes based on a hierarchy of monitoring managers has been presented and used. But, failures of some monitoring managers cause managed network elements not to be continuously and correctly polled until the managers are repaired. For this purpose, this paper proposes an efficient monitoring manager fault-tolerance scheme to enable the managers to effectively exploit their hierarchical structure. The scheme results in low failure detection overhead by each monitoring manager periodically sending a manager advertisement message only to its immediate super manager. Therefore, even if some managers crash concurrently, the scheme allows their immediate super managers to take over them. This behavior can achieve minimizing the number of live managers affected by the failures. Moreover, after failed managers have been recovered, it allows them to immediately play their pre-failure roles in order to improve entire monitoring system performance degraded by the failures.