Skip to Main Content
We investigate the use of gossip protocols for continuous monitoring of network-wide aggregates under crash failures. Aggregates are computed from local management variables using functions such as SUM, MAX, or AVERAGE. For this type of aggregation, crash failures offer a particular challenge due to the problem of mass loss, namely, how to correctly account for contributions from nodes that have failed. In this paper we give a partial solution. We present G-GAP, a gossip protocol for continuous monitoring of aggregates, which is robust against failures that are discontiguous in the sense that neighboring nodes do not fail within a short period of each other. We give formal proofs of correctness and convergence, and we evaluate the protocol through simulation using real traces. The simulation results suggest that the design goals for this protocol have been met. For instance, the tradeoff between estimation accuracy and protocol overhead can be controlled, and a high estimation accuracy (below some 5% error in our measurements) is achieved by the protocol, even for large networks and frequent node failures. Further, we perform a comparative assessment of GGAP against a tree-based aggregation protocol using simulation. Surprisingly, we find that the tree-based aggregation protocol consistently outperforms the gossip protocol for comparative overhead, both in terms of accuracy and robustness.