The components of a fault-tolerant distributed system must be capable to accurately determine which components of the system are faulty and which are fault-free. In this paper, we present a new distributed algorithm for event diagnosis in fully-connected networks. An event is defined as a faulty node becoming fault-free, or vice versa. Previous hierarchical algorithms considered a static fault situation, in which an event can only occur after a previous event has been fully diagnosed. The new algorithm is capable of achieving the diagnosis of dynamic events as long as the nodes stay in a given state for a period of time long enough for all testers to detect that state. Each node running the algorithm keeps a timestamp for the state of each other node in the system. This timestamp is implemented as a counter, which is incremented every time a node changes its state. In this way, each tester may obtain information about a given node in the system from more than one tested node without causing any inconsistencies, i.e. without taking an older state for a newer one. Nodes run a hierarchical testing strategy, which is a hypercube when all nodes are fault-free. When a fault-free node is tested, the tester gets diagnostic information about N/2 nodes for a system of N nodes. In spite of the overhead of keeping and transferring timestamps, the new algorithm significantly reduces the average latency when compared to other similar approaches, presenting a new option for practical diagnosis implementation
Published in:
Parallel and Distributed Systems, 2000. Proceedings. Seventh International Conference on
Date of Conference: 2000