Skip to Main Content
We present a framework of fault management for a particular type of failure propagation that we refer to as "poison message failure propagation": Some or all of the network elements have a software or protocol 'bug' which is activated on receipt of a certain network control/management message (the poison message). This activated 'bug' will cause the node to fail with some probability. If the network control or management is such that this message is persistently passed among the network nodes, and if the node failure probability is sufficiently high, large-scale instability can result. In order to mitigate this problem. we propose a combination of passive diagnosis and active diagnosis. Passive diagnosis includes protocol analysis of messages received and sent by failed nodes, correlation of messages among multiple failed nodes and analysis of the pattern of failure propagation. This is combined with active diagnosis in which filters are dynamically configured to block suspect protocols or message types. OPNET simulations show the effectiveness of passive diagnosis. Message filtering is formulated as a sequential decision problem, and a heuristic policy is proposed for this problem.