Skip to Main Content
We present a statistical probing-approach to distributed fault-detection in networked systems, based on autonomous configuration of algorithm parameters. Statistical modelling is used for detection and localisation of network faults. A detected fault is isolated to a node or link by collaborative fault-localisation. From local measurements obtained through probing between nodes, probe response delay and packet drop are modelled via parameter estimation for each link. Estimated model parameters are used for autonomous configuration of algorithm parameters, related to probe intervals and detection mechanisms. Expected fault-detection performance is formulated as a cost instead of specific parameter values, significantly reducing configuration efforts in a distributed system. The benefit offered by using our algorithm is fault-detection with increased certainty based on local measurements, compared to other methods not taking observed network conditions into account. We investigate the algorithm performance for varying user parameters and failure conditions. The simulation results indicate that more than 95% of the generated faults can be detected with few false alarms. At least 80% of the link faults and 65% of the node faults are correctly localised. The performance can be improved by parameter adjustments and by using alternative paths for communication of algorithm control messages.