Skip to Main Content
Fault localization is the core element in fault management. Symptom-fault map is commonly used to describe the symptom-fault causality in fault reasoning. For Internet service networks, a well-designed monitoring system can effectively correlate the observable symptoms (i.e., alarms) with the critical network faults (e.g., link failure). However, the lost and spurious symptoms can significantly degrade the performance and accuracy of a passive fault localization system. For overlay networks, due to limited underlying network accessibility, as well as the overlay scalability and dynamics, it is impractical to build a static overlay symptom-fault map. In this paper, we firstly propose a novel active integrated fault reasoning (AIR) framework to incrementally incorporate active investigation actions into the passive fault reasoning process based on an extended symptom-fault-action (SFA) model. Secondly, we propose an overlay network profile (ONP) to facilitate the dynamic creation of an overlay symptom-fault-action (called O-SFA) model, such that the AIR framework can be applied seamlessly to overlay networks (called O-AIR). As a result, the corresponding fault reasoning and action selection algorithms are elaborated. Extensive simulations and Internet experiments show that AIR and O-AIR can significantly improve both accuracy and performance in the fault reasoning for Internet and overlay service networks, especially when the ratio of the lost and spurious symptoms is high.