Skip to Main Content
Service fault management in distributed computer systems and networks is a difficult task that requires high efficient inferences from mass data. In this paper, we propose a corresponding solution. Firstly, challenges of distributed systems service fault management are analyzed, and a multilayer model is recommended. Then, a dependency matrix to represent the causal relationship between faults and probes is defined and the framework of fault management is built. After these, a service fault management scheme using active probing is proposed. This scheme is composed of two phases: fault detection and fault localization. In first phase, we propose a probe selection algorithm, which selects a minimal set of probes while remaining a high probability of fault detection. In second phase, we propose a fault localization probe selection algorithm, which selects probes to obtain more system information based on the symptoms observed in previous phase. Finally, the instance proves the validity and efficiency of our scheme.