Skip to Main Content
Internet backbone networks are under constant flux in order to keep up with demand and offer new features. The pace of change in technology often outstrips the pace of introduction of associated fault monitoring capabilities that are built into today's IP protocols and routers. Moreover, some of these new technologies cross networking layers, raising the potential for unanticipated interactions and service disruptions, which the individual layers' built-in monitoring capabilities may not detect. In these instances, operators typically employ higher layer monitoring techniques such as end-to-end liveness probing to detect lower or cross-layer failures, but lack tools to precisely determine where a detected failure may have occurred. In this paper, we evaluate the effectiveness of using risk modeling to translate high-level failure notifications into lower layer root causes in two specific scenarios in a tier-1 ISP. We show that a simple greedy heuristic works with accuracy exceeding 80 percent for many failure scenarios in simulation, while delivering extremely high precision (greater than 80 percent). We report our operational experience using risk modeling to isolate optical component and MPLS control plane failures in an ISP backbone.