Skip to Main Content
In large-scale cloud computing systems, even a simple user request may go through numerous of services that are deployed on different physical machines. As a result, it is a great challenge to online localize the prime causes of performance degradation in such systems. Existing end-to-end request tracing approaches are not suitable for online anomaly detection because their time complexity is exponential in the size of the trace logs. In this paper, we propose an approach, namely Magnifier, to rapidly diagnose the source of performance degradation in large-scale non-stop cloud systems. In Magnifier, the execution path graph of a user request is modeled by a hierarchical structure including component layer, module layer and function layer, and anomalies are detected from higher layer to lower layer separately. In each layer every node is assigned a newly created identifier in addition to the global identifier of the request, which significantly decreases the size of parsing trace logs and accelerates the anomaly detection process. We conduct extensive experiments over a real-world enterprise system (the Alibaba cloud computing platform) providing services for the public. The results show that Magnifier can locate the prime causes of performance degradation more accurately and efficiently.