Skip to Main Content
Existing performance management tools for large-scale distributed web services detect anomalies in the performance metric behavior by thresholding on the metrics, which often leads to high false alarm rates, is hard to interpret, and misses multimodal performance behavior. We provide an information-theoretic approach to detecting anomalies in the metric behavior by taking into account the temporal and spatial relationships among the metrics. We model the metrics using a parametric mixture distribution such that each component of the mixture represents a homogeneous segment of temporally contiguous metric behavior. We discover the number, parameters and (temporal) locations the segments (i.e., mixture components) by minimizing an information-theoretic relative entropy between the mixture model and the unknown, underlying distribution of the metrics. We then cluster the discovered segments based on the statistical distances between them to detect any anomalous performance behavior and modes of typical metric behavior.