Skip to Main Content
In large-scale compute cloud systems, component failures become norms instead of exceptions. Failure occurrence as well as its impact on system performance and operation costs are becoming an increasingly important concern to system designers and administrators. When a system fails to function properly, health-related data are valuable for troubleshooting. However, it is challenging to effectively detect anomalies from the voluminous amount of noisy, high-dimensional data. The traditional manual approach is time-consuming, error-prone, and not scalable. In this paper, we present an autonomic mechanism for anomaly detection in compute cloud systems. A set of techniques is presented to automatically analyze collected data: data transformation to construct a uniform data format for data analysis, feature extraction to reduce data size, and unsupervised learning to detect the nodes acting differently from others. We evaluate our prototype implementation on an institute-wide compute cloud environment. The results show that our mechanism can effectively detect faulty nodes with high accuracy and low computation overhead.