Skip to Main Content
Cloud computing has become increasingly popular by obviating the need for users to own and maintain complex computing infrastructure. However, due to their inherent complexity and large scale, production cloud computing systems are prone to various runtime problems caused by hardware and software failures. Autonomic failure detection is a crucial technique for understanding emergent, cloudwide phenomena and self-managing cloud resources for system-level dependability assurance. To detect failures, we need to monitor the cloud execution and collect runtime performance data. These data are usually unlabeled, and thus a prior failure history is not always available in production clouds, especially for newly managed or deployed systems. In this paper, we present an Adaptive Anomaly Detection (AAD) framework for cloud dependability assurance. It employs data description using hypersphere for adaptive failure detection. Based on the cloud performance data, AAD detects possible failures, which are verified by the cloud operators. They are confirmed as either true failures with failure types or normal states. The algorithm adapts itself by recursively learning from these newly verified detection results to refine future detections. Meanwhile, it exploits the observed but undetected failure records reported by the cloud operators to identify new types of failures. We have implemented a prototype of the algorithm and conducted experiments in an on-campus cloud computing environment. Our experimental results show that AAD can achieve more efficient and accurate failure detection than other existing scheme.
Date of Conference: 8-11 Oct. 2012