Skip to Main Content
Networked computer systems continue to grow in scale and in the complexity of their components and interactions. Component failures become norms instead of exceptions in these environments. A failure will cause one or multiple computer(s) to be unavailable, which affects the resource utilization and system throughput. When a computer fails to function properly, health-related data are valuable for troubleshooting. However, it is challenging to effectively identify anomalies from the voluminous amount of noisy, high-dimensional data. In this paper, we present auto-AID, an autonomic mechanism for anomaly identification in networked computer systems. It is composed of a set of data mining techniques that facilitates automatic analysis of system health data. The identification results are very valuable for the system administrators to manage systems and schedule the available resources. We implement a prototype of auto-AID and evaluate it on a production institution-wide compute grid. The results show that auto-AID can effectively identify anomalies with little human intervention.