Skip to Main Content
Enterprise software systems are large and complex with limited support for automated root-cause analysis. Avoiding system downtime and loss of revenue dictates a fast and efficient root-cause analysis process. Operator practice and academic research have shown that about 80% of failures in such systems have recurrent causes; therefore, significant efficiency gains can be achieved by automating their identification. In this paper, we present a novel approach to modelling features of log files. This model offers a compact representation of log data that can be efficiently extracted from large amounts of monitoring data. We also use decision-tree classifiers to learn and classify symptoms of recurrent faults. This representation enables automated fault matching and, in addition, enables human investigators to understand manifestations of failure easily. Our model does not require any access to application source code, a specification of log messages, or deep application knowledge. We evaluate our proposal using fault-injection experiments against other proposals in the field. First, we show that the features needed for symptom definition can be extracted more efficiently than does related work. Second, we show that these features enable an accurate classification of recurrent faults using only standard machine learning techniques. This enables us to identify accurately up to 78% of the faults in our evaluation data set.