Skip to Main Content
Intrusion detection datasets play a major role in evaluating machine learning techniques for Intrusion Detection Systems. The Intrusion detection datasets are generally very large and contain many noncontributing features and redundant data. These drawbacks lead to inaccurate intrusion detection and increased computational cost when machine learning techniques are evaluated. Several data cleaning techniques have been proposed to eliminate redundant records and noncontributing features. These techniques reduce the size of the datasets significantly and make the characteristics of the data closer to the characteristics of intrusions in a real network. This paper identifies anomaly problems in normal and intrusion attacks data, and proposes an ellipsoid-based technique to detect anomalies and clean the intrusion detection datasets further. Publically available KDD'99 and NSL-KDD datasets are used to demonstrate its performance. It reveals an interesting property, i.e. monotonically decreasing behavior, of the NSL-KDD dataset.