Skip to Main Content
Two challenges often encountered in data mining are the presence of excessive features in a data set and unequal numbers of examples in the two classes in a binary classification problem. In this paper, we propose a novel approach to feature selection for imbalanced data in the context of software quality engineering. This technique consists of a repetitive process of data sampling followed by feature ranking and finally aggregating the results generated during the repetitive process. This repetitive feature selection method is compared with two other approaches: one uses a filter-based feature ranking technique alone on the original data, while the other uses the data sampling and feature ranking techniques together only once. The empirical validation is carried out on two groups of software data sets. The results demonstrate that our proposed repetitive feature selection method performs on average significantly better than the other two approaches, especially when the data set is highly imbalanced.
Date of Conference: 4-6 Aug. 2010