Skip to Main Content
Two of the most challenging problems in data mining are working with imbalanced datasets and with datasets which have a large number of attributes. In this study we compare three different approaches for handling both class imbalance and high dimensionality simultaneously. The first approach consists of sampling followed by feature selection, with the training data being built using the selected features and the original (unsampled) data. The second approach is similar, except that it uses the sampled data (and selected features) to build the training data. In the third approach, feature selection takes place before sampling, and the training data is based on the sampled data. To compare these three approaches, we use seven groups of datasets covering different application domains, employ nine feature rankers from three different families, and generate artificial class noise to better simulate real-world datasets. The results differ from an earlier work and show that the first and third approaches perform, on average, better than the second approach.