By Topic

Comparison of approaches to alleviate problems with high-dimensional and class-imbalanced data

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

4 Author(s)
Shanab, A.A. ; Florida Atlantic Univ., Boca Raton, FL, USA ; Khoshgoftaar, T.M. ; Wald, R. ; Van Hulse, J.

Two of the most challenging problems in data mining are working with imbalanced datasets and with datasets which have a large number of attributes. In this study we compare three different approaches for handling both class imbalance and high dimensionality simultaneously. The first approach consists of sampling followed by feature selection, with the training data being built using the selected features and the original (unsampled) data. The second approach is similar, except that it uses the sampled data (and selected features) to build the training data. In the third approach, feature selection takes place before sampling, and the training data is based on the sampled data. To compare these three approaches, we use seven groups of datasets covering different application domains, employ nine feature rankers from three different families, and generate artificial class noise to better simulate real-world datasets. The results differ from an earlier work and show that the first and third approaches perform, on average, better than the second approach.

Published in:

Information Reuse and Integration (IRI), 2011 IEEE International Conference on

Date of Conference:

3-5 Aug. 2011