Skip to Main Content
Three important data characteristics that can substantially impact a data mining project are class imbalance, poor data quality and the size of the training dataset. Data sampling is a commonly used method for improving learner performance when data is imbalanced. However, little effort has been put forth to investigate the performance of data sampling techniques when data is both noisy and imbalanced. In this work, we present a comprehensive empirical investigation of how data sampling techniques react to changes in four training dataset characteristics: dataset size, class distribution, noise level and noise distribution. We present the performance of four common data sampling techniques using 11 learning algorithms. The results, which are based on an extensive suite of experiments for which over 15 million models were trained and evaluated, show that data sampling can be very effective at dealing with the combined problems of noise and imbalance. In addition, the dataset characteristics which have the greatest impact on each of the data sampling techniques are identified.