Skip to Main Content
Erroneous attribute values can significantly impact learning from otherwise valuable data. The learning impact can be exacerbated by the class imbalanced training data. We investigate and compare the overall learning impact of sampling such data by using four distinct performance metrics suitable for models built from binary class imbalanced data. Seven relatively free of noise, class imbalanced software engineering measurement datasets were used. A novel noise injection procedure was applied to these datasets. We injected domain realistic noise into the independent and dependent (class) attributes of randomly selected instances to simulate lower quality measurement data. Seven well known data sampling techniques with the benchmark decision-tree learner C4.5 were used. No other related studies were found that have comprehensively investigated learning by sampling low quality binary class imbalanced data containing both independent and dependent corrupted attributes. Two sampling techniques (random undersampling and Wilson's editing) with better and more robust learning performances were identified. In contrast, all metrics concurred on the identification of the worst performing sampling technique (cluster-based oversampling).
Date of Conference: 11-13 Dec. 2008