Loading [MathJax]/extensions/MathZoom.js
A Comprehensive Investigation of the Role of Imbalanced Learning for Software Defect Prediction | IEEE Journals & Magazine | IEEE Xplore

A Comprehensive Investigation of the Role of Imbalanced Learning for Software Defect Prediction


Abstract:

Context: Software defect prediction (SDP) is an important challenge in the field of software engineering, hence much research work has been conducted, most notably throug...Show More

Abstract:

Context: Software defect prediction (SDP) is an important challenge in the field of software engineering, hence much research work has been conducted, most notably through the use of machine learning algorithms. However, class-imbalance typified by few defective components and many non-defective ones is a common occurrence causing difficulties for these methods. Imbalanced learning aims to deal with this problem and has recently been deployed by some researchers, unfortunately with inconsistent results. Objective: We conduct a comprehensive experiment to explore (a) the basic characteristics of this problem; (b) the effect of imbalanced learning and its interactions with (i) data imbalance, (ii) type of classifier, (iii) input metrics and (iv) imbalanced learning method. Method: We systematically evaluate 27 data sets, 7 classifiers, 7 types of input metrics and 17 imbalanced learning methods (including doing nothing) using an experimental design that enables exploration of interactions between these factors and individual imbalanced learning algorithms. This yields 27 × 7 × 7 × 17 = 22491 results. The Matthews correlation coefficient (MCC) is used as an unbiased performance measure (unlike the more widely used F1 and AUC measures). Results: (a) we found a large majority (87 percent) of 106 public domain data sets exhibit moderate or low level of imbalance (imbalance ratio <; 10; median = 3.94); (b) anything other than low levels of imbalance clearly harm the performance of traditional learning for SDP; (c) imbalanced learning is more effective on the data sets with moderate or higher imbalance, however negative results are always possible; (d) type of classifier has most impact on the improvement in classification performance followed by the imbalanced learning method itself. Type of input metrics is not influential. (e) only 52% of the combinations of Imbalanced Learner and Classifier have a significant positive effect. Conclusion: This paper offers two practical ...
Published in: IEEE Transactions on Software Engineering ( Volume: 45, Issue: 12, 01 December 2019)
Page(s): 1253 - 1269
Date of Publication: 15 May 2018

ISSN Information:

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.