Skip to Main Content
Using classification methods to predict software defect proneness with static code attributes has attracted a great deal of attention. The class-imbalance characteristic of software defect data makes the prediction much difficult; thus, a number of methods have been employed to address this problem. However, these conventional methods, such as sampling, cost-sensitive learning, Bagging, and Boosting, could suffer from the loss of important information, unexpected mistakes, and overfitting because they alter the original data distribution. This paper presents a novel method that first converts the imbalanced binary-class data into balanced multiclass data and then builds a defect predictor on the multiclass data with a specific coding scheme. A thorough experiment with four different types of classification algorithms, three data coding schemes, and six conventional imbalance data-handling methods was conducted over the 14 NASA datasets. The experimental results show that the proposed method with a one-against-one coding scheme is averagely superior to the conventional methods.
Note: In the print edition, this paper appears with the incorrect publication title in the running head. The correct publication title is IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS-PART C: APPLICATIONS AND REVIEWS.