Conditional Domain Adversarial Adaptation for Heterogeneous Defect Prediction

Heterogeneous defect prediction (HDP) has become a very active research field in software engineering, which predicts the maximum number of bug-suspiciousness modules of a target project by prediction models built on source project with heterogeneous metric set. At present, some researchers have proposed some HDP models with a promising performance. Most of existing HDP models adopted unsupervised transfer learning to map source project and target project into the same feature space, which only considered the metrics space, not the label information from source project and few part of target project. Meanwhile, the predictive ability of these HDP models in effort-aware context have not been compared. Therefore, we set up to investigate the effectiveness of label information on HDP, and to propose a HDP model for improving the predicting performance in classification and effort-aware contexts. In order to use these label information, we propose a novel conditional domain adversarial adaptation (CDAA) approach to tackle heterogeneous problem in SDP, which is motivated by generative adversarial networks (GANs). There are three networks in architecture of our CDAA, including one generator, one discriminator and one classifier. The generator learns how to transfer source instance space to target instance space. The discriminator learns how to identify the fake instances generated by generator. The classifier learns how to correctly classify the label of instances. In our CDAA, the loss function of classifier and discriminator are both back propagate to generator. Then, to ensure a fair comparison between state-of-the art methods and CDAA, we take AUC, MCC and $P_{opt}$ as measures to evaluate 28 open-source projects. Experimental results demonstrate that CDAA method could take advantage of label information to effectively map source project to target project and improve the predictive performance. Also, experimental results demonstrate that our CDAA method is not affected by the number of same metrics between source project and target project.


I. INTRODUCTION
Software defect prediction (SDP) aim to detect as many defective modules as possible in a software project by learning models trained on sufficient historical labeled instances [1]- [5], which has caused widespread attention from industrial communities and academic [6], [7]. Prior studies have proved that defect prediction models with sufficient historical labeled instances could generate ideal performance. Unfortunately, it is difficult to obtain sufficient number of labeled training dataset in practice software development, especially for a project in the initial stage [8]- [10].
The associate editor coordinating the review of this manuscript and approving it for publication was Xiaobing Sun .
Cross-project defect prediction (CPDP) is one way to tackle lack of label instances of a project, which builds learning model on labeled instances of other projects [2], [10], [11]. Zimmermann et al. [8] firstly pointed out that due to the difference of data distribution, CPDP was still a tricky question through investigating 622 cross -project combined models. After that, many CPDP approaches have been proposed to solve the difference, such as VCB-SVM [12], TCBoost [13], DTB [14], MNB [15], and HYDRA [9]. However, these methods assumed that train and test projects had the same metric sets. In practice, since projects have different developing environment and developing language, different metric sets would be extracted from these projects (i.e., NASA repository has 37 metrics, while MORPH repository only has 20 metrics). Thus, these existing CPDP models are difficult to apply in this scenario, especially for the new open software company.
For this scenario, the biggest problem is the different metric sets between source and target projects, which is called heterogeneous problem. To tackle heterogeneous problem, many heterogeneous defect prediction (HDP) models have been proposed [16], [17], [18]. They mainly employed different technologies to transfer source and target metric spaces into the same feature space. Such as, Jing et al. [17] proposed a transfer canonical correlation analysis (CCA+) approach to handle heterogeneous problem. Nam et al. [18] proposed HDP-KS approach, including both feature selection and feature matching techniques to handle heterogeneous problem. Meanwhile, Li et al. [28] presented a cost-sensitive transfer kernel canonical correlation analysis (CTKCCA) method through introducing cost sensitive and kernel into CCA+ method. However, heterogeneous problem is still a challenge in software defect prediction.
Therefore, in our study, we also tackle the heterogeneous problem in SDP. In order to understand our work easily, we list a description of the specific terms in Table 1.

A. MOTIVATION
As mentioned above, due to different developed language and environment, different projects may have different metric space (including different number of metrics). Most existing HDP models mostly employ the knowledge of metrics to transfer source and target project into the same feature space.
Firstly, when these HDP methods learned the transferable representations between source and testing instances, they didn't consider the label information of source instances. The label information includes the relation between metrics and defect (or non-defect) label. After transferring source instances into common space, the relation between transferred instance space and original label information may be changed. Thus, the performance of the machine learning classifiers trained on these transferable instances may be hindered.
Secondly, in some case, test projects have a small number of label instances (named training target data), which include the relation between metrics and defect (or non-defect) label. Meanwhile, Turhan et al. [19] pointed out that predicting models trained on datasets including small number of training target data outperformed the predicting models only trained on source datasets. However, the effectiveness of train target data has not been investigated.
Finally, providing a sequential list of test modules by predictive models would be more effective for testers. Most models are measured on the defect-classification models, the effectiveness of models on the Effort-aware models have not been investigated.
Therefore, our study would set out to investigate and employ the label information including source data and target training data to achieve better transferable representations between source and testing instances on the defect-classification and Effort-aware models.

B. CONTRIBUTION
Motivated by this, the main contributions of our study could be summarized as follows: • To take full advantage of the label information from source and target training instances, we introduce GAN into SDP, and propose a conditional domain adversarial adaptation (CDAA) approach. CDAA has three networks, which could learn the label information and generate a classifier when transferring source instance space into target instances space.
• We do numerous experiments on 28 public projects from five different repositories covering NASA [19], [20], AEEEM [7], SOFTLAB [19], ReLink [21] and MORPH [40] to estimate our CDAA method in the defect-classification and effort-aware context. The experimental results demonstrate that CDAA method can improve predicting performance for HDP with the label information.
The rest of our study is organized as follows. We present the related work about SDP learning models in section II; We describe our proposed CDAA method in section III; We describe our experimental setup in section IV; Experimental results and analysis are presented in section V; In section VI, we describe the threats to the validity of our approach; Finally, Section VII gives the conclusions and future directions.

II. RELATED WORK
We briefly review the CPDP models using common metric sets, HDP models, and domain adversarial networks in this section.

A. CPDP USING COMMON METRIC SETS
In real software development, CPDP learning models become imperative when a project doesn't have abundant historical labeled instances at the beginning step. Many researchers have focused on designing effective learning models for CPDP [13], [22], [23].
Zunnernabb et al. [24] conducted broad-scale research on the CPDP learning model to draw conclusions about which metrics (domain or process) were more beneficial for crossproject. Through 622 combination experiments, they found VOLUME 8, 2020 that testers should first quantify and evaluate the code, process and domain metric before training CPDP learning model.
After that, some researchers proposed to solve different distribution problems by searching for similar instances. Such as, Turhan et al. [19] firstly proposed to select similar instances from other projects as training datasets, and build the learning model with K-nearest Neighbor (KNN). Then, Liu et al. [25] proposed another search-based learning model using multiple projects. They applied genetic algorithm to search, which included baseline, validation and validationand-Voting classifiers. Canfora et al. [26] also proposed a multi-objective logistic regression learning model based on genetic algorithm. Xia et al. [9] proposed a Hybrid model reconstruction Approach (HYDRA) to build CPDP learning model. They generated a large number of classifiers to predict defective modules by genetic learning and integrated learning phases. Recently, Hosseini et al. [27] proposed a novel search-based method to select instances. Firstly, they applied NN-filter to extract confirmation dataset from train instances, and then used genetic algorithm to select instances as training set. Finally, they applied Naive Bayes as classifier to predict the test project.
Moreover, some researchers have proposed mixed-project defect prediction models to improve the performance by using small amount of label instances from test project. Such as, Turhan et al. [15] firstly investigated the effectiveness of mixed-project, and compared it with NB classifier-based WPDP learning model. Experimental results demonstrated using mixed-project performed as well as WPDP learning model in the early stages of a project. Chen et al. [14] proposed a double transfer boosting (DTB) method that integrated two layers of data transmission. They firstly applied data gravity to reshape the overall distribution of cross-company (CC) instances. Secondly, they eliminated negative effect instances by a NB classifier-based transfer boosting method. Ryu et al. [12] presented a value-cognitive boosting method based on support vector machine (VCB-SVM) to build CPDP learning model. This method simultaneously sampled and modified instances to improve performance. In addition, Ryu et al. [13] also presented a transfer cost-sensitive boosting (TCSBoost) method to build learning model that used a few labeled target data.
However, existing CPDP learning models assumed that train and test projects had the same metric space. Due to some reasons, the number of common metric sets is very small in practice, which hinders the application of these CPDP learning models.

B. HDP MODELS
In recent years, some researchers have focused on heterogeneous metrics problems and proposed some HDP models. Such as, Jing et al. [17] proposed canonical correlation analysis (CCA+)-based HDP learning model. Firstly, they defined a unified metric representation (UMR) for metric space of train and test projects, and then used canonical correlation analysis (CCA) to convert the UMR instances into the same metric space.
After that, Cheng et al. [16] proposed a cost-sensitive correlation transfer support vector machine (CCT-SVM) method to build a HDP learning model. Based on CCT-SVM method, Li et al. [28] presented a novel cost-sensitive transfer kernel canonical correlation analysis (CTKCCA) method to build a HDP learning model, which simultaneously considered linearly inseparable and class imbalance problems.
However, these existing HDP models mainly employed the feature space of source and test project to solve the heterogeneous problems. When transferring the feature space, the label information both from source project and small part of test project are not been considered. Thus, our paper is to leverage the label information to build a conditional domain adversarial adaptation model.

C. DOMAIN ADVERSARIAL NETWORKS
Machine learning techniques need the same distributional characteristics between training and testing dataset. If this condition is not met, the performance of classifiers would be reduced. In order to conquer this gap, transfer learning algorithm transfers feature knowledge from train to test dataset by weakening these assumptions. Deep transfer learning techniques study the transferable knowledge by deep neural networks, which many researchers have paid attention in recent years.
In these deep transfer learning techniques, adversarial-based deep transfer learning is to find transferable representations by introducing adversarial technology of generative adversarial nets (GANs) [29]. Because of its good prediction performance and strong practicality, researchers proposed different adversarial-based deep transfer learning nets. Such as, Chen et al. [30] introduced adversarial technology into transfer learning. Meanwhile, they introduced the domain adaptive regularization term into the loss function. Ganin and Lempitsky [31] expanded some standard layers and one new gradient reversal layer to build adversarial-based feedforward networks. Tzeng et al. [32] combined GANs loss with discriminator, and proposed a new domain adaptation approach.
Distinguishing with the mentioned adversarial-based deep transfer learning approaches, we build our conditional domain adversarial adaptation to solve the heterogeneous problem in SDP.

III. OUR PROPOSED APPROACH
Firstly, to understand our approach a little bit better, we define the notation used in our work. After that, we present our proposed conditional domain adversarial adaptation (CDAA) approach including the architecture and training process.

A. NOTATION
When a new software company develops a new project, this company may not have enough labeled instances to build WPDP or CPDP learning model. One way is to employ defect dataset from some open source projects. Different projects have different metric sets called heterogeneous problem. In this scenario, the dataset of new project is target dataset, while the dataset of other project is source dataset.
Supposed that D s is labeled instance space from source dataset, D st is the target training instance space, and D t is unlabeled instances space from target data. D s contains a data set X s = {χ s 1 , χ s 2 , · · · , χ s N } and a label set Y s = {y s 1 , y s 2 , · · · , y s N }, where χ s i denotes the i th module in X s , y s i is the associated label, and N is the number of modules of X s . D st contains a data set where χ t i denotes the i th module in X t and M is the number of modules of X t . Note that the metric set in X s , X st and X t are different and d s .
where d s is the number of metrics in X s and d t is the number of metrics in X st and X t .

B. CONDITIONAL DOMAIN ADVERSARIAL ADAPTATION
In software engineering, some researchers have proved that few label instances from target project could ameliorate the predictive performance of CPDP learning models [8], [19]. Motivated by this, we introduce the generative adversarial networks (GNAs) into SDP and build conditional domain adversarial adaptation (CDAA) networks. CDAA can simultaneously learn the edge distribution from source project when transferring the feature space of source project to the feature space of target project.

1) THE ARCHITECTURE OF CDAA
In order to use label information in learning feature space transformation, and to get a classifier at the same time, our CDAA networks include three networks: generator (G), discriminator (D) and classifier (C). Generator is to learn a transferable representation from source instances to target instances, which can also be called feature extractor. Discriminator learns how to distinguish the source from the target project. Classifier learns how to minimize the misclassification loss of source instances X s and target training instances X st . Finally, the source instance space is mapped to the test instance space, in which the label information are included through back-propagation of classifier. The architecture is shown in Fig. 1.
In our CDAA networks, generator, discriminator and classifier are all three layers, which are input layer, hidden layer, and output layer. The number of neurons in first layer of generator is d s , which is the number of metrics in source project. The number of neurons in first layer of discriminator and classifier are both d t , which is the number of metrics in target project. Meanwhile, we add a Batch Normalization (BN) between full connection and activation function in each layer of these three networks, which could keep the input of each layer of networks in the same distribution during training.

2) TRAINING OUR NETWORKS
The aim of our CDAA is to obtain a strong classifier when studying a feature representation from source project to target project. Thus, during training our CDAA networks, we seek the parameters of generator to maximize the misclassifying generated instances loss of discriminator and to minimize the misclassifying label loss of classifier, the parameters of discriminator to minimize the misclassifying loss for generated and test instances, and the parameters of classifier to minimize the misclassifying label loss of generated and target training instances. To meet these criteria, the loss function of discriminator is defined as: where, G(x s ) is generated by generator. The discriminator is to minimize the loss of misclassified generated and target instances. Meanwhile, in order to consider the label information of source project, the loss of classifier on source instances is added in generator. The loss function of generator is defined as: Finally, the classifier is to minimize the loss of label information including both generated instances and target training instances. Thus the loss function of classifier is defined as: In order to understand our training process, the pseudo code for training CDAA is listed in algorithm 1.

IV. EXPERIMENTAL SETUP A. RESEARCH QUESTION
For effectively evaluating our proposed CDAA networks, we conduct experiments with the following three questions: RQ2: Does the number of common metrics between source and target project affect the performance of CDAA method?
RQ3: Could our proposed CDAA method perform better than state-of-art SDP methods?

B. DATASETS
In our work, we apply 28 projects from five different companies covering NASA, AEEEM, SOFTLAB, ReLink and MORPH to conduct experiments. Table 2 lists the details about these projects. The first to six columns list the repository's name, test project, number of metrics, number of instances, number of defective instances, and the percentage of defective instances, respectively.
Each project in NASA contains various static code metrics and corresponding label. In our study, we use CM1, MW1, PC1, PC3 and PC4 projects as experimental objects, which are cleaned by [20].
Jureczko and Madeyski [22] collected MORPH repository from the online PROMISE data repository. Each project in MORPH has 20 metrics including McCabe's cyclomatic metrics, object-oriented metrics.
In order to indicate the number of common metrics among different repositories, Table 3 lists the number of common metrics. As shown above, NASA, AEEEM, SOFTLAB, ReLink and MORPH have few common metrics, which can be used for HDP.

C. EVALUATION MEASURES
For fairly evaluating the performance of learning models, we separately apply AUC, Mattews correlation coefficient (MCC) and Popt as the performance measures. AUC and Mattews correlation coefficient (MCC) are used for evaluating the predicting performance in classification context, while Popt is used for evaluating the performance in effort-aware context.
AUC is the threshold-independent measure, which is the area under the receiver operating characteristic curve (ROC).
MCC is a measure considering all four confusion matrix categories. Also Chicco and Jurman [41] showed that MCC could provide a more informative and truthful score than F1score and accuracy. MCC can be calculated by the formula: Popt is an effort-aware indicator, which is based on the area under the cost-effectiveness (CE) curve. A CE curve of a model is list in Figure 2, whose x-axis is the percentage of LOC reviewed and y-axis is the percentage of defects discovered. Meanwhile, OPT shows the optimal scenario which review the code based on the descending defect density. Defect density refers to the radio of number of defects to lines of code in a module. Worst shows the worst scenario which review the code based on the ascending defect density, and prediction model shows a actual model m. So the Popt of a   model m can be calculated by the following formula [42]: For these three measures, the bigger the measure is, the better the learning model is.
In addition, we compute the Wilcoxon-signed rank test [43], [44] and the Cohen's d effect size test [27], [45] between CDAA method and other compared methods to statistically compare the experimental results.
The Wilcoxon-signed rank test is a paired difference test to test whether the results of two methods are statistically different. In our experiment, we set the confidence level of Wilcoxon-signed rank test is 99%. Cohen's d is a effect size measure to quantify the difference between our method and other compared method, and it is defined as: Depended on the values of Cohen's d, it can be divided into four levels as shown in Table 4.

D. PARAMETER SETTINGS
In our CDAA networks, the number of neurons of hidden layer in generator, discriminator and classifier are all 256. The number of neurons of output layer in generator is the number of metrics in target project, while the number of neurons of output layer in discriminator and classifier are one. In addition, based on method suggested by reference [49], we tune the parameter in the CDAA networks. Table 5 displays the hyper-parameter values set in our CDAA networks. Approach: To investigate the effect of target label information, we compare the different proportion of target training data which ranges from {0%, 10%, 15%, 20%}. Meanwhile, to investigate the effect of label information from source project, we compare with generative adversarial networks (GANs) method that doesn't consider the label information of source data. GANs transfer the source instances into target instance space, and then use the transferred source instances to train RF classifier [46], [47]. We use RF classifier because several studies show that Random Forest learners often produce the best performing models across all of the studied SDP scenarios [36], [48].
Results: Figure 3, 4 and 5 respectively shows the AUC, MCC and Popt values with different proportion of target training data and label information from source instances across all 28 projects from five repositories. From these figures, we observe that CDAA method with some target training data (for instance, {10%, 15%, 20%} ) outperform CDAA method without target training instances and GANs method. We also observe that CDAA method with 10%, 15%, and 20% of target training instances has similar performance. Table 6 lists the statistical results and effect size between CDAA method with 10% target training data and CDAA methods with other label information. From this table, we observe that CDAA method with 10% target training data has statistically significant difference and large effect size with CDAA method without target training data and GAN method across all performance measures. We also observe that CDAA method with 10% target training data has no statistically significant difference and small effect size with CDAA method with 15% and 20% target training data. Analysis: Compare with CDAA method without target training data, CDAA method with 10% target training data improve large performance. That is to say, target label   information is useful to improve the performance of HDP. Compared with GANs method, CDAA method with 10% target training data improve large performance. That is to say, label information from source project is useful to transfer source to target, which improves the performance of HDP. Compare with CDAA method with 15% and 20% target training data, CDAA method with 10% target training data could have similar performance. That is to say, increasing the number of target training instances has little impact on performance of CDAA. CDAA can effectively capture the label information of target data from target training data.

B. RQ2: DOES THE NUMBER OF COMMON METRICS BETWEEN SOURCE AND TARGET PROJECT AFFECT THE PERFORMANCE OF CDAA METHOD?
Approach: To investigate whether the number of common metrics affect the performance of our CDAA method, experiments are conducted in different source dataset which has different common metrics with target dataset. As Table 3 shown, we can divide these common metrics into three situations: one, four/eight, and twenty-eight. One represents that there are few common metrics. Four/eight represents that there are small number of common metrics. Twenty-eight represents that there are lots of common metrics.
Results: Figure 6, 7 and 8 respectively shows the AUC, MCC and Popt values with different number of common metrics across all 28 projects from five repositories. Noted that for AEEEM, MORPH and ReLink, there are only one and four of common metrics. From these figures, we observe that except the Popt value of ReLink dataset, CDAA methods with these three types of common metrics have similar predicting performance across the five repositories.     with one common metric has no statistically significant difference and small (or negligible) effect size with CDAA method with four/eight and twenty-eight common metrics.
Analysis: From these above results, we can confirm that even if only a few of the common metrics exist in both the source instance space and target instance space, our proposed CDAA method also can effectively transfer the source instance space into the target instance space. Therefore, this is proved that our CDAA method don't depend on the number of common metrics. CDAA method could be widely used in practical defect prediction. Approach: To investigate the predicting performance of our CDAA method, we compare our CDAA with the conventional within-project defect prediction methods including Random forests (RF) [34], ManualDown (MD) [35] and spectral clustering (SC) [37], the typical CPDP methods including VCB-SVM [12], TCBoost [13], DTB [14] and MNB [15], and HDP methods including CCA+ [17] and CTKCCA [28]. Noted that we just use our CDAA method with 10% target training instance. Our experimental environment is Python3.6 and Scikit-learn (0.19.2). Meanwhile, we run 20 times for each target project to avoid randomness.
Results: Figure 9, 10 and 11 respectively shows the AUC, MCC and Popt values of compared methods across all 28 projects from five repositories. We observe that (i) The median values of AUC, MCC and Popt measures obtained by CDAA method outperform compared CPDP methods of   VCB-SVM [12], TCBoost [13], DTB [14] and MNB [15] across all five repositories. The median values of AUC, MCC and Popt measures obtained by our CDAA method outperform compared HDP methods of CCA+ [17] and CT-KCCA [28].
In addition, we compute the Wilcoxon-signed rank test [43], [44] and the cohen's d effect size test [27], [45] between our CDAA method and the other studied methods. Meanwhile, we statistically quantify the number of datasets on compared methods with different statistical effect size. Table 8 lists the results of Wilcoxon-signed rank test. Figure 12, 13 and 14 respectively shows the number of datasets with different effect size between CDAA and other studied methods across studied measures. From these figures, we observe that except WPDP method with RF classifier, CDAA method has achieved significant improvements in most projects compared to studied methods. Taking the comparison results of CDAA and MNB as an example, 24 of 28 (13 or 13 of 28) projects are shown large or median positive effect size in terms of the AUC (MCC or Popt). Compared with WPDP method with RF classifier, though RF outperforms our CDAA method, there are not statistical difference in terms of AUC and MCC.  Analysis: Based on above results, we can conclude CDAA method could get better performance than compared CPDP and HDP methods. Compared with VCB-SVM, DTB, TCBoost, and MNB methods that are CPDP learning models using mixed-project: they only employed the same metrics included in source and target projects to train the prediction models. However, the other metrics would include more  information for building models. That is to say, they can only use the data in the same metric space, our CDAA method can use data from different metric space. Compared with MD method which is suggested to compare by Zhou et al. [35], MD considers a larger module as more defect-proneness. Although MD can achieve better values of Recall, it also misclassifies more non-defective instances. Compared with SC: this method is an unsupervised clustering classifier, and its predictive power usually under-perform that of supervised classifiers. Compared with the WPDP with RF classifier: RF can obtain good prediction performance with sufficient historical data, which has the same distribution. However, our CDAA taking advantages of label information and self-learning ability of GANs can effectively transfer source instance space into target instance space.

VI. THREATS TO VALIDITY A. THREATS TO CONSTRUCT VALIDITY
We constructs CDAA with a target training data by randomly sampling 10% of instances from target project. The experimental results are likely to be associated with the selected target training instances. Thus, in our experiment we run 20 times for each experimental object to mitigate this potential randomness.
In addition, the 28 projects studied in our experiments are gathered from open-source datasets by SZZ algorithm [39] to discover the defective modules. SZZ algorithm may loss some defective modules, which is a potential threat to the validity of construct. In future, the threat would be eliminated by employing completed defect data.

B. THREATS TO INTERNAL VALIDITY
Our study employs three comprehensive measures AUC, MCC and Popt to report the prediction results, and this would be a potential threat to the internal validity. In the future, other measures such as G-means, balance and ER will be measured.
In addition, for the compared approaches, we implement them carefully according to the published papers. Thus, experimental results may have some bias from the original approaches.

C. THREATS TO EXTERNAL VALIDITY
Experimental objects are from five public available repositories, which are used in many studies in software engineering. However, they would not be generalized to other private projects. Furthermore, conducting experiments on private projects will reduce this threat.

VII. CONCLUSION
Heterogeneous defect prediction (HDP) is an important research topic, since it is applicable to construct the prediction model on new projects that don't have the same metric set with source project. To improve the performance of HDP, motivated by GANs which has strong learning ability to different domain, we propose conditional domain adversarial adaptation (CDAA) approach to tackle heterogeneous problem in SDP. CDAA has three networks of generator, discriminator and classifier, which could take full advantage of the information included in source project and small label instances in target project.
We conduct abundant experiments on 28 projects from five repositories to investigate the performance in both classification and effort-aware contexts. Further, the non-parametric Wilcoxon-signed rank test [43], [44] and the cohen's d test are used. Experimental results indicate that CDAA could improve the performance of HDP.
In the future, we will apply more software projects including commercial projects to evaluate our method. In addition, there is class imbalance problem that influences the prediction results of learning models, we would like to address the class-imbalance for HDP.

ACKNOWLEDGMENT
This paper would not have been possible without the generous support by joint Ph.D. program of double first rate construction disciplines of CUMT. VOLUME