Adversarial Learning for Cross-Project Semi-Supervised Defect Prediction

Cross-project defect prediction (CPDP) aims to build a prediction model on existing source projects and predict the labels of target project. The data distribution difference between different projects makes CPDP very challenging. Besides, most existing CPDP methods usually require sufficient and labeled data. However, acquiring lots of labeled data for a new project is difficult while obtaining the unlabeled data is relatively easy. A desirable approach is building a prediction model on unlabeled data and labeled data. CPDP in this scenario is called cross-project semi-supervised defect prediction (CSDP). Recently, generative adversarial networks have achieved impressive results with these strong ability of learning data distribution and discriminative representation. For effectively learning the discriminative features of data from different projects, we propose a Discriminative Adversarial Feature Learning (DAFL) approach for CSDP. DAFL consists of feature transformer and project discriminator, which compete with each other. A feature transformer tries to generate feature representation, which learns the discriminant information and preserves intrinsic structure inferred from both labeled and unlabeled data. A project discriminator tries to discriminate source and target instances on the generated representation. Experiments on 16 projects show that DAFL performs significantly better than baselines.


I. INTRODUCTION
Software defect prediction (SDP) [1]- [8] is an important software quality assurance step of predicting the defectproneness in software project development history. Many prior SDP studies predict the fault of a new instance within the same project, which is called within-project defect prediction (WPDP) [9]- [13]. In recent studies, machine learning models are successfully applied to identify defective instances, e.g., canonical correlation analysis [14], [15], neural networks [16], [17], and deep learning (DL) [18]- [22]. Prior The associate editor coordinating the review of this manuscript and approving it for publication was Zhaojun Li . studies have shown that a useful machine learning model needs to be trained by using sufficient and complete data. However, it is a challenging problem that a new project with limited historical data could perform the prediction model well. When we do not have sufficient amount of historical data, cross-project defect prediction (CPDP) [23] is a satisfactory solution, which refers to building the prediction model trained by the data from source projects and then predicting the label of a target project.
However, in fact, achieving a satisfactory CPDP often is challenging. Zimmermann et al. [24] evaluated CPDP performance on 12 projects (622 cross-project pairs), and only 21 pairs performed well. The data from different projects usually exhibit significant distribution gap, which violates a basic requirement of most machine learning modeling technology, i.e.,the similar distribution assumption [25], [26]. Many studies aim to decrease the feature distribution difference from different projects in CPDP [25], [26]. These researches has been focused on feature (metric) selection. Subspace learning is a popular approach. With subspace learning, the distribution gap can be reduced and the transformed features can acquire the maximum correlations by learning corresponding transforms. Deep neural network (DNN) [27]- [30] models have provided scalable nonlinear transformations for feature representations in image classification scenario, and DNN has received extensive development. Recently, generative adversarial networks (GANs) [31] have been proposed, which build an adversarial modeling framework by simultaneously training two models. To learn good feature representation of the input data, GANs perform a minimax game by making two models compete with each other. GANs have excellent ability of data distribution generation and feature representation learning. Thus GANs can be used for modeling the data distribution of different projects. Then we can learn the common discrimination representation of inter-class instances and boost the correlation learning of intra-class instances in different projects.
Most existing CPDP approaches focus on building prediction model by using only labeled data. In software development practice, the availability of labeled data is limited due to reasons such as high cost, lack of budget, and time limitation [32]- [34]. It is difficult for a new project to obtain sufficient and labeled data. Therefore, in such cases, we can not use the regular supervised techniques for CPDP. Fortunately, it is relatively easy for obtaining unlabeled data. And then with the help of the sufficient unlabeled data from external project, CPDP can be performed better.
In order to address the challenges of distribution difference between different projects and limited number of labeled data, we propose a new approach, termed Discriminative Adversarial Feature Learning (DAFL) for CSDP. Fig. 1 illustrates the general framework of DAFL. DAFL consists of three major parts: feature mapping, feature transformer, and project discriminator. We adopt four-layer feed-forward neural network (FNN) [35] as the feature mapping in DAFL to nonlinearly map the respective features from different projects to a common subspace. Feature transformer generates inter-class separated and intra-class correlated representation instances (i.e., the instances from the same classes are close and the instances form different classes are separate) based on labeled data and lots of unlabeled data from different projects in the common subspace. Feature transformer performs the representation task by feature discrimination and feature structure preservation. The feature structure preservation is accomplished by triplet constraint. The maximum correlation of transformed features between source and target projects is vital for CPDP. It is also significant that whether the features are statistically indistinguishable. To achieve this goal, we apply project discriminator to identify the source project and target project of a transformed instance. Project discriminator tries to distinguish the project label (i.e., the label with which we can distinguish the instances from source project or target project) and in this way guides the learning of the project transformer. Feature transformer and project discriminator are trained under the adversarial learning framework. Therefore, we can ensure that the transformed instances of different projects are both discriminative across different classes within a project and intrinsic structure invariant across projects.
The main contributions of this paper are as follows: 1) We introduce the adversarial learning framework into cross-project defect prediction to better address the data distribution difference problem of different projects, which can effectively maximize the feature correlation over the different projects. The discriminator judges the project label from source project or target project. The transformer learns the intra-class correlation and inter-class discrimination from both the within-project and cross-projects. Discriminator and transformer compete with each other for better cross-project defect prediction. 2) DAFL is a semi-supervised approach that performs triplet sampling from labeled and unlabeled data. The triplet constraints are enforced on the feature transformer for making full use of intrinsic geometrical structure information from unlabeled data and class label information from labeled data. 3) We perform the proposed approach on NASA, AEEEM, and PROMISE datasets. The experiment results show that our approach achieves competitive performance compared to the state-of-the-art CPDP and semisupervised defect prediction methods. The rest of paper is organized as follows: Section II outlines related works on CPDP and semi-supervised defect prediction. In Section III, we describe our proposed method DAFL. Experimental settings and results are demonstrated in Sections IV and V. We outline different threats to validity in Section VI. Section VII concludes this paper.

II. RELATED WORK A. CROSS-PROJECT DEFECT PREDICTION (CPDP)
To solve the shortage of historical data for target project, the researchers turn their attentions on external projects and propose CPDP. Zimmermann et al. [24] first attempted to conduct CPDP model. They ran 622 cross-project predictions and only 21 had satisfactory experiment results. The success rate is only 3.4%. Thus they concluded that CPDP was still a difficult challenge. He et al. [36] concluded that the selection of training data is a vital step for the constructing of CPDP model. They selected suitable training data for target project, and supported the conclusion that the prediction results of CPDP was related with the distributional characteristics of datasets. Turhan et al. [37] proposed a filter for companies that lack sufficient historical data to build defect prediction model. They analyzed the applicability of building prediction model by cross-company data. And then they selected ten nearest neighbors instances for each unlabeled test instance by k-nearest neighbor algorithm. The experiment demonstrates that this method is effective for building predictors by small sample data. Peters et al. [38] extended LACE2 which reduced the amount of data shared by using multi-party data sharing. Also, the obfuscation algorithm is applied to hiding project details to realize privacy-preserving for CPDP. Xia et al. [39] proposed a hybrid model reconstruction approach (HYDRA), which could iteratively learn new classifiers and compositions of classifiers to collectively better capture generalizable properties. By those iterative learning of various classifiers, HYDRA could alleviate the challenge of distribution difference between different projects. Poon et al. [40] used a credibility theory based Naïve Bayes classifier. It provides a credibility factor which can establish a novel reweighing mechanism so that the source project data can adapt to the target project data whilst retaining their own data pattern. Herbold et al. [41] reproduced 24 methods and evaluated the performance on CPDP. They determined that CamargoCruz09 [42] perform best. Jing et al. [43] proposed a method named SSTCA + ISDA for both WPDP and CPDP. They introduced subclass discriminant analysis into CPDP and WPDP to solve the class imbalance problem. Meanwhile, the semi-supervised transfer component analysis method is employed to make the distributions of source and target data similar. Zhou et al. [8] investigated existing CPDP models between 2002 and 2017 years. They concluded that the prediction performance of simple module size models are superior to most of the existing CPDP models in the literature.
There is an assumption in most existing CPDP methods that the available source project data are well labeled. In fact, labeling data requires expert judgment and labeling lots of data is time-consuming.

B. SEMI-SUPERVISED DEFECT PREDICTION (SSDP)
To solve the shortage of labeled data, a feasible solution is making use of the information of both labeled and unlabeled software instances, and SSDP methods have been presented [33], [44]- [46]. Seliya and Khoshgoftaar [47] proposed an expectation maximization based semi-supervised method. It estimates the model parameters and classification probability of unlabeled data. Lu et al. [48], [49] proposed a method called fitting-the-confident-fits with multidimensional scaling (FTcF-MDS). Firstly, FTcF first labels the unlabeled data, and then a supervised learner is trained by using all the labeled data. Multidimensional scaling is used to reduce the feature dimension before FTcT is initiated. Catal [50] made an assessment about four semisupervised classification methods, including low-density separation (LDS) and so on. Experimental results showed that LDS-based prediction approach is useful for SDP when the fault data are limited. Ma et al. [51] proposed random undersampling tri-training (RusTri) method, which employs the sampling strategy to resample the original training set and tri-training method to update training set. Thung et al. [52] proposed a SSDP method which labels a small data subset and then used labeled and unlabeled data in the learning process. He et al. [53] proposed a method named extRF. ExtRF extends the random forest algorithm by self-training paradigm and also employs change information to improve prediction performance. Zhang et al. [33] proposed a nonnegative sparse graph based label propagation (NSGLP) method for defect prediction and classification. NSGLP solves classimbalanced problem of training data by Laplacian score sampling strategy and sparse representation. The next step is to learn a relationship graph by the using of a nonnegative sparse algorithm. Based on the relationship graph, NSGLP employs a label propagation algorithm to predict defectproneness of unlabeled instances. Yu et al. [46] proposed a semi-supervised clustering-based data filtering (MsTrA+) method to filter out irrelevant cross-company data.
Recently, Wu et al. [32] for the first time studied the crossproject semi-supervised defect prediction (CSDP) problem and proposed the cost-sensitive kernelized semi-supervised dictionary learning (CKSDL) method. They introduced semi-supervised dictionary learning into SDP field. CKSDL uses kernel mapping to enhance the separability of data and cost-sensitive technique to ease misclassification problem. Different from existing CSDP methods, DAFL utilizes the adversarial learning, which can learn the discriminative feature of unlabeled and labeled data and has stronger feature learning ability, to improve prediction ability for defect prediction model.

C. ADVERSARIAL LEARNING
Goodfellow et al. [31] proposed adversarial learning in GANs. The original framework consists of two major components: a generative model G and a discriminative model D, which have two opposite training goals. To learn the distribution p g over data x, they defined a noise variable p z (z). G is a generative model to generate instances that D cannot distinguish those instances and the real instances during the generation process; D is a discriminative model to correctly distinguish the real instances and fake instances during the discrimination process. D and G play a minimax game as follows where x denotes the true data, z denotes the noise data. The generator G defines a probability distribution p g as the distribution of the instances G (z) obtained when z ∼ p z . Goodfellow et al. [31] proved that this game has a global optimum for p g = p data . Therefore, the function V (D, G) could converge to a good estimator of p data . Since the adversarial learning was first introduced, it has gradually been a hot topic and used in many fields, including image generation [54], [55], cross modal retrieval [56]- [58], etc. Wang et al. [58] successfully applied adversarial learning into cross-modal retrieval for different modalities. They first used neural networks for feature transforming. Then they sought an effective common subspace which can regularize the data distributions based on adversarial learning. Adversarial learning consists of two adversarial processes: a feature transformer and a modality classifier. The experiment shows that their method narrows down the distribution gap between different modalities either maximizes correlations of similar instances. Inspired by this work, adversarial learning is introduced into CPDP. We first use adjustable neural networks for feature dimensionality reduction. Then we use adversarial learning framework for narrowing the data distribution gap from different projects. Furthermore, inspired by triplet ranking loss [59], which is effective for feature learning, we make an improvement on general adversarial learning.

III. OUR APPROACH A. PROBLEM FORMULATION
Assume that S = {s 1 , s 2 , . . . , s m } ∈ R d×m and T = {t 1 , t 2 , . . . , t n , t n+1 , . . . , t n+n u } ∈ R d×(n+n u ) separately denote the source project data and target project data, where m and n + n u denote the numbers of the software instances in S and T , n u denotes the number of unlabeled instances from target project, d denotes the number of metrics. We define T u = {t n+1 , . . . , t n+n u } as the unlabeled dataset from target project. The source data are all unlabeled. We also define label set of labeled data from target project as Y t = y t 1 , y t 2 , . . . , y t n ∈ R c×n , c is the number of classes.
The value of various software metrics usually have a wide range, we conduct data normalization on software metrics. We normalize the data by using the commonly used min-max normalization [60]. And then all the values are transformed to the interval [0,1]. Given a metric a i of a instance, the normalized valueā i is computed as where max(a) and min(a) are maximum and minimum values of a.

2) PSEUDO LABEL FOR SOURCE PROJECT
In order to alleviate the influence of class-imbalance problem and learn cost-sensitive discriminative information from unlabeled and labeled data, we use simplified cost-sensitive local collaborative representation (CLCR) [61] to infer the pseudo label information for unlabeled data. We simplify the neighboring selecting process of CLCR and then refer to the remaining process of CLCR. The collaborative representation has been introduced into SDP and proved the effectiveness for SDP. Inspired by the discriminant ability of collaborative representation, CLCR also involves the costsensitive factor in the representation coefficients to infer the label information. For a given i-th unlabeled instance s i from S, we obtain the representation coefficients as follows where α = [α 1 ,α 2 ,..., α n ] T is the collaborative representation coefficient vector, T denotes vector or matrix transposition in our paper, and µ is a regularization coefficient.
. . , cost n ) −1 is cost matrix and its element can be defined as follows where num(defective) and num(defect-free) are the numbers of defective instances and defect-free instances. The sparse coefficient vector α can be computed by the following formula: Considering that s i − t i α i 2 is the class specific representation residual, and α i 2 is also associated with class VOLUME 8, 2020 information, they can bring discrimination information for classification, the regularized residual of each class is calculated as Then the instance s i is assigned to the j-th clasŝ After CLCR, the pseudo label set of source project we obtained is defined asŶ s = ŷ s 1 ,ŷ s 2 , . . . ,ŷ s m ∈ R c×m . And we also adopt a standard random oversampling [62] to alleviate the influence of class-imbalance problem.

C. ADVERSARIAL LEARNING
The framework of DAFL (Discriminative Adversarial Feature Learning) is shown in Fig. 1. In feature mapping, the datasets S and T have been transformed in common subspace F. Although the data between source project and target project have significant difference, the abundant parameters of fullyconnected layers can ensure the excellent ability of representations. Then, DAFL plays the minimax game between the feature transformer and project discriminator, and these two processes steer the feature representation learning. Specifically, we decompose the learning processes into three parts: project discriminator is utilized to minimize the distribution gap of representations between different projects; feature discrimination is used to ensures learned feature representations from different classes are discriminative; feature structure preservation, which maximizes the distances between interclass instances and minimizes the distances among intra-class instances in different projects simultaneously.

1) FEATURE MAPPING
CPDP aims to extract common knowledge from source project and transfer the knowledge to target project. As different projects have different data distributions, if we training a prediction model using data from extern projects (source project) directly, CPDP cannot generalize well to the target project [24]. We aim to find a common subspace for minimizing the distance between source and target projects while preserving the original data properties. The source project data S and target data T can be mapped to F S = f S (S; θ S ) and F T = f T (T ; θ T ). f S (S; θ S ) and f T (T ; θ T ) are mapping functions, F S ∈ R d f ×m and F T ∈ R d f ×(n+n u ) are the transformed instances in subspace F, θ S and θ T are adjustable parameters. Inspired by the subspace learning methods, F S and F T in our method are realized as feed-forward neural networks. FNN has enough capacity of feature representations considering the large difference of complex distribution between different projects. We adopt four-layer FNN as the feature mapping in DAFL to nonlinearly transform the features from different projects into a common subspace.

2) PROJECT DISCRIMINATOR
We first define a discriminator D with parameters θ D . We take a source or target instance as input vector, and the discriminator outputs a scalar indicating whether the input vector is from source or target. The mapped instances from source project are assigned the label 01. The mapped instances from target project are labeled 10. For the discriminator, the goal is to maximize the classification accuracy given an unknown instance. In this process, we refer to the classification loss as adversarial loss L D . For the discriminator implementation, we use a three-layer FNN with parameters θ D . The adversarial loss L D can be defined as: where L D denotes the cross-entropy loss of project discriminator of all data. Each instance is assigned a project class label p i in a one-hot vector (binary encoding representation). n b is the number of instances within each minibatch. And D (.; θ D ) denotes generated class probability which per instance comes from source or target project.

3) FEATURE TRANSFORMER
To ensure the intra-class correlation and inter-class discrimination are preserved in the new representation subspace, we designed a feature transformer in this process. Feature Discrimination: To preserve the inter-class discrimination of data in different projects, we define a defect label classifier to predict the class of the mapped instances in the common subspace. We add a two-layer FNN which applies softmax activation on the top of neural network. The classifier takes the mapped instances as training data and generates a class probability as output. We usep as generated probability distribution. The label prediction loss is as follows: where L fd denotes the label prediction loss of source data instances. θ Y denotes the parameters of the defect label classifier, y i is the truth label of each instance,ŷ j is the pseudo label of source project instance. Feature Structure Preservation: In order to ensure that the distance between the within-class instances are closer than the distances in instances from other classes on the same project, we design a triplet constraint. We further provide a useful regularizer for preserving the data structure on unlabeled data. This regularizer is based on the prior assumption of consistency. It means nearby instances are likely to have the same label [63]. Based on the above consideration, we enforce triplet constraints onto the feature generation process via a triplet loss function.
Given a source project instance s i , we build couples of the form s i , t + j , s i , t − k between different projects. Let t + j and t − k denote the positive and negative instances. s i is selected as an anchor. The target instance t + j is assigned as a positive matching with the closest distance compared with other target instances, and t − j with the farthest distance compared with other target instances is assigned as a negative mismatching. Similarly, given a target instance t i , we have t i , s + j and t i , s − k . We also define the constraints for the target instances as t i , t + j , t i , t − k by utilizing the label information from target labeled data. t + j is the instance farthest from t i in same class, and t − j is the instance closest to t i from different class. Finally, we build the sets of triplets s i , t + j , t − k for per source for per target instance.
The l 2 norm is used to compute the distance between all mapped instances after feature mapping: Thus, we compute the loss between different projects using the following equations: where the hyper-parameter γ is a distance threshold value, λ is a balance factor.
Then the overall feature structure preservation loss across different projects is defined by the following equation: Based on the above equations, the loss function of the feature generation is formulated as the discrimination loss L G by using inter-class discrimination information and intraclass correlation information between different projects.
where the hyper-parameter β controls the contributions of the L fd and L str .

4) OPTIMIZATION
The optimization goals of our functions are adversarial, which are trained as a min-max game of the following processes:

IV. EXPERIMENT A. RESEARCH QUESTIONS
We investigate two questions in this paper as follows: RQ1: Is the prediction performance improved when we use DAFL?

Algorithm 1
The Algorithm of DAFL Input: Source project dataset S = {s 1 , s 2 , . . . , s m } ∈ R d×m and the corresponding pseudo label matrixŶ s = ŷ s 1 ,ŷ s 2 , . . . ,ŷ s m ∈ R c×m ; Labeled Target project dataset T = {t 1 , t 2 , . . . , t n } ∈ R d×n and the corresponding label matrix Y t = y t 1 , y t 2 , . . . , y t n ∈ R c×n ; Unlabeled target project dataset T u ; hyperparameters λ, γ , β, µ Output: The prediction labelŶ t of T u repeat for k steps do end for Update parameters of discriminator by ascending its stochastic gradients through Gradient Reversal Layer: return learned representations in common subspace: f S (S; θ S ) and f T (T ; θ T ) until convergence The prediction labelŶ t ←NN Classifier
For RQ 2, we investigate whether the important components of DAFL are effective or not. First, to explore the effect of adversarial learning in DAFL, we display the values of discrimination loss and adversarial loss form epoch 1 to 10000. Then we compare DAFL and its two variants (DAFL with L fs only and DAFL with L str only). Finally, we compare DAFL against the baseline that is DAFL without pseudo labels.

B. DATASETS
We conduct experiments on three widely used software defect prediction datasets: AEEEM [64], NASA [65], PROMISE [66], [67]. Table 1 lists the details about the datasets. A brief introductions on each dataset are mentioned as follows.
The AEEEM dataset consists of Equinox Framework (EQ), Eclipse JDT Core (JDT), Apache Lucene (LC), Mylyn (ML), and Eclipse PDE UI (PDE) projects. It has 61 metrics, including Chidamber&Kemerer (CK) metrics, object-oriented metrics, churn of source code metrics, and entropy of source code metrics. VOLUME 8, 2020 NASA dataset was collected by the NASA metrics data program. Each project of NASA represents a subsystem from NASA. We select five projects from NASA dataset, and these five projects have 37 common metrics, which include code size and complexity etc., [68].
The PROMISE dataset was prepared by Jureczko et al. [66]. In the PROMISE dataset, not all the projects have the same set of metrics. Hence, we select six open-source projects that have 20 common metrics. The metrics contain McCabe's cyclomatic metrics (e.g., Average Cyclomatic Complexity), CK metrics (e.g., Coupling between Object Classes) and other object-oriented metrics (e.g., Depth of Inheitence Tree).

C. EVALUATION MEASURE
In this paper, we use two measures to evaluate the performance of DAFL: F-measure and G-measure. F-measure and G-measure are well-known measures [8], [15], [38], [46], [69], [70], which are applied to evaluate the prediction performance. We follow previous studies and use F-measure and G-measure as the indicators in our study. The values of measures range from 0 to 1. The higher the value means, the better the performance is.
F-measure is the harmonic mean of recall and precision and balances out the recall-precision tradeoff. Recall (Pd) is a statistical measure defined as TP/ (TP + FN ), where TP and FN mean True Positive and False Negative. Precision (Pre) is a statistical measure defined as TP/ (TP + FP), FP means False Positive. F-measure is defined as (1+α 2 )×Pd×Pre Pd + α 2 ×Pre , where α is a parameter to decide the relative importance of the Precision over the recall. We usually set the parameter α as 1 [15]. G-Measure takes both recall and specificity into consideration and is the geometric mean of recall and specificity. Specificity is a statistical measure defined as TN / (TN + FP), where TN means True Negative. G-measure is defined as 2*Pd×specificity Pd+specificity . TP, FN , FP and TN are defined in Table 2.

D. BASELINES FOR COMPARISON
We compare our approach DAFL with ten state-of-the-art methods including six SSDP methods (FTcF-MDS [48], LDS [50], RusTri [51], NSGLP [33], MsTrA+ [46], CKSDL [32]) and four CPDP methods (ManualDown [8], CamrgoCruz09 [42], NN-filter [37], and SSTCA + ISDA [43]). Specifically, CKSDL is the first method that apply to CSDP scenario. MsTrA+ is a cross-company semisupervised method. We consider MsTrA+ as an CSDP method. ManualDown is an unsupervised method, which is a baseline model suggested in [8] for comparison when developing new CPDP model. We consider ManualDown as an unsupervised CPDP baseline. CamargoCruz09 is a classical method which is always ranked among the best statistically significant test results in [41] for all metrics and data sets in CPDP. NN-filter and SSTCA + ISDA are effective CPDP methods. To make fair evaluation with these methods, we re-evaluate the method implementations which provided by authors.

E. EXPERIMENT SETTINGS
We organize the cross-project semi-supervised setting on the same dataset and conduct CSDP. For example, for the AEEEM dataset, when EQ is regarded as the target project, other four projects in AEEEM are regarded as the source project in turn, and four cross-project pairs will be formed. We regard all the instances of the source data as unlabeled data. In the target project, we randomly select a certain percentage (e.g., 20%) instances as labeled data. For tackling the randomness of instance selection, we randomly perform the experiments for 5 times. Finally we report the mean F-measure and G-measure results for each target project.

F. IMPLEMENTATION DETAILS
We employ four-layer feed-forward neural networks to nonlinear project source project and target project into a common subspace. We conduct the experiments with different  parameters of the four-layer network on AEEEM, NASA, and PROMISE datasets. For the project discriminator, we use the three fully connected layers. Furthermore, we use softmax activation for the discriminator. The number of nodes in each layer in network architecture is summarized in Table 3.
In experiment, we set the batch size as 64 on all the datasets. The model parameters µ, γ , λ and β are set as 0.05,0.1, 0.01 and 0.1, separately. The values of µ, γ and λ are mainly follow the paper of [61] and [58]. The parameter β is important that controls the contributions of feature discrimination and feature structure preservation. We search values between 0 and 100 and set the range of values for β as [0.01, 0.1, 1, 10, 100]. Finally we choose one value that yields the best F-measure and G-measure values at the same time.

A. ANALYSIS RESULTS OF RQ1
In this section, we compare DAFL with ten methods. Tables 4 and 5 report the F-measure and G-measure values of all baselines on three companies. The mean F-measure and G-measure values corresponding to each target project are reported. We also report the mean values of different target projects in a company in the table. The best values for each target project are highlighted in bold font. From these tables, DAFL obtains significant prediction performance improvement in most of the cases. On average, DAFL obtains the Feasure and G-measure as 0.7034 and 0.6944 and achieves the best average F-measure and G-measure. Our DAFL can significantly outperform the traditional SSDP methods (FTcT-MDS, LDS, RusTri, NSGLP, MsTrA+ and CKSDL) and the state-of-the-art CPDP methods (CamargoCruz09, ManualDown, NN-filter and SSTCA + ISDA), and this shows the advantage of adversarial learning for CPDP.
Compared with SSDP methods: in most cases, F-measure and G-measure values of DAFL are higher than related methods. The performance improvement compared with competing methods is significant. For example, DAFL improves the mean F-measure by 77.17%, 82.98%, 59.57%, 70.19%, 68.89% and 44.49% over FTcF-MDS, LDS, NSGLP, RusTri, MsTrA+ and CKSDL, respectively. DAFL uses large unlabeled instances from other project, which contains useful information. DAFL can learn the information effectively. And our approach can model the distribution between data form different projects and then effectively reduce the gap between source and target project data. The adversarial techniques have been theoretically and empirically proven to be able to minimize the distribution difference. Thus, DAFL performs better than other competing methods. Specifically, compared with two CSDP methods (MsTrA+ and CKSDL), the prediction performance of DAFL is improved. Our approach shows better ability of obtaining effective transformed features. On the one hand, we transform the data from different projects into a common subspace by feature mapping. This process can ensure the data between different projects is correlated. On the oother hand, the strong ability of feature learning of DAFL makes the prediction performance of our model improving a lot. Compared with MsTrA+ and CKSDL, our model has stronger feature learning and discrimination ability.
Compared with CPDP methods: ManualDown, Canar-goCruz09, NN-filter and SSTCA + ISDA are four baselines which have shown a good performance in recent CPDP surveys [8], [41]. In most cases, our DAFL has better performance than these four baselines. DAFL fully extracts the useful information from both the large amount of unlabeled instances and labeled instances in the common subspace. CanargoCruz09 can only perform data transformation, Man-ualDown can only utilize the module size feature, NN-filter aims to collect the required local data easily and quickly and is fail to construct a better model, and SSTCA + ISDA cannot use the unlabeled data effectively, which leads to the undesirable performances as compared with DAFL.

1) STATISTICAL SIGNIFICANCE TEST AND EFFECT SIZE TEST
Statistical Significance Test: To statistically analyze the detailed results corresponding to Tables 4 and 5 (5 random running), we perform the non-parametric Friedman test [71] with the Nemenyi test (at a confidence of 95%) to compare the significant difference of the model performance. The evaluation has been used for performance comparison in many studies [15], [41], [72]. For each measure, we compare multiple model over the datasets.
We report the visual representation results of DAFL against competing methods in Fig. 2.(a)-(b). The lower ranking indicates the better performance. In terms of F-measure value in Fig. 2, DAFL outperforms other competing methods with statistical significance. DAFL always rank the first in Fig. 2. The average rank of DAFL is better than other baselines with statistical significance. In brief, DAFL outperforms other baselines and shows comparable results against baselines with statistical significance in F-measure and G-measure.
Effect Size Test: We use Cliff's delta (δ) effect size test [73] to measure the amount of difference that exists between DAFL and the baselines. The values of δ are in the interval [−1, 1]. As shown in Table 6, the values are divided into four levels. The higher value denotes the better effect. The effect levels of Cliff's delta among DAFL and baselines are shown in Table 7.
Form Table 7, we find that the prediction performance of DAFL improves a lot over baselines on most projects. Take G-measure as an example, there are 14/16 (   over baselines. In the same way, DAFL has the similar conclusion on F-measure. The results show that DAFL achieves better prediction performance than baselines.

B. ANALYSIS RESULTS OF RQ2 1) EFFECT OF ADVERSARIAL LEARNING
In the objective function of DAFL, we employ the adversarial learning by jointly optimizing the adversarial loss and discrimination loss. To explore the effectiveness of adversarial learning in our approach, we display the values of adversarial loss and discrimination loss from epoch (one epoch = one forward pass and one backward pass of all the training examples) 1 to 10000 in Fig. 3. In Fig. 3, the adversarial loss has almost remained stable (after 10 epochs), while the discrimination loss first decreases and then converges smoothly until about 100 epochs. The results in Fig. 3 are in conformity to the expectation of our approach in CSDP. When the discrimination loss is vibrating, the defect prediction performance will increase. The project discriminator that directs the process of subspace feature discriminate learning is incorporated in the feature transformer. If the value of adversarial loss is soaring, the project discriminator will fail to direct the feature transformer. And if the adversarial loss is optimized to zero, the CPDP will be impossible. The success of project discriminator means the failure of the feature transformer, thus the discriminate feature learning will be ineffective.

2) EFFECT OF FEATURE TRANSFORMER (COMBINING PROJECT DISCRIMINATION AND FEATURE STRUCTURE PRESERVATION)
Feature transformer of DAFL consists of two processes: label prediction and feature structure preservation. To evaluate the effectiveness of this combination, we compare DAFL with two baseline approaches of our approach: DAFL with L fd only (DAFL L fd ), DAFL with L str only (DAFL L str ). Table 8 shows the F-measure and G-measure values of DAFL and these baselines on three datasets. The results show that both the feature transformer and project discriminator contribute to the defect prediction. And the performance of optimizing the L fd and L str simultaneously in the model is better than optimizing only one of them.

3) EFFECT OF THE PSEUDO LABEL
In order to assess the effectiveness of pseudo label, we give the F-measure and G-measure results which we perform experiment without pseudo label. In feature discrimination process, we only discriminate the class from the target labeled instances without the pseudo label from source instances. In Table 9, we can acquire better experiment results with pseudo labels. To make full use of discrimination of unlabeled data, we apply CLCR technology to estimate the labels of unlabeled data from existing label information in data preprocessing, and then use the pseudo labels and true labels in feature discrimination process. The experiment results show that the feature discrimination process will perform better with both pseudo labels and existing labels.  In this section, we investigate the effect of different ratios of labeled instances in target project to DAFL. The percentage of labeled instances in the target project is range from 10% to 30%. For a specific percentage of labeled instances, we report the mean results of each method on different target projects. We report the average G-measure results on AEEEM, NASA, and PROMISE datasets in Fig. 4.(a)-(c). In Fig. 4, when the ratio of labeled instances increases, the prediction performance of all the methods improves. And DAFL always outperforms most of the baseline methods.

2) HOW WE SET THE NUMBER OF EPOCH?
Deciding the number of epochs is an important step for training the model. In general, the higher the number of epochs, the more the time cost. We choose Tomcat project as target project and camel as source project to conduct the experiment. We use a machine with 32GB RAM, Intel Core i7-8700k processor and an Nvidia GeForce GTX 1080Ti GPU card. We use discriminate loss L G and adversarial loss L D to evaluate the parameter. The values of epochs range from 1 to 10000. In Fig. 3, when we set the number of epochs to 100, the values of discriminate loss and adversarial loss are stable, and the time we cost is little. In this study, the number of epochs is set to 100 and the time cost is only 60 seconds.

3) PARAMETER SETTING OF THE PARAMETER β
As mentioned in Section IV-F, we set β value to 0.01 to 100. In this section, we discuss the influences of different values of β. We set the range of the value as [0.01,100] on AEEEM dataset. Fig. 5 depicts the F-measure and G-measure values of DAFL with different β values on AEEEM dataset. From  Fig. 5, when β = 0.1, DAFL achieves higher F-measure and G-measure values, but the range of value differences is small.

VI. THREATS TO VALIDITY A. THREATS TO EXTERNAL VALIDITY
In this paper, although 16 projects from three datasets and various metrics are used in experiment, e.g., complexity, code size. We cannot be sure whether DAFL can be applied to other systems and metrics.

B. THREATS TO INTERNAL VALIDITY
We implement the baselines by carefully following the original papers. These compared related works do not provide the source codes of their works except for CKSDL, although we try our best to guarantee the accuracy of the implementation, our implement may not completely consistent with original paper.

C. THREATS TO CONSTRUCT VALIDITY
We mainly used F-measure and G-measure, which have been widely used to evaluate the effectiveness of defect prediction models, to evaluate the prediction performance. Although these two measures are widely used in most compared methods, some biases still exist.

VII. CONCLUSION
In this work, we proposed DAFL approach, a novel approach to learn both discriminative and correlative representations in common subspace for CSDP. DAFL takes advantage of adversarial learning that performs a minimax game into two adversarial stages: a feature transformer generates intraclass correlated and inter-class discriminated representations, a project discriminator tries to discriminate the project attribute of an unknown representation to minimize the distribution difference between source project and target project data. In feature transformer, we designed triplet constraints to ensure the feature structure is preserved, which ensures the feature representations learned are intra-class correlated and inter-class discriminated in common subspace. We construct experiment on three widely used defect datasets and the result analyses have demonstrated the effectiveness and efficiency of our DAFL.
For future work, we will adjust DAFL framework to better deal with more complex datasets including both commercial proprietary closed and open source projects.
XIAO-YUAN JING received the Ph.D. degree in pattern recognition and intelligent system from the Nanjing University of Science and Technology, in 1998.
He was a Professor with the Department of Computer Science, Shenzhen Research Student School, Harbin Institute of Technology, in 2005. He is currently a Professor with the School of Automation, Nanjing University of Posts and Telecommunications, China, the School of Computer, Wuhan University, and the School of Computer, Guangdong University of Petrochemical Technology. He has published over 100 scientific articles in international journals and conferences, including TIP, TIFS, TCB, TCSVT, TMM, TR, TSMC-B, CVPR, AAAI, IJCAI, and ICME. His research interests include pattern recognition, image processing, computer vision, and machine learning. He has been a Professor with the School of Automation, Nanjing University of Posts and Telecommunications, since 2006. His current research interests include intelligent optimization, network management, machine learning, and future networks. VOLUME 8, 2020