Few-Shot Learning Based Balanced Distribution Adaptation for Heterogeneous Defect Prediction

Heterogeneous defect prediction (HDP) aims to predict the defect tendency of modules in one project using heterogeneous data collected from other projects. It sufficiently incorporates the two characteristics of the defect prediction data: (1) datasets could have different metrics and distribution, and (2) data could be highly imbalanced. In this paper, we propose a few-shot learning based balanced distribution adaptation (FSLBDA) approach for heterogeneous defect prediction, which takes into consideration the two characteristics of the defect prediction data. Class imbalance of the defect datasets can be solved with undersampling, but the scale of the training datasets will be smaller. Specifically, we first remove redundant metrics of datasets with extreme gradient boosting. Then, we reduce the data difference between the source domain and the target domain with the balanced distribution adaptation. It considers the marginal distribution and the probability of conditional distribution differences and adaptively assigns different weights to them. Finally, we use adaptive boosting to relieve the influence caused by the size of the training dataset is smaller, which can improve the accuracy of the defect prediction model. We conduct experiments on 17 projects from 4 datasets using 3 indicators (i.e., AUC, G-mean, F-measure). Compared to three classic approaches, the experimental results show that FSLBDA can effectively improve the prediction performance.


I. INTRODUCTION
With the availability of massive storage capabilities, high speed Internet, and the advent of Internet of Things devices, modern software systems are growing in both size and complexity [1]. Software Defect Prediction (SDP) can accurately find defects in the early stages of software development. It focuses on identifying defect tendencies in software modules and helps researchers allocate limited resources to modules with high probability of containing defects. SDP can solve the problem of insufficient energy of developers and limited development cycle, on the other hand, it can effectively improve the quality of software.
Cross-project defect prediction (CPDP) utilizes the existing historical data of other projects to construct a prediction model, which does not require sufficient historical data of the project to be predicted. However, the source project is required to have common metrics as the target project. However, the programming languages and application domains of The associate editor coordinating the review of this manuscript and approving it for publication was Shiqiang Wang . different projects are often different, the corresponding metrics are different [2]. Heterogeneous Defect Prediction (HDP) can reduce significant difference, whose prediction effect is independent of whether the two projects have common metrics or not. Li et al. proposed a new cost sensitive transfer kernel canonical correlation analysis (CTKCCA) approach for HDP, which made the data distributions of source and target projects much more similar in the nonlinear feature space [3]. Li et al. not only made better use of two projects but also alleviated the class imbalance problem by setting different misclassification costs for different samples [4]. Li et al. proposed a multi-source selection based manifold discriminant alignment (MSMDA) approach. A sparse representation based double obfuscation algorithm is designed and applied to HDP [5].
The researchers used Domain Adaptation (DA) to reduce significant difference of data, no longer requiring that the source project (source domain) has the same metrics and distribution as the target project (target domain). There are three kinds of DA. Distribution adaptation focuses on the data distribution of source domain and target domain. Feature VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ selection focuses on the shared metrics of source domain and target domain. Subspace learning is concerned with subspaces shared by the source domain and the target domain. Feature selection and subspace learning are often used in the field of heterogeneous software defect prediction. The researchers used the feature selection method to select shared metrics from the source domain and target domain, and constructed a unified model [6], [2]. Yu et al. grouped the original metrics with spectral clustering according to the correlation of metrics, and employed Relief F algorithm to compute the relevance between each metric with respect to the number of faults, and selected the most relevant metrics from each resulted cluster [7]. Subspace learning transforms the source domain and the target domain into the same subspace. He et al. used clustering and data selection to select the subset highly relevant to the target domain to learn the similar distribution between the two domains [8]. According to the distribution curve of metrics, Yu et al. matched the heterogeneous metrics and aligned the source domain with the target domain [9]. Wen et al. adopted feature selection, which combined with transfer component analysis (TCA+) for spatial transformation, and obtained accurate prediction results [10]. Chen et al. proposed a heterogeneous data orienting multiview transfer learning for software defect prediction, which can achieve different dimensions and granularities metrics to automatically learn labels through neural network models [11]. Chen et al. proposed a collective training mechanism for defect prediction (CTDP), which made the distributions of source and target projects similar to each other by transfer learning [12]. Liu et al. proposed a spatial-neighborhood manifold learning (SNML) framework for data analysis, which used the spatial-neighborhood information to construct the adjacency graph [13]. However, feature selection is limited by two reasons: whether there are common metrics that affect the classification results in the source domain and the target domain, or whether there is a greater correlation between the important metrics of the source domain and the target domain. Subspace learning can reduce data drift during data mapping, but there are still different marginal distribution and conditional distribution in source domain and target domain, which affect the decision result.
Distribution adaptation mainly considers the probability of marginal distribution and the probability of conditional distribution. Some methods [14], [15] consider only one of these aspects. Long et al. proposed the joint distribution adaptation method (JDA) to match the marginal distribution and conditional distribution between different domains [16]. Others extended JDA by adding structural consistency [17], domain-invariant clustering [18], and target selection [19]. These methods tend to ignore the importance of two different distributions and simply add them up. However, when there are large differences between two distributions, these methods cannot assess the importance of each distribution and may not be well generalized in most cases. Existing distribution adaptation methods generally assume that two distributions are equally important. However, this assumption does not hold. For example, when there is a big difference between the source domain and the target domain data, marginal distribution adaptation is more important. Conditional probability distribution adaptation is more important when the datasets of source domain and target domain have higher similarity. Wang et al. proposed a balanced distribution adaptation method (BDA) [20], which can dynamically measure the different effects of marginal distribution and conditional distribution, rather than simply give them the same weight.
In particular, there is class imbalance in the datasets. In order to obtain the model with better classification effect, undersampling is carried out on non-defective samples. However, the quantity of training data is small, which easily leads to under-fitting of the prediction results of the model. Ensemble learning can complete the few-shot learning and obtain a complex learning model by constructing and combining multiple base classifiers. It does not require additional parameters and avoids costly off-line training sessions [22]. Li et al. developed an ensemble multiple kernel correlation alignment (EMKCA) predictor, which combined the advantage of multiple kernel learning with domain adaptation techniques [23]. Li et al. proposed a novel Two-Stage Ensemble Learning (TSEL) approach to HDP, which learned multiple different EMKCA predictors and used average ensemble to combine them together. [24]. Boosting in ensemble learning can use a small amount of data to iteratively generate multiple learners with weak generalization performance and construct a strong ensemble classification model [21]. This paper proposes a heterogeneous defect prediction method, denoted as FSLBDA, which combines ensemble learning with domain adaptation. It addresses the fact that there are no or fewer common metrics between two projects. The feature selection of the source domain is realized by extreme gradient boosting (XGBoost). Domain adaption reduces the data distribution difference between source domain and target domain. Undersampling was used to solve the class imbalance, but a small number of defective samples resulted in a small balance training set. Adaptive boosting (AdaBoost) iteratively updates sample weight and reduces the deviation of the classification surface. The main contributions of this article are as follows: (1) AdaBoost is used to realize the few-shot learning to prevent under-fitting of the prediction model obtained from the small data set (a balanced data set is obtained from undersampling).
(2) BDA can dynamically measure the importance of marginal distribution and conditional distribution and realize adaptive distribution adaptation.
The remainder of the paper is organized as follows. Section II presents the architecture of the proposed approach. Section III is the theory of feature selection and classification. Section IV describes the main principle of balanced distribution adaptation. Section V analyzes and discusses the experiments results. Finally, conclusions are drawn in Section VI.

II. PROPOSED APPROACH
As shown in Figure 1, the proposed approach framework of this paper is mainly divided into three parts. Firstly, XGBoost selects the metrics in the source domain. Secondly, BDA minimizes data distribution differences between source and target domains. Thirdly, AdaBoost classifies the samples of the target domain.
The first step is to select metrics in the source domain. Defective samples and non-defective samples are selected with the ratio of 1:1 to construct a balanced data set. XGBoost is adopted, and the gradient lifting algorithm is used to continuously reduce the loss of the previously generated decision tree. It updates the weight G i of the samples and generate a new decision tree. The weights of different metrics in all sub-leaves ω j were weighted and averaged to determine the importance of the metrics. The complexity of the tree model is added to the objective function to avoid over-fitting. The first and second derivatives of the Taylor expansion of the objective function are applied to accelerate the optimization.
The second step is to measure the conditional distribution and marginal distribution of source domain and target domain dynamically to reduce data difference adaptively. We use A-distance to estimate the balance factor µ. A-distance is defined as establishing a linear classifier to distinguish hinge losses in two data fields. Maximum Mean Discrepancy (MMD) is used to calculate the difference between the two probability distributions for obtaining the matrix M 0 and M c .
The optimal transformation matrix A is obtained by using Lagrange multiplier optimization. The source domain and target domain data with similar distribution can be obtained by transformation matrix.
The third step is to use AdaBoost to predict the defect tendency of the target domain modules. Each training sample is initially assigned the same weight. If a sample has been accurately classified, its weight is reduced in constructing the next training set. Conversely, if a sample is not accurately classified, its weight is increased. Then the sample sets with updated weight ω i are used to train in the next classifier, and the training process proceed iteratively. After the training process of each weak classifier was completed, the weight α i of the weak classifier with low classification error rate was increased, while the weight of the weak classifier with high classification error rate was decreased. Weighted average was used to determine the predicted results of defect samples.

III. FEATURE SELECTION AND CLASSIFICATION
Boosting's evolutionary methods in ensemble learning include XGBoost and AdaBoost. This paper adopts the above ideas to achieve the feature selection of source domain and defect tendency prediction of target domain respectively. Boosting generates a strong learner with nearly perfect performance by increasing the number of iteration times of the weak leaner. The weak learner means that the classification effect is only slightly better than the random guess effect. VOLUME 8, 2020 In practice, it is easier to get a weak learner than a strong learner. Each classifier generated after the first classifier is to learn from the samples that were not correctly classified in the previous time, which can effectively reduce the deviation of the model. As shown in Figure 2, Boosting repeatedly runs a weak learning to deal with different weight of training data, then the weak learners generated each time are combined into a composite strong learner.
XGBoost quantifies the importance of each metric using the characteristics of the Classification and Regression Tree (CART) to select the partitioning points. In order to minimize the cost of the segmented tree, the metric with the highest gain is selected for segmentation until the maximum depth. The gradient lifting algorithm is used to continuously reduce the loss of the previously generated decision tree, which minimizes the objective function and ensures the reliability of the final decision. The objective function considers the complexity of the tree model to avoid over-fitting. The loss function is expanded by Taylor expansion, and the first and second derivatives are used to accelerate the optimization.
The objective function to be minimized is as follows. l is a differentiable convex loss function to predictŷ i and target y i differences.
is penalty term of the model complexity, which can help smooth the final weights of learning and avoid over-fitting.ŷ i(k) is the prediction of the i-th sample at the k-th iteration. q (x i ) is the structure function of each tree that maps an example to the corresponding leaf index. The objective function greedily adds p k p k . Each p k corresponds to an independent tree structure q and leaf weights ω. T is the number of leaf nodes, and ω is the magnitude of leaf node vector. γ represents the parameter for adjusting the shading of a node, and λ represents the L2 regularization coefficient.
Second-Order approximation optimizes the model quickly, where g i = ∂ŷ i(k−1) l y i ,ŷ i(k−1) , h i = ∂ 2 y i(k−1) l y i ,ŷ i(k−1) are the first and the second order gradient statistics of the loss function. I j is the sample set of leaf j. The objective function after removing the constant term can be expressed as follows.
The weight ω j of each leaf in each tree is obtained, which is used to calculate the metric importance finally.
AdaBoost changes the weight of training data, which is the probability distribution of samples. Its idea is to focus on the samples that are wrongly classified, reduce the sample weight of the last round of correct classification, and improve the sample weight of those wrongly classified. AdaBoost uses the method of weighted majority voting, which increases the weight of weak classifiers with small classification error rate and reduces the weight of weak classifiers with large classification error rate. The weight of the sample is mainly used for the weak classifier to find the decision point with the smallest classification error, and then the weight of the weak classifier is calculated with this minimum error. The larger weight of the classifier has the greater voice in the final decision.
During the process of training the weak classifier, the objective function can be optimized as: β is the weight coefficient and f represents the prediction function of the weak classifier. F (x i ) represents the function of the strong classifier constituted by iteration. The exponential loss function is used instead of the mean square error loss function because the latter is not effective for classification applications. The first part of the exponential function represents the loss function of the existing strong classifier on a single training sample. The latter part is the loss function of the current weak classifier to the training sample. The objective function can be simplified as: ω j i is the sample weight, which is only related to the strong classifier obtained in the previous iteration, and has nothing to do with the current weak classifier. This optimization problem can be solved in two steps. First, β is regarded as a constant. Since the values of y i and f (x i ) can only be +1 or −1, they must be equal to minimize the objective function. Therefore, the optimal solution is: where I is the indicator function, whose value is 0 or 1 according to the conditions in brackets. The optimal solution is the classifier which can make the weighted error rate of the sample minimum. The optimization objective can be expressed as: The derivative of the function at the extreme point is 0, so the optimal solution of β is obtained: err j is the weighted error rate of the weak classifier to a training set: The updating formula of sample weight in iteration is written as follows:

IV. BALANCED DISTRIBUTION ADAPTATION
In software defect prediction problems, labeled source domain and unlabeled target domain often differ in both marginal and conditional distributions. Figure 3 demonstrates the importance of matching both marginal and conditional distributions for domain adaptation. If the two distributions are treated equally, they cannot take full advantage of each other's importance. When two domains are very dissimilar (Figure 3(a) → (b)), the marginal distribution is more important to align. When the marginal distributions are close (Figure 3(a) → (c)), the conditional distribution should be given more weight. BDA can adaptively adjust the importance of two distributions to achieve the better performance. Giving a labeled source domain X s i , y s i n i=1 , an unlabeled target domain X t i m i=1 , their marginal distribution P s = P t and conditional distributions Q s = Q t , BDA aims to learn the labels y t of the target domain D t using the source domain D s . Domain Adaptation often adaptively minimize the marginal and conditional distribution discrepancy between domains. Specifically, this refers to minimizing the distance: BDA exploits a balance factor µ to leverage the different importance of distributions where µ ∈ [0, 1]. When µ →0, it means the datasets are more dissimilar, so the marginal distribution is more dominant; when µ → 1, it reveals the datasets are similar, so the conditional distribution adaptation is more important. Therefore, the balance factor µ can adaptively utilize the importance of each distribution and lead to good results. Since the target domain D t has no label, the conditional distribution P t cannot be directly obtained. Instead, we use the class conditional distribution to approximate P t . To calculate the class conditional distribution, D t is predicted using the base classifier trained on the source domain D s , soft labels are obtained and constantly corrected. In order to compute the discrepancies between the marginal distribution and the conditional distribution, we used MMD to estimate the discrepancies between the two distributions. d (D s , D t ) can be represented as: By further using the matrix operation rules, the above formula is formalized as follows: The former term is used to adapt marginal distribution and conditional distribution with balance factor µ, and the latter term is regularization. λ denotes regularized parameter for Frobenius norm • 2 F . There are two main influence factors. One is the transformed data A T X which holds the internal properties of the original data. The second is the value range of balance factor. X is the input matrix composed of X s and X t , A is the transformation matrix, I ∈ R (n+m)×(n+m) VOLUME 8, 2020 is the identity matrix, H = I − (1/n) I is the centering matrix. M 0 and M c are MMD matrices. By multipliers = ( 1 , 2 , . . . , d ), the Lagrange function as follows: Set the derivative ∂L ∂A = 0, the optimization can be derived as a generalized Eigen decomposition problem to find out d minimum eigenvectors, and then the optimal transformation matrix A is obtained. Source domain and target domain with the least discrepancy can be obtained by transformation matrix.
According to the difference of marginal distribution, Adistance between the source domain and the target domain is calculated, denoted as A M . For conditional distribution differences, we first cluster the target domain into C classes, and then calculate A-distance of the data from the same class in the two domains. The average of A-distance between all categories is denoted as A C . Then, can be estimated as µ ≈ A C / (A C + A M ).

V. EXPERIMENTAL RESULTS AND ANALYSIS A. DATASETS DESCRIPTION
Two projects of the Relink dataset (Safe and Zxing) are used as the source domain data in the following experiments [25]. The Relink dataset was collected using the Understand tool (https://scitools.com) by Wu et al., with 26 metrics that measure code complexity. A total of 15 projects were selected from the data sets of AEEEM, NASA and SOFTLAB as the predicted target domain. The AEEEM dataset was collected by D'Ambros et al. Each AEEEM dataset consists of 61 metrics including object-oriented (OO) metrics, previous-defect metrics, entropy metrics of change and code, and churn-of-the source code metrics [26]. ReLink and AEEEM have no common metrics. The static code measures involved of NASA dataset include lines of code, software complexity, and software readability, all of which are closely related to software quality. Shepperd et al. found conflicts and inconsistencies in the NASA dataset and cleaned it up, which is the cleanedup version used in the experiments [27]. The SOFTLAB dataset comes from the Turkish software company. ReLink and NASA have three common metrics, including lines of code, blank lines, and comment lines, while others have no common metrics.
Precisely, Table 1 summarizes the 17 datasets utilized in this paper. We can see that the imbalance ratio varies from 1.51 (only slightly imbalanced) to 12.44 (highly imbalanced). We also considered datasets with diversity in the number of instances; the smallest dataset has 36 samples, while the largest dataset contains 1862 samples.

B. EXPERIMENTAL RESULTS
Experiments used the scikit-learn under Linux as the backend. Python the multi-paradigm programming language with rich data science packages has been selected. The information of hardware is CPU: Intel R Core TM i7-9750H, Video card: NVIDIA Geforce RTX 2060.
There are four possible output results for any sample in the target domain after the defect prediction model: when a sample containing defects is predicted to be a defect sample, it is denoted as TP (true positive); when a sample without defects is predicted as a defect sample, it is denoted as FP (false positive); when a sample containing defects is predicted to be a non-defective sample, it is denoted as FN (false negative); when a sample without defects is predicted as nondefective sample, it is denoted as TN (true negative). Based on the above possible output results, Precision, Recall, TNR, G-mean, and F1-measure can be defined.   Precision is the percentage of all samples that are predicted to be defective that actually contain defects.
Recall is the percentage of samples that are correctly predicted to be a defective sample.
TNR is the percentage of samples that are predicted to be a non-defective sample.
G-mean can be used to evaluate the model performance of imbalanced data.
F1-measure comprehensively considers the Precision and Recall.
In general, when Precision is high, Recall is often low. Recall is high, Precision is often on the low side. However, it is not enough to use accuracy or recall only as evaluation index. F1-measure is obtained through the harmonic mean calculation of Precision and Recall. In addition, the area under the working characteristic curve (AUC) was not influenced by threshold value and the class imbalance. In view of this, this paper uses AUC, G-mean and F1-measure to evaluate the performance of different approaches. Three classical approaches of heterogeneous software defect prediction are conducted to evaluate the performance of FSLBDA in heterogeneous defect prediction, including TCA+ [28], CCA+ [29] and KCCA+ [9]. Figure 4 and Figure 5 are the metric ranking scores of Safe and Zxing. The score is obtained by XGBoost considering the complexity of the tree. We get rid of irrelevant metrics based on the score we get. The higher the score of a metric, the more meaningful it is for classification. We can find that the importance of the same metric differs between the two projects. The specific meaning of each metric can be queried VOLUME 8, 2020   through Understand website. Additionally, the source domain training set for balanced distribution adaptation was obtained by undersampling, and the Safe project for 44 samples and the Zxing project for 236 samples were obtained respectively.
The new projects with class balance are more favorable to predict the defect tendency of 15 target projects.
It can be seen from the statistical results in Table 2 when Safe is the data of the source domain, AUC of FSLBDA is      projects are superior compared to classical approaches with an average of 0.782. F1-measure of FSLBDA is between 0.712 and 0.857, with a mean of 0.779.
It can be seen from Figure 6, 7, 8 that AUC, G-mean and F1-measure generated by FSLBDA for each target data set are mostly higher than the other three approaches, indicating that the performance of FSLBDA in correctly predicting defective and non-defective classes is better than others. The result shows that whether the source project and the target project have common metrics, the prediction effect has been improved.
When Zxing is the data of the source domain, the predicted results are shown in Table 3. It can be seen from the statistical results that the AUC of FSLBDA is between 0.806 and 0.912, with an average value of 0.860. Mean value of AUC for other three classic approaches are 0.647, 0.723, 0.786 respectively, improving 32.92%, 18.95%, 9.41%. In terms of G-mean, FSLBDA is superior to other classical approaches, with a mean of 0.798. The other classic approaches averages of 0.573, 0.655 and 0.726, respectively. G-mean of FSLBDA increased 0.225, 0.143 and 0.072, respectively. F1-measure is between 0.722 and 0.836, with a mean of 0.774. Except for PC1 and AR6 projects, FSLBDA is better than the other three classical approaches. As can be seen from Figure 9, 10, 11, the classification effect of FSLBDA is improved compared with the other three approaches. The result shows that FSLBDA is not only applicable to the case with common metrics, but also can be used to predict the defect tendency without common metrics.
For comprehensive assessment of the overall performance of the proposed approach in this paper, Figure12, 13, 14 shows the mean of AUC, G-mean and F1-measure  generated by Safe and Zxing as source data. It can be found that AUC, G-mean, F1-measure of FSLBDA are better than other three approaches and increase largely.
Experimental results show that the prediction performance of FSLBDA proposed in this paper is better than other approaches. FSLBDA can better reduce the data difference between the source domain and the target domain to improve the prediction performance, especially for the prediction of imbalanced datasets.
The non-parametric test does not assume that the population distribution must conform to the normal distribution. It can infer that the population distribution directly from samples. The Kruskal-Wallis test is carried out under significance level α = 0.05, and TCA+, CCA+, KCCA+, and FSLBDA are compared in pairs. The null hypothesis for each row in Table 4 show that the Method 1 and Method 2 distributions are the same. In order to reveal which of these groups differ from each other, we conduct a post hoc test  with the Holm-Bonferroni correction. We use SPSS software to obtain adjusted p-value, which is directly compared with 0.05, and the difference is considered statistically significant if it is less than 0.05. Table 4 clearly shows that there is a significant difference between FSLBDA and TCA+, CCA+, and KCCA+.

VI. CONCLUSIONS & FUTURE WORK
In this paper, we introduce BDA to dynamically narrow the gap between marginal distribution and conditional distribution differences of heterogeneous datasets with the balance factor.
Since the defect datasets have class imbalance attributes and there are redundant metrics, we use the twice ensemble learning to solve this problem. XGBoost is used to rank the importance of metrics, adding complexity to the objective function to avoid over-fitting. We obtain the balanced small sample dataset through undersampling the non-defective samples, and use AdaBoost to predict the target modules, thus avoiding the under-fitting of the classification model.
The experimental results showed that the proposed FSLBDA approach is feasible and yields promising results. HDP is very promising, because it permits potentially all heterogeneous datasets of software projects to be used for defect prediction on new projects or projects that lack defect data. Furthermore, it may not be limited to defect prediction. This technique may be applicable to all predictive approaches for software engineering problems. In the future work, we will explore the feasibility of building various prediction models using heterogeneous datasets.