Heterogeneous Defect Prediction Based on Federated Reinforcement Learning via Gradient Clustering

Heterogeneous defect prediction (HDP) refers to using heterogeneous data collected by other projects to build a defect prediction model to predict the software defects in a project. Traditional methods usually involve the measurement of the source project and the target project. However, due to the limitations of laws and regulations, these original data are not easy to obtain, which forms a data island. As a new machine learning paradigm, federated learning (FL) has great advantages in training heterogeneous data and data island. In order to solve the data island and data heterogeneity of HDP, we propose a novel Federated Reinforcement Learning via Gradient Clustering (FRLGC) method in this paper. Firstly, the parameters of the global model are transferred to each dueling deep Q network (dueling DQN) model and each client uses private data to train the dueling model which combines experience replay to increase data efficiency in limited datasets. Secondly, gaussian differential privacy is used to encrypt the model parameters to ensure the privacy and security of the model. Finally, we cluster the clients according to their locally encrypted model parameters and use weighted average to aggregate to create a new global model locally and globally. Experiments on nine projects in three public databases (Relink, NASA and AEEEM) show that FRLGC is superior to the relevant HDP solutions.


I. INTRODUCTION
With the continued growth of the functionalities and requirements of the software products, their size and complexity are also increasing, developing good quality software is an expensive task. Software defect detection (SDP) is useful for software engineer to pay attention to defective modules and minimize the occurrence of software defect to minimize the software development cost.
Most SDP methods use the collected historical from a project as training data to build prediction models and then predict defects of new software modules in the same project The associate editor coordinating the review of this manuscript and approving it for publication was Chengpeng Hao . through prediction models. These SDP methods are called as within-project defect prediction (WPDP) [1], [2], [3]. However, it is difficult to predict new projects for WPDP. Cross-project defect prediction (CPDP) can solve the lack of historical defect datasets, which establishes a learning model on defect datasets of other projects [4], [5], [6].
Unfortunately, the prediction effect of most CPDP is not satisfactory [7], because most previous CPDP methods were based on the assumption that the same feature sets are shared across projects, which is not realistic and it's challenging to collect the same feature sets when projects are developed in different programming languages. When the metrics of source and target projects are totally different, it becomes a heterogeneous problem and conventional CPDP approaches cannot be applied to heterogeneous scenario directly. In recent years, many heterogeneous defect prediction (HDP) models have been proposed. Gong et al. proposed conditional domain adversarial adaptation (CDAA) for HDP which takes advantage of label information to effectively map source project to target project and improves the predictive performance [8]. In order to realize the improvement of generalization ability for the defect prediction model, Shen et al. constructed a semi-supervised software defect prediction method based on sampling and integration for the problem of class imbalance in software defect data and the incomplete classification of data sets [9]. Wu et al. set up an intermediate domain from the target domain to the source domain to break the distribution gap and narrow the difference between the source domain and the target domain [10]. Jin. proposed to implement domain adaptation (DA) by using the kernel twin support vector machines (KTSVMs), which can match the distributions of training data for different projects [11]. Because of the redundancy and nonlinearity characteristics of the source and target data, Li et al. proposed a landmark selection-based kernelized discriminant subspace alignment (LSKDSA) approach, which can reduce the discrepancy of the data distributions between the target and source projects [12]. Tong et al. proposed to combine kernel spectral embedding, transfer learning, and ensemble learning to find the potential common feature of source datasets and target datasets [13]. Most HDP models realize defect prediction through domain adaptation or searching the latest common feature space between source datasets and target datasets. However, all of the above approaches involve the measurement of the source project and the target project, which will expose the privacy of original data and is illegal. In fact, most defective datasets are scattered in different organizations and are not easy to obtain due to the limitations of many laws and regulations, which forms a data island and it will be challenging in the future.
Google first proposed the concept of federated learning (FL) in 2016, which is a powerful framework that can solve data island problem while protecting users' privacy. FL is a promising collaboration paradigm which enables all clients to train their models with local datasets and jointly build a global model by transferring models' parameters to the server. The whole process will not expose data privacy, and clients have complete autonomy over local data. Wang et al. proposed Federated Transfer Learning via Knowledge Distillation (FTLKD) approach which improves the performance of heterogeneous defect prediction models compared to the popular methods nowadays [14]. In order to protect data privacy, Bai and Fan. proposed a homomorphic encryption-based privacy enhancement mechanism which has effect on membership inference attacks [15]. Poor network connection and limited computing resources make it very slow or infeasible to train a deep neural network (DNN) according to the FL pattern. Xu et al. proposed weight quantization, structured pruning and selective updating to accelerate the FL training process [16]. Su et al. used FL to preserve energy data and enable energy data owners (EDOs) to cooperatively train a shared AI model without revealing the local energy data [17]. Ye et al. proposed edge federated learning (EdgeFed), which can train a deep learning model from decentralized data on modern mobile devices [18]. However, building a global model of high-quality is challenging when the feature space is small and the training data is limited of each client for FL.
Adam et al. thought a promising approach is experience replay (ER) in reinforcement learning (RL). Experience replay is that obtained data during the learning process are stored and repeatedly presented to the underlying RL algorithm, which can increase data efficiency in limited datasets [19]. Federated reinforcement learning (FRL) is that each client builds their private models by using the RL algorithm in the FL framework. The goal of FRL is to improve efficiency of training or policy quality by interacting with information with privacy protection. FRL can solve the data island and improve data efficiency simultaneously. Zhuo et al. proposed a novel deep reinforcement learning framework, which builds a DQN for each client to build high-quality policies when the training data is limited and the feature space of states is small [21]. Lee and Choi proposed a novel FRL approach, in which convergence of the agents' optimal policies is faster [22]. Hu et al. proposed a general FRL framework, which takes reward shaping as the information shared between different clients to improve the training speed and policy quality of each client [23]. When all edge computing devices jointly build a global model, the heterogeneous private datasets of a large number of devices will cause deterioration of model quality in the training process. EK et al. proposed a novel aggregation algorithm, called FedDist (Federated Distance), which can identify the differences between specific neurons and modify their model structure [24]. Pang et al. proposed an intelligent central server which has the ability to identify heterogeneity and helps guide most clients to achieve better performance [25]. In order to address the local clients' data distributions diverge, Sattler et al. group the client population into clusters which have jointly trainable data distribution by using the geometric properties of the FL loss surface [26]. Huang et al. proposed a federated learning framework (Fed-DSR) and a similarity aggregation algorithm to improve the quality of the model [27].
In this paper, we propose a federated reinforcement frame via gradient clustering, called FRLGC. The basic idea behind FRLGC is to address the problem of HDP by utilizing the deep reinforcement learning model in a federated learning framework and using similarity knowledge of client gradients to improve the quality of the model. The contributions of this paper can be summarized as follows: 1. We suggest federated reinforcement learning for HDP. Federated reinforcement learning allows multiple clients to train their private models and build a global model without exposing their private training data to solve the problem of data island. In the case of limited datasets, experience replay is used to improve data efficiency.  2. We separate the original output into the value branch and the advantage branch and then combine the two branches of fully connected layers into one output, which leads to better policy evaluation in the presence of many similar-valued predictions through making the last module of the network implement the forward mapping.
3. Similarity knowledge of client to guide FL aggregation. Clients with similar gradients are aggregation locally by using average weights and followed by the global aggregation to ensure better coverage while reducing differences. 4. We prove the correctness of FRLGC and evaluate its performance through experimental results, which can significantly improve prediction performance.
The remainder of this paper is organized as follows. Section II introduces the overall framework of FRLGC and shows the details of steps in FRLGC. The experimental setup, including experimental data, evaluation measure and analysis of the experiments results are shown in Section III. Section IV finally concludes the paper. Fig. 1 shows the overall method framework of FRLGC proposed in this paper. The overall approach is carried out under the framework of federal learning. In order to eliminate data redundancy and align the data of all clients, we first use principal components analysis (PCA) to reduce the dimension of data in the data preprocessing. Then training the data of all clients locally by dueling DQN. After the training, the model parameters of each local model will be encrypted through gaussian differential privacy. At the process of model aggregation, we only select some of the clients to participate in the federation aggregation process restricted by poor network connection and limited computing resources. The selected clients are clustered through K-means. Each cluster will conduct local aggregation first, and then conduct global aggregation on the central server to form a global model. Finally, the global model will be broadcast to each client. Each client mainly has four stages: data preprocessing, local training, data encryption and model aggregation. When the private model of each client converges or reacheing the maximum communication round, the update stops.

A. DATA PREPROCESSING
The number of data sample features owned by each client is different, and the data features may have redundancy. Therefore, in the data preprocessing stage, we use PCA to reduce the dimension of the data.
PCA uses the variance of projection data to represent the information size of original data. So the purpose of PCA is to get the larger the variance of projection data which is to find an adaptation matrix A ∈ R d×k to maximize the following problems: d×a represents a real space, a is the total amount of test data and training data, d is the dimension of each sample, H = I − Q a Q T represents the central matrix and Q represents a full 1 matrix of size a × a, I is the identity matrix whose size is a × a.
1 a XHX T is another representation of the covariance matrix and XHX T A = A , = diag (φ 1 , . . . , φ k ) ∈ R k×k , R k×k is a real number space, and φ 1 , . . . , φ k is the first k largest eigenvalues, is the matrix constructed by φ 1 , . . . , φ k as the diagonal element, and other elements except the diagonal are 0, then the optimal low dimensional feature representation is

B. DUELING DQN
In FRL, building a policy of high-quality is challenging when the feature space of states is small and the training data of each client is limited. The FRL aims to promote training efficiency or improve policy quality through information interaction with privacy protection.
Mnih et al. [20] proposed Deep Q-Network (DQN), which can train AI agents to play an Atari game better than human players. The DQN trains a Q-network by estimating the value function of the actions for a given state. Even though DQN improves the accuracy and efficiency of existing Q-network in a great scale, it suffers from an overestimation problem that causes inaccurate results. Wang et al. [29] improved DQN by presenting dueling DQN, which trained the Q-network faster than the DQN.
In local training stage of FRL, each client trains local data by dueling DQN of reinforcement learning technique for private model construction. Dueling DQN repeatedly presented the stored data in the training process to the policy network of dueling DQN, which can improve the utilization efficiency of limited datasets. And at the output layer of dueling DQN, which combines the value branch and the advantage branch into one output, which can avoid estimating the redundant and low-valued predictions and lead to a better policy evaluation through making the last module of the network implement the forward mapping in the case of that existing presence of many similar-valued predictions.
Suppose there are N available devices for a federated learning job. During each round, each device uses dueling DQN to build a private model.
Dueling DQN usually consists of a dynamic environment and an agent that interacts with the environment. It is a process in which the agent interacts with the environment and keeps learning from the environment to maximize its expectation of the rewards.
In HDP domain, we define states, actions and rewards as follows.
States: Let the state be represented by S t = s are data characteristics of the N agents, respectively.
Actions: There are two actions for each agent: {select, neglect}, selecting a state at this time indicates that it is a defective sample and neglecting indicates the state is flawless. Select and neglect correspond to 1 and 0 respectively. The state is the data characteristic.
Rewards: Let the instant reward be r t , where r t determined by data set defect rate d i , d i ⊆ [0, 1] and as defined as follows: The goal of each agent is to maximize the expectation of the cumulative discounted reward as far as possible during the training process and the expectation is given by is a factor discounting future rewards.
For an agent behaving according to a stochastic policy π, the Q function of the state-action pair (s, a) and the statevalue V π (s) are defined as follows: We define the advantage function, which relates to the value and Q function: From the (6) and state-value V π (s), it follows: Equation (8) is unidentifiable in the sense that given Q we cannot recover V and A uniquely. This lack of identifiability is mirrored by poor practical performance when this equation is used directly.
To address this issue of identifiability, we can force the advantage function estimator to have zero advantage at the chosen action. That is, we let the last module of the network implement the forward mapping. Fig. 2 shows the structure of the neural network of the policy network. The lower layers of the dueling DQN are two convolution layers, one average pool layer and two fullyconnected layers. However, we use two sequences of fully connected layers instead of using a fully connected layer sequence as the output which can lead to a better policy evaluation through making the last module of the network implement the forward mapping in the case of that existing presence of many similar-valued predictions. The construction of these streams enables them to provide a separating estimation of the value and advantage functions. Finally, the two sequences are combined into an output function. The output of the network is a set of Q-value. Fig. 3 shows the overall framework of dueling DQN. Data, data labels and reward functions constitute a dynamic environment. The policy network uses two neural networks with the same structure (a primary network and a target network) to predict. When the data features are input into the policy network, the primary network outputs the action a t = arg max Q (s, a). The history includes the action, the reward and the data features. When the history is the input policy network, the primary network is for selecting an action and the target network is for generating a Q-value for the action. During training, dueling DQN employs backpropagation. For the backpropagation, we computed the error between the estimated value and the target value. The target value, which is recorded during training process, which is related to parameter s, a and r, and the estimated value comes from the primary network. Q (s, a) is the result of the Q function, which is the estimation of the final reward. Because Q (s, a) is the estimation of the final state, we estimate the reward from the current reward r t by adding the maximum reward γ max a Q s , a . Therefore, the loss function is written as: Dueling DQN algorithm separates the target network and the primary network for loss function to remove the correlations between the Q and the target value.
The weight of the target network is fixed, and only the weight of the primary network will be updated regularly.

C. GAUSSIAN DIFFERENTIAL PRIVACY
FRL achieves privacy-preserving by keeping the local datasets on user devices and only sharing the local updates with the server. However, it has been proven to be insufficient for maintaining data privacy as the parameters can reveal insights into the datasets that have been used for training. Consequently, FRL can only be incorporated with honest participating clients and extended for secure and privacypreserving settings, extra measures should be considered.
In the data encryption stage, we use the gaussian differential privacy method [28]. The idea of gaussian differential privacy is that when the adversary tries to query individual information from the database, which will confuse. The adversary cannot distinguish the sensitivity of individual level from the query results.
Gaussian differential privacy has its unique advantages. It has no cumbersome encryption and decryption process, which can save computing resources and has higher data processing efficiency compared to the traditional cryptography method. For the FRL system, gaussian difference privacy can confuse the model parameter information by adding gaussian noise to the model parameter information, so as to avoid other curious clients or central server inferring the exchanged information about training data or the model updating parameters by repeatedly receiving the model parameters in the joint training process. In addition, gaussian difference privacy can provide a clearer boundary of privacy loss by adding the average and variance of noise.
For two datasets with only one different data sample D and D , and the functions of arbitrary fields f : D → R M and a randomized mechanism M : R M → O, M • f achieves gaussian differential privacy for any output subset S ⊆ O: where the smaller ε represents stronger level of privacy protection, and δ ⊆ [0, 1] represents the probability of breaking the gaussian differential privacy. Properly add gaussian random noise to obfuscate f (·) .
where I M is the identity matrix, and N ε, σ 2 I M is the multivariate gaussian noise with mean ε and variance σ 2 , the required noise variance is: where f is global sensitivity of function f , and · denotes the 2-norm.

D. MODEL AG GREGATI ON
The models locally trained on heterogeneous data can be significantly different from each other. Aggregating these divergent models can slow down convergence and substantially reduce the model performance. Effective aggregation of models is essential to create a common global model for all clients. By analyzing inter clients' relations, we can determine the extent to which one client is universal and contribute to this aggregation. We use similarity knowledge of clients to guide model aggregation in model aggregation stage. Model aggregation is divided into two steps, which are local aggregation and global aggregation. Clients with similar gradients are clustered together, local aggregation models are weighted average within the cluster for local aggregation.
And then using global aggregation to ensure better coverage while reducing differences. Fig. 4 shows the structure of model aggregation. Firstly, clustering randomly selected clients according to their local gradients without sharing privacy by using K-means. Secondly, selected clients update their model weights. Then, in each cluster, local aggregation is to form a specialized model by using weighted average. For a given cluster c, at round t, the weighted average is formed as: where n k is the size of selected clients, and n is the total number of clients. Finally, combining specialized models to build a new global model. At round t + 1, the new global model is formed as: where ϕ is the federated cluster space. In a heterogeneous environment, increasing the coverage of clients will contribute to the universality of the global model, while gradients' similarity can help to identify and reduce the potentially harmful impact of different clients on global aggregation.
We propose FRLGC as Algorithm 1. More specifically, each epoch includes five procedures: (1) Line 2 is broadcasting to all agents; (2) Lines 3-11 are clustering clients with similar gradients for local aggregation; (3) Lines 12-13 are the global aggregation to ensure better coverage while reducing differences; (4) Lines 14-17 are the acting procedure by ε −exploration, where agents of clients choose whether to act randomly or follow the actor net strategies; (5) Lines 18-22 are the experience replay procedure of the networks which can increase data efficiency in case of limited datasets.

III. EXPERIMENT RESULTS AND ANALYSIS A. EXPERIMENT DATASET DESCRIPTION
In this part, we evaluate the performance of the FRLGC algorithm based on three imbalanced datasets: NASA [30], AEEEM [31] and Relink [32], which are in the field of software defect prediction. Table 1 lists the details of the databases used in the experiments.
Broadcast ω t to all agents 3.
for k ⊆ c do 7.
end for 9.
Interact with environment and obtain reward r (k) 17 In addition, we note that when developers perform corrective maintenance work, they usually consider the severity of defects and the importance of modules. However, in three datasets, there is no information about defect severity or module importance. Moreover, in the current SDP research, there is no clear explanation about the incorporation between the defect severity and the importance of software modules in the evaluation process. Therefore, in this paper, we do not consider the severity of defects or the importance of modules.
To evaluate the prediction performance, the area under the receiver operating characteristic curve (AUC) and G-mean are used in this article. AUC is unaffected by class imbalance as well as being independent from the cutoff probability (prediction threshold) that is used to decide whether an instance should be classified as positive or negative. The value range of AUC is 0∼1 and the higher AUC represents a better prediction performance. G-mean can also be used to evaluate the model performance of imbalanced dataset, which comprehensively considers both minority and majority classes. When the classification accuracy of the minority class and the majority class is closer, we can get the best G-mean.  The experiment was programmed with Pycharm. We run all experiments on NVIDIA GTX 1070Ti. Two projects belong to NASA, AEEEM and Relink are selected in turn as the data set of each client in three groups of experiments.

B. ANALYSIS OF CLUSTERING OF CLIENTS
We perform a cluster analysis by using datasets of all clients to explore the impact of the relationship of clients on final prediction performance.
The evaluation criterion of good cluster is having both high intra-cluster similarity and low inter-cluster distance.
For the intra-cluster similarity, the dataset z is clustered by K-means algorithm, and the centroid of each cluster C i is expressed as c i . The intra-cluster similarity of z in the k clusters is written as: The inter-cluster distance represents the distance between the centroids c i and c j of the two clusters. So, the inter-cluster distance of k clusters is expressed by the sum of all the clustercluster distances as Fig.5 presents a two-dimensional mapping of the clustering in which different clusters of all clients are displayed in different colors. It is apparent that AEEEM and Relink have higher intra-cluster similarity (density) compared to the other datasets, which possibly explains the significant yet greater performance. All datasets lack a great inter-cluster distance (separability). Here the analysis of clustering shows that we can use the similarities during the learning process of clients, and can capture similarity by comparing their gradients to realize the aggregation process based on gradient similarity.

C. ANALYSIS OF NUMBER OF CLIENTS
In order to test the impact of randomly selected participants on the FRL system, AEEEM and Relink are selected as the prediction results of the data set for analysis. TABLE 2 summarizes the relationship between average AUC and G-mean and time and number clients. 2, 3, 4, 5 and 6 clients are selected to participate in each round of communication process of FRL. When the number of clients is 4, the average values of AUC and G-mean can reach 0.6571 and 0.5519 respectively, and the time is 370.193s. When the number of clients is 5 and 6, the average values of AUC and G-mean are equivalent to situation of 4 clients, but the time is increased by 16.752s and 129.407s respectively. To consider AUC, G-mean and time together, we choose 4 clients for the experiments. Fig. 6, Fig. 7, Fig. 8 show the AEEEM and Relink, NASA and AEEEM, NASA and Relink as the datasets of each client, and the relationship between AUC and G-mean and communication rounds, respectively. Communication rounds refers to the number of times the server and each client  communicate with each other. It can be found most clients show stable convergence with increasing rounds. AUC and G-mean all increase in varying degrees in the process of continuous communication between the client and the central server for most clients.

D. ANALYSIS OF PREDICTION RESULTS OF CLIENTS IN COMMUNICATION ROUNDS
In Fig. 6, AUC and G-mean increased by 0.25 and 0.08 on average compared with those before they did not participate in the communication. At 0<round<5 or 10, the growth rate is fast, and at round>20, most lines of clients tend to be stable. In Fig. 7, AUC and G-mean increased by 0.16 and 0.04 on average compared with those before they did not participate in the communication. At 0<round<5, the growth rate of most clients is fast, and at round>25, most lines of clients tend to be stable. In Fig. 8, the result with NASA and AEEEM is less stable compared to other experiments, which is possibly owing to poor performance of clustering. On the whole, AUC and G-mean increased by 0.27 and 0.21 on average compared with those before they did not participate in the communication.

E. EFFECTIVENESS OF OUR METHOD
In order to evaluate the effectiveness of our method, we compared it with three classical non federated learning methods CCA+ [33], KCAA+ [34], KSETE [35] and two federated learning methods, FedAvg [37] and FTLKD [14]. The classical methods of non-federated learning use LR as the classifier, six datasets as source data, and each dataset tested as target data. The federated learning method sets each data set to six clients for testing. Taking AUC and G-mean as the measurement indicators of each project, the best effect is expressed in bold font.
In Table 5, AUC and G-mean are 0.553 ∼ 0.6447 and 0.5014 ∼ 0.5365 respectively, with average values of 0.602 and 0.5216 respectively. Compared with CCA+, KCAA+, KSETE, FedAvg and FTLKD, the average value increased by (10.56%,2.7%), (17.64%, 4.89%), (6.11%, 1.46%), (10.65%, 2.83%), (3.47%, 0.41%) respectively. The FRLGC has better prediction results for EQ, while AUC and G-mean values on PC are lower. It is better than the comparison method in EQ, JDT and LC performance. Table 3, Table 4 and Table 5 show that each client performs best when AEEEM and Relink are used as datasets.  The reason of best performance is that AEEEM and Relink have higher intra-cluster similarity (density) compared to the other datasets, which is good at using similarity knowledge of client to guide FL aggregation.
In general, the prediction effect of the FRLGC is better than CCA+, KCAA+, KSETE, FedAvg and FTLKD, indicating that the FRLGC has better prediction performance.

IV. CONCLUSION
In this paper, we propose FRLGC for heterogeneous defect prediction, which can solve the problems of data island and heterogeneity and increase the data efficiency in the case of limited datasets. Each client jointly constructs a global model by communicating without exposing privacy, which can improve the performance of private models. Firstly, the parameters of the global model are transferred to each dueling DQN model and each client use private data to train the dueling DQN model which combines experience replay to increase data efficiency in limited datasets. Secondly, gaussian differential privacy is used to encrypt the model parameters to ensure the privacy and security of the model. Finally, we cluster the selected clients according to their locally encrypted model parameters and use weighted average to aggregate locally and aggregate globally to create a new global model and solve data heterogeneity problems. The experimental results show that FRLGC has good convergence and a better prediction performance than the existing HDP solutions.
In the future, we will expand the experiment to more defect datasets to test the generalization ability of our method. We are considering that FRLGC needs multiple rounds of communication for each client to realize the convergence of private model. Therefore, more consideration should be given to the communication costs. Therefore, we must create a new algorithm in FRLGC to reduce communication costs.