Learning Stretch-Shrink Latent Representations With Autoencoder and K-Means for Software Defect Prediction

Detecting defective source code to localize and fix bugs is important to reduce software development efforts. Although deep learning models have made a breakthrough in this field, many issues have not been resolved, such as labeled data shortage and the small size of defective elements. Given two similar programs that differ from each other by an operator or statement, one may be clean while the other may be defective. To address these issues, this study proposes a new deep learning model to facilitate the learning of distinguishing features. The model comprises of three main components: 1) a convolutional neural network-based classifier, 2) an autoencoder, and 3) a k-means cluster. In our model, the autoencoder assists the classifier in generating program latent representations. The k-means cluster provides penalty functions to increase the distinguishability among latent representations. We evaluated the effectiveness of the model according to performance metrics and latent representation quality. The experimental results on the four defect prediction datasets show that the proposed model outperforms the baselines thanks to the generation of sophisticated features.


I. INTRODUCTION
Software defect prediction is an active research topic thanks to the great significance in the software industry. According to annual reports, software failures have affected many users and led to serious consequences like financial losses [1], cyber risks [2], and even disasters in autonomous systems [3]. Software companies continuously issue patches to fix bugs through customer feedback or self-investigation. The later the bugs are detected, the greater efforts to fix them are. Thus, predictive models for defective code are efficient tools to enhance software quality and safety.
The traditional methods train the predictive models based on software metrics. The software metrics are statistical measurements of source code such as the complexity, the The associate editor coordinating the review of this manuscript and approving it for publication was Long Xu. execution time, the load time, and the instruction path length. Various learning algorithms have been exploited to process the features and build the classifiers including Bayesian networks, Support Vector Machines (SVM), and decision trees [4], [5]. To address the critical issues related to data, several studies focused on feature distillation methods, i.e. autoencoders [6], feature selection [7], and Principal Component Analysis (PCA) [8]. The effects of unbalanced data were investigated [9], and various rebalancing strategies were employed to facilitate model training [10], [11]. Even so, the prediction performance is not high. An unresolved problem is that the existing software metrics reflect insufficient information about syntax and semantics of code [12].
Thanks to automatically learning programs' features, deep learning has made breakthroughs in the software defect prediction. To build high performance predictors, many networks were designed on representations that contain VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ rich information about code syntax and semantics including language tokens [12], token abstract syntax trees (AST) [13], and instruction sequences [14]. These networks encode a whole program into a vector representation. Meanwhile, defects just exist in a tiny part in the program such as an operator, a statement, or a snippet. Our observations on source files have found that similar programs may belong to different groups, and different code structures may both be defective. Consider two similar codes in sub-figures 1a and 1b for finding the greatest common divisor (GCD) of two integers, the defect in the second code is caused by the use of a wrong operator (the division instead of the modulo). In this scenario, the feature sets generated by networks that treat code components with the same role may not be distinguishable.
To pay attention to smaller elements when learning distinguishing features, we propose a new deep learning model including three components with different tasks. The first one is a convolutional classifier for the defective prediction. An autoencoder that encodes the input into latent representations and reconstructs the input from such representations will support the classifier in learning important features. The classifier and the autoencoder tend to keep meaningful information of the whole program. To increase the distinguishability among similar programs with different defective characteristics, a k-means cluster is built on the feature space and provides two penalty functions. The two functions aim to stretch the distance of samples within clusters, and shrink the centroids to reduce the variance. Figure 1 shows two coding styles of recursive (1a, 1b) and non-recursive (1c, 1d) to compute the GCD of two integers. Since each coding style may be clean or defective, enlarging the distance between samples is necessary for discrimination.
To sum up, the main contributions of the paper are as follows: • Observing and analyzing carefully the defective characteristics and the difficulty to learn the distinguishing features, even with some deep networks.
• Proposing a new model to not only learn sophisticated features of programs but also enlarge the defective characteristic discrimination ability within each coding style.
• Validating the effectiveness of the proposed method by different perspectives such as performance measures, the ability to deal with the unbalanced data, and the quality of the features. The rest of the paper is organized as follows. Section II surveys the relevant studies to software defect prediction. Section III describes the proposed model and the training procedure. Section IV presents the baselines and the experimental settings. The results and evaluation are discussed in Section V. We conclude the findings in Section VI.

II. RELATED WORK
A software defect is an imperfection in coding which causes the software product to work incorrectly as specifications. Finding software defects by manual testing or code review is time-consuming and laborious. Therefore, automatic solutions for predicting potentially buggy code help to improve the software quality and reduce the development cost. Despite various powerful methods have been employed, software defect prediction is still a challenging task. The main issues include the unbalanced data, the lack of annotated samples, and the difficulty to extract defective features. As can be seen in Figure 1, the defects are small components in the code, and considering a component whether to be defective or clean depends on specific requirements. Regarding the defect prediction, the survey in [15] revealed a disjunction between research evidence and practitioners' perceptions.
Initially, handcrafted features namely software metrics were selected to distinguish source code. A software metric is a statistical characteristic of code such as the file size, the complexity, and the execution time. Various common learning and deep learning algorithms have been adopted to train prediction models on the original metrics or distilled metrics. For learning on the original metrics, [4] used a Bayesian network to measure the importance of metrics regarding defectiveness, [16] proposed an ensemble system of seven common classifiers. Other studies try to refine metrics using deep learning [6], feature selection/weighting [17], and principal component analysis (PCA). Yang et al. applied a deep belief network (DBN) to select expressive features [18]. In [6], the metrics were mapped into latent representations by an autoencoder. The results demonstrate the effectiveness of the latent representations in comparison with the original metrics.
Nowadays, the application of deep learning to automatic feature extraction has achieved state-of-the-art performance in software defect prediction. Many deep neural networks have been developed to generate sophisticated features from programs' representations. In Mou et al. [19] designed a tree-based convolutional network on abstract syntax trees (AST) obtained by parsing source files. Other studies generated sequences of AST tokens and trained prediction models using DBN [12], Long Short-Term Memory (LSTM) [13], Convolutional Neural Networks (CNNs) [20], bidirectional LSTM with the attention mechanism [21]. Since close to machine code, learning on assembly instructions resulted by program compilation have achieved high performance in defect prediction. In [14], convolutional neural networks were applied directly on assembly instruction sequences. To enrich input information, Phan et al. constructed control flow graphs (CFG) and then developed a graph-based neural network to generate defective features [22]. The experimental results on four datasets showed that assembly instruction-based methods significantly outperformed AST-based methods and other baselines.
The unbalanced data problem is frequently encountered in defect prediction tasks because the number of defective samples generally dominates that of clean samples. To prevent classifiers from biasing towards majority classes, different learning strategies have been proposed such as subsampling, cost-sensitive, and ensemble. Subsampling strategies aim to rebalance samples among classes prior to the model construction. The main approaches include 1) undersampling that eliminates majority class instances, 2) oversampling [23] that randomly replicates minority class instances, 3) SMOTE [24] that generates new minority class instances from nearest neighbors, and 4) hybrid methods that combine other techniques. Cost-sensitive learning treats the misclassification of classes with different costs [25]. Ensemble learning combines the predictions from multiple algorithms to make the decision [10].
The lack of annotated samples is one of the main challenges for defect prediction. To alleviate this issue, many strategies have been employed including transfer learning, self-supervised learning, masked language modeling. In [26], Phan et al. collected data from different sources to train a convolutional model and used it as a program feature extractor. In [27], Feng created a pretrained model, namely CodeBERT for programming and natural languages.
Previous studies, whether handcrafted or automatic, treated program components equally when learning features. Unlikely, our methods direct the learner to smaller components to extract distinguishing features. During the feature generation, similar samples are assigned to a cluster regardless of being defective or clean. Then, we apply penalty functions to enlarge sample distances inside a cluster to distinguish, and shrink cluster centroids to reduce the variance. Figure 2 shows the proposed model architecture. The model is composed by 4 components including 1) a convolutional neural network, 2) an autoencoder, 3) a k-means cluster, 4) and a classifier. The convolutional network undertakes feature extraction from the input. The autoencoder and the k-means cluster assist the convolution in generating features that retain the important information, and are more distinguishable. The main task of defective probability prediction is performed by the classifier.

A. THE CONVOLUTIONAL NEURAL NETWORK
The model aims to predict software defects at file level. An original input is an assembly instruction sequence obtained by compiling a source file or decompiling an executable file. The sequence is then converted into numerical representations before feeding into the model. In the model, the convolutional neural network applies a set of convolutional and pooling layers to learn features from inputs.
Embedding layer provides numerical input representations for networks. Given an instruction sequence (a i ) L i=1 , the embedding matrix X is formed by mapping each item into a d-dimensional vector.
where X ∈ R L×d , e a i ∈ R d is the vector of instruction a i , and ⊕ is the concatenation operator. Convolutional layers use a set of filters to extract local features. In convolutional layers, each neuron is just connected with a sub-structure of the input. Filters consist of learnable parameters to define the connection weights and regions. Each filter aims to extract specific types of features. During the forward pass, the filter slides along the input VOLUME 10, 2022 FIGURE 2. The model architecture includes three components of an autoencoder, a k-mean, and a classifier. In which, Conv, Pool, Unpool stand for the convolutional, pooling, and unpooling layers.
sequence and produces a feature map as follows: where i = 1, . . . , L, f is a non-linear activation function, h is the filter size, e a i:i+h−1 = e a i ⊕ e a i+1 ⊕ . . . ⊕ e a i+h−1 , W ∈ R h×d is the weights, and b is the bias. Intuitively, a feature map contains neurons that are arranged along the input sequence, connect to a local region, and share the same weights and bias. By the sharing weight mechanism, the filter can emphasize the features appearing in sub-structures at any position in the input data. Since instruction sequences are very long and defects just occur in small parts of programs, convolutional neural networks are employed instead of other architectures like long short-term memory (LSTM) [28], and transformers [29]. A filter with a size of h explores features for h-element regions. Stacking multiple convolutional layers allows the network to learn high-level abstract features and expand feature extraction regions. A pooling layer is commonly stacked on each convolutional layer to perform dimension reduction. For high dimensional data, downsampling helps to reduce model parameters, and hence to speed up computational time and control overfitting. In experimental datasets, a program may have up to 3,000 instructions resulting in a 3, 000 embedding matrix. In addition, the convolution just transforms the input data without downsampling. Therefore, applying pooling layers is essential.
In pooling layers, each feature map is resized independently. Normally, a pooling layer splits a feature map into non-overlapping regions and then applies the pooling operation for such sub-regions. Two commonly used operations are max and average. For example, a max pooling with the filter size of 2 and the stride of 2, the feature map column C = {c i } will be separated into two-value regions, and the greater one of each pair is selected. As a result, the pooled feature map has a size of L/2. In the training process, the backward pass simply routes the gradient to the highest value in the forward pass.

B. THE AUTOENCODER
An autoencoder consists of two segments, the encoder and the decoder (parts 1 and 3) that have symmetric structures.
The encoder maps input sequences X by function φ to latent representations F. Inversely, the decoder tries to recreate the original sequences X by function ψ. The process can be formulated as follows.
In the model, both the encoder and the decoder are built up from convolutional layers. A slight difference between the two architectures is that the pooling is used for dimension reduction in the encoder, while the decoder applies upsampling to expand feature maps.
The objective of training the autoencoder is to minimize reconstruction errors. The loss function for an instance x i is as follows.
where x i is the reconstruction that has the same shape as x i .

C. THE K-MEANS
K-means is an unsupervised algorithm to partition data samples into disjointed clusters. Samples within a cluster are similar and dissimilar to those in other clusters. The similarity is evaluated based on a specific distance such as Euclidean, Manhattan and Minkowski. Algorithm 1 indicates the steps of k-means. The inputs include data samples and the cluster number. The algorithm estimates cluster presentation points called centroids and assigns samples to clusters. Firstly, the centroids are randomly initialized. Next, the centroids and sample labels are recalculated repeatedly until meeting stopping criteria. Some criteria can be applied including 1) centroids do not change, 2) samples remain in the same clusters, 3) an iteration number.
In the model, the k-means cluster programs according to their contents. As mentioned in Section I (Figure 1), different implementations can be used to perform a task, and defects can occur in any program regardless of the implementations. To learn distinguishing features for defects, Sample labels c (i) ∈ {1, 2, . . . , k} 1 Randomly initialize µ 1 , µ 2 , . . . , µ k ; 2 while not stopping_criteria do we use k-means in conjunction with two constraints including 1) shrinking the centroids, and 2) stretching samples inside each cluster (Figure 3). The motivation behind shrinking is to reduce the variance, and stretching is to emphasize features to distinguish similar programs. The two constraints are implemented by the loss function as follows: where d(c i , c j ) is the distance between two centroids c i and c j .
where c (i) is the cluster of the sample x i .

D. THE CLASSIFIER
For predicting faulty code, the classifier is built based on the features extracted by the convolutional network. This part includes a global pooling and some fully connected layers. Global pooling. The last pooling layer in the part (1) is also high dimensional features. To select the most sophisticated features, a global pooling is applied to pool each feature map into a single value.
Fully connected layers contains neurons that fully connect to all neurons in the previous layer. The activation of the last layer is softmax to convert score values into distribution probabilities.
where K is the number of target labels, c = 1, . . . , K , and z = (z 1 , . . . , z K ) We use categorical cross-entropy to evaluate the classification loss. For an instance i, the loss is computed as follows: where y i,c is the indicator function that has value of 1 if and only if c is the correct target for the instance i. p i,c is the predicted probability for the class c as in Equation 7.

E. TRAINING THE MODEL
The loss function is a linear combination of the loss component in Equations 4, 5, 6, and 8. L = L class +λ reconst * L reconst + λ inter * L inter + λ intra * L intra (9) where λ factors adjust the contribution of each component. The training procedure. All the components are jointly trained using the mini-batch gradient descent. Algorithm 2 presents the training procedure to find the best model parameters by minimizing the loss function on a dataset.
Firstly, the parameters are randomly initialized and data samples are divided into batches with the same size. Next, training model is performed repeatedly on the whole dataset. At each epoch, the parameters are updated at every batch. Given a batch of samples X b , we pass into the model to compute the outputs of all components including input reconstructionsX b , latent representations L − b, features F b , k-means centroids M b , sample clusters C b , and classification predictionsŶ b . Finally, we compute the total loss and estimate the gradient to update the parameters.

A. THE DATASETS
The proposed model is compared to state-of-the-art methods for software defect prediction on four datasets, namely VOLUME 10, 2022

Algorithm 2: Mini-Batch Gradient Descent Algorithm
Input : Data samples x (i) , i = 1..N ; Learning rate η, batch size bs; Output: Model parameters θ = {W , B} 1 Randomly initialize θ, θ ← 0; 2 Partition x (i) into bs-sample batches X batches = {X b };  [14]. The comparisons are performed according to different criteria including performance measures and the quality of generated features. Each dataset consists of C/C++ implementations for solving a problem on CodeChef, 1 a site for learning programming languages. The four problems are described as follows: • SUMTRIAN (Sums in Triangle): finding the largest of the sums of numbers on the path from the top towards the base in a n−row lower triangular matrix. On each path, the next number is located directly below or below and one place to the right.
• FLOW016: finding the greatest common divisor (GCD) and the least common multiple (LCM) of two integers.
• MNMX: finding the minimum sum of costs to convert the array into a single element by the operation of removing the larger of two adjacent integers. The cost of each operation is the smaller number.
• Given an array A of N elements, count the number of non-decreasing subarrays of array A A non-defective program must produce the correct output and satisfy the problem's constraints such as running time, and memory usage. Clean programs are labelled as 0, while others are tagged according to defective types including 1-time limit excess, 2-incorrect output, 3-interrupted execution, and 4-failed compilation.
All programs are transformed into assembly code by g++ compiler. The assembly code is selected as the model input because it is close to the machine code. Table 1 presents data statistics on the four experimental datasets. Remarkably, data samples among classes are greatly unbalanced wherein the ratios of several classes are just 1-3%. To train models, each 1 https://www.codechef.com/problems/ <problem-name>  dataset is split by the ratio of 3:1:1 for training, validation, and test. In which, these subsets are used to adjust the trainable parameters, select the best point, and evaluate the model performance.

B. BASELINES AND SETTINGS
To evaluate the performance, the proposed model is compared with state-of-the-art approaches for sequence learning including convolutional neural networks (CNN) [30], long short-term memory (LSTM) with/without attention mechanism [31], transformers [32], and graph-based networks [22]. The hyper-parameter settings for the networks are as follows.
• LSTM, and LSTM with attention (AttLSTM). The number of hidden state is 128.
• Transformer. The head is 2, and the hidden state is 128.
• CNN includes two convolutional layers wherein each is followed by a pooling. The filters numbers are 50, and 100. The filter size is set to 2.
• DGCNN has the configuration as in [22], consiting of two graph convolutional layers with 100 and 600 filters, a fully connected layer with 600 neurons, and a softmax.

C. EVALUATION MEASURES
The approaches are evaluated according to popular performance measures including the classification accuracy, F1 and AUC (the area under the receiver operating characteristic -ROC). The accuracy, and F1 are computed from the confusion matrix. Considering binary classification problems wherein samples are tagged either +1 (positive) or −1 (negative), the confusion matrix is constructed as in Table 2. TP and FN are the total positive samples being predicted as positive and negative. Similarly, TN and FP are the total negative samples being predicted as negative and positive. Classification accuracy: The F1 measure: where Precision = TP TP+FP and Recall = TP TP+FN AUC that shows the degree to distinguish among classes is a good measure in the case of unbalanced data. For unbalanced data, learning algorithms normally bias towards the majority classes and ignore the minorities. In many practical problems, detecting the minorities, e.g. defects, is more significant than that of the majorities. The higher the AUC is, the better distinguishing ability the model has. AUC is the area under the ROC curve that plots the true positive rate TPR = TP P and the false positive rate FPR = FP N ) at different thresholds.
For multi-class problems, the metrics can be estimated by some methods such as 1) micro -calculating metrics globally by counting prediction outcomes, 2) macro -calculating metrics for each class, and find their unweighted mean, 3) weighted -calculating metrics for each class, and take the average weighted by the number of samples. Table 3 compares the approaches in terms of accuracy and F1. AE_Kmeans outperforms other baselines on four experimental datasets. It is worth noting that although the same architecture, AE_Kmeans overcomes CNN thanks to the k-means support. In comparison with DGCNN, a recent state-of-the-art approach on these datasets, AE_Kmeans improves accuracy remarkably on FLOW016, SUBINC, and SUMTRIAN. Because of very long programs (as in Table 1), sequentially compressing to generate features is inefficient. As a proof, LSTM obtains a low performance, although it is a powerful model for various sequence processing tasks. The application of the attention mechanism in AttLSTM and Transformer to reduce long-term dependencies helps to boost LSTM significantly. Since defects just exist in small parts of the programs, extracting local features is beneficial. For this reason, both sequence-based and graph-based convolutional neural networks achieve higher performance than those of sequential compression approaches. In the proposed network, the autoencoder guarantees that generated features contain important information of programs; the k-means groups similar programs and focuses on different small parts to distinguish. Thus, the use of these components facilitates the feature generation by the CNN. Above analyses have indicated the importance of the k-means and the autoencoder in the model. Table 4 compares AE_Kmeans and DGCNN on minority classes by AUC. For unbalanced data, learning models normally have inaccurate predictions on minority classes, while detecting them is more preferable. Taking the defect prediction into account, although defective programs are minor, recognizing them is more significant than that of clean. Thus, AUC that indicates the ability to distinguish among classes is an important measure to evaluate models. From Tables 3 and 4, the proposed model optimizes accuracy without a decrease of the distinguishing ability despite highly unbalanced datasets. With rich information of the input representations, DGCNN outperforms other network architectures including CNN, LSTM and Transformer. However, AE_Kmeans achieves higher not only accuracy and F1 but also AUC than that of DGCNN in most cases. As mentioned in Section I, defects may exist in any program regardless of coding styles. Distinguishing similar codes with different defective types is difficult. The high performance of AE_Kmeans has indicated that dividing programs according to code similarities and stretching within classes are beneficial to defect prediction. Figure 4 depicts the ROC curves generated by the two best models on SUBINC. For majority classes such as 0 and 1,  both models achieve high performance with the AUC higher than 80%. When increasing the decision threshold, the curves run along the vertical and upper horizontal axes. However, DGCNN obtains a low performance on minority classes, especially class 2 with the sample ratio of 3.18%. The ROC curve approaches the diagonal line and AUC is around 53%. This performance is just equivalent to a random classifier. In contrast, the ROC curve produced by AE_Kmeans is much higher than the diagonal line and the AUC reaches 61%.

V. RESULTS AND DISCUSSION
To verify the effects of k_means and autoencoder on the feature learning, programs' features generated by the approaches are visualized for analysis. Figure 5 shows visualizations of FLOW016 and MNMX for AE_Kmeans and DGCNN. The features are linearly transformed into a two dimensional space using the principal component analysis method (PCA) for the visualization. For DGCNN, programs are scattered on the feature space (subfigures 5b and 5d). With the support of k-means, the features generated by the proposed model are separated into different small clusters regarding coding styles. The stretch penalty function to enlarge samples in each cluster allows the model to detect defects in similar programs.

VI. CONCLUSION
This paper proposed a new deep learning model consisting of a CNN, an autoencoder, and a k-means for software defect prediction. One of the main reasons leading to the challenges in these problems is that similar programs may belong to different defective classes. Common learning algorithms may tag similar programs into one defective class. To address the issue, the proposed model uses the autoencoder to support the CNN to retain important information, and the k-means with two penalty functions to group similar programs and try to distinguish them. The experiments on four datasets have shown the efficiency of our method for software defect prediction.