An in-Depth Analysis of the Software Features’ Impact on the Performance of Deep Learning-Based Software Defect Predictors

Software Defects Prediction represents an essential activity during software development that contributes to continuously improving software quality and software maintenance and evolution by detecting defect-prone modules in new versions of a software system. In this paper, we are conducting an in-depth analysis on the software features’ impact on the performance of deep learning-based software defect predictors. We further extend a large-scale feature set proposed in the literature for detecting defect-proneness, by adding conceptual software features that capture the semantics of the source code, including comments. The conceptual features are automatically engineered using Doc2Vec, an artificial neural network based prediction model. A broad evaluation performed on the Calcite software system highlights a statistically significant improvement obtained by applying deep learning-based classifiers for detecting software defects when using conceptual features extracted from the source code for characterizing the software entities.


I. INTRODUCTION
Software Defects Prediction (SDP) consists in identifying defective software components, being considered an essential activity during software development. It represents the activity of identifying defective software modules in new versions of a software system [1]. SDP is considered of great importance in software engineering, as it contributes to continuously improving the software quality. Developing high quality software systems is expensive and, in this context, SDP is used for increasing the cost effectiveness of quality assurance and testing [2]. By detecting fault-prone modules in new versions of a software system, SDP helps to allocate the effort so as to test more thoroughly those modules [1].
SDP assists measuring project evolution, supports process management [3], predicts software reliability [4], guides testing and code review [1]. All these activities allow to significantly reduce the costs involved in developing and maintaining software products [5]. Moreover, particularly in The associate editor coordinating the review of this manuscript and approving it for publication was Hui Liu . the case of safety-critical systems, SDP helps in detecting software anomalies with possible negative effects on human lives.
As the software systems complexity increases, the number of software defects generated during the software development will also significantly increase. This growing complexity of software projects requires an increasing attention to their analysis and testing. Numerous researches from the SDP literature are based on mining historical and code information during the software development process and then building a prediction model (statistical, machine learning-based or other) to predict software defects [6].
Despite its importance and extensive applicability, SDP remains a difficult problem, especially in large-scale complex systems, and a very active research area [7]. The conditions for a software module to have defects are hard to identify and, therefore, the defect prediction problem is computationally difficult. From a supervised learning viewpoint, predicting defects is a difficult task as the training data used for building the defect predictors is highly imbalanced. The faulty modules in a software system are greatly outnumbered VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ by the error-free modules. Therefore, conventional learning algorithms are often biased towards the non-defective class. Another important issue in SDP is related to the features used for characterizing software entities (an entity may be a component, class, module -depending on the targeted level of granularity). As, generally, in machine learning (ML), the classical approach is to use manually engineered features, traditional software metrics are usually used in SDP as features characterizing the given software entities. Literature reviews in SDP revealed that about 87% [8] of the case studies used procedural or object-oriented software metrics. The two prevalent research directions in the SDP literature are: proposing software features relevant to the discrimination between defective and non-defective software entities and building or recommending high-performing defects prediction models.
When it comes to large amounts of data, deep learning models are some of the best at making accurate predictions, regardless of the origin of that data. As long as there is correlation between the input information and the output, the models will discover it. In order to use deep learning, the input software features are written in tabular form, a data form that has been extensively researched and for which many models are available [9].
In the present work, we follow both above-mentioned directions. Our study originated from three research questions: RQ1 Could the performance of predicting software defects be enhanced by enlarging the software features proposed for SDP with conceptual features extracted from the source code? Which is the most appropriate feature set to distinguish between defective and non-defective software entities and to what extent is the performance improvement significant from a statistical perspective? RQ2 Could the relevance of the conceptual-based software features be empirically sustained by both unsupervised and supervised analyses conducted on a large scale software system? RQ3 Does deep learning-based defect prediction bring a statistically significant improvement when compared to traditional supervised classifiers?
With these research questions in mind, we have performed an in-depth analysis of the software features' impact on the performance of software defect predictors. We have extended the large collection of SDP features proposed by Herbond et al. [10] with Doc2Vec and LSI-based conceptual software features that capture the semantics of the source code (including comments). An extensive study conducted on different versions of the Calcite data set highlight, through both unsupervised and supervised learning-based analyses, that the conceptual features bring a statistically significant improvement on the performance of SDP. As a second line of research, we have extensively examined the effect of the feature set identified as being the most relevant on the performance of various defect predictors. To the best of our knowledge, a study similar to ours has not been proposed in the literature, so far.
The remainder of the paper is organized as follows. Section II states and formalises the SDP problem, highlighting its importance and practical relevance. An extensive review on existing machine learning-based approaches for predicting software defects as well as the data sets and features used in the SDP literature are presented in Section III. Section IV presents the experimental data and methodology used in our work. Section V details the first stage of our research, which consists in a machine learning-based analysis of the software feature sets' relevance. The performances obtained when applying various defect prediction models are comparatively analysed in Section VI. The threats to validity are discussed in Section VII, while Section VIII highlights the conclusions of our paper and draws directions to further extend our study.

II. PROBLEM STATEMENT AND RELEVANCE
Software defects are logic or implementation errors that cause the system to operate in unintended ways or to produce incorrect results. SDP consists in identifying the software components that contain defects.
Let us consider a software system Syst described by a set of software entities (modules, classes, methods or functions, depending on the chosen granularity), Syst = {e 1 , e 2 , . . . , e n }. The software entities are represented as numerical vectors and are characterized by a set of software features (usually software metrics) SF = {sf 1 , sf 2 , . . . , sf }. Thus, each element from the vector associated with a software entity represents the value of a software feature (or metric) computed for that entity. A software entity e i ∈ Syst is represented as an -dimensional vector, e i = (e i1 , e i2 , . . . , e i ), where e ij expresses the value of the software metric sf j computed for the software entity e i .
From the perspective of supervised machine learning, SDP can be formulated as a binary classification problem. There are two possible target classes for the software defect predictor (or classifier): the positive class of the defective software entities (labeled as ''+,'' or ''1''), and the negative class of the defect-free software entities (labeled as ''-,'' or ''0''). A training data set including both positive and negative samples will be used for building the software defect predictor that will be further used for classifying unseen instances (software entities) in order to predict their defect-proneness.
The target function to be learned in a SDP task is the mapping t : Syst → {''+,'' ''-''} which has to assign to each software entity e a class t(e) ∈ {''+,'' ''-''}, denoting if the entity is defective or not. Thus, from a supervised classification perspective, the SDP task may be formalised as searching a hypothesis h ≈ t (i.e., an approximation of the target function to be learned) that best fits the training data.
SDP has a broad applicability. Clark and Zubrow [3] have analysed the importance of predicting software defects. One important motivation for performing defect prediction is that it helps software managers to measure how software projects evolve. In addition, it supports process management by assessing the software product's quality [3], thus being essential for effective software quality assurance. As shown in [5], SDP significantly reduces the cost of the processes that aim at ensuring the quality of software.
Software quality assurance involves numerous processes, including testing and code review, also called code inspection. SDP makes testing more efficient by allowing to focus on the components identified as defective [1]. By increasing the effectiveness of testing, SDP contributes to improving the quality of the next versions of a software project. Identifying software defects is also useful for guiding code review by indicating the locations in the source code that are very likely to be defective and thus require particular attention.
SDP is also useful for predicting software reliability, which is imperative in software development, particularly for large scale and complex software projects [4].

III. BACKGROUND
The current section starts by describing, in Section III-A, the publicly available data sets used as case studies for SDP. Section III-B reviews existing supervised machine learning-based solutions for SDP. The section ends with a description of the features used for SDP, both manually and automatically engineered ones.
The prediction of defects in software systems is a highly active research area. For instance, Hall et al. [7] have identified, in a systematic review of SDP, 208 studies on defect prediction, all published between 2000 and 2010, and numerous other studies have been published since then.
There is a great interest in developing new highperformance software defect predictors. Besides the interest in developing accurate and robust defect predictors, there is also interest in defining new relevant software features on the basis of which to distinguish between defective and non-defective software modules. Therefore, the research efforts in the field of SDP take one of the following two directions: proposing new accurate classifiers or designing new relevant features [11].

A. DATA SETS FOR SDP
The vast majority of existing studies [11]- [22], have considered, as experimental data, some of the SDP data sets available in Promise Software Engineering Repository [23], which is currently known as SeaCraft (Software Engineering Artifacts Can Really Assist Future Tasks) [24]. They contain static OO metrics (such as the CK metrics proposed by Chidamber and Kemerer [25]) or traditional metrics associated with the quality of the procedural source code (such as the ones proposed by McCabe [26]).
As suggested in the software engineering literature, the publicly available and thus reused SDP data sets are subject to two problems: the noisy labels [10] and the fact that the software features are insufficient or insufficiently relevant [41]. The noisy labels negatively affect the SDP models, while also predisposing the results of experimental evaluations to be unreliable [10], while the lack of significant features considerably limits the SDP performance [41].
More studies revealed multiple issues with SZZ, caused by identifying insignificant changes [44], disregarding the field mentioning the affected version from issue reports [45], using a six-month time frame for attributing defects to releases [38] or relying on the supposedly correct labeling of issues in the issue tracking system.
In a very recent such study, Herbond et al. [10] performed an empirical assessment, on 398 releases of 38 Apache projects, focused on the defect labeling effectuated by the SZZ algorithm. The study concluded that SZZ misses approximately one fifth of the bug fixing commits, while only about half of the commits identified as bug fixing commits were truly bug fixes.
The authors have also assessed SZZ-RA [46], which is the state-of-the-art variant of SZZ. The experimental results disprove the loss of bug fixing commits, but the problem that only about half of the commits detected as bug fixes are indeed bug fixes persists.
In order to mitigate these problems, Herbond et al. [10] have slightly extended SZZ-RA, by adding a filter to ignore documentation and test files and by linking commits and issues based on Jira issue pattern.

B. SUPERVISED LEARNING-BASED SDP APPROACHES
Predictive machine learning models have been extensively applied in the SDP literature with the goal of predicting software defects.
Ruchika Malhotra [47] has compared statistical and ML methods, as solutions for SDP. In particular, Logistic Regression has been compared with six ML approaches comprising Decision Trees, Artificial Neural Networks, Support Vector Machines, Cascade Correlation Networks, Group Method of Data Handling (GMDH) Polynomial Networks, and Gene Expression Programming. These learning models have been evaluated on two Ar data sets and the best performance has been obtained using Decision Trees.
Panichella et al. [48] proposed a combined approach called COmbined DEfect Predictor (CODEP), which combines the classifications provided by different ML techniques to improve the detection of defective entities. CODEP has been evaluated on ten open source software systems in the VOLUME 10, 2022 context of cross-project SDP. The authors concluded that the accuracy of the predictions has been improved by combining different classifiers.
Xuan et al. [15] investigated the performance of withinproject defect prediction based on 10 defect data sets from the Promise repository using six state-of-the-art ML approaches. Ten-fold cross-validation has been performed based on each data set and several evaluation measures were reported.
In order to better cope with noise and imprecise information, Marian et al. [16] have investigated a fuzzy Decision Tree method for SDP. The experimental results obtained on JEdit and Ant demonstrated the superior performance of the fuzzy approach when compared to a non-fuzzy approach.
A solution for SDP using a Bayesian approach has been proposed by Okutan and Yildiz [20]. The authors have applied the K2 algorithm [49] on nine publicly available data sets. Two new software metrics have been added to the software metrics from the Promise repository: number of developers (NOD) and lack of coding quality (LOCQ). The efficiency of different software metric pairs has been comparatively analyzed.
Highly appealing in the SDP literature are the cross-project defect predictors. They allow predicting defects in a target software system based on historical data from other systems. Therefore, they are more general and allow predicting defects in projects with limited historical data.
The problem of cross-project SDP, which allows predicting defects in a target software system based on historical data from other systems, has been approached in several studies including the ones of Yu and Mishra [50], Jaechang and Sunghunin [51] or Canfora et al. [14].

C. FEATURES USED FOR SDP
The SDP literature comprises various approaches proposed to engineer features relevant for SDP (usually software metrics that are considered to be appropriate for discriminating between defects and non-defects) as well as methods to automatically learn features using machine learning techniques, particularly deep learning models.
Regarding the insufficiency of relevant features for enabling the discrimination between defective and defect-free software entities, a relatively recent but active research direction in the SDP literature aims at defining new software features that are relevant for SDP.
Along this direction, in the last two decades, a noteworthy amount of research studies focused on the reliance of coupling and cohesion for predicting software defects [52]. If until relatively recently the studies focused exclusively on the coupling and cohesion metrics from the traditional suites (such as the Chidamber and Kemerer [25] metrics suite), the latest studies are concerned with updating, extending and complementing them, by proposing new relevant coupling and cohesion measures [53], [54].
A systematic mapping study on object-oriented (OO) coupling and cohesion metrics has been performed by Tiwari and Rathore [52]. The authors selected 137 research papers.
Of these, 17% introduced new coupling metrics, 8% introduced new cohesion metrics, while 24% introduced both coupling metrics and cohesion metrics. The rest of the studies (51%) focused only on assessing the existing metrics suites. The prevalent criterion by which the coupling and cohesion metrics have been evaluated is their relevance for predicting software defects.
In a subsequent study, Rathore and Kumar [55] have conducted a survey on the existing approaches for SDP, with emphasis on the considered software metrics, quality of data, prediction models and performance indicators. Their review uncovered that the majority of the studies (39%) used OO metrics. The explanation Rathore and Kumar have formulated for the high use of OO metrics for SDP is the inability of traditional software metrics to capture OO features that underlie the modern software development practices, including coupling and cohesion. The authors concluded that more studies concerned with the proposal and the assessment of new metrics suites are necessary.
An approach for automatically learning semantic features from token vectors extracted from Abstract Syntax Trees (ASTs) has been proposed by Wang et al. [11]. The authors have used Deep Belief Networks (DBNs) to automatically learn features from token vectors extracted from the programs ASTs. The features have then been used for both within-project and cross-project SDP. Ten open source projects from the Promise repository have been considered. The semantic features have been comparatively evaluated against 20 traditional features (software metrics in the Promise repository), as well as the term frequencies of the AST nodes (i.e., the ones used to train the DBNs). The evaluation results have confirmed that the semantic features are able to lead to superior predictive performance and thus are more relevant to SDP.
Features that have been automatically learned through a process similar to the one proposed by Wang et al. [11] are combined with traditional features in a subsequent study performed by Li et al. [17]. To generate semantic and syntactic features, DBN has been replaced by Li et al. with CNN, given that the Deep Learning community claims that CNN is better than DBN, since the latter can capture local patterns better than the former. The automatically learned features have been fed into a Logistic Regression classifier, which has been evaluated on 7 open source software projects from Promise. The empirical results confirmed that the CNN based prediction model outperforms the classifiers based on traditional features, while combining the automatically learned features with traditional features raises performance even more.
Another study proposing using AST-based features for SDP is the one performed by Dam et al. [18]. After highlighting that traditional software metrics are not so effective, while code tokens carry semantic information, the authors have proposed a tree-structured network of Long-Short Term Memory (LSTM) units as a SDP prediction model fed with AST embeddings. The features generated by LSTMs have been fed into traditional classifiers (Logistic Regression and Random Forest). As evaluation case studies, the authors considered the same ten open source Java projects from the Promise repository as in the study performed by Wang et al. [11], but they have also considered a data set from open source projects contributed by Samsung and developed in the C programming language. As empirical results, Random Forest performed better on the Samsung data set, while in the case of the Promise data sets, the Logistic Regression proved superior performance.
Huo et al. [19] have proposed Convolutional Neural Network for Comments Augmented Programs (CAP-CNN) as a model for SDP. Their approach is based on using pretrained Word2vec to encode code and comments into numeric vectors and then feeding the so-obtained vectors into two separate CNNs. Eight Promise data sets have been employed in the empirical evaluation, while using resampling for their balancing. The evaluation results highlighted that CAP-CNN outperformed, for most experiments, CNN, as well as, standard classifiers such as Logistic Regression or Naive Bayes, but also Deep Belief Network [11].
In a previous study [56], we have also proposed a semantic features based hybrid SDP model combining Artificial Neural Networks with Gradual Relational Association Rules (GRARs). After encoding the source code and comments into fixed-length numeric vectors, GRARs mining has been employed to uncover interesting GRARs that are able to discriminate between defective and defect-free software components. Based on the differentiating GRARs, a Multilayer Perceptron is trained in order to learn the classification function. The empirical evaluation has been performed on 3 software projects from the Promise repository. The experimental results revealed that considering semantic features instead of traditional metrics preponderantly leads to superior SDP performance.
In 2020, Wang et al. [57] extended their prior publication [11], by doubling the within-project SDP with cross-project SDP, proposing new techniques to process incomplete code, updating the performance assessment scenarios and performing new experiments on open-source commercial projects. The experimental results reconfirmed that the proposed DBN-based semantic features outperform traditional SDP features.
Very recently, Sikic et al. [58] have proposed DP-GCNN, a SDP model based on a Convolutional Graph Neural Network (GCNN), which is fed with AST data. The neural network architecture employed is specifically tailored for graph data.As experimental data, the authors have considered 7 SDP data sets from the Promise repository. The experimental results revealed that DP-GCNN's performance is superior to those of the traditional SPD models and comparable with those of the state-of-the-art AST-based SDP models, including [57].
There are also traditional metrics based Deep Learning approaches in the SDP literature. Two recent studies [59], [60] have proposed Siamese Deep Neural Networks for SDP.
Unlike the previously reviewed papers, the study was performed on NASA data sets instead of Promise data sets.

IV. METHODOLOGY
This section introduces the methodology underlying our study on how the software features used in SDP impact the SDP performance. The Calcite data set used as case study and the SDP software features proposed in the literature for this data set are described in Section IV-A.
The pipeline proposed for our study consists of the following stages: 1) Adding conceptual features. The additional set of conceptual software features proposed for capturing the semantics of the source code in order to enlarge the original feature set for the Calcite data set (Section IV-A) is introduced and detailed in Section IV-B. 2) Features relevance analysis. An in-depth analysis on various subsets (described in Section IV-C) of the original features set extended with the conceptual features proposed at the previous stage is then conducted in Section V. An extensive study performed on sixteen versions of Calcite has the goal of determining, through a supervised learning-based analysis reinforced by an unsupervised one, the set of features that brings a statistically significant improvement on the performance of predicting software defects on Calcite data. 3) Predictive models performance analysis. The last stage (Section VI) consists in a study on the performance of various defect predictors on the Calcite data set using the most relevant feature set previously identified.

A. CASE STUDY
As a case study, we selected Apache Calcite, an open-source dynamic data management framework [61]. We are making our data sets publicly available [62]. We started from the data provided by Herbond et al. [10], namely the values for 4189 software features for each software instance from Calcite and the defect labels produced by their extended version of SZZ, which is SZZ-RA [46].
• Metrics based on the warnings produced by the PMD static analysis tool [64].
• The number of different types of changes [65] and refactorings [66] from the last six months, collected using changeSHARK and refSHARK [63], respectively.  • Code churn metrics proposed by Moser et al. [67], Hassan [68] and D'Ambros et al. [31]. Additionally, all the 13 schemes proposed by Zhang et al. [69] for aggregating class, interface, enum, method, attribute and annotation metrics have been applied to expand the feature space.
We are focusing in our study on five features subsets: the entire set of software metrics and four other feature subsets with the largest dimensionality. The features sets considered in our case study are summarized in Table 1. Each row from the table indicates a feature (sub)set, its dimensionality and a brief description of the contained features.
Descriptive statistics for the available versions of Calcite are presented in Table 2. For all Calcite versions, the total number of software instances, number of defective software instances and defective rate are given. Table 2 reveals that both the defective rate and the number of faulty entities have a general decreasing tendency during the evolution of the Calcite software system. In the latest release of the software (version 1.15.0) there is the lowest defective rate and the smallest number of software defects. This tendency is expectable since as the system evolved it was improved and defects were corrected.
For better understanding the complexity of the software defect prediction task during the evolution of the Calcite software, we computed for each data set (corresponding to a software version) three difficulty measures. Following the definition given by Zhang et al. [70], the difficulty of a given class c (''+'' or ''-,'' in our case) is computed as the propor- tion of software entities labeled as c for which the nearest neighbor (computed using the Euclidean distance) belongs to the opposite class (i.e., ''-'' or ''+,'' respectively). The overall difficulty of a data set is expressed as the weighted average of the difficulties computed for the defective (positive) and non-defective (negative) classes. Intuitively, the difficulty of a certain class indicates how hard is to distinguish the instances belonging to that class, considering a given vectorial representation for the software entities. Table 3 presents the values for the previously described difficulty measures for each version of the Calcite system, considering the entire feature set (labeled as All in Table 1). The second column from the table denotes the difficulty for the defective class (the positive one), while the third column depicts the difficulty values for the non-defective class (the negative one). Figure 1 plots the variation of the defective rates and difficulties for each version of the Calcite data set. One can observe from the figure that there is a strong correlation between the defective rate and the difficulty values during the evolution of the software. The same relationship may be observed from Table 4 that presents the Pearson correlation coefficients [71] between the defective rates and the difficulty values for all versions of the Calcite software. However, even if there is a strong linear relationship between the defective rate and the difficulty for the positive class, the correlation is inverse (negative) and thus it indicates that the number of defects and the difficulty for the defective class tend to move in opposite size and direction from one another. This is  not unexpected since, intuitively, the smaller the number of defects is, the harder it is to differentiate them from the entire set of entities.
Due to the severe imbalancement of the two classes (the number of defects are highly outnumbered by the number of non-defective ones), the main difficulty is that of predicting the positive class. Therefore, we consider, as the real difficulty, the one on the positive class. A difficulty of 1.0 means that every instance of the respective class has as its nearest neighbour an entry from the other class. That level of dissimilarity between positive entries makes it incredibly difficult for a classification model to correctly identify that class. It can be observed that some data sets have difficulties that come close to 1.0, while for all of them, the positive entries are mostly surrounded by negative ones (difficulty > 0.5).

B. PROPOSED CONCEPTUAL-BASED FEATURES
As shown in Section III, numerous ML techniques applied for predicting software defects are based on using classic software metrics as input features. Using these data sets, SDP models can be built without considering the source code of the analyzed software.
Rathore and Kumar [55] concluded their extensive review on existing SDP approaches by emphasizing the need to propose and validate new features that can be relevant for discriminating between defective and non-defective software entities.
Moreover, various studies from the literature [17], [18] reveal that the traditional software metrics are unable to capture the semantics of the source code. Besides the structural relationships existing in a software system and expressed by most of the software metrics, it would be relevant to consider the textual information contained in the source code as well. In this regard, it is agreed that conceptual software features extracted from the source code (identifiers, comments, etc) are able to capture semantic characteristics that structural metrics are not entirely able to express. Extracting conceptual (semantic) information from comments and identifiers within the source code has been also investigated in the software engineering literature for expressing conceptual coupling between software components [72], [73].
Since 2016, SDP many researches focused on using DL models and semantic features extracted from the source code. Recent research papers (Yang et al. [74] and Wang et al. [11]) introduced Deep Belief Neural Networks (DBN) for performing defect prediction based on code analysis. Wang et al. [11] argued that besides the classical software metrics, the semantics of code should also be considered for SDP. The authors proposed DBN to automatically learn semantic features from input vectors of tokens extracted from the AST of the source code. Dam et al. [18] first used Long Short Term Memory (LSTM) networks to learn semantic features from the AST which were used to train a Logistic Regression (LR) and a Random Forest (RF) model. Traditional metrics were combined with features learnt from AST using a Convolutional Neural Network (CNN) by Li et al. [17]. Šikic et al. [58] used a graph convolutional neural network (GCNN) for processing the information of the nodes and edges from the AST of the source code for classifying the module as being defective or non-defective.
Doc2Vec [75] and LSI [76] models may also be used for unsupervisedly learning conceptual-based features from the source code. Both models are used for representing texts (in our case, source code) as fixed-length numerical vectors.
Doc2Vec, or Paragraph Vector is a multilayer perceptron (MLP) based model proposed by Le and Mikolov [75]. It allows expressing variable-length textual information as a fixed-length dense numeric vector, called paragraph vector, thus being an alternative to common models such as bag-ofwords and bag-of-n-grams.
A first advantage of Doc2Vec over the traditional models is that it considers the semantic distance between words [75]. Therefore, private will be closer to protected than to boolean. An additional advantage over bag-of-words is that it also takes into consideration the words order, at least in a small context. Despite the fact that bag-of-n-grams, with a large n, also takes into account the word order in short contexts, it suffers from high dimensionality and data sparsity.
Doc2Vec extends Word2Vec, which learns distributed vector representations of words. Doc2Vec learns distributed representations for variable-length pieces of text, called paragraphs, ranging from sentences to entire documents.
The experimental results of previous studies we have conducted [54], [56] revealed that combining Doc2Vec and LSI is appropriate and increases the predictive performance.
Using Doc2Vec and LSI, the entities from a software system are represented as conceptual vectors. The conceptual vectors are vectors of numerical values corresponding to a set S = {s 1 , s 2 , . . . , s } of conceptual (or semantic) features unsupervisedly learned from the source code. Thus, a software entity e i is represented as an -dimensional vector in Doc2Vec and LSI spaces: , · · · , e Doc2Vec i ), where e Doc2Vec ij (∀1 ≤ j ≤ ) denotes the value of the j-th semantic feature computed for the entity e i by using Doc2Vec.
(2) e LSI i = (e LSI i1 , · · · , e LSI i ), where e LSI ij (∀1 ≤ j ≤ ) denotes the value of the j-th semantic feature computed for the entity e i by using LSI.
In our study, for extracting the conceptual vectors corresponding to the software entities, the unsupervised learning models Doc2Vec and LSI are used. Both Doc2Vec [75] and LSI [76], also known as Latent Semantic Analysis (LSA), are models aimed to represent texts of variable lengths as fixed-length numeric vectors capturing semantic characteristics, but Doc2Vec is a prediction-based model trained using backpropagation together with the stochastic gradient descent, while LSI is a statistical, count-based model.
We opted for = 30 as the length of the conceptual vectors extracted using Doc2Vec and LSI. For building the corpora for training, we filtered the source code (including comments) afferent to each class so as to keep only the tokens presumably carrying semantic meaning. So, operators, special symbols, English stop words or Java keywords have been eliminated. For both Doc2Vec and LSI, we have used the implementation offered by Gensim [77].

C. FEATURE SETS USED
In this section we are describing the feature sets that will be further used n Section V in our study performed on Calcite data set. The proposed study is aimed to determine, through a supervised learning-based analysis reinforced by an unsupervised one, the set of features that brings a statistically significant improvement of the SDP performance on Calcite data.
Twelve feature sets will be further experimented: 1.-5. The first five feature sets (labeled as All, SM, PMD, D'Ambros, AST) are the features (sub)sets described in Table 1. 6.-8. The next three feature sets, labeled as AST+SM, AST+PMD and AST+D'Ambros are obtained by fusing the AST-based features and the SM, PMD and D'Ambros features sets, respectively. We decided to use the AST-based features in all these combinations since the literature reveals various approaches [17], [18], [58] in which deep learning models are used to learn relevant features starting from the ASTs of the source code. 9.-10. The next two feature sets, denoted by Doc2Vec and LSI, are the conceptual features from Doc2Vec and LSI spaces, as described in Section IV-B. 11. The feature set labeled as Doc2Vec+LSI is represented by fusing the Doc2Vec and LSI features. 12. The last feature set, denoted by All+Doc2Vec+LSI, is obtained by fusing the feature set All with the conceptual features within the Doc2Vec+LSI feature set.

V. FEATURE SETS RELEVANCE ANALYSIS
As directions for further research in SDP, Herbond et al. [10] have recommended that analyses have to be performed in order to uncover the most relevant subsets of the extensive metrics set they proposed. Following this idea and the methodology introduced in Section IV, with the goal of answering RQ2, we are examining the feature sets proposed in Section IV-C for deciding, through supervised and unsupervised learning-based analyses, their relevance in the context of SDP applied on the Calcite data set.
In Section V-A a supervised learning-based analysis will be conducted to decide the best feature set (from those described in Section IV-C), namely the set of features that provides a statistically significant performance improvement for a deep learning defect predictor applied on all the versions of the Calcite software. Afterwards, the results of the supervised learning-based analysis are strengthen in Section V-B by an unsupervised learning-based study.

A. SUPERVISED ANALYSIS
For determining which is the most relevant feature set for characterizing the software entities from the Calcite system (i.e., the set of features able to discriminate best between defective and non-defective entities) we decided to use a highly performant deep learning classifier and to evaluate its performance (in terms of multiple performance evaluation metrics) on all versions of Calcite, described by using all 12 feature sets (described in Section IV-C).
The deep learning classifier we decided to use, denoted by DL-FASTAI, is implemented in the FastAI machine learning library [78]. It is composed of an Artificial Neural Network combined with embeddings of the input layer. The architecture consists of 1 input, 1 output and 2-4 hidden layers, depending on the number of features. Compared to other deep learning models, especially Convolutional Neural Networks (CNNs) used in computer vision, this ANN model is very small and fast, with training times under 2 minutes on our data set and inference time under 1 second per instance at runtime, making it suitable for real-time use. The model is trained using the FastAI 'fit one cycle' method, which uses a learning rate that varies according to a specific pattern: first it increases, then it decreases and the process is repeated for each epoch.
In order to evaluate the performance of the DL-FASTAI model, we employed the following evaluation methodology. The data was split into 70% train, 10% validation and 20% test sets. In order to get consistent results, cross-validation on 10 experiments with different splits had been done.
During the cross-validation process, the confusion matrix for the binary classification task has been computed for each testing subset. Based on the values from the confusion matrix

7) Area under the Precision-Recall curve (AUPRC).
Somehow similarly to the ROC curve, the Precision-Recall curve represents a two-dimensional plot of (sensitivity, precision) points computed for different values for the threshold applied for deciding the output class. For the classifiers for which the output is the class label (obtained without thresholding the output value), the point (sensitivity, precision) is linked to the points at (0,1) and (1,0), and the area under the resulting trapezoid is computed as AUPRC = (Prec+Sens) 2 . AUPRC is considered a good measure for imbalanced classification and it also has higher values for better classifiers. 8) Matthews Correlation Coefficient (MCC) [80] is also considered to be a good evaluation metric for imbalanced data sets and is computed as MCC =

11)
Overall F-score (F1) computed as the average between F-score + and F-score − . 12) Weighted F-score (F1 w ) is computed as the weighted average between F-score + and F-score − , where the weights are computed as the defective and nondefective rates, respectively.
All the previously mentioned evaluation measures range from 0 to 1, excepting MCC, which ranges from -1 to 1. For better classifiers, larger values are expected.
For all 16 versions of the Calcite system and all 12 feature sets selected for analysis (as presented in Section IV-C), the 12 evaluation metrics previously described have been computed. For a given version v (v ∈  Table 5 presents, for each Calcite version v, the winning feature set(s), win(v), and the number of performance metrics, n(v, win(v)), whose values are the highest for the winning feature set(s). VOLUME 10, 2022  Table 6 presents the feature sets fs for which a non-zero value has been obtained for the WIN (fs) measure. The feature sets are listed in the decreasing order of the WIN values.
From Table 6 we observe that the feature set with the maximum number of wins is Doc2Vec+LSI, the feature set obtained by fusing the proposed Doc2Vec and LSI semantic features. We remark that the feature set containing only the original features (All) [10] was not the winning feature set for none of the Calcite versions. Still, the joint feature set All+Doc2Vec+LSI was the second best feature set (the winning feature set for 3 Calcite versions). This suggest that the conceptual features extracted from the source code through Doc2Vec and LSI are the best for distinguishing between the defective and non-defective software entities.
Tables 7 and 8 present the performance metrics values obtained by evaluating the DL-FASTAI classifier on all Calcite versions characterized by the Doc2Vec+LSI, All+Doc2Vec+LSI and All feature sets. 95% CIs are used for the results. For each of the Calcite versions, the feature set that provides the best performance metrics (the maximum number of best performance values) is highlighted.
From Tables 7 and 8 one observes that the Doc2Vec+LSI feature set is the best for 67% of the Calcite versions (10 out of 15), when compared to All+Doc2Vec+LSI and All feature sets. For verifying the statistical significance of the differences observed between the evaluation metrics values obtained for Doc2Vec+LSI features and All+Doc2Vec+LSI/All features, a one tailed paired Wilcoxon signed-rank test [81], [82] has been applied. The sample of values representing the performance metrics values obtained by the DL-FASTAI classifier for the Calcite versions and Doc2Vec+LSI feature set was tested against the samples of values obtained for All+Doc2Vec+LSI and All features, respectively. The obtained p-values of 0.0037779 (for Doc2Vec+LSI vs. All+Doc2Vec+LSI) and 0.000309 (for Doc2Vec+LSI vs. All) confirm a statistically significant improvement achieved by the Doc2Vec+LSI feature set, at a significance level of α = 0.01.
The superiority of the Doc2Vec+LSI feature set with respect to All+Doc2Vec+LSI and All features is strongly correlated with the overall difficulty values, as shown in Figure 2. The figure plots the overall difficulty values computed for all Calcite versions and the feature sets Doc2Vec+LSI, All+Doc2Vec+LSI and All. A statistically significant difference, at a significance level of α = 0.01, was observed between the difficulties obtained for the  Doc2Vec+LSI features and the difficulties for the other feature sets (All+Doc2Vec+LSI and All) as provided by a one-tailed paired Wilcoxon signed-rank test: p-values of 0.000876 (between Doc2Vec+LSI and All) and 0.0024120 (between Doc2Vec+LSI and All+Doc2Vec+LSI).
Using the values from Tables 7 and  8 for the Doc2Vec+LSI feature set, we computed the Pearson correlation coefficients between the sample of defective rates for all versions of the system and the values obtained for sensitivity (Sens) and AUC. A strong correlation (0.66) has been observed between the sensitivity values and the defective rates and a moderate correlation (0.42) between the AUC values and the defective rates. Figure 3 depicts the variation of sensitivity, AUC and defective rates for the Calcite versions. A higher strength of the association between sensitivity values obtained by the DL-FASTAI classifier and the defective rates for the Calcite versions is expected. As shown in Figure 1, if the defective rate increases there is a general decrease of the difficulty for the ''+'' class and thus it is easier to recognize the defects, i.e. it is very likely that the DL-FASTAI classifier will obtain a higher true positive rate (sensitivity).

B. UNSUPERVISED ANALYSIS
In order to strengthen the supervised learning-based analysis performed in Section V-A and to better highlight that Doc2Vec+LSI feature set is superior to the feature set (denoted by All in our study) proposed in the literature [10] in terms of differentiating between defective and non-defective software components we applied t-distributed Stochastic Neighbor Embedding (t-SNE) [83].
t-SNE is an unsupervised non-linear technique used for dimensionality reduction and feature extraction, as well as for visualizing and exploring high-dimensional data. It primarily focuses on retaining the local structure of the data, but also considers preserving its global structure. It works by finding similarities between the data points, showing similar points to be close to each other on the visual representations. The algorithm was implemented using the scikit-learn library [84].
Three The plots from Figures 4, 5 and 6 reveal what the positive difficulty already predicted: the defective instances are very different from each other, usually similar to some nondefective ones. This behaviour is accentuated as the data set is more imbalanced, resulting in an almost uniform distribution of the positive instances in Figure 6. Furthermore, for the All feature set, small heterogeneous clusters are formed, making it even harder for a classifier to distinguish between positive and negative instances, due to their very high similarity. This unsupervised analysis supports the supervised one: F-score + is higher for version 1.0.0 than 1.15.0 and the metrics in general are better for the Doc2Vec+LSI feature set than the All one.

VI. PREDICTIVE MODELS PERFORMANCE ANALYSIS
Following the methodology introduced in Section IV, this section presents the last stage of our study. More specifically, we are going to comparatively analyse the performance of various defect predictors (DL-FASTAI, XGBoost, VOLUME 10, 2022    classical ML-based defect predictors and thus answering our RQ3; (2) to test the hypothesis that DL-FASTAI brings a statistically significant improvement of the SDP task with respect to the other classifiers; and (3) to highlight the improvement achieved through DL-FASTAI over two baseline classifiers: the random guessing and the Zero rule baseline.
Apart from the DL-FASTAI model, whose architecture was described in the previous sections, other classifiers have been employed. The ANN, SVM and XGBoost classifiers were selected as the classical ML techniques used as a basis for our comparison as they are well known both in the classical ML literature [85], [86] as well in the SDP literature [8], [87] for their very good predictive performance. One of them is XGBoost, a decision-tree based machine learning algorithm that uses optimised gradient boosting to improve performance [88]. This gradient boosting reduces overfitting by employing regularization and handling of the missing values, as well as parallel processing and tree-pruning. The XGBoost model was also trained using the FastAI library [78]. Furthermore, we apply and denote with ANN the scikit-learn implementation of an Artificial Neural Network model. From the same library, we also use a Support Vector Machine (SVM) classifier, a model that constructs a high dimensional hyper-plane, in order to find a separation boundary between the two classes [89].
To determine the random guessing baseline, let us denote by d the defective rate (proportion of positive instances) and with n the total number instances in the defect data set (e.g., a given version of the Calcite software). The confusion matrix for the random guessing classifier is the following: • TP = n · d 2 , i.e., the number of true positives (defects) for a random guessing classifier is the number of defects (n · d) multiplied with the probability of an instance of being defective (d).
• TN = n·(1−d) 2 , i.e., the number of true negatives (nondefects) for a random guessing classifier is the number of non-defective entities n · (1 − d) multiplied with the probability of an instance of being non-defective (1−d).
• FN = n · d · (1 − d), i.e., FN is the number of defects n · d minus the number of true positives (TP).
Based on the previous values, the performance metrics for the random guessing classifier applied on a defect data set with a defective rate of d and the total number of instances n are given in Table 9. VOLUME 10, 2022 TABLE 9. Performance metrics for the random guessing classifier on a defect data set with a defective rate of d and the total number of instances n. The second baseline method we are considering is the Zero rule (ZeroR) classifier. The ZeroR classifier uses the simplest rule of predicting the majority class (i.e., the non-defective class). Considering the same notations previously introduced (d the defective rate and n the total number instances in the defect data set), the confusion matrix for the ZeroR classifier is the following: • TP = 0, since the classifier predicts only the negative class.
• FP = 0, since the classifier predicts only the negative class.
• FN = n · d, i.e., the number of defects that are misclassified by ZeroR. The performance metrics for the ZeroR classifier applied on a defect data set with a defective rate of d and the total number of instances n are shown in Table 10.
The performances of the supervised classifiers previously mentioned (DL-FASTAI, XGBoost, SVM, ANN), using the Doc2Vec+LSI feature set and the evaluation metrics described in Section V-A, have been computed for each of the Calcite versions,. Additionally, the evaluation metrics have been determined for the baseline random guessing and ZeroR classifiers, as well. We decided to present the results obtained only for 4 Calcite versions 1.0.0, 1.5.0, 1.8.0 and 1.15.0. Versions 1.0.0, 1.8.0 and 1.15.0 were selected based on the overall difficulty criteria (minimum/median/maximum value), while for version 1.5.0 the best AUC has been obtained. The obtained results are given in Table 11. The classifiers have been evaluated using 10-fold cross-validation, the performance metrics being averaged during the 10 runs and 95% CIs being computed for the mean values. Table 11 also includes the performance of the random guessing and ZeroR used as baseline classifiers.
From Table 11 one observes that the best performing model is the DL-FASTAI one. This performance results from the combination of two key factors: a performant ANN based architecture and a state of the art training method provided by the fastai library. The other models on our list contain at most one of these factors. Furthermore, it is also worth noting the large improvement in performance of our model over baseline classifiers, whose performances expressed through metrics regarding the positive class (Precision, F-score + ) betray the difficulty of classification on very imbalanced data sets. Table 12 presents, for each classifier c ∈ {DL-FASTAI, XGBoost, SVM, ANN} (excepting the baselines), the number of Calcite versions for which c provided the best performance considering: (1) all performance metrics (the second column from the table); (2) the sensitivity (Sens) metric (the third column from the table); and (3) the AUC metric (the last column). We note that for the first evaluation (considering all performance metrics) the best performant classifier c was considered the one that provided the maximum number of performance metrics with the highest value. The second (2) and the third (3) evaluations (considering only the Sens and AUC measures) have been considered since, as revealed by the SDP literature, a perfomant defect classifier is the one that maximizes Sens and AUC [79].
The results from Table 12 reveal that the deep learning model DL-FASTAI is the best performing classifier when considering the sensitivity and AUC evaluation metrics. Considering all performance metrics, the performance of DL-FASTAI was slightly outperformed by the ANN classifier. Figure 7 presents the ROC curves for the Calcite version 1.5.0 for which the highest AUC value has been obtained.
The AUC values averaged over all 16 Calcite versions obtained for the evaluated classifiers are given in Table 13. The improvement brought by DL-FASTAI compared to the other classifiers is highlighted in Table 14. In terms of the average AUC values, the best performing classifier is DL-FASTAI, which is followed by ANN. In order to answer RQ3, the statistical significance (at a significance level of α = 0.01) of the improvement obtained by the DL-FASTAI classifier (in term of AUC values) has been tested against the AUC values provided by XGBoost, SVM and ANN classifiers using a one-tailed paired Wilcoxon signed-rank test. The obtained p-values (0.000241 -DL-FASTAI vs. XGBoost,  0.000241 for DL-FASTAI vs. SVM, and 0.000512 for DL-FASTAI vs. ANN) reveal a statistically significant improvement acheived by DL-FASTAI, at a significance level of α = 0.01.

VII. THREATS TO VALIDITY
In what concerns construct validity [90], the performance of the ML models has been analyzed using specific metrics that both stem from literature and characterise the task at hand. However, not for all metrics the models have the same performance ranking and therefore their performance is relative to the task required of them.
When comparing the performance of different models, an internal validity pitfall could be focusing only on the architecture and ignoring the different methods used to train those models. Given the fact that the FastAI library contains state  of the art training methods, this could lead to erroneous conclusions regarding the convolutional neural network architectures. Furthermore, there is also a comparison of the training methods, by keeping the architecture constant (artificial neural network). In future research, other factors could also be considered when assessing the model performance.
Regarding external validity which is concerned with the possibility to generalize the obtained findings, we have chosen a public data set that is relevant for the task of software defect prediction and made public the extracted features. The problem of class imbalancement, present with various degrees in our feature sets, is common in this field area. As further research, our models could be applied to other data sets, to achieve generalization.
In order to increase reliability, we employed cross validation with 10 repeats of the same experiment, so that statistics would show the most likely result, as well as the confidence interval. The libraries we employed are public and described in the literature, as well as the architectures and training methods. Further analysis could present the value of each parameter used.

VIII. CONCLUSION
In this paper, we conducted an extensive analysis of the impact that different software features have on the performance of software defect predictors. We started from a large set of software features proposed by Herbond et al. [10] for SDP and we enlarged it with conceptual software features that are able to capture the semantics of the source code. The conceptual features have been automatically learned using Doc2Vec.
Doc2Vec and LSI models are used only in the feature engineering step, in order to extract conceptual software features that capture the semantics of the source code. These features have been used for enlarging the feature set proposed by Herbold et al. [10] in the SDP literature. The enlarged attribute set was then fed into the deep learning model DL-FASTAI (as described in Section V-A) that extracts from the raw input attributes characterizing the software entities a set of features relevant for discriminating between defects and non-defects. The experiments performed in Section VI highlight a statistically significant superior predictive performance of DL-FASTAI with respect to other machine learning models (XGBoost, SVM, ANN). The obtained results empirically validate our hypothesis that the features learned through a deep learning model are better correlated with defect proneness than the raw input attributes and software metrics fed into a classical machine learning model. A detailed investigation on sixteen different versions of a large scale software system, the Calcite framework, has been performed using both unsupervised and supervised learningbased analyses. The experimental results highlighted a statistically significant improvement obtained on the performance of SDP when using the conceptual features and the deep learning-based predictor we proposed.
The research questions stated in Section I have been answered. First, it has been shown that the performance of software defector prediction can be enhanced by enlarging the classical software features proposed for SDP with conceptual features extracted from the source code. Secondly, the relevance of the conceptual software features for SDP has been highlighted through unsupervised and supervised analyses conducted on Calcite framework. As a third conclusion of our study, a statistically significant improvement has been obtained using a deep-learning based defect predictor instead of traditional supervised classifiers.
For reinforcing the conclusions of the present study we will further investigate other open-source software systems, such as Apache Commons libraries (Collections, Compress, Configuration, etc) [91]. We also aim to further extend the feature set for SDP. In this regard we intend to use static analysis tools for source code quality, in order to extract information about software defects, code smells or other source code vulnerabilities. On the other hand, we envision deriving conceptual coupling and cohesion software metrics starting from the proposed conceptual features and investigating their ability to increase the SDP performance even more.