Early Software Defects Density Prediction: Training the International Software Benchmarking Cross Projects Data Using Supervised Learning

Recent reviews of the literature indicate the need for empirical studies on cross-project defect prediction (CPDP) that would allow aggregation of the evidence and improve predictive performance. Most empirical studies predict defects at granularity levels of method, class, file, and module/package during the coding phase, and thereby avoid external failure costs. The main goal of this study is to perform an empirical study on early defect prediction at the beginning of a project at the product level of granularity for using it as input in planning quality activities of the project. Hence, both internal and external failure costs could be avoided as much as possible through proper planning of quality. We first made a systematic mapping study (SMS) on secondary studies (literature reviews) on defect prediction to identify the most used datasets, the project attributes and metrics utilized as estimators, and the supervised learning methods employed for training the data. Then, we made an empirical study on defect density prediction using cross-project data. We collected 760 project data from the International Software Benchmarking (ISBSG) dataset version 11, which reported both defects and functional size attributes. We trained the prediction models using: i) the complete set of project attributes, ii) the individual attributes, and iii) multiple subsets of attributes. We employed classification and regression approaches of machine learning. The machine learning models are trained using original values of the dataset, and z-score and logged transformations of original values to explore the effects of data normalization on prediction. Most machine learning models trained on the z-score transformation of the dataset performed best for classifying defects. The Multilayer-Perceptron (Neural Network) model trained on the z-score transformation of complete dataset predicted defects with the highest F1-score of 0.89 using binary classification. The logged transformation and feature selection methods improved the results for multivariable regression. The multivariable regression predicted defects with the highest Root Mean Squared Error (RMSE) and R2 (r-squared) values of 0.4 and 0.9, respectively, with a subset of 11 features using logged transformation. The results of classification and regression approaches indicate that defects can be predicted with reasonable accuracy at the software product level using cross-project data.


I. INTRODUCTION
The costs associated with project quality consist of the cost of quality (money spent during the project to avoid failures) and the cost of poor quality (money spent during and after the project for failures) [1].Software companies require predicting time and resources for quality activities such as reviews, inspections, and testing at the beginning of a project as well as throughout the development life cycle as the data on defects are being collected in the software requirements specification, design, coding, and testing phases [2].Most of the literature on defects estimation focused on predicting defects during the testing phase to allocate the required time and resources for removing them before delivering to the customers [3], [4] and thereby avoiding external failure costs (i.e., the cost of failures found by the customer).The identification and fixation of post-deployment defects typically require more effort as compared to the software development and testing phases.Therefore, the defect prediction term is usually considered as referring to predicting software defects in an intermediate or final software product, or its releases by the project team during testing activities [5].Indeed, a recent SLR and meta-analysis on defect prediction [4] identified that most of the defect prediction studies predict defects at the method, class, file, and module/package granularity levels.They utilize the attributes of source code artifacts which are produced during the software coding phase [6], [7].
On the other hand, internal failure costs (i.e., the cost of rework for the failures found by the team during testing) are also very high [8].Therefore, early software defect prediction for the final product is crucial for planning quality assurance and control activities to avoid both internal and external costs of failure [9].
The main goal of the study is to perform an empirical study on defects prediction of software at the beginning of a project for using it as an input in planning quality activities of the project.Hence, both internal and external failure costs could be avoided as much as possible.
To this end, we first made a Systematic Mapping Study (SMS) of literature reviews on defect prediction for identifying i) the machine learning methods applied, ii) the best estimator attributes, and iii) the project datasets utilized.According to the findings of the SMS, we designed our empirical study.In our SMS, we did not find any study that predicted defects for a software product at the beginning of a project using a publicly available dataset.This finding is in line with what Hosseini et al. [7] mentioned in their literature review and meta-analysis.Hence, most previous work is focused on defect prediction during the testing phase, which is too late to avoid costs of poor quality for a project.
Software defect prediction typically requires fitting a mathematical model on a training dataset that learns an estimator (e.g., software project and/or product attributes), and later using the model on unseen or new data [2].In this study, we used binary classification and regression-based machine learning models for model fitting.
Most of the primary studies we identified in the SMS used machine learning approaches of binary classification and regression.The binary classification approaches predict the code as either defective or non-defective.The regression approaches predict the number of defects on a continuous scale.Hosseini et al. [7] found that 29 out of 30 primary studies on defect prediction predicted binary classes of defects and only one study performed regression.The regression provides more information because the prediction of several defects helps to plan testing and development resources that will be required to find and fix the defects.
Software companies have been reported to have difficulty in collecting and organizing useful defect-related data [4], [10].Therefore, they tend to collect cross-project data (CPD) from open-source projects or publicly available datasets [3], [11].Radjenovicetal et al. [11] found that 58% of the defect prediction studies used private datasets, 21% used partially public datasets, and 22% used public datasets.The private datasets including defect data, or source code are not publicly available.The partially public datasets share the source code of the project and defect data without the metrics' values (i.e., project attributes).The public datasets share the metrics values and the defects data for all the modules [11].The most common estimators among the publicly available datasets include object-oriented and static code metrics.The object-oriented metrics contain information such as the number of classes, weighted method per class, and depth of the inheritance tree.The static code metrics contain information such as code churn, lines of code, number of modules/files, cyclomatic complexity, and function calls.
When the training dataset includes data from the same project, the process is referred to as within-project defects prediction (WPDP), and when data from other projects is used, it is referred to as cross-project defects prediction (CPDP) [10].A recent systematic literature review (SLR) on CPDP [4] found that most of the primary studies on CPDP were published since 2010 and the interest has increased since 2015.For example, Malhotra [12] trained models using the NASA MDP dataset and predicted defects for several telecommunication projects in Turkcell.Turhan et al. [5] trained models using the Promise dataset [13] and predicted defects in projects of the SOFTLAB.Both studies predicted software defects at the module level using static code metrics such as cyclomatic complexity, number of lines of code, decision count, and the total number of operands and operators.
In this study, we used a publicly available software projects dataset, namely, the International Software Benchmarking Standards Group (ISBSG) dataset version 11.According to our SMS results, this dataset was not previously utilized for CPDP studies.We used a chunk of projects and their attributes for training the prediction model and the model predicted the software defects density (DD) of the products in the other chunk of projects.
The CPDP presents the challenge of effectively using a dataset of multiple projects in such a way that the effect of the heterogeneous data on prediction is reduced [4].The data can be heterogeneous in terms of sources (e.g., databases and spreadsheets), formats (e.g., CSV, JSON, and XML), types (e.g., text, and numerical) and it might suffer from different data qualities from different sources.The data coming from different sources needs to be integrated and its features are needed to be pre-processed to make them consistent in terms of their formats and data types.
In this study, we collected both numerical and categorical data from the ISBSG dataset.The ISBSG provides data in the MS Excel format.The data formats are consistent in the dataset and the data types were already defined by the group.The quality of the data was also assessed and provided.Hosseini et.al [7] pointed out a problem of lack of normal distribution in all the CPDP datasets, which we also observed in the ISBSG dataset, and thus normalized the original data.In CPDP, the challenge of dimensionality reduction is significant.This would help train generalized machine learning models and avoid overfitting.In this study, we catered to this problem.We also followed some of the relevant guidelines of Hosseini et al. [7] and Hall et al. [14] to our study (e.g., data quality; performance measures, and statistical tests) and we provided our reflections in the light of both studies.
Hence contributions of the study are: • A systematic mapping study (SMS) on secondary studies (literature reviews) on CPDP.
• First empirical study on software product defects prediction for planning purposes using a publicly available dataset that contains cross project data.
• Collection of both numerical and categorical data from the ISBSG dataset.Transformed original data into z-scores and log values to cater lack of normal distribution.
• For the dimensionality reduction challenge, the study trained the models on i) the complete set of attributes, ii) the individual attributes, and iii) multiple subsets of attributes identified by feature selection methods.
• Comparisons of the study results to the findings of the secondary studies.This article is organized as follows; Section II presents problem definition and data preparation phases of the empirical study and Section III presents experimentation of machine learning models and analysis of the results.Section IV presents the conclusions of the study and reflections on the findings of the study with respect to previous studies.Appendix A presents the details of our SMS on the literature reviews on defects prediction, Appendix B provides links for the source codes of experiments and details of the obtained results, and Appendix C presents names and descriptions of attributes/measures used in the ISBSG dataset.

II. EMPIRICAL STUDY
The empirical study included three phases: problem definition, data preparation, and experimentation as shown in Figure 1.Below, in the following subsections, we discuss these phases.

A. PROBLEM DEFINITION
In our SMS on secondary studies on defect prediction (Appendix A), we identified the most used datasets as NASA MDP, ECLIPSE, PROMISE, Mozilla, Apache, Jureczko, and Softlab.According to our findings, the ISBSG dataset has not been previously used for defect prediction.The ISBSG dataset contains data from 5,052 projects.We used a subset of this dataset that reported data on defects.The subset contains 760 projects with 23 attributes (Table 12).As for the defects, the ISBSG defines the Defects Density (DD) attribute for software products.The defect density measure is in general calculated as shown in (1).

Defect density =
number of defects software size ( The primary studies used in the SMS count number of defects, however measuring DD for CPDP offers more advantages.The DD provides a normalized count of defects concerning size and it enables fair comparison of projects with varying sizes.The ISBSG dataset measures the size in terms of functionality provided by the software (i.e., functional size units) instead of counting source lines of code.The functional size units are a more useful measure of size when software projects are implemented in different programming languages.
The ISBSG defines DD as ''the Number of defects reported per 1,000 Functional Size Units of delivered software in the first month of use of the software''.The availability of values/labels for the DD attribute [target class] makes its prediction suitable for supervised learning [15], [16].The DD attributes (dependent variable or target class) are available for 760 projects in the ISBSG dataset.
Two validity threats concerning the constructs of the empirical study might have been the fact that the ISBSG dataset has a mixture of different functional size measurement methods, and the defect counting methods differ from organization to organization.The ISBSG clearly defines what is a defect and states that it has worked with the data-providing companies to report defects consistently and using well-defined defect categories.
As for the units of functional sizes reported, we analyzed the dataset.Among the functional size measurement methods in the dataset, the Common Software Measurement International Consortium (COSMIC) method was developed to measure both business applications and real-time software.The rest of the methods have a similar scale for reporting software business applications' functional size.The COSMIC scale for reporting the sizes of business applications is also similar to those of previous methods.Cuadrado-Gallego et al. [17] reported an approximated conversion factor of 1:1, within a range between 0.9 and 1.1, but moving from a larger number of data points analyzed than in past studies.As most of the projects measured by the COSMIC method are business applications (57/69 projects), we decided to keep all 760 projects in our dataset.
The DD is predicted by using regression and classification approaches of machine learning.The regression uses a variable (x) or set of variables (x 1 , x 2 , . . ., x n ) to learn an output variable y as a mapping function such that f(x) = y [15], [16].The function approximating a numerical value of y based on a single variable is linear regression; otherwise, it is a multivariable (multiple) regression.The x i values are either numerical or ordinal.Examples of regression include estimating the number of defects, function points, and effort (man-hours) in the upcoming project.Similarly, the classification predicts the value of y in terms of a variable (x) or a set of variables (x 1 , x 2 , . . ., x n ).The x i values are either numerical, ordinal or feature vectors.The value of y is either predicted as a binary class (e.g., non-faulty, faulty) or multi-class (e.g., non-faulty, very faulty, extremely faulty) [15], [16].

B. DATA PREPARATION
The data preparation consists of three sub-tasks i.e., dataset quality description, dataset description, and dataset construction.

1) DATASET QUALITY DESCRIPTION
The ISBSG dataset mostly includes numerical, categorical, and Boolean types of values.The coding schemes are consistent in the dataset and a meta-data file is available along the dataset.The dataset used in this study has no typographical errors or coding inconsistencies.The ISBSG dataset provides quality ratings for the project's data.The ISBSG organization claims that their quality ratings are trusted by software development organizations because they provide structured guidelines and questionnaires for data collection, and they rigorously check and recheck the quality of data before adding it to their repository.
Table 1 presents a description of quality ratings and the number of projects with each quality rating in the ISBSG dataset.The largest part of the data (92%) is classified as 'A' and 'B'.According to [7] and [14], NASA, PROMISE, and ECLIPSE are the top 3 most used datasets for WPDP and according to [4], NASA, Jureczko, and Softlab are the most used datasets for CPDP.We have observed that data quality information is missing in these datasets and in general among publicly available datasets.The quality rating is very important as if we are unaware of the quality of data then it makes the prediction less credible.
The software development organizations planning to use the ISBSG dataset along with their dataset for CPDP can develop their data quality guidelines in line with ISBSG.It includes defining data quality criteria such as given in Table 1.In addition, they should consistently define software attributes such as those given in Appendix C. It should be followed by identifying the sources of data collection (e.g., source code, software requirement specification (SRS) document, test cases, etc.) from software development processes, software products, and software development resources.The data quality information can be improved through the data profiling process i.e., defining meta-data such as statistics about the data and dependencies among the features.In addition, it involves defining use cases of data, profiling non-relational data, and exploring tools for automated quality assurance.A comprehensive data profiling mechanism can be defined by following the guidelines of Abedjan et al. [18].The quality of data is maintained through educating data collection teams, organizing training and workshops, conducting data audits, and treating data quality improvement as a continuous process that is backed by project managers [19].

2) DATASET PRE-PROCESSING
This section presents descriptive statistics of the dataset and its relevant pre-processing.Tables 2 and 3 represent types of attributes/metrics and descriptive statistics of the dataset.The definition of each attribute is presented in Appendix C. Fenton and PFleeger defined three types of metrics (i.e., process, product, and resource) which are discussed in most of the software measurement studies [20], [21].
A software development organization with a defined measurement process has all of these three types of metrics data collected from past projects.However, it is observed in SMS studies (Appendix-A) that most of the defect prediction datasets used mostly the product type metrics (e.g., size, object-oriented, and complexity) followed by process metrics (e.g., testing effort, deployment time, etc.).The resource metrics are occasionally found in publicly available datasets.This study explores machine learning approaches using all three types of metrics.We have used 23 metrics/features including process (6), product (11), and resource (6) metrics in this study.
Table 2 presents descriptive statistics of attributes/metrics measured in numerical scale.The count attribute represents the number of values available without imputation.The number of missing values is due to the reasons that every software development organization does not collect all types of data due to heterogeneous domains of projects, available budget, and allocated time and objectives of the measurement process.The analysis of standard deviation (std), min (minimum value), max (maximum value), and quartiles (25%, 50%, and 75%) indicate skewness of the data, which is dealt with logarithmic transformation, given in (2), of attribute/metrics values.
The heterogeneity of values in the dependent variable (DD) and independent variables (23 features in Table 1) is dealt with using the z-score, given in (3), where x is the value of attribute/metrics, µ is the mean and σ is the standard deviation.Table 3 presents descriptive statistics of attributes/metrics containing categorical/discrete values.
The codes of discrete values in available data are found consistent.The missing values are replaced with the help of the KNN-imputer algorithm (proposed in [22]) by using the sci-kit-learn API.It uses Euclidian distance between data points to determine nearest neighbor values for missing data.The use of logarithmic transformation and data normalization using a z-score helps to mitigate large disparities among the attribute/metric values.The data normalized using z-score   from a set of continuous/categorical input variables of training data (x) to a continuous output variable y.The DD attribute originally contains numerical/continuous values.The classification models are trained to predict binary classes i.e., defective, non-defective.The conversion of continuous variables into discrete ones (nominal intervals) is critical in machine learning as it increases learning efficiency, prediction accuracy, and comprehension of prediction results in a machine-learning context [15].
It is also critical because the majority of machine learning algorithms perform better while predicting target class in terms of discrete values for supervised learning [15].A typical discretization method starts with sorting the numerical values in ascending/descending order.The discretization bins are created for ranges of values (i.e., intervals) by assigning a nominal label (i.e., Yes for defective, No for non-defective) to each interval.The binary classification for defect prediction is most widely used for most of the supervised learning tasks and especially for CPDP [7], [12], [15], [23].

3) FEATURES SELECTION
The cross-project datasets contain data from heterogeneous projects therefore it has chances of containing irrelevant and redundant features which might add to the curse of dimensionality (i.e.lack of performance in prediction due to irrelevant features) [7], [15], [24], [25], [26].The features/metrics subsets selection techniques are used for dimensionality reduction [7], [15],.These techniques are specifically identified and applied for regression and classification approaches in the following sections.
Feature Selection for Classification: There is no fixed set of metrics in the software measurement domain that is used to predict a specific attribute [20], [21], [27].Therefore, this study predicts DD classes by using the dataset in three different ways as illustrated in Table 4.Each attribute is used to train the machine learning model to evaluate its ability to predict DD.The complete dataset is also used to predict DD.In addition, five feature selection techniques are also used to select subsets of data.The description of feature selection techniques and attributes/features selected with each technique are presented in the following section.
It can be performed in three ways i.e., wrapper, filters, and embedded methods [28].The wrapper method creates subsets of a dataset using a prediction model.These subsets are randomly created, and they range from an individual to many features.The exhaustive search is time-consuming for a large number of features.In our case, due to 23 features, it was appropriate to use to method.The wrapper method iteratively creates a learning scheme by using cross-validation to select the best subset of attributes.Each subset is split between train and test sets.Each subset trains a prediction model, and the best subset is selected based on less number of incorrect predictions (error rate) on the test set.The filter methods use a proxy measure instead of an error rate.Filter methods are more computationally intensive than wrapper methods.The proxy measures include feature ranking based on different methods such as chi-square and information gain.The chi-squared statistic of an attribute concerning the target class is evaluated to select an attribute.It is calculated in the (4): where: C is the degree of freedom, O is the observed value(s), and E is the expected value(s).The information gain of an attribute is measured concerning the target class to select an attribute.It is calculated in (5).Embedded methods are a catch-all group of techniques that involve feature selection as a part of building a prediction model such as using a recursive feature elimination algorithm with principal component analysis (PCA) and gain ratio to iteratively build a predictive model by removing features with low weights [23].
The gain ratio of an attribute is measured with respect to the target class to select an attribute.It is calculated as follows: The PCA is achieved through performing data transformation and eigenvectors of the attribute are mapped on principal component space to eliminate eigenvectors with noise and later transform back to the original space.The PCA is used in combination with the Ranker search.The given dataset is used to identify the kth principal component of a data vector x(i) in the coordinates of tk(i) = x(i).w(k), (where w(k) is the k th eigenvector of xTx) after the data transformation.The subsets of features are presented in Table 14 of Appendix C.
Feature Selection for Regression: The regression approaches approximate a mapping function (f) from a set of continuous/categorical input variables of training data (x) to a continuous output variable y.The dataset is a combination of attributes that contain discrete values or real numbers.The non-numeric discrete values are converted into numeric values before training regression algorithms.This study uses the dataset for regression in three different ways as illustrated in Table 5.Each attribute is used to train the machine learning model to evaluate its ability in predicting DD.
The complete dataset is also used to predict DD.In addition, four subsets of features are created with the help of feature selection techniques i.e., multicollinearity, variance inflation factor (VIF), P-value, and Bayesian information criterion (BIC).Statistics-API1 and Scikit-learn 2 are used to perform the aforementioned methods.A subset of features is also created by adding values of effort attributes/measures in the software development life cycle i.e., planning, specifying, designing, building, testing, and implementation efforts as 'total effort' attribute/measure.In addition, this shortened dataset is further reduced using multicollinearity analysis.In total, this makes 6 subsets of features.
Multicollinearity in the dataset is detected when independent variables are linearly correlating with each other.The drawback of predicting with a dataset having multicollinearity is the inability to know which features are contributing toward predicting a dependent variable (DD in this study).The heat map in Figure 2 is used to identify multicollinearity in the dataset using Spearman correlation and features having a higher correlation than 0.7 (moderately strong relationship) are dropped to create a subset [29].Estimating the p-values of each feature helps to identify the statistically significant features (p < 0.05) for predicting the dependent variable.It is calculated by evaluating the results of ordinary least square (OLS) regression using two-tailed T-tests using stats models API.The features having a p-value less than 0.05 are selected to create a subset [30].
The VIF is used to select a subset of features (independent variables) by measuring the impact of their multicollinearity on the preciseness of predicting dependent variables in regression models [31].Multicollinearity is observed when two or more variables in a regression model have a high correlation.The features having a VIF score less than 10 are selected to create a subset [31].The VIF score of 10 means that the variance of the coefficient of prediction (such as R 2 , RMSE, etc.) will be 10 times more than in a case where there will not have been a correlation.The high VIF score causes inflated standard errors which are uncertain in identifying the unique contribution of each predicting variable in the regression model.
This deteriorates the interpretability of the regression model because it reduces the reliability and stability of the coefficients of independent variables in a regression model.The VIF score of 10 is most commonly used as a threshold of feature selection because VIF less than 5 shows a low correlation between predicting attributes.A VIF score between 5 and 9 is considered as moderate and a VIF score greater than 9 or more is considered as high correlation [31].
The BIC is a statistical method that is used to select among competing regression models.A regression model having the smallest value of BIC is considered the most suitable for prediction [32].The formula to calculate the BIC score i.e., BIC = −2 × LL + log(N) × k, where LL is the log-likelihood of the model which means how well the model fits the training data, N is the number of training examples and k is the number of independent variables in the model.The BIC method aims to select a model that best fits the training data (higher LL) and has a smaller number of independent variables (K).It reduces the complexity and improves the interpretation of the model.The lowest BIC score indicates the best tradeoff between the model fitting the data (LL) and its complexity in terms of several variables (K).The BIC score calculated for models based on original, shortened, multicollinearity, P-value, and VIF-based selection of features is 4048, 4009, 4023, 4006, and 4017 respectively.This hints that if complexity and performance both are considered then a subset of three features that are selected based on P-value might be best predicting DD.However, if the complexity is ignored then subsets of eleven features selected based on multicollinearity performed slightly better than the p-value-based subset of features.The subsets of features are presented in Table 14 of Appendix C.

III. EXPERIMENTATION
This phase discusses performance assessment criteria and results and analysis of prediction.

A. PERFORMANCE ASSESSMENT CRITERIA
This task is performed in two steps.One is to decide how the measurement dataset is used to train and test the machine learning models, second is to decide how the performance of machine learning models will be assessed.

1) TRAINING-TESTING METHOD
The selection of an appropriate training-testing method depends on several factors such as dataset size, reliability of testing/evaluation results, and required computation resources.The dataset used in this study is cross-project data of 760 heterogeneous projects.The size of the dataset is smaller in CPDP studies (SMS, Appendix A) therefore it does not require special training resources.The heterogeneity of the projects requires avoiding methodological mistakes of learning parameters of prediction function from a chunk of data and applying it to the other chunk.This can cause higher variability in the performance of training models based on how data is split into chunks, and it also has a limitation of using a chunk of data for testing which could have been used for training as well.This reduces the generalization of a training model and intends to perform accurately on training data but cannot accurately predict based on yet unseen data which is called overfitting.This problem can be avoided through cross-validation.To implement cross-validation, the 141972 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.predictive models are trained in three steps by dividing the dataset into three subsets i.e., training data, validation data, and test data.The training data is used to construct the predictive model and validation data is used to fine-tune the predictive model.The test set is then used to estimate the performance of the predictive model [15].N-fold crossvalidation is the approach to better estimate the performance of ML algorithms.Using this approach, data is split into N equal disjoint sets.N-1 sets (or folds) are used for training of ML model and the nth set is used for testing.The data sets for training and testing are different in each of the N iterations [15].We used 10-fold cross-validation for our classification and regression experiments.This approach utilizes the whole dataset for training and testing, and it reduces overfitting.

2) PERFORMANCE EVALUATION
Classification Measures: According to studies in the SMS, precision, recall, and F1-Score are the most commonly used measures for performance evaluation [6], [7], [12], [14], [33], [34], [35], [36].The possible outcomes of a prediction based on a machine learning algorithm are evaluated using a number of True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN) measures.The precision, recall and are calculated with the help of TP and FP, and F1-Score is calculated with the help of Precision and Recall as given in ( 7) - (9).
According to Hosseini et al [7], the F1-score and area under the curve (AUC) are the most used evaluation measures among CPDP studies.The ISBSG dataset like other cross-project industrial datasets contains imbalance class distributions.
The choice of evaluation measures for a training model is critical for understanding its performance.Precision measures the correctly classified defects among all the predictions that classify the product as defective as shown in equation ( 7) above.The high precision means that actual defects are accurately identified through the training model and defect-finding resources will be correctly deployed.The recall measures correctly classified defects concerning all the actual defects (true positive cases) and wrongly classified defects (false positive cases) in supervised learning as shown in equation (8).It helps to maintain the quality of software products because high recall means that there will be fewer chances of missing a real defect.The F1-score is the harmonic mean (H.M) of precision and recall as shown in equation ( 9).An imbalanced dataset suffers from a disparity between the number of positive and negative classes.The imbalanced data causes training models to be biased towards a certain class.The F1-score is suitable for evaluating the performance of the training models on datasets with imbalanced class distributions [8].The F1-meaasure depicts a lower score if the training model fails to detect a defect or incorrectly classifies the defect.
Regression Measures: According to SMS, most studies use Mean Squared Error (MSE) [36], Root Mean Squared Error (RMSE) [6], [7], [14], [35], [36], and r-squared (R 2 ) [7] to evaluate regression results.MSE is based on the average squared difference between the predicted values and the original values in the dataset.It measures how close a fitted (regression) line is to actual data points.The smaller value of MSE shows a better fit of the actual data set.RMSE is the square root of MSE.The r-squared (R 2 ) is also called the coefficient of determination and it measures how well the predicted values approximate the actual data points.Ideally, a R 2 value of 1 and an RMSE value of 0 represent that predicted values are equal to actual data points.
MSE and RMSE estimated over n samples are defined in equations ( 10) and (11), respectively.
where: ŷ is the predicted value of the i th sample, and y is the corresponding true value.R 2 estimated over n samples is defined in (12).
SS Residual is the sum of squares of residuals defined in (13).
where α is the estimated value of constant term α and β is the estimated value of coefficient β. x i is the i th value of an independent variable.SS total in ( 14 In summary, the RMSE provides insight into the accuracy of the predictions and R 2 explains how well the training model adapts to the variance in the target variable.Therefore, a combination of both is used to select the best training models.

B. MACHINE LEARNING ALGORITHM SELECTION
The supervised machine learning algorithms are selected based on the most commonly used and successful machine learning approaches on CPDP datasets that are identified in the SMS in Appendix A.

1) PARAMETER SETTING
The machine learning models are implemented using a renowned machine learning API i.e., scikit-learn.The classical machine learning algorithms such as linear regression, multivariable regression, logistic regression, decision tree, support vector machine, and Naive Bayes are trained on default settings.The K-NN model is trained after finding the optimum value of a number of clusters.The hyper-parameter tuning is performed for ensemble decision tree models i.e., random forest and extra-tree classifiers.These settings include the number of decision trees (100 to 1200), criterion ('gini', 'entropy'), the maximum number of tree levels (5,23), minimum number of samples required to split a node (2, 5, 10, 15, and 100), minimum number of samples required for leaf node (1, 2, 5, and 10).The multilayer perceptron (ANN) model is also trained with multiple combinations of the parameters and retained with the best performance.These settings include activation functions ('identity', 'logistic', 'tanh', 'relu'), solver ('lbfgs', 'sgd', 'adam'), learning rate (0.1 to 0.5), number of neurons in the hidden layer (up to 100), number of iterations (up to 200), kernel regularizer (L2 penalty, alpha=0.0001).
141974 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

C. RESULTS
The following sections present results and analysis of classification and regression tasks.
The complexity of a training model and its predictive performance are considered in deciding the most suitable machine learning model for CPDP.The simpler models due to the smaller number of features contain more biases because they do not process all the information of the complete dataset.In addition, their adaptation to lower variance makes them more vigorous to noisy data.However, simpler models are prone to underfitting the training data.On the other hand, complex models might become overfitting i.e., fitting too well on the training data and poor performance on unseen data.These models tend to adapt to higher variance and missing underlying patterns among features in the training data.To tackle the issue of underfitting and overfitting, we have used N-fold cross-validation to train and test all chunks of the dataset.In addition, we have used data transformations to normalize the effect of data heterogeneity.Furthermore, more than one evaluation measure is used to mitigate the bias in evaluating the performance of training models.After taking the aforementioned steps, the variations of simple to complex models are also experimented with using a feature selection process to select the best set of features to train the machine learning model and predict DD.

1) CLASSIFICATION
This study has used the dataset in multiple ways to predict DD classes using 8 machine learning algorithms for identifying the best set of metrics.The variations of the dataset include: Original dataset: • Original dataset containing all 23 metrics/features combined.{1 dataset}.
• The original dataset is transformed into Logged values of its features.{1 dataset} • The original dataset is transformed into Z-score values of its features.{1 dataset} Subsets: • Six feature selection methods are used to create subsets of metrics/features from the original dataset.{6 datasets} • Each subset is transformed into Logged values of its features.{6 dataset} • Each subset is transformed into Z-score values of its features.{6 dataset} Individual features: • Each metric/feature is separately used to build machine learning models.{23 metrics used as 23 datasets}.
• Values of each feature are transformed into Logged values.{23 dataset} • Values of each feature are transformed into Z-score values.{23 dataset} Due to space constraints, we have made all results available in Google Drive (Appendix B).The box plot in Figure 3 depicts the overall performance of machine learning algorithms on the subsets and complete datasets.The datasets are also transformed into logged values and z-score values to normalize them.Multilayer Perceptron performed best with an F1-score of 0.87 followed by the KNN classifier with an F1-score of 0.80.Random Forest and Extra Trees classifiers predicted the highest F1 score of 0.67.
Figure 4 depicts the performance of the top two models, and it is evident that the Z-score transformation of the dataset helped to train the model with the highest F1 score.Logged transformation of 12 features extracted using chi-square feature selection helped to train the model with the second highest F1-score of 0.80.This shows that data transformations and feature selection are useful on the dataset because all the higher F1-scores are achieved using transformations as compared to using original values as shown in Figure 3 and Figure 4.The box plot in Figure 5 presents the overall performance of machine learning models on individual features.
The KNN classifier model outperformed other models with best highest F1-score of 0.77.All other machine learning models mostly predicted DD around an F1-score of 0.50, which is considered equal to random guessing [37].KNN classifier model has performed consistently over individual features, complete dataset, and its subsets.
Figure 6 presents the best results of DD prediction by training the KNN classifier model using individual features.
The features with an F1-score of 0.7 or above are presented.The 'functional size' feature predicted DD with an F1-score of 0.77 followed by 'normalized work effort level 1', 'normalized level 1 PDR (ufp)', and 'normalized work effort' with F1-scores of 0.75, 0.75, and 0.74, respectively.
Receiver Operating Characteristics (ROC): In this study, the ISBSG dataset has 64 percent of projects classified as non-defected which means that the dataset is skewed [38], [39].The Receiver Operating Characteristic (ROC) graph is used to measure the algorithm's performance being insensitive to class skewness and unequal classification error costs.The performance metrics (e.g.precision, recall, F1-score) are based on confusion metrics and they are sensitive to class skewness [15], [38], [39].These performance metrics depend on the distribution of instances in a dataset, for example, faulty (e.g., positive) and non-faulty (e.g., negative) instances for binary classification of defects.
The Multilayer Perceptron and KNN classifier models predicted DD with their highest F1-score of 0.89 and 0.80 respectively.Therefore, we used ROC curves to visualize their performance concerning the true positive rate and false positive rate in

2) REGRESSION
Linear regression and multivariable regression are used to predict DD using individual features and combined features respectively.Overall, the features related to and productivity of software predicted DD best in terms of r-squared and RMSE measures.
Multivariable Regression: This section presents results and analysis of DD prediction using complete datasets and subsets.Figure 10 represents RMSE and r-squared scores of multivariable regressions on the subsets as discussed above.Each subset and dataset are transformed in two ways i.e., either complete data is transformed or only DD (target) feature is transformed.The best prediction performance is achieved with the dataset that is reduced based on a statistical filter of multicollinearity and transformed using logged values as shown in Figure 10.Overall, logged transformations on the datasets enabled the best scores of RMSE in the range (0.4, 0.98) and r-squared values in the range (0.9, 15).
Figure 11 represents higher results of RMSE (i.e., closer to 0) and r-squared (i.e., closer to 1).The best results (with RMSE = 0.4 and r-squared = 0.9) are achieved based on logged feature values of the complete dataset and a multicollinearity-based reduced dataset of 11 features.The logged subsets outperformed the subsets of original values and z-score values.It shows the benefits of feature reduction and data transformations.The prediction of bugs in the planning phase can help project managers make mitigation strategies; however, fewer amounts of data are available for prediction at the start of a project.In that case, requiring three attributes that are related to available project time, the architecture of the project, and the number of resources available can make prediction work convenient.
The VIF, BIC, and multicollinearity-based feature selection methods use the analysis of correlation coefficients among independent variables in different ways.The simplest way of finding multicollinearity among the independent variables is to find a pair-wise correlation among independent variables and their correlation with the target variable as well.The variables having a high correlation with other variables and a lower correlation with the target variable are dropped.
The VIF score provides a more structured process of finding multicollinearity as compared to the manual approach because it processes the r-square values of predicting the target attribute.The BIC score is used to predict the best subset of data that can predict the target variable and its criteria is to process the number of parameters and coefficients of prediction.The p-value is calculated by analyzing the coefficients of the regression model through the Wald test.This test is performed with the help of dividing linear regression confidence by its standard errors and later normalizing the results through z-transformation.The P-value for each independent variable informs its level of significance for predicting a dependent variable.The P-value is the probability that the absolute value of a feature is more extreme than the one calculated through the Wald test.
In terms of predicting DD, the subsets of features selected with P-value performed better than the ones selected based on correlation.The fundamental difference between these approaches is multicollinearity analyzes pair-wise correlation coefficients among independent variables as well as with target variables to drop/keep a variable in a subset.The P-value uses analysis of variance between each independent variable and dependent variable.The analysis of training models using RMSE and R2 supports a comprehensive analysis because the residuals and the structure of training models are evaluated through both measures.

IV. CONCLUSION
This study presented an empirical study on early defect prediction to plan quality activities at the beginning of a project to minimize costs of poor quality (that is both internal and external failure costs).The results indicate that software companies may utilize regression and classification-based training models to predict post-deployment defects of a software product at a reasonable accuracy by using the dataset.
The main contribution of this study is twofold: i) a systematic mapping study (SMS) on secondary studies (literature 141980 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.reviews) on CPDP (cross-project defect prediction) and ii) an empirical study on early defects prediction using a publicly available benchmarking dataset.In Table 6, we provide our reflections on the findings of this study by comparing them to the findings of recent SLR and meta-analysis on CPDP [7] and SLR of Hall et al. [14].
In the future, we are planning to further extend the study by exploring deep learning and transformer-based training models to predict defects.The software development organization can use the findings of this study and predict their defects by using the source code of training models in this study.The source code is given in Appendix B. In addition, the ISBSG cross-project dataset is further extendible through the addition of their in-house project data.

APPENDIX A SMS ON SECONDARY STUDIES
Here, we present a summary of our SMS on secondary studies (i.e., literature reviews (LRs), systematic literature reviews (SLRs), and SMSs) on defect prediction.These studies reviewed in total of 509 defect prediction studies published between 1990 and 2019.The research questions  'ISBSG AND (Defect OR Fault OR Error OR Failure) AND (Prediction OR estimation)' and within the SLRs with an objective to identify the use of the ISBSG dataset for defects prediction.We did not find a study that used the ISBSG dataset for defect predictions at a project level.

B. STUDY INCLUSION CRITERIA
We have selected LRs, SLRs, and SMSs on defect predictions which are published in peer-reviewed journals.A study is selected if it at least has an analysis of types of machine learning algorithms, datasets, or metrics used in the defect prediction studies.

C. SEARCHING ACTIVITIES
12 represents an overall process that is used for the selection of secondary studies.First, selected studies based on title, abstract, and conclusion for the complete review and evaluated them according to inclusion criteria.The complete review resulted in a selection of 9 studies.The snowball tracking process identified 3 more studies.

1) SEARCH RESULTS
Table 8 represents the research databases, search results, and the number of SLR, SMS, and LR studies selected after applying the search string.Table 9 represents the title, coverage time, and number of studies included.

D. RESULTS OF THE SMS
Table 10 presents the most used machine learning methods.Few studies explicitly mentioned the percentage of machine learning methods used.In the other studies, we have identified the most used algorithms based on the titles of primary studies and overall discussion in the SLRs, LRs, and SMSs.The methods mentioned against the Study IDs in Table 10 are according to their frequency of use in the research studies or frequency of appearance in the SLRs, LRs,   Table 11 presents the most used CPDP datasets that are publicly available and attributes/measures collected about process, product, and resource entities.NASA MDP, ECLIPSE, PROMISE, Mozilla, Apache, Jureczko, and Softlab are found to be the most used.It is observed that resource-related measures are the least used ones.

FIGURE 2 .
FIGURE 2. The heatmap of multicollinearity among all features.
) describes how well the regression model fits the training data.Its higher value denotes that the regression model does not fit the training data.evaluation measures provides an important insight into the performance of the training model.The MSE and RMSE are measured in the same units of the target variable which supports direct interpretation of results.Both measure the standard deviation among the differences between actual and predicted values i.e., residuals.The MSE takes a square of the residuals, which makes the interpretation of the results difficult.Taking the square root of the MSE (i.e., RMSE) helps to normalize the results.RMSE is preferred over MSE when comparing the performance of multiple training models.The large residuals will cause the RMSE value away from its ideal value of 0. The R 2 provides the goodness of fit of a training model.It indicates how well the variances in the target variable are explained by the model.A higher value of R 2 (ideally close to 1) indicates that the training model has learned patterns and relationships among independent and dependent variables during the training phase.It indicates how well the model might perform on the unseen data.

FIGURE 3 .
FIGURE 3. Performance of machine learning models using datasets and subsets.

FIGURE 4 .
FIGURE 4. Highest performances of multilayer perceptron and KNN classifier models using datasets and subsets with various feature selection techniques.

FIGURE 5 .
FIGURE 5. Performance of various machine learning models using individual features.

FIGURE 6 .
FIGURE 6.The F1-score of various KNN-based models using individual features.

Figure 8
Figure 8 represents the performance of linear regression using original, logged, and z-score values of individual features and DD features.The r-squared results of only those features are presented which are closer to 1 (i.e., between the range of 0.4 and 0.9).Logged feature values predicted DD

FIGURE 7 .
FIGURE 7. The comparison between ROC curves of multilayer perceptron and KNN classifier.

FIGURE 8 .
FIGURE 8.The R-squared (R 2 ) values of defect density prediction using individual features.

Figure 9
Figure 9 represents the performance of linear regression in terms of RMSE by using original, logged, and z-score values of individual features and DD features.The RMSE results of only those features presented that are closer to zero (i.e., between the range of 0 and 0.6).It is evident that the logged feature values predicted DD better than using z-score values of features.Logged values of 'Developement_Type', 'Client_Server',

FIGURE 9 .
FIGURE 9.The RMSE values defect density prediction using individual features.

FIGURE 10 .
FIGURE 10.Multivariable regression using multiple subsets of a complete dataset.

FIGURE 11 .
FIGURE 11.Higher results using multiple subsets of the complete dataset.
and SMSs.According to the findings, Bayesian Learners, Regression, Random Forests, Decision Trees and Support Vector Machines, Rule-based methods, and Artificial Neural Networks are the most used methods.

TABLE 2 .
The attribute measured in numerical scale in the ISBSG dataset.

TABLE 3 .
The attribute measured in the categorical scale in the ISBSG dataset.ranges between −3 and +3.Log10 transformed the values of metrics ranging between 0.1 and 5.1 with an average of 2.1.The available dataset is transformed from an MS Excel sheet to CSV files.The dataset is used for training regression and classification models.The regression models predict the DD i.e., number of defects per unit of function point.The regression approaches approximate a mapping function (f)

TABLE 7 .
Research questions of the SMS.FIGURE 12.The selection process of secondary studies.

TABLE 8 .
Selection of studies.(RQ) of this study are shown in Table The key findings of SMS and compared with the results and findings of this study Section III.(Literature review OR systematic review OR systematic review OR SLR OR Systematic mapping OR systematic mapping study) AND (Defect OR Fault OR Error OR Failure) AND (Prediction OR estimation).

TABLE 9 .
The list of secondary studies included.

TABLE 10 .
Machine learning methods used in the list of secondary studies included.

TABLE 11 .
Most used defect prediction datasets and metrics for defect prediction.

TABLE 12 .
Description of dataset attributes/metrics.Authorized licensed use limited to of the applicable license agreement with IEEE.Restrictions apply.

TABLE 13 .
Feature selection methods applied on the dataset for classification models.

TABLE 14 .
Feature selection methods applied on the dataset for regression models.