Adaptive Discretization Using Golden Section to Aid Outlier Detection for Software Development Effort Estimation

The software engineering researchers have worked on different dimensions to facilitate better software effort estimates, including those focusing on dataset quality improvement. In this research, we specially investigated the effectiveness of outlier removal to improve estimation performance of 5 machine learning (ML) methods (Support Vector Regression, Random Forest, Ridge Regression, K-Nearest Neighbor, and Gradient Boosting Machines) for software development effort estimation (SDEE). We propose a novel discretization method based on Golden Section (dubbed as Golden Section based Adaptive Discretization, GSAD) to identify optimal number of outliers for SDEE dataset. The results signify the importance of optimal number of outliers’ removal to improve estimations. Moreover, the results obtained after applying GSAD technique have been compared with IQR and Cooks’ distance based outlier identification methods over 4 datasets: ISBSG Release 2021, UCP, NASA93 and China. The empirical results confirm that the performance of ML based SDEE methods is generally improving by employing GSAD and the proposed GSAD method has the ability to compete with the other prevalent outlier identification methods.


I. INTRODUCTION
The need to have utmost accuracy in estimated effort has become reign supreme for software development industry to better support the decision making process. Both overestimation and underestimation of required effort are highly undesirable and occurrence of these can cause failure of a software project and resource wastage [1], [2]. To achieve accurate effort estimates for the proposed software project, it is required to build SDEE model using historical projects' datasets. The quality of dataset affects estimation accuracy and reliability of empirical research, thus it is suggested that improving dataset quality can be an important factor in improving accuracy of SDEE models and it can facilitate more clarity to project managers to take better decisions [1], [3], [4], [5], [6], [7], [8].
The associate editor coordinating the review of this manuscript and approving it for publication was Giuseppe Destefanis .
Outlier handling is one of the important pre-processing task not only just to improve the quality of data but also for the reliability of model generated using that data, since SDEE datasets greatly suffer due to outliers [8], [9], [10], [11], [12]. The presence of outliers may be the indication of: error in reporting the measurements; since at times a project stakeholder (more specifically development team member) works on more than one project at the same time so it becomes difficult to keep track of the correct measurements, lack in project continuity and/ or other unstable environmental factors during project development. Therefore, the identification of outliers is essential before building an estimation model in order to ensure the reliability of the model.
The outlier identification is relatively less researched in the field of SDEE. No research has been performed to consider the extent of outlierness of data points for the identification of optimal number of outliers and the extent to which they influence model's performance. In this research work, we aim to introduce a novel method called Golden Section based Adaptive Discretization (GSAD) to identify outlier data points. GSAD discretizes the dataset using Golden Section to partition each feature domain and subsequently finds the data points which are very distant from the other data points. It further cumulates the distant data points of each feature to come up with true outlier data points. The idea behind using the discretization method for outlier identification is that discretization leads to finite set of non-overlapping partitions. Since discretization not only aims to improve the efficiency in a particular field to wherever it is applied as a pre-processing step but the very basis of discretization is to represent the information in a more compact form. Thus, a discretization method can be capable of providing a set of data points which do not comply with the compact representation of a particular dataset.
In the conventional major discretization methods that follow equal-width or equal-frequency concept, all occurrences of a repetitive value may belong to different interval. This conception may not be useful in facilitating best results for every case, specifically where the continuous attributes are not uniformly distributed. Whereas, the proposed GSAD method places all repetitive entries of a value in the same interval which is beneficial for identifying distant values in a dataset.
SDEE has been one of the most widely researched domain of study in the field of software engineering. Several studies have carried out vast review of the promising work done by researchers in the past [13], [14], [15]. As per these review studies, some researchers assume expert based SDEE methods to be better than the model based SDEE, some favour model based estimation while some found no difference in them [16]. As per a comparative study done in [17], nine studies found analogy-based SDEE model outperforming regression based SDEE, while four studies found regression based SDEE model to be the best performing. The enormous amount of work has been done in the domain of SDEE to assist continually growing software development industry. But, it is also prominent that studies often affirm the findings by relying on limited experimental settings, which also emanates to a very crucial ''conclusion instability'' problem [18], [19].
The reliability of comparative analysis of models' performance broadly depends on following factors: 1) Dataset characteristics 2) Design of experiments; more specifically validation method and data splits [18], [20], [21] 3) Specific aspect of accuracy; more specifically different evaluation criteria [22], [23], [24] The proposed outlier identification method has been evaluated on five machine learning methods using four widely acknowledged benchmark SDEE datasets: NASA93, China, UCP and ISBSG release 2021. We have analysed and evaluated model performances with systematically designed experimental settings so as to alleviate the conclusion instability concern. The novel contributions of this paper are two folds: 1) A novel method named GSAD is introduced to identify outliers in SDEE dataset based on Golden Section method. 2) Additionally, in the literature the concept of outlierness has never been studied for the purpose of outlier identification in the field of software development effort estimation. In order to detect optimal number of outliers, we have conceptualized the identification of outlierness of each data point and the effect of removing those data points on the basis of level of outlierness. With the consideration of all the aforementioned challenges and the objectives of this study, this paper more specifically investigates following research questions: RQ1. How to use golden section based discretization method to detect optimal number of outliers in SDEE data, which is one of the factors associated with the data quality in SDEE? RQ2. Can our golden section based outlier identification method help in improving the performance of ML based SDEE method that does not consider the presence of outliers? RQ3. How the performance of ML based SDEE method varies using our proposed outlier identification method in comparison to other prevalent outlier identification methods? The prime focus of this paper is to provide detailed insight on how to use golden section method to deal with the previously mentioned challenges and to further validate golden section-based approach with reference to outlier identification to improve effort estimation. To the best of our knowledge, Golden Section method has never been explored to discretize dataset for the purpose of outlier identification in the field of SDEE.
The organization of this paper is as follows: section II discusses about the most relevant selected studies that form the basis of this research work. Thereafter, section III discusses the background techniques. Section IV reflects upon the experimental design including dataset description, the proposed outlier identification method, validation and accuracy measures used in this study. This is followed by section V which elaborates results and discussion. Further, section VI discusses about the threats to validity. Finally, section VII outlines general conclusion and gives direction for future research.

II. RELATED WORK
This section provides the review of studies done in the the field of SDEE with a focus on outlier identification. Keung et al. [25] have applied Mantel's correlation to assess the dataset quality and identify outliers. The major attention in their proposed method is that it can specifically and only be used for analogy based estimation.
Chan et al. [26] have proposed least median squares based statistical methodology to identify and eliminate outliers. 90370 VOLUME 10, 2022 Their proposed method has two major concerns: they have used only MMRE as evaluation criteria, which is considered highly biased [27], [28], [29], [30], also their proposed approach is dependent of dataset distribution. While considering the concerns of approach proposed in [26], Seo et al. [3] have proposed two approaches: statistics based least trimmed square and data mining based K-means to identify outliers. Their evaluation criteria were also only MRE based measures which is considered sensitive to high MRE values [27], [31].
In another study, Seo et al. [1] have explored least trimmed squares, Boxplot, K-means clustering, Mantel leverage, and Cook's distance to study the influence of outliers. They observed that initially the experimental results found no noticeable improvements with outlier removal but statistical analysis depicted significant improvement with the removal of outliers on some datasets. Further, they suggested that outlier removal has the potential to improve the likelihood of better estimates.
Kocaguneli et al. [5] have proposed QUICK method to filter out the outliers present in the dataset. Their approach relied on Euclidean distance to prune out unpopular rows as outliers. They further advocated that essential information can be represented with reduced content in SDEE datasets. In a recent study, Silhavy et al. [32] have explored the process of outlier identification using relative percentage error, median absolute deviation (MAD), and inter-quartile range (IQR). They further identified that MAD based approach has best performance when being used with stepwise model. In a study [33], Nassif et al. have used IQR to identify outliers before building fuzzy regression models. This study also recommended to remove outliers for the purpose of improving prediction performance of estimation models.
All the previous researches identify outliers using either a single attribute such as 'effort' or consider all the points that are identified as outliers using each individual attribute of dataset. We propose GSAD to determine the outlierness of each data point based on groups of outlier positive attributes. As a data point may be a potential outlier for one or more attributes. If we consider all such data points as true outliers that are potential outlier to each individual attribute, it may classify a large portion of dataset as final outliers. Whereas, it is not sufficient to use any single specific attribute such as 'effort' to identify final set of outliers because there may be some other important attributes such as 'size' or some productivity influencing attributes [13] that may also contain outlier values. Unlike the previous outlier identification methods, through GSAD we need not to manually decide which attribute(s) should be utilized for final outlier set identification, rather the idea is to use GSAD in order to automatically find those attribute groups which are actually containing outlier values.
To the best of our knowledge, there is only one previous research work [34] in the field of SDEE, which has considered extent of outlierness and the influence of it on the performance. Please note that the focus of our research paper is to improve the performance of estimation models by identification and removal of outliers that are already present in dataset. Whereas, the study [34] is not focused towards outlier identification rather they have experimentally added outliers in dataset to study the extent of adding different degrees of outliers in existing dataset. Therefore, the work mentioned in [34] can not be used for comparative analysis purposes in our work.

III. BACKGROUND TECHNIQUES A. DEFINITION
In order to eliminate the ambiguities in comprehension of various concepts in the remainder of this article, we mention following key terms: • Frequency Count: The total number of values which actually belong to a particular interval [36].
• Frequency Threshold: In partitioning algorithms a frequency threshold is required to assess the intervals in order to re-partition the crowded interval, which is the necessary measure for the convergence of the algorithm [37].

B. OUTLIER IDENTIFICATION METHODS
In this work, we have proposed a new outlier identification method using Golden Section method. In order to enhance the integrity of findings, we have also evaluated the competency of our method with other prevalent outlier identification methods: Box-plot or inter-quartile based and Cooks' Distance measure [38] based methods.

1) GOLDEN SECTION METHOD
The Golden Section (GS) is surprisingly one of the most appearing pattern in the universe [39]. Its presence can be noticed in nature as well as man-made systems such as arrangement of flower petals, solar system, famous ancient architectures, paintings etc. It is not just perceived as a premise to measure beauty but also facilitates the optimal solution to space distribution. Owing to its wide applicability, it has been used in several fields of research including Flight Management [40], power point tracking [41], computer science [42], signal processing [43], image processing [44]. The term (GS) has its inception in the classical problem for the division of line segment in an specific manner [41], [43]. As specified in Fig.1, problem states that the line segment defined by search space [lb, ub] of length L is divided into two sub-segments (major and minor) of lengths L1 and L2 respectively with a point p in line such that the ratio between L1 and L2 is equal to the ratio between L and L1, i.e.
where ϕ is the golden ratio representing quotient of L1 (i.e. major sub-segment) to L2 (i.e. minor sub-segment). The resulting quadratic equation for Eq.1 in terms of ϕ is: The solution to problem of dividing the line segment is achieved by preserving the golden ratio which is a positive root of Eq.2 as: The ratio of L2 (i.e. minor sub-segment) to L1 (i.e. major sub-segment) is called the GS (φ) which is defined as: Using GS optimization technique for line search optimization, GS is used to obtain two cutpoints C 1 and C 2 from a line segment which is defined by interval [min, max] as: In this research paper, initially the search space is spanning across the entire set of values (i.e. from minimum to maximum value) of an attribute and this search space is later on getting dynamically reduced. The sub-spaces are being identified by GS method along with the application of our explained procedure. The complete process of the proposed GS based outlier identification and removal method has been explained in Section IV-C.

2) INTER-QUARTILE RANGE (IQR)
IQR based outlier identification method has been one of the most adapted statistical outlier identification method in the field of SDEE [32]. The process of outlier identification relies on the measurement of the inter-quartile range between the lower quartile (Q1) and upper quartile (Q3). It is measured as IQR = Q3 − Q1. The data points which are lower than lower boundary Q1−1.5 * IQR or greater than upper boundary Q3+ 1.5 * IQR are considered as outliers. For comparison purposes, in the present study we have followed the same approach as that of [1], that if a data point contains its values outside the lower and upper boundaries for one or more variables then that data point will be identified as outlier data point.

3) COOKS' DISTANCE
Cooks' distance measures the change in residual values of all data points from full regression model to refitted regression model after omitting i th data point [38]. To identify outliers using Cooks' distance, we have followed the same approach based on previous studies [1], [45], [46].

C. EFFORT ESTIMATION METHODS
Machine learning (ML) based prediction methods have been widely researched in the field of SDEE to improve the predictions [14], [47]. In this study, we have compared the performance of our GSAD based outlier identification and removal method with other prevalent outlier identification and removal methods using widely researched five different ML based methods that are also being explored by data science competition community [48].

1) SUPPORT VECTOR REGRESSION (SVR)
SVR has been widely studied by researchers in the area of SDEE [24], [49], [50]. It is machine learning technique which works by mapping non-linear separable patterns in data into higher feature space with an aim of minimizing the loss function along with maximizing support vector bounds.

2) RANDOM FOREST REGRESSION (RF)
RF is a state-of-art ensemble learning approach adapted for both classification and regression purposes [51]. In ensemble learning, the final results are proposed by combining the individual results of multiple similar or distinct type of estimation models. For regression purposes, it functions by constructing multiple decision trees using the training data and further provides the final solution in terms of the mean prediction of all individual decision trees.

3) RIDGE REGRESSION (RIDGE)
Ridge is an alternative regression approach which attempts to provide improvement over ordinary least square (OLS) regression. OLS suffers due to the presence of highly correlated attributes. The objective in regression problem is to capture the variation occurring in response variable(s) with the proportional variations in explanatory variables, but due to high collinearity among explanatory variables, these variations do not reflect clear true patterns while explaining these variations. The Ridge works to alleviate this concern by adding a penalty factor (λ) to all the diagonal elements of matrix X T X before finding inverse of the matrix. The mathematical representation of Ridge parameter estimation is of following form: KNN has been chosen in this study since it is one of the simplest estimation method which is perceived as similar to human based expert-judgement [52], [53]. KNN is a case based reasoning (CBR) or analogy based estimation (ABE) approach which assumes that a new project (which is characterized by a set of n features) will most likely have similar effort as to its similar past projects from the case base [2], [52]. A distance function is used to identify the similarity between projects on the basis of values of n features. Thereafter, the average effort value of all k most similar projects is considered to be as the effort for target project. 90372 VOLUME 10, 2022 GBM [54] is an ensemble based approach which relies on gradient descent [55] approach to optimize error reduction to build model with minimal error. GBM is a popular variant of boosted trees which is preferred because of its strength of facilitating better predictions even without much data preprocessing.
In this study, all the models using SVR, RF, Ridge, KNN, and GBM have been implemented in Python using Scikit − learn library [56].

IV. EXPERIMENTAL DESIGN A. DATASET
For the empirical evaluation of SDEE models proposed in this study, we have used 4 well-known benchmark datasets, namely ISBSG release 2021 [57], NASA93 [58], China [59], UCP [60]. These datasets have been most widely used by researchers in SDEE domain [5], [24], [61], [62], [63]. It is noteworthy that each of these datasets represent quite diversity in terms of number of instances, type and number of features, technical specifications, sources of data collection, different application domain etc. These datasets are representing different model-based counting approaches: China and ISBSG 2021 (selected subset) datasets are FP based, NASA93 is LOC based, while another dataset belongs to UCP methodology.
ISBSG release 2021 contains 10,531 cross-company projects including data from different countries, organization types, and development types. We have selected the dataset instances following the guidelines from ISBSG [64] and the procedure mentioned in [63]. This has resulted in 1179 new development type IFPUG version 4+ projects of quality A and B (data and function point quality). The feature selection was performed using the same protocol as suggested by Dejaeger et al. [65]. We have discarded the dataset instances which were having missing values, together with this some features have also been discarded due to irrelevancy/ unavailability of such attributes at the time of initial estimation. These features include project sequencing related features, features which are supposed to not be available at the time of initial estimation (e.g. development duration, time), features which are highly correlated and features which are having only one value (i.e. constant). The descriptive statistics of these datasets have been listed in table 1.

B. DATASET PRE-PROCESSING
Most of the SDEE datasets are skewed in nature, so studies have suggested the use of natural logarithm in order to make the distribution of data closer to normal distribution [28], [66]. Boehm has also suggested to use natural logarithm of attribute values for regression purpose since prima facie effort varies exponentially with an increase in software size [67]. We have observed that several SDEE datasets are having multiple '0' entries corresponding to different attributes. In such scenarios, consideration of normal log transformation is not a solution because it can not transform '0' values. Even if we leave a '0' value as it is and consider taking log(x) of only non-zero values then logarithmic of '1' will result in a '0', so the attribute having value either a '0' or '1' will ultimately have same effect. In this study, we have transformed each attribute of dataset using log1p transformation function (8) before the generation of regression models in order to reduce the skewness of dataset.
Here, x belongs to the feature vector, . . , x in ) and i ∈ (1, 2, . . . , m) for a dataset of size nxm. log1p transformation successfully alleviates the aforementioned concerns that are associated with logarithmic transformation by adding one to the value before taking its logarithm.

C. OUTLIER IDENTIFICATION AND ELIMINATION
This section answers RQ1, which is aimed to determine the use of GS based discretization method to identify outlier points in the dataset. GSAD is an adaptive discretization method that has been proposed in this research work to identify outliers with an aim to assess the impact of outliers in SDEE. The method forms the intervals automatically on the basis of dataset values and hence we do not need to fix the interval length or the number of intervals beforehand. GSAD works as follows (see Fig. 2): 1) Discretization of each feature vector separately.
3) Find most significant data points and discard outliers from dataset. The following sections provide detailed description of the proposed method and its steps.

1) GSAD BASED OUTLIER IDENTIFICATION (OI@GSAD)
As stated earlier, the objective of this research is to identify the optimal set of outliers in a SDEE dataset by performing discretization to improve the quality of dataset. We propose GS based discretization method to filter out the outliers and find the most influential projects for the context of SDEE dataset. Consider a SDEE dataset D of size n × m which consists of a set of m − 1 independent features (that are accountable for the required effort amount for a project) and one dependent feature that is the effort value. It can be visualized in the form of a matrix as: An outlier identification method identifies a project as outlier if the feature values of that project are very distant to the feature values of other projects in the dataset [68]. It is required to identify and discard the set of outlier projects and make the use of only significant projects for the purpose of effort estimation of a new project. This is to specify that we VOLUME 10, 2022 will be using partition, region, and interval interchangeably in further sections. OI@GSAD consists of three major steps: the identification of potential outliers, generation of outlier commonality matrix and finally distinguishing false and true outliers. The entire process of outlier identification is as follows:

2) DATASET PARTITIONING AND IDENTIFICATION OF POTENTIAL OUTLIERS
For a dataset D which consists of m features, we perform the potential outlier identification process to each of the feature separately. To begin with the process, we firstly partition the entire region of feature values into three sub-regions as per the GS method with the help of two cutpoints found using Eq. 5 and Eq. 6. The algorithmic representation of the method is given in Algorithm 1.
GSAD based partitioning is an iterative partitioning approach where at each iteration t, the feature vector A j of size n (where X i ∈ R n ) is partitioned and assessed for re-partitioning on the basis of its frequency count value f A j (t), where A j is the j th feature vector of the dataset. We consider the partitioning problem p(X )s.t.X ∈ r 0 ⊂ R n where r 0 is a sub-partition defined by boundaries lFreq ≤ f r 0 (t) ≤ uFreq, where lFreq, uFreq are the lower and upper frequency thresholds for the dataset D. The sub-partition with its f r 0 (t) value greater than the upper frequency threshold is subjected to be re-partitioned in the next splitting iteration. Here, each such partition which is selected for re-partitioning is divided into three sub-regions formed with the help of two cutpoints. The maximum number of new partitions/ intervals at a splitting iteration t is equal to t + 2.
The partitions that do not cross the upper frequency threshold need not to be re-partitioned and thus made fixed as the final sub-partitions (i.e. partition.fixed = TRUE). Also the partitions have been made in such a way that if a value has multiple occurrences then all such occurrences will belong to the same partition. In this particular scenario, the compulsion of having the interval frequency to be less than the upper frequency threshold is relaxed. The same has been shown through statement Int i .fixed = TRUE of Algorithm 1, where interval Int i has been made fixed to avoid its further partitioning.
Partitioning Process Convergence Measure: The frequency thresholds lFreq and uFreq are the factors which decide when to partition the current interval further and when to stop partitioning. These frequency thresholds have been identified empirically. If n is the number of samples in dataset, then the thresholds will be defined as: lFreq = 10% of n -40% of (10% of n) and uFreq = 10% of n + 40% of (10% of n), For example, if dataset D has 100 samples, then at any iteration t, the boundaries for an interval r 0 will be considered as (10 − 4) ≤ f r 0 (t) ≤ (10 + 4), i.e. 6 ≤ f r 0 (t) ≤ 14.
The partitioning process automatically stops further partitioning after total q splitting iterations when the convergence criterion meets. Thereafter, the process of diagnosing potential outliers begins. For this purpose, we consider partitions obtained after q splitting iterations and identify those partitions which defy the potential outlier frequency threshold OF Th . We have set the OF Th to 5% of n which signifies that those partitions for which the f r 0 (t) is less than 5% of total values in the attribute will contain the potential outliers. Thereafter all the elements of such partitions are added to the set of potential outliers (PO j ), for j th attribute. We repeat the entire process to identify the set of potential outliers for all the remaining attributes in dataset. The statistics of obtained intervals and resulting potential outliers have been displayed in Table 2.
In this research work, we have empirically decided OF Th to be equal to 5% of n for the studied datasets. This threshold is depending on the size of dataset. Note that, finding an optimal value for a particular threshold to identify outliers has always been a difficult issue in the domain of outlier identification. The Tukey's method (i.e. IQR) has fixed 1.5 * IQR while defining lower and upper boundaries to ensure optimal number of outliers. Tukey has taken value '1.5' because 2 was too big and 1 was too small [69]. Similarly, in our method, we have observed that taking a OF Th value below 5% identifies very few number of potential outliers by individual attributes, which ultimately results in no or too few true outliers of high degree of outlierness. In a nutshell, the lower the value of OF Th , the lower is the number of outliers. Therefore, we have recommended OF Th value to be taken as at least 5% of n for the selected set of datasets.

3) GENERATION OF OUTLIER COMMONALITY MATRIX O cn [i, j]
Initially we have as many potential outlier sets as the number of features in the dataset. Then out of these temporary sets of potential outliers, we obtain a final set of outliers by taking intersection over all these temporary outlier sets.
O cn [i, j] matrix shows the popularity of a data point as potential outlier among all the features of the considered dataset. The dimension of this matrix is n × m, where n is the number of instances and m is the number of attributes in the dataset. As an example, we have shown the matrix entries for data points o 1 , o 2 , o 95 and o 499 of China dataset in Figure 3. The matrix entry O cn [i, j] = 1 signifies that i th data point has been chosen as potential outlier from j th attribute and for the contrary O cn [i, j] will be equal to a 0 entry. So, the total number of 1s in a row of this matrix corresponds to     Further, it can be seen in Figure 4, that for China dataset, a total of 161 data points have their TOP = 1, 50 data points have TOP = 2, 20 data points have TOP = 3 and so on.
The rationale behind final outlier set identification is to achieve better performance (minimum error) along with the removal of minimum number of data points as outliers. To analyse that, GSAD categorizes all the data points as per their TOP(o i ) values into different sets. The set TOP ≥ m includes all those data points as outliers that have been found as PO for at least m attributes. For instance, with the help of Figure 4, it can be easily observed that set TOP ≥ 1 will include all those data points which have been found as PO with TOP = 1 up to TOP = 6, i.e. 161+50+20+ 13+7+2 = 253. This value can be reconfirmed from the first row of Table 3.
The analysis to obtain optimal number of outliers starts with TOP ≥ 1 and stops at TOP ≥ n, where n is the total number of attributes in a dataset. For the simplicity, we have shown the analysis results for three categories (up to TOP ≥ 3) in Table 3. It can be seen from the table, TOP ≥ 2 is yielding minimum error % for two datasets (UCP and ISBSG 2021). For China and NASA93, TOP ≥ 1 is resulting in minimum error but it is also noteworthy that TOP ≥ 1 is resulting in a significantly large portion of data to get removed as outliers. Which is again encouraging us to use TOP ≥ 2 for final outlier set identification. The similar patterns of results appeared even beyond TOP ≥ 3. Thus, for the present study, we have considered all those data points (POs) as the final true outliers for which TOP ≥ 2. So the rest of the data points to which only single feature has considered as potential outlier will be filtered out to the false outlier set O False . Therefore, the false outlier set O False = PO−TOP ≥ 2 needs to be filtered out. Thus, after following this process, we have filtered 161 (i.e. 253-92) data points from China, 33 (i.e.  data points from UCP, 14 (i.e. 19-5) data points from NASA93 and 347 (i.e. 881-534) data points from ISBSG 2021 dataset as false outliers. Even though we have used Mean Square Error (MSE) of SVR based estimation model to direct the search for final outlier set, the entire set of estimation outcomes scored well across an exhaustive range of ML based SDEE models along with multiple performance measures.
We have also compared the performance of GSAD with other two previous methods of outlier identification. Table 4 shows the statistics of outliers after running GSAD (with TOP ≥ 2) and the other two methods for each of the dataset. The datasets are of varying sizes, but it is clearly evident that GSAD is identifying varying and optimal number of outliers for different datasets.   [1], [45], [46] methods in all datasets.

D. EFFORT ESTIMATION MODEL GENERATION
All the models have been generated using 5 learning algorithms: SVR, RF, Ridge, KNN , and GBM . We have considered them as baseline to assess whether the GSAD based outlier identification and removal can improve the performance. For the completeness, we have compared the results with prevalent outlier identification and removal approaches (IQR and Cooks). In order to assess the estimation performances, each type of machine learning method has been applied to generate four different versions of estimation models: 1) with all data points, i.e. without using any outlier identification and removal method 2) with only non-outlier data points identified after using proposed method GSAD 3) with only non-outlier data points identified after using existing outlier identification method IQR [1] 4) with only non-outlier data points identified after using existing outlier identification method Cooks based on studies [1], [45], [46] Table 5 lists all the parameter values that have been investigated for each SDEE method in this study. The best hyper-parameters for each learning algorithm have been selected using grid-search technique with 10 × 3 repeated cross-fold validation.

E. EFFORT ESTIMATION MODEL EVALUATION
In this study, we have validated the performance of all SDEE models using Repeated Cross-Validation (RCV) and the most recommended Leave-One-Out Cross Validation (LOOCV) [12]. RCV can be beneficial over Cross-Fold validation in order to reduce the high-variance between multiple predictions by repeating the process of validation [18]. T. Menzies et al. [18] have suggested to use more than one validation methods, for example LOOCV and RCV to study the variability of results while reporting the final conclusions. To study the variability in results, we have undertaken the sensitivity analysis [70]. Through this, we intend to determine the sensitivity of the findings of this study by repeating all the experiments with different experimental settings. To achieve this, we have repeated all the experiments by switching between LOOCV and 10 × 3 RCV that involves 10 repeats of 3-fold cross-validation.

F. PERFORMANCE EVALUATION
A variety of performance measures have been used by researchers to showcase the performances of SDEE models. These performance measures do not measure and/ or represent the same facet of performance. The reliability in measurement of performance largely depends on performance evaluation measure [20], [71]. Some of the measures have been criticized for their biasness, but none of the measure has been unanimously accepted to compare all type of SDEE models [13]. Therefore, for proper empirical analysis, we have used Mean Absolute Error (MAE) (9), Mean Square Error (MSE) (11), Median Absolute Error (MdAE) (10), Standardized Accuracy (SA) (12).
The use of MAE has been recommended by several studies [24], [29], [72] in the past for measuring average absolute difference between actual and predicted effort. MSE has also been recommended for the SDEE field in study [73]. MSE measures the average squared loss from actual to the predicted effort. MAE focuses on central tendency, therefore, it is considered as unbiased for both under and overestimation. On the other hand, MSE penalizes large error values more as the square will be comparatively much larger if the error (difference between actual and predicted effort) is large. In other words, with the use of MSE, the training model will give much attention to improving the predictions of those software projects for which the error is high. MdAE measures the median value from the absolute errors of all projects in observation. It has been recommended as more robust to large outliers' scenario [74]. SA gives an idea about the performance of a model in comparison to random guessing. We have also used effect size (∆) measure to verify whether there is any improvement in comparison to random guessing or if the predictions are generated by chance.
where y i / y i represents the actual/estimated effort value, 1 ≤ i ≤ n and n is the number of projects in test set. MAE M i is the MAE of SDEE model for which the performance is being measured and MAE M 0 is the MAE of large number of randomly guessed effort values (generally 1000) from the dataset. SP 0 is the standard deviation of randomly guessed effort values for entire test sample. This is to note that smaller values of MAE, MdAE, MSE error measures signify better performance while SA is the accuracy measure for which the larger the value, the better the performance. The values of ∆ are interpreted with the help of scales proposed by Cohen [75], according to this scale the values can be categorized into small (around 0.2), medium (around 0.5) and large (around 0.8). A ∆ value falling in these categories signify real effect in model's performance in comparison to random guessing.
To the best of our knowledge in the field of SDEE, it is improper to consider that any specific performance measure can always be preferred over others, rather each one of them may be useful in measuring different aspects. In this manuscript, we are not comparing the performances of error and accuracy measures to choose the best measure among all. We have used a stack of unbiased performance measures to evaluate the performances of all SDEE models over different aspects of error and accuracy.

G. STATISTICAL TEST
In order to evaluate the performance of proposed methods in this research work, Mann-Whitney U tests have been employed. Mann-Whitney U test is non-parametric test which has been adopted since the error distributions of the studied techniques are not normally distributed as identified using Shapiro-Wilk test [76]. Through these tests, we aim to identify whether the error distributions (absolute error, square error or MAE, MSE) of two techniques T i and T j are significantly different or not. More specifically, we have tested null hypothesis (H 0 ) which states that techniques T i and T j are statistically equivalent. The alternative hypothesis (H A ) states that techniques T i and T j are significantly different. Here, T i represents technique employing our GSAD while T j signifies one of the techniques employing baseline (no outlier removal), IQR, or Cooks. All statistical tests have been performed at a significance level (α) of 0.05.
To summarize the results of statistical comparisons and further assess the competitiveness of our GSAD based models, we have used win-tie-loss statistics (Algorithm 2) as used in previous studies [5], [24], [77], [78].

Algorithm 2: Pseudo Code for Win-Tie-Loss Calculation Between Techniques T i , T j w.r.t. Error Distributions Err i and Err j
Initialize win i ← 0, tie i ← 0, loss i ← 0, win j ← 0, tie j ← 0, loss j ← 0; if Function MannWhitneyU(Err i , Err j ) says they are same then tie i = tie i + 1; tie j = tie j + 1; end else if Err i < Err j then win i = win i + 1; loss j = loss j + 1; end end else win j = win j + 1; loss i = loss i + 1; end

V. EXPERIMENTAL RESULTS AND DISCUSSION
This section is aimed to elaborate the significance of the proposed GSAD method through the performance comparison of models with and without the consideration of outliers as well as with other schemes of outlier identification. To simplify the readability, the SDEE model that do not incorporate any outlier removal scheme (that is as per previous VOLUME 10, 2022 study [79]) is referred as SDEE_BASE, the model that uses only non-outlier data points after applying GSAD is referred as SDEE_GSAD, whereas SDEE_IQR and SDEE_Cooks signify the models that are using non-outlier data points obtained after applying IQR and Cooks schemes respectively. This is to also note that an SDEE model is one of the ML method specified in Section IV-D.  Tables 6 and 7.
In order to empirically assess whether our GSAD based models are actually predicting and not guessing, the SA, ∆ values have been shown in Table 6. It can be observed that all the SDEE_GSAD models are achieving mostly positive values for SA (with very few exceptions for ISBSG 2021) that are ranging from 0.12 (12%) to 0.67 (67%) for RCV while from 0.53 (53%) to 0.99 (99%) for LOOCV validation. The ∆ values obtained for all the SDEE_GSAD models are showing medium to large effect size improvement over random guessing. Therefore, it can be concluded that SDEE_GSAD models are not yielding their predictions by chance. It can be further confirmed from the table that SDEE_GSAD models are obtaining better values than their competitive SDEE_BASE models for most cases.
Further, we have assessed the performance of SDEE_ GSAD and SDEE_BASE using MAE, MSE and MdAE. The results can be seen in Table 7  The p-values obtained after statistical comparison of SDEE_GSAD and SDEE_BASE are shown in Table 9. The MAE and MSE distributions obtained by SDEE_GSAD Performance comparison of all models in terms of Standardized Accuracy (SA) and Effect Size (∆) for each dataset using 10 × 3 RCV and LOOCV. The models with superscript † represent the models that are as per previous paper [79]. models are found to be significantly different than SDEE_BASE in 15 cases for RCV validation method. While for LOOCV, the distributions are found to be significantly different for 5 cases with both MAE and MSE. The win-tie-loss statistics of Mann-Whitney U tests for statistically comparing the distributions of MAE and MSE measures have been summarized in Table 12. We can observe that SDEE_GSAD achieves best win-tie-loss outcomes (14+14+4+4=36 wins and 5+5+15+15=40 ties, out of total 80 cases) against SDEE_BASE across both error measures over all datasets for both RCV and LOOCV validation schemes. This proves that SDEE_GSAD models have not been outperformed by SDEE_BASE models for more than 5% (4 out of 80) cases.
Therefore, it can be concluded from the results that when we remove the identified outliers using GSAD approach, the performance of ML based SDEE models is generally improving.

B. RQ3. HOW THE PERFORMANCE OF ML BASED SDEE METHOD VARIES USING OUR PROPOSED OUTLIER IDENTIFICATION METHOD IN COMPARISON TO OTHER PREVALENT OUTLIER IDENTIFICATION METHODS?
We have empirically assessed the competitiveness of our GSAD based method to other prevalent outlier identification methods (IQR, Cooks) in improving the performance of ML based SDEE methods. The comparative results of all measures can be seen in Tables 6 and 7. From Table 6, it is clearly evident that GSAD based models are mostly achieving positive values for SA (with only 2 exceptions with one dataset), while IQR and Cooks based methods are yielding negative SA values with all 4 datasets. Thus, IQR and Cooks based models are performing poorer than random guessing in some cases with all datasets. Table 7 shows the summary of error measures' results in terms of MAE, MSE and MdAE for all datasets with RCV and LOOCV methods. We have further assessed the statistical significance of results using Mann-Whitney U tests between the competing models. The p-values obtained after statistical comparison of SDEE_GSAD with SDEE_IQR and SDEE_Cooks models over MAE and MSE distributions are shown in Table 10 and Table 11.
For completeness, win-tie-loss statistics between the competing methods have been summarized in Table 12. We can observe that SDEE_GSAD models achieve 28 (12+8+7+1) wins and 46 (8+10+12+16) tie outcomes against SDEE_Cooks models. Which means that SDEE_GSAD models have not been outperformed for more than 7.5% (5 out of 80) cases by GSAD_Cooks models. In comparison to SDEE_IQR models, the proposed SDEE_GSAD models achieve 19 (9+7+3) wins and 24 (1+3+9+11) ties. The proposed SDEE_GSAD models have been outperformed by SDEE_IQR models for 46.25% (37 out of 80) cases. Here, it is also noteworthy to remember that IQR based outlier identification method has discarded the highest percentage of data as outliers (refer Table 4), leaving only a small portion of data as non-outlier set for model training Performance comparison of all models in terms of MAE, MSE and MdAE for each dataset using 10 × 3 RCV and LOOCV. The proposed models have been highlighted in bold fonts. The models with superscript † represent the models that are as per previous paper [79]. and testing. With this, there is very high possibility that IQR might classify a new project as outlier and also the model VOLUME 10, 2022  [80] in terms of MAE, MSE and MdAE for each dataset using 10 × 3 RCV and LOOCV. The models with superscript * represents that our SDEE_GSAD model performs better than ATLM model w.r.t given performance measure. trained using such low volume of data might not reflect the ground truth.
Therefore, it is apparent that GSAD based outlier identification can compete with and even outperform the other outlier identification approaches for ML based SDEE methods.
The performance comparison with previous works is a difficult task for researchers in the field of SDEE due to lack of complete implementation details, discrepancies in usage of validation schemes and performance measures. However, we have compared the performance of our proposed models with ML models obtained as per previous study [79] and IQR [1] and Cooks distance [1], [45], [46] outlier identification and removal techniques based models. In addition to this, we have compared the results with well-known state-of-art baseline SDEE method Automatically Transformed Linear Model (ATLM) [80]. The ATLM model is regression based publicly available model. ATLM chooses between log, sqrt, and none type of transformation as per the skewness of dataset attributes. Most of the real SDEE datasets are of extremely heterogeneous nature with 0 being the minimum value and a very large maximum value. With 0 values, it is not possible to choose between log, sqrt, and none type of options for dataset transformation. Therefore, we needed to use log1p transformation for implementation of ATLM models as well. Our proposed models' performances in comparison ATLM can be seen in Table 8. The results have been examined and shown in Table using three performance measures over two validation schemes for fair comparison. We observed   some instabilities in the performance of ATLM. It resulted with extremely high error values for some validation folds of the studied dataset especially ISBSG, so after removing the results of those outlying folds, ATLM models' results came down to those which have been reported in Table 8. We can see that our proposed models have much better performances than ATLM especially for China, UCP and ISBSG datasets. It can further be concluded that GSAD based our proposed ML models have the potential to compete with previously proposed ATLM method.
Computational Code Availability: The computational code have been made freely available online at: Code Archive_GSAD.

VI. THREATS TO VALIDITY
There are multiple factors which may introduce biasness to the validity of an empirical study. This section discusses these threats in respect to internal, construct, conclusion and external validity pertaining to our study and the measures that have been taken to address these threats. VOLUME 10, 2022 The major threat to internal validity is concerned with selection bias of data studied. We have used 4 datasets that follow different model-based methodologies. China and ISBSG 2021 datasets are FP based, NASA93 is LOC based, while another dataset belongs to UCP methodology. These datasets also vary in terms of dimensions, application domain, size, complexity. Thus, for this study, internal threat is not a cause of major concern.
B. CONSTRUCT VALIDITY With regards to construct validity, a possible threat of emphasis is the verification biasness [18]. To avoid this while performing empirical analysis, we have used a wide variety of unbiased error and accuracy measures. Each of these measures are covering a distinct specific aspect of performance measurement.

C. CONCLUSION VALIDITY
Conclusion validity is related to the degree of variability of the results with different experimental settings. To address this threat, we have worked on sensitivity analysis [70] in this study. Where, we have repeated all the analysis using different cross-validation schemes (RCV and LOOCV). In addition, we have performed the experiments over individual dataset on all 5 ML based SDEE methods using common data splits. Through this notion, we have attempted to rule out any possibility of performance improvement due to randomness.

D. EXTERNAL VALIDITY
The major potential threat concerning external validity is related to the generalization of findings. We have performed all the experiments over a wide variety of datasets including with-in and cross-company data. Therefore, we believe that the results of this study would be helpful in generalizing the findings for homogeneous as well heterogeneous datasets ranging in different sizes and domains.

VII. CONCLUSION AND FUTURE WORK
The accuracy of estimates has significant place in software industry so as to deliver quality software. Since inaccurate estimates cause compromised quality of software at later stage of its development. In this research work, we have proposed a novel approach of adaptive discretization using GS to effectively address outlier detection. We have examined the potential of removing outliers using GSAD in improving the performance of 5 SDEE methods based on SVR, RF, Ridge, KNN, and GBM over 4 benchmark real industrial SDEE datasets comprising with-in and cross-company data using 2 different cross-validation methods (namely 10 × 3 RCV, and LOOCV). The empirical analysis show that the performance of models is less likely to be sensitive to randomness resulting due to different validation and data splitting approach. In addition, the statistical tests show that performance has been significantly improved for several cases in different validation settings. This signifies that the importance of outlier removal can not be overlooked for SDEE dataset. For the completeness, we have also compared the performance of our GSAD based SDEE methods with IQR and Cooks' distance based SDEE methods. Overall, the proposed GSAD approach is efficient and competitive in enabling a simple yet effective outlier identification and removal procedure to improve the performance of investigated SDEE methods.
The proposed discretization based outlier identification method GSAD works well irrespective of the linearity of data and it does not require class labels prior discretization. Also, it is independent of distributional dependence.
In this study, we have performed extensive empirical analysis of our proposed GSAD based outlier identification and removal along with prominent SDEE methods, yet there are some options which can be explored. The future direction is to consider the correction of identified outliers instead of removing them altogether, specifically in small datasets with high percentage of outliers. We are planning to integrate the finding of the optimum value of outlier frequency threshold threshold for other datasets of different dimensions especially when the dataset has large number of categorical data. We are also determined to further investigate the impact of incorporating multiple nominal variables such as language type, organization type, application type etc. along with numeric variables on estimation performance.
In this paper, we have investigated the applicability of GSAD for SDEE, while this novel outlier identification method can also be adapted in other fields of research to assess its suitability in order to identify optimal number of outliers.