A Novel Approach for Code Smell Detection: An Empirical Study

Code smells detection helps in improving understandability and maintainability of software while reducing the chances of system failure. In this study, six machine learning algorithms have been applied to predict code smells. For this purpose, four code smell datasets (God-class, Data-class, Feature-envy, and Long-method) are considered which are generated from 74 open-source systems. To evaluate the performance of machine learning algorithms on these code smell datasets, 10-fold cross validation technique is applied that predicts the model by partitioning the original dataset into a training set to train the model and test set to evaluate it. Two feature selection techniques are applied to enhance our prediction accuracy. The Chi-squared and Wrapper-based feature selection techniques are used to improve the accuracy of total six machine learning methods by choosing the top metrics in each dataset. Results obtained by applying these two feature selection techniques are compared. To improve the accuracy of these algorithms, grid search-based parameter optimization technique is applied. In this study, 100% accuracy was obtained for the Long-method dataset by using the Logistic Regression algorithm with all features while the worst performance 95.20 % was obtained by Naive Bayes algorithm for the Long-method dataset using the chi-square feature selection technique.


I. INTRODUCTION
Now days the complexity of the software is increasing continuously due to complex requirements, increased number and size of modules, and code smells in the developed software etc. Complex requirements are difficult to analyze and understand and thus development becomes difficult. Understanding complex software is also difficult and thus maintainability of complex sofware becomes low. Complex requirements are not in developers hand but code smells can be detected and refactored to make the software more simple, understandable, easy to develop and maintain [1]. In the software development process, functional and nonfunctional values both are essential for designers to follow the guaranteed software quality [2]. Generally functional requirements are only emphasized by developers whereas nonfunctional requirements, for example comprehensibility, verifiability, evolution, maintainability and reusability are neglected [3]. The lack of nonfunctional quality leads to decline in the quality of the software, so that the complexity and maintenance work of software increases. Fowler et al. [4] explained the refectory technique by which the loosely implemented code can be converted into a good implementation. They proposed definitions of 22 code smells.
The effects of code smells on software have been examined by various studies and revealed their undesirable effect on the quality of the software [19]- [23]. The effects of removing code smells in improving the possibility of software system failures and faults are also examined by them. They explored the challenges that code smells have negative effects on the software development process and recommended refactoring the software to remove them.
The influences of code smells on software are analyzed by different researchers; Olbrich et al. [24]- [25], Khomh et al. [26], Deligiannis et al. [27] by inspecting the frequency and size of changes in the software system. It is also examined by them that the classes, infected by code smells have a higher rate of changes and needs more maintenance work. Li et al. [29] studied the role of bad smells and class error probability in the object-oriented software system. Their study results showed that the infected software components using code smells have higher changes of class errors than other components. Castillo et al. [28] studied the harmful effect of god class on power consumption and that eliminating God Class smells reduce the cyclomatic complication of the source code. Thus code smell detection will indirectly help in reducing all these problems of costly maintenance and chances of failure of software systems etc.
There are lot of challenges in code smell detection for software developers. Different types of code smells make this process difficult. Not a formal definition of code smells is another challenge.
In this study, a framework is built for code smell prediction using machine learning algorithms and software metrics. In a code smell prediction, metrics play an important role in measuring the functional and nonfunctional qualities of software and understanding the features of the source code. Static information of the software is given by metrics, such as number of methods, classes, and parameters along with the measure of coupling and cohesion among objects in the system. This paper proposes code smell prediction and its analysis using six machine learning algorithms. For this purpose, four code smell dataset (God Class, Data-Class, Feature-Envy, and Long-Method) from Fontana et al. [18] are used that are generated from 74 open source systems. The first two datasets belong to class-level and the remaining two datasets belong to method-level. Six machine learning algorithms (Naive Bayes, KNN, Multilayer Perceptron, Decision Tree, Logistic Regression and Random Forest) are applied on the datasets and achieved the highest accuracy in the Logistic Regression algorithm (100%).
Various experimental studies, tools, and techniques have been developed by researchers to detect code smells and they found different results. Various reviews and comparative studies [30][31][32] represent that there are several reasons behind it including the problem to discover the formal definitions for code smells. Additionally, given tools and techniques are not identifying all code smells since they focus on some code smells.
The main contribution of this paper is two-fold: In the first step the six machine learning methods for detecting the code smell are proposed. In the second step, performance measurement (Accuracy, Precision, Recall, and F-1 Score), achieved by applying machine learning methods and their evaluation with grid search cross-validation technique are shown.
The following research questions are answered and discussed in this research study.
RQ1. What is the effectiveness of applying machine learning techniques to explain the code smell detection problem and which software metrics play important role in predicting code smells?
Motivation. To study and examine the impact of machine learning methods to build a code smell prediction framework that can predict code smells in object-oriented software using software metrics.
RQ2. Does a feature selection method affect the performance of the prediction?
Motivation. To study the impact of feature selection technique to improve the model's accuracy and to identify which software metrics contribute an important role in the code smell prediction process.
RQ3. Whether methods used for separating dataset into training set and test set affect accuracy or not?
Motivation. To study the impact of 10-fold cross validation, to set the training set and test in each dataset, on the performance accuracy.
RQ4. Does the parameter optimization technique affect the accuracy of the prediction? Motivation. To study the impact of tuning the machine learning techniques' parameters on performance accuracy.
The outline of the paper is as follows; the section 2 explains related works, which briefly describes code smell detection using machine learning techniques. Section 3 reports the datasets related information, approaches and research framework used in this work. Section 4 illustrates the experimental results. Threats to validity is discussed in section 5 and the final section 6, provides the conclusion of our work.

II. RELATED WORK
Many researchers have studied machine learning techniques applications to detect the code smells. This section represents different types of tools and techniques, and different types of machine learning techniques applied on code smell datasets to detect the code smells. In the literature, different tools and techniques are proposed to detect code smells. Each tool and technique produces different results. Kessentini et al. [17] classified the code smells detection techniques into seven categories such as manual approaches, symptom-based approaches, metricsbased approaches [13]- [15], probabilistic approaches [12], visualization-based approaches [8], search-based approaches [9]- [11], and cooperative-based approaches [7]. Inspection methods [5], process identification, and manufacturing process methods [6] are used to improve software quality by manual approach. The specification algorithm [16] is used to detect code smell by symptombased approach.
Travassos et al. [5] proposed Inspection methods and Ciupke [6] presented process identification, and manufacturing process methods to improve software quality by manual approach. The specification algorithm [16] used to detect code smell by symptom-based approach. Marinescu [13] proposed a technique called "detection strategy" for devising metrics-based rules that capture deviations from good design principles and heuristics. Moha et al. [14] and Tsantalis et al. [15] also introduced metrics-based approaches. Rao et al. [12] proposed a probabilistic approach to detect bad smells in objectoriented design.
A novel smell indicator known as Stench Blossom is proposed by Hill et al. [8], it represents a cooperative ambient conception designed to first give programmers a fast, high-level overview of the smells in their code, and then, if they wish, to help in understanding the sources of those code smells. Abdelmoez et al. [7] recommended riskbased approach to detect code smells. And search-based approach presented by [9]- [11] for code smell detection. A comparing and experimenting machine learning algorithm for code smell detection is suggested by Fontana et al. [18]. They experimented 16 machine learning algorithms on four code smells dataset and 74 Java systems that are manually validated instances on the training dataset. Additionally, boosting techniques are applied on four code smell datasets. The code smell severity classification using machine learning algorithm is proposed by Fontana et al. [33]. This technique can help software designers to prioritize the classes or methods. Multinomial classification and regression method is used to classify the code smell severity.
Assessment of Code Smell for Predicting Class Change Proneness Using Machine Learning techniques is presented by Pritam et al. [42]. They approve the effect of code smells on the change inclination of a specific class in a product framework. They applied six machine learning algorithms to predict the change proneness using code smell from a set of 8200 java classes spanning 14 software systems. Kacem et al. [43] suggested an advanced method to detect code smells by using deep-learning techniques. They applied hybrid detection approach based on deep Auto-encoder and artificial neural network algorithm on four code smell datasets that is extracted from 74 open-source systems. Pushpalatha et al. [34] recommended ensemble methods approach to predict the severity of closed source bug reports. They applied. two preprocessing techniques, i.e., Information gain and Chi-square for data reduction and they observed information gain gives slightly good accuracies over chi-square. They found bagging gives better accuracy than other ensemble algorithms. They obtained the accuracies of various ensemble approaches after reducing dimensionality using Information gain. The accuracy for, PitsA varies between 58 Mhawish et al. [35] proposed Machine learning approach to detect code smells from software and they observed the metrics that play critical roles in the detection process. They applied genetic algorithm based two feature selection techniques and parameter optimization technique based on a grid search. They obtained best accuracy in predicting the Data Class, God Class, and Long Method smells by 98.05%, 97.56%, and 94.31% respectively using GA_CFS method, and in the Long Method scored the best accuracy of 98.38% using GA-Naïve Bayes feature selection method.
Mhawish et al. [36] suggested code smells analysis of Predictions using machine learning techniques and software metrics. They also applied genetic algorithm-based feature selection methods to enhance the accuracy of these machine learning algorithms by selecting the best features in each dataset. Moreover, they applied parameter optimization techniques based on the grid search algorithm which is enhance the accuracy of all these algorithms. They noted in Random Forest model achieves the best accuracy of 99.71% and 99.70% in predicting the Data Class in the ORI_ D and REFD _D datasets respectively.
Guggulothu et al. [37][38] considered code smell detection using multi-label classification approach. They used multi-label classification method to detect whether the given code elements are affected in multiple smells. They applied an unsupervised classification technique for finding good accuracy. Guggulothu et al. [37] obtained 99.10% highest accuracy using B-J48 pruned algorithm for Featureenvy dataset, and 95.90% highest accuracy using Random forest algorithm for Long-method dataset.
Gupta et al. [39] recommended prediction of code smells using feature extraction from source code on eight types of code smells. They present the application of data sampling technique to handle the class imbalance problem and uses feature selection technique to find most relevant features sets. They applied deep learning technique and improved 88.47 to 96.84% AUC accuracy.
Kaur et al. [40] presented Ensemble learning technique and correlation feature selection technique on three opensource java datasets for detection of the code smell. They applied Bagging and Random forest classifier to analyze each approach with four performance measurements like accuracy (P1), G-mean 1(P2), G-meam2 (P3), and Fmeasure (P4).
Draz et. al. [41] suggested search-based technique to improve the code smell prediction using Whale optimization algorithm as a classifier. They perform their experiment on five open-source software projects and found the nine types of code smells. They obtained average of 94.24% precision and 93.4% recall. Azeem et.al. [63] described a Systematic Literature Review (SLR) on the utilization of machine learning techniques for code smell detection. They targeted four aspects related to previous research work on code smell detection techniques. These are (a) which code smells have been detected (b) what machine learning setup has been adopted (c) what kind of evaluation strategies have been exploited, and (d) what are the claimed performances of proposed ML techniques. They have found that the decision tree and support vector machine are the most common machine learning techniques used for code smell detection. On the other hand, JRip and Random forest algorithms are the most effective classification techniques.
F. Pecorelli et al. [64] proposed a significant study to compare the performance of heuristic-based and machinelearning-based techniques for metric-based code smell detection. They considered five types of code smells (God Class, Spaghetti Code, Class Data Should Be Private, Complex Class, and Long Method) and compare ML techniques with DECOR, a state-of-the-art heuristic-based approach. They found that the DECOR normally obtained better performance than ML baseline.

III. RESEARCH FRAMEWORK
The list of steps followed to build the code smell prediction framework is shown in figure 1. In the first step, code smell datasets are taken from Fontana et al. [18]. Then, preprocessing (Normalization) step is applied to the dataset to cover the different ranges of the datasets and to obtain the best algorithms' parameters. Then machine learning algorithms on the dataset is trained and their performance are computed. Finally, 10-Fold cross-validation technique is applied to evaluate each experiment performance during the training process and Grid search algorithm applied to enhance the accuracy.

A. REFERENCE DATASETS
In this study four code smell datasets (God class, Data-class, Feature-envy, and Long-method) from Fontana et al. [18] are used to build the code smell detection framework. The data preparation methodologies of Fontana et al. [18] are briefly explained in following subsections. These datasets are available at http://essere.disco.unimib.it/reverse/MLCSD.html.

B. DATASET SELECTION AND REPRESENTATION
Qualitus Corpus Software System is compiled by Tempero et al. [44] and analyzed by Fontana et al. [18]. In the corpus software out of 111 systems, 74 systems are considered and remaining 37 systems could not be used to detect code smells since these did not comply correctly. For the available 74 software systems, Fontana et al. [18] computed 61 software metrics for the class level code smells-Data Class and God Class and 82 software metrics for the method level code smells-Feature Envy and Long Method.
Fontana et al. [18] used several detection tools and techniques to detect code smells called advisors: iPlasma (God Class, Brain Class), Anti-pattern Scanner [45], PMD [46], iPlasma, Fluid Tool [47], and Marinescu detection rules [48]. Table 1 shows the automatic detection tools. They filtered and relabeled results manually with the help of 3 students of Master's degree. Each dataset contains140 smells and 280 no-smell. Long Method iPlasma (Brain Method), PMD, Marinescu detection rule [48] As shown in Table 2, 61 software metrics are calculated for Data class and God class code smells at class level. 82 software metrics are calculated for long method and feature envy code smells at method level. Detailed descriptions of these features are available in Fontana at al. [18]. Definitions of the code smells on which study is based as following: God class describes a large class that have many lines of code, functions, or fields. The God Class is considered the most complicated code smells for many reasons, operations and functions that occur there. It causes issues associated with size, coupling, and complexity [49].
Data Class points to the class used to store the data which are used by other classes. Data Class covers only Predict the ML Models fields and accessor methods (getters/setters) without any behavior methods or complex functionalities. Because of it create problems related to data abstraction and encapsulation [49]. Feature Envy is the method level smell that accesses data or use operations that belong to different classes as compared to its own class i.e., it admits additional foreign data in comparison the local one. and therefore, it creates problems associated to the strength of coupling [49].
Long Method is method level smell that points to bigsized method, because of the heavy size of code lines and the functionalities, it is applied within the method. It enhances the functional complication of the method and it will be difficult to understand and therefore it creates problems related to the strength of understanding the operations in methods [49].

C. FEATURE SCALING
Sometimes datasets have different ranges of features, so machine learning or classification algorithms cannot be directly applied to the dataset. Therefore, Feature scaling is necessary to cover the different range of the dataset. Sometimes, several machine learning algorithms can converge faster on the feature scaled dataset and when the model is sensitive to size, it will have a greater impact. For example, before applying the support vector machine algorithm, avoiding standardization id is essential for supremacy of advanced numerical ranges on slight numerical ranges where the range of possible high values cause mathematical problems [50].
In this paper, Min-Max normalization technique is applied to convert feature values of datasets between 0 −1. This process is used in the data preprocessing stage in which the data are prepared to be used later by one of the machine learning technique like Support vector machine, Neural network etc. [51]. Equation 1 shifts and rescales the values of a feature (X) so that they end up ranging between 0 and 1.
In the equation 1, X is an original value and X' is the normalized value. X' will be between 0 and 1. Min-Max normalization is applied to all the datasets and the new reformed data received in the form of 0 to 1 have been taken as input into all machine learning techniques

D. FEATURE SELECTION TECHNIQUE
Feature selection is a method designed to obtain the maximum impact features in the dataset so it can advance performance by improving awareness of the software metrics that play a significant role in distinguishing between similar roles in design patterns [55]. In this study, two feature selection techniques are applied to find most important features from each dataset: Chi-square and Wrapper based feature selection technique.
The chi-square based feature selection technique is used for categorical features in a dataset. Chi-square is calculated between each metrics and select the desired number of metrics with best chi-square scores. Best metrices are selected in each code smell datasets using chi-square based feature selection technique.
The Wrapper based feature selection technique is used to select a subset of features or variables in the dataset so that these features are most relevant to the predicted target value [60]. Fig. 2 shows the working of wrapper-based feature selection method.

E. PARAMETER OPTIMIZATION
Every learning approach in machine learning, has one set of parameters that affect the learning procedure and algorithm performance. In addition, each parameter is different in type and domains. The best parameter for each algorithm is different, depending on training data set. To determine the correct parameters' values, different combination of parameter values for each algorithm should be tested; so that the prediction model can correctly predict the test data set [36].
Grid search algorithm is applied in this work to find the optimum values of the parameters of each algorithm. The grid search approach is an optimization algorithm which is used to obtain the best hyper parameter values from the parameter values set provided [52]. Grid search is essential to find the optimal hyper parameters of a model, which results in the most accurate predictions. It is based on comprehensive search for the combination of parameters that yields the best performance value in the prediction model [53]. 10-fold cross-validation is used to prevent it from overfitting of the model on training data. It is used to measure the algorithm performance with each possible combination of the parameters.
In parameter optimization based on a grid search algorithm, a set of values for each parameter are recognized. For nominal parameters, the nominal values are assigned as shown in Table 3. The value range as well as number of steps to the numeric parameters are also assigned. The values that must be tested are assigned within upper and lower bounds of the range based on the specified number of steps that are assigned for each parameter, as shown in Table 4.

F. VALIDATION METHODOLOGY
In this study, validation techniques are applied to evaluate each experiment performance. During the training process 10-fold cross-validation technique is applied. Machine learning models are evaluated by using 10-fold crossvalidation that split the dataset into 10 portions with 10 times of repetition of training the model. In each repetition, one part of the dataset is considered as a test dataset, and other parts of the dataset are taken for training purpose. Then, finally the trained models are tested with unseen test dataset (20% split from the dataset before training). The following fig. 3 shows the general k-fold cross-validation evaluation technique.  • Accuracy: in the machine learning technique, accuracy is an important performance measurement. It represents the percentage of correctly classified instances in the positive and negative classes. The accuracy is defined to the relationship between precision and recall. Ideally, a rational approach should be taken in decent precision and recall rate, i.e., while recall values improve the precision values should remain high. Thus, it is concluded that a suitable method should have a high rate of true positives with a low rate of false positives and false negatives [34]. Equation 5 shows the formula of accuracy. The accuracy is 0.0 for worst performance and 1.0 for best performance.

IV. EXPERIMENTAL RESULTS AND DISCUSSIONS
The six machine learning techniques and the results obtained from the experiments are shown in following sub sections. The experimental results obtained from each machine learning technique are shown in the form of table and bar chart.

A. NAIVE BAYES
Naive Bayes is a simple machine learning algorithm, it uses Bayes rules, and strongly assumes that given classes, the conditions are conditionally independent. Although this independence assumption is often violated in practice. Naive Bayes often provides competitive classification accuracy. Together with its computational efficiency and many other desirable features, this has led to the wide application of Naive Bayes in practice [56].
Experimental results obtained by applying the Naive Bayes algorithm to the four code smell datasets is shown in Table 5. The comparative performance among all the code smell datasets are shown in Fig. 4 in the form of bar chart. In this experiment, it is found that Naive Bayes algorithm is obtained highest accuracy (96.80%) for God Class dataset.

B. K-NEAREST NEIGHBOUR (KNN)
The K Nearest Neighbor (KNN) algorithm is a supervised ML procedure that can be applied for classification and regression prediction problems. However, it is mostly used for classification prediction problems in industry. KNN algorithm uses 'feature similarity' to predict the value of a new data point, which also means that the new data point will be assigned a value based on how closely the new data point corresponding to training point [57]. Experimental results obtained by applying the KNN algorithm to the four code smell datasets are shown in Table 6. The comparative performance among all the code smell datasets is shown in Fig. 5 in the form of bar chart. In this experiment, it is found that KNN algorithm obtained highest prediction accuracy (97.89%) for Long-Method dataset.

C. MULTILAYER PERCEPTRON (MLP)
Multilayer perceptron (MLP) is flexible machine learning technique that can adapt to complex nonlinear assignments. MLP is the most popular type of neural network, consisting on a feed forward network of processing neurons that are grouped into layers and connected by weighted links [58]. Experimental results obtained by applying the Multilayer perceptron algorithm to the four code smell datasets are shown in Table 7. The comparative performance among all the code smell datasets is shown in Fig. 6 in the form of bar chart. In this experiment, it is found that MLP algorithm obtained highest prediction accuracy (97.62%) for Data Class dataset.

D. DECISION TREE (DT)
A decision tree is a tree in which each internal or non-leaf node is associated with a decision, and the leaf node is usually associated with an outcome or class label. Each internal node tests one or more attribute values that lead to 2 or more links or branches. Each link in turn is associated with a possible decision value. These links are separate and exhaustive [59]. Experimental results obtained by applying the Decision tree algorithm to the four code smell datasets are shown in Table 8. The comparison performance among all the code smell datasets is shown in Fig. 7 in the form of bar chart. In this experiment, Decision Tree algorithm obtained highest prediction accuracy (98.59%) for Long-Method dataset.

E. LOGISTIC REGRESSION
Logistic regression is a statistical analysis technique that is applied to predict data values based on past observations of a dataset. This is an important tool in machine learning which predicts the dependent data variables by analyzing the relationship between one or more existing independent variables. It can be used to predict whether software metrics will be found, or not. Experimental results obtained by applying the logistic regression to the four code smell datasets are shown in Table 9. The comparison performance among all the code smell datasets is shown in Fig. 8 in the form of bar chart. In this experiment, it is found that Logistic Regression algorithm obtained highest accuracy 99.52% and F1 score 100.00% for Long-Method dataset.

F. RANDOM FOREST
Random forest is a machine learning technique used to solve regression and classification problems. It uses ensemble learning, which is a technique that combines many classifiers to provide solutions to complex problems. Random forest algorithm builds the result based on the prediction of the decision tree by taking the average or mean value of the yield of various trees. It can reduce the over fitting of the dataset and improve accuracy. Experimental results obtained by applying the Random forest algorithm to the four code smell datasets are shown in Table 10. The comparison performance among all the code smell datasets is shown in Fig. 9 in the form of bar chart. In this experiment, it is observed that Random Forest algorithm is obtained highest prediction accuracy 99.52% and F1 score 100% for Long-Method dataset.

G. PERFORMANCE COMPARISION
In the earlier subsections, the performance measurements of six machine learning techniques are shown in different tables. In this section, the performance of all these six machine learning techniques (Naive Bayes, KNN, MLP, Decision Tree, Logistic Regression and Random Forest) are compared . The comparison of performances of these six machine learning techniques on the four code smell datasets are shown in Table 11 and their comparison charts are shown in Figure 10. After comparing it is observed that the Random Forest algorithm (Data class-98.94%, God class-97.88%, Feature-envy-97.58%, and Long-method-99.52%) has got better accuracy then the rest of five algorithms, while the overall worst performance is achieved by Naive-Bayes.

H. IMPACT OF FEATURE SELECTION
This experiment is focused on the study and influences of feature selection methods to improve the model accuracy and recognizing the software metrics that contribute significant role in predicting the code smells. To answer the RQ2, Chisquare and Wrapper-based feature selection techniques are applied. In Table 12, the percentage accuracy and F1-score of all algorithms before and after applying the feature selection technique are compared. The results indicate that Random forest and Logistic regression algorithm performs better when all features are used. On the other hand, the feature selection technique significantly improves the accuracy in some models such as KNN, Naive Bayes and Multilayer perceptron model. Table 13 shows the selected set of features that are detected by Chi-square feature selection technique for code smell datasets. The best results are obtained by using 10 metrics for the God class and 12 metrics for the Data class, Feature-envy and Long-method are selected. Likewise, Table  14 shows the selected metrics from each datasets that are detected by Wrapper based feature selection technique. Using this method 12 metrics in data class, 9 metrics in the god class, 11 metrics in Feature envy and 7 metrics in Long method are selected.

I. IMPACT OF 10-CROSS VALIDATION AND GRID SEARCH
To answer the RQ3, the accuracy of 10-Fold cross validation and Grid search algorithm are compared. Table  15 illustrates the accuracy of 10-fold cross validation and Grid search. In this experiment it is found that the grid search algorithm gives better results rather than cross validation. To answer the RQ4, the impact of tuning machine learning algorithm parameters on performance is checked. Decision tree model achieved highest accuracy of 98.22% when the number of trees are 12, maximum depth 10, and criteria is "Gain ratio", "Gini index" or "Information gain", "Accuracy" and "Entropy". The Table 16 and 17 shows the different combination of parameters for decision tree algorithm. The Multilayer Perceptron model achieved best accuracy 97.62% when Learning Rate is Constant, and Momentum Rate is set to 0.9. In the same case, when the Momentum Rate is set to 0, the accuracy is decreased from 97% to 93%. The Naive Bayes model achieved best accuracy 96.80% when the best parameter Alpha set is 10.
In the same case, when the Alpha value is set to 0, the accuracy is decreased 96% to 92%. The Random Forest model obtained best accuracy 99.74% when the numbers of trees are 35, and the maximal depth is 20. Table 18 shows that when the number of trees grows and the maximal depth remains constant, the accuracy decreases. The logistic regression model achieved highest accuracy 99.52% when parameter 'C' is 0.1 and penalty is 11. A table 19 shows, in the same case, when 'C' value is set to 1 to 100 and penalty is 12 or 11, the accuracy is decreased 97% from 98%.
In this experiment, it is observed that the parameter optimization technique provides positive effects for improving the accuracy of the machine learning algorithms.    RQ1: To answer our first research question six machine learning algorithms are applied on four code smell datasets. It has been seen that machine learning algorithms have good capability of predicting code smells. Software metrics that play important role in predicting code smells are identified and shown in Table 13 and Table 14. RQ2: Chi-square and Wrapper-based feature selection techniques are applied. The results indicate that Random forest and Logistic regression algorithm perform better when all features are used while KNN, Naive Bayes and Multilayer perceptron model's accuracy is significantly improved using feature selection techniques.

A. COMPARISON OF OUR RESULTS WITH OTHER RELATED WORKS
RQ3: To answer our third question the accuracy of 10-Fold cross validation and Grid search algorithm are compared. Table 15 illustrates the accuracy of 10-fold cross validation and Grid search. It is found that the grid search algorithm gives better results rather than cross validation.
RQ4: The impact of tuning machine learning algorithm parameters on performance is checked to answer fourth question. It is observed that the parameter optimization technique provides positive effects for improving the accuracy of all machine learning algorithms used in this work.

C. THREATS TO VALIDITY
Here possible threats will be discussed that might have affect our experiment and how we tried to mitigate them.

1) THREATS TO INTERNAL VALIDITY
The main internal threat in our study is the dataset. The dataset used in this study is taken directly from Fontana et al. [18]. They developed the dataset using code smell advisors to select candidates from large repository of 74 heterogeneous software systems (Qualitas Corpus) and validated manually the 420 examples for each code smell. Different metrics are considered to generate dataset. All of them might not have impact on the performance of models implemented. To manage this threat, two feature selection techniques are used to find metrics that are more impactful and compared results found using both techniques. As for the experimented prediction models, the model is implemented in Python which is now widely accepted as a better programming language with large set of the libraries in most of the domains

2) THREATS TO EXTERNAL VALIDITY
In our experiment, threats to external validity are as follows. The first threat is that the dataset used has only two types of code smells, namely class-level and Method-level smells. The second threat is related to application software from which dataset is generated are all Java source code. Thus, our approach might not be appropriate for C/C++ source codes.

3) THREATS TO CONCLUSION VALIDITY
This threat focuses on evaluating the performance of prediction models. 10-fold cross-validation is used to evaluate predictive models using multiple evaluation metrics, including accuracy, F1-score, precision, and recall. Although these evaluation metrics are not sufficient, to compare our results with existing techniques, the same metrics are used to evaluate the performance of models as taken from Fontana et al. [18]. To manage this threat, the confidence value of each prediction is calculated, and using feature selection, the important metrics are identified that have higher impact on prediction which helps the prediction model to take the correct decisions

VI. CONCLUSION AND FUTURE WORK
In this paper, a novel approach is proposed to predict the code smells from the software and detect the metrics that contribute a significant role in the detection process using machine learning techniques. Four code smell datasets (God class, Data class, Feature-envy and Long-method) generated from 74 open-source system (Fontana et al. [18]) are used. The Chi-Sqaure and Wrapper-based feature selection technique is applied to detect the best metrics that can be used to improve the accuracy. The Grid search algorithm is applied for the parameter optimization technique that significantly improves the accuracy of all algorithms. The six different machine learning algorithms (Naive Bayes, KNN, Multilayer Perceptron, Decision Tree, Random Forest and Logistic Regression) are used to detect metrics from code smell datasets that are generated from 74 open-source system (Fontana et al. [18]). The main contribution of this paper is two-fold: In first step machine learning algorithms are used for detecting the code smells. In the second step the performance measurement (Accuracy, Precision, Recall, and F1-Score), are calculated. The performance was improved by applying chi-square and wrapper-based feature selection techniques along with grid search algorithm with 10-fold cross-validation technique. In this paper, it is observed that for Data class dataset Random Forest algorithm achieved highest accuracy 99.74% when all features were considered while worst performance was achieved by Naive Bayes 83.10% when all features were considered. In case of God class dataset Random Forest algorithm achieved highest accuracy 98.21% when chisquare feature selection technique was used while worst performance was achieved by KNN 93.81% when chisquare feature selection technique was used. In case of Feature-envy dataset Decision tree algorithm achieved highest accuracy 98.60% when all features were considered while worst performance was achieved by KNN 84.85% using wrapper-based feature selection technique. In case of Long-method dataset Logistic Regression achieved highest accuracy 100% when all features were considered while worst performance was achieved by Naive Bayes 95.20% using chi-square feature selection technique.