Smell-Aware Bug Classification

Code smell indicates inadequacies in design and implementation choices. Code smells harm software maintainability including effects on components’ bug proneness and code quality has been demonstrated in previous studies. This study aims to investigate the importance of code smell metrics in prediction models for detecting bug-prone code modules. For improvement of the bug prediction model, in this study, smell-based metrics of code have been used. For the training of our model, we employed 14 different open-source projects from the PROMISE repository. Every project file consists of source code as well as smell code metrics and was written in Java. We examined different evaluation metrics such as F1_score, accuracy, precision, recall, the area under the receiver operating characteristic curve, and the area under the precision-recall curve of the five methods within the version, within the project, and across the projects. We classify the code components as buggy or non-buggy using Naïve Bayes, Random Forest (RF), Support Vector Machine (SVM), Logistic Regression, and k-Nearest Neighbor classifiers. RF and SVM have given better results within the version as well as within the project.


I. INTRODUCTION
The software system has an essential role in our daily life.Software systems are used to achieve almost all daily requirements.The software system is widely used for different tasks in this digital world, the world is powered by software.Human beings use software systems for economics, transport, medicare, communication, knowledge, combat, power plants, or even for the entertainment of humans.Since human beings depend primarily on software, software applications' accurate functionality is vital, and as far as possible it should be bug free.A bug in software is a flaw or malfunction in a software code that causes incorrect or unwanted results [1].Software defects are called bugs in a software development process.It is unanticipated deeds, and actions figured out by the quality control engineers in application testing and are preserved as software bugs.Bugs have high effects on software quality [2].The information related to bugs is kept in a bug report [2], [3].When the software bug reports are The associate editor coordinating the review of this manuscript and approving it for publication was Alberto Cano .generated, the reports are returned to the software development team to fix the identified bugs.This assignment and transfer procedure is term as bug triaging [4], [5].The process of bug fixing is exceptionally steady and time-consuming.It is required to detect bugs automatically through automatic bugs detection that is binary classification.The code may belong to the buggy or non-buggy classes in binary classification.It is technically conceivable to create an application without bugs, however, this is not the case in practice [6].Varshneya [6], proposed that making an application without bugs is impossible because complete code coverage is not a criterion for bug detection unless that software is a life-critical application.Even though that will be impossible to make the application entirely without bugs, software engineers strive to release applications with the fewest possible bugs.Hence, software testing must be an obligatory portion of the software development life cycle (SDLC).
Predicting software bugs is essential in software development because predicting buggy modules before program release increases overall software quality and user satisfaction and further improve the whole software performance [7].Furthermore, software adaptation to various surroundings is improved early by predicting software bugs and optimizing resource utilization.Several techniques are proposed to deal with software bug prediction problems.For the prediction of software bugs, machine learning (ML) techniques are well documented [1], [2], [3], [4], [5], [6], [7].They are widely used to predict buggy components based on historical data, essential metrics, and other software computing techniques.This research paper uses five supervised machine learning classifiers to evaluate ML abilities in software bug prediction.
The primary goal of this investigation is to create a bug prediction model by using code smells as a candidate metric [1].The bug prediction model proposed in this study is smell aware, i.e., to categorize the code into two classes: buggy class and non-buggy class based on source code and smell code metrics.On the contrary, the smell-aware prediction can decrease the debugging period by localizing and refactoring the smelly files triggering the failure [8].To be smell-aware, we added an intensity index to the dataset.In this study, both source code and smell code metrics are used to train a bug prediction model.We used Logistic Regression (LR), Naïve Bayesian (NB), Random Forest (RF), and Support Vector Machine (SVM) algorithms as our selected algorithms to train the bug prediction models.Furthermore, in this study, we compare the efficiency of the NB classifier, SVM classifier, LR classifier, and RF classifier.The comparison was made based on specific measures such as accuracy, precision, F1 score, recall, and ROC curves.

A. PROBLEM STATEMENT
Code smell often indicates a more severe problem in the software system.It arises when the software engineer does not follow the design principles, for example, encapsulation, modularity, abstraction, hierarchy (top down and bottom-up strategy), modifiability, cohesion, and coupling.Code smell lowers code quality and makes it difficult to understand and sustain [8].Code smell exposes design flaws and creates a more challenging software system to understand, maintain, and improve [9].For this purpose, we want to detect and classify the bugs based on both smells-based and sourced code-based metrics, which is an active research area in software engineering.Some studies on bug classification have been published [10], [11], [12], [13] however, the majority of them are limited to the source code metrics only [14], [15], [16] which lowers the predictive ability of the preceding bug classification model.This study develops a model, which is based on both source and smell code metrics.To be smell aware we added the intensity index.The intensity index estimates the severity of code smell that aids in bug classification and showcases the complexity of the code.To develop a smell aware bug prediction model, the intensity index plays an important role in deciding the severity of design issues influencing a code module.Most of the published literature [10], [13], [16] did not use intensity index as a feature for bug classification, our essential contribution to the dataset is the addition of an intensity index for smell-aware bug classification.
In the published literature majority of the studies use LR [1], [17], NB [13], [16], k-NN [18], and DTrees [16] as bug classification models, however, the effectiveness of SVM was not explored in the published literature.Furthermore, the preceding models were trained on dataset that were based only on source code metrics.

B. AUTOMATED BUG PREDICTION
The first bug prediction model is designed by Taba et al. [19].They specifically established three measures, which they coined as antipattern metrics.These metrics are described in the context of smells and might be used to assess the average amount of antipatterns, complications, and repetition length using as antipattern measurements in addition to structural metrics [20].A bug prediction model will now take advantage of antipattern measures to develop a smell aware bug prediction model.Furthermore, structural metrics were developed and tested with structural metrics, demonstrating that when the design faults are considered, bug prediction models can improve by up to 12.5 %.
We assumed that in a bug prediction model, the severity of a design issue disrupts a source code segment.We employed the intensity index, which was determined by Fontana et al. [21], to prove this conjecture [21].To create a smell aware bug prediction model, we consider the design flaws and its severity that affect a code module.More precisely, we assessed the severity index's predictive power by combining it with a bug prediction model based on structural quality measures [22] and comparing the accuracy with that obtained by the standard model on fourteen large open-source Java projects.The benefits of adding the severity index to these models to other structural metrics, and the ones used to calculate the intensity were also analyzed.According to the findings, using the intensity index to predict the bug improves the classification results.The consequences exposed that based on architectural quality criteria (AQC) using the severity index as a predictor of buggy modules improves the correctness of a bug prediction model.Furthermore, the data show that the severity index is more significant than every other quality metric in predicting the bug-proneness of the smelly classes.The findings show that using the severity index as a reliable indicator of buggy modules improves the effectiveness of structurally based baseline bug prediction models.Still, they emphasize the significance of the intensity of code smells in the process metrics-based prediction approaches.

C. OBJECTIVE OF THE STUDY
• The first objective of this research work is to leverage code smell metrics as the feature metrics for the development of a bug prediction model specifically called smell aware bug prediction model.Therefore, the model in this study is developed using various sources of information, specifically the product and process metrics.
14062 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
• The second main objective of the study is to describe the real contribution made by the intensity index to the bug classification.The intensity index estimates the severity of code smell that aids in bugs classification and shows the complexity of the codes.To develop a model for bug classification, the intensity index plays an important role in deciding the severity of design issues influencing a module.The experiments conducted in this study aims to see how vital the intensity index is in prediction models for detecting bug-prone code modules [20].
• The third objective of the study is to develop a smell aware bugs prediction model by highlighting and evaluating the selected models based on the modified data set based on smell-based metrics such as intensity index and amongst them specify the best model.Therefore, the smell-aware bug classification model was built using candidate classifiers such as NB, RF, SVM, k-NN, and LR classifiers.For this objective, this study examines different evaluation metrics of the five models within the version, within the project, and across the projects.

II. BACKGROUND AND LITERATURE REVIEW A. CODE SMELL
The term 'Code Smell' was originally coined by Fowler in his refactoring book.Code smells are a metaphor for defining patterns, commonly linked with poor design and bad programming practices [23].Code smell often indicates a more severe problem in the system.It arises when the software engineer does not follow the design principles, for example, encapsulation, modularity, abstraction, hierarchy (top down and bottom-up strategy), modifiability, cohesion, and coupling.Even though the developers know the design principles, the developers often violate them because software engineers do not have experience, the pressure of a deadline, and heavy competition between competitors in the marketplace.In the real world, software systems regress daily to meet new requirements or correct bugs discovered.The pressure to fulfill tight deadlines makes it difficult for developers to manage the complexity of such modifications effectively.Indeed, development operations are carried out frequently in an undisciplined manner, resulting in eroding the system's initial design by presenting technical debts [24].Software aging is a common term for this phenomenon [25].This phenomenon was measured in terms of entropy by some researchers.Fowler et al. [26] defined this phenomenon as ''Bad code smell'' (shortened as ''smells of code'' or simple ''smell'') as ''signs of existence of the bad design or choices of implementation applied in a software application development''.Smell means developers/designers do not appropriately design the software.Code smells have different kinds, such as long methods, complex classes, message chains, and many more, which are explained in the subsequent sections.These are just a few instances of code smells that might harm a software system.In addition to this approach for automatically detecting code smells in source code [27], the community of researchers struggled to understand code smell and its adverse effects on the non-functional characteristics of source code.We can learn when and why code smells take shape, how they develop and persist in software programs, and to what extent code smells apply to software developers.Several studies [19], [20], [21] have also reported that code smells can have adverse effects on software maintainability and understandability.Khomh et al. [28] and Palomba et al. [29] recently recognized that classes with design flaws are more likely to compromise future bugs.Moreover, this research revealed the hidden, and underhanded impact of code smells on bug prediction.The academic community has just scratched the surface of these observations.

B. MOTIVATION AND NOVELTY
In real-life, software applications frequently change to be adapted to novel requirements or as a result of bug fixes.The demand to fulfill tight deadlines makes it difficult for programmers to effectively deal with the complexities of such modifications.Indeed, development processes are carried out in an undisciplined manner, resulting in eroding the system's preliminary design by introducing technical debts [24].It is experimentally proven that code smells cause code to be less understandable.The empirical evidence shows that code smell has been exposed to hamper code understandability [30], raise modification [28], and proneness of error [31], and make code less maintainable [32].Code smell impacts normal software development tasks such as code inspection, refactoring, and maintenance [33].Some researchers [15], [20], [21] have termed this issue as code smell.
With code smell, the system may work, but it might slow down the entire system and can produce future bugs due to bad design and smell.When the bugs grow, the system will get an error and give an unwanted result.Code smells are signs to identify poor designs that result in having code with a smaller amount of maintainability.There is huge possibility that something may be assumed in the source code without following the actual design pattern when we have more signs of bad code smells in the source code.Therefore, we want to develop a smell-aware bug prediction.Most of the authors did bug classification without smell based metrices i.e., some of them did bug classification based on priority and some of them did bug classification based on severity and some others did it using different methods and approaches.In this study, we develop a smell-aware bug classification using different supervised ML classifiers.

C. LITERATURE REVIEW
One of the hottest current research fields in software engineering is bug prediction.The academic community has created a variety of prediction methods.The major approaches to software bug prediction are based on classification.It was thought that the software's complexity might lead to defects.To highlight the complexity of software, Akiyama [34] suggested a basic model based on LOC.It was too simplistic to use LOC as a bug prediction metric.In 1976, MaCabe proposed cyclamate complexity (CC) metrics for bug prediction [35].At that time, Halstead and CC [36] were outstanding measurements; unfortunately, they suffer from severe flaws.Khoshgoftaar and Munson et al. [37] suggested a more accurate categorization model.In the 2000s, process metrics prediction models were introduced as the use of version control systems expanded.The bug prediction model created in the 2000s had several disadvantages.One shortcoming of this model is that it cannot predict defects when a source code file is modified.To solve this issue, the Just in Time (JIT) model is proposed to predict bugs.Another disadvantage was anticipation of bugs for new projects or projects with limited historical data.As a solution to this constraint, cross-defect prediction methods were developed.This approach demonstrates that when cross-company data increases the likelihood the percentage of false positives also increases.
Pan et al. [14] proposed 13 program-slicing metrics for bug classification in the C programming language; these metrics use program slice information to count program size, coupling, sophistication, and cohesion.Program slicing metrics have measurements for program behavior in contrast to standard code metrics that focus on statements of code or structure of code.The program slicing techniques [38], [39] investigate the behavior of source code by looking at the flow and control dependencies between statements.Some metrics used in program slicing lists as sliceCount, verticesCount, edgesCount, sliceVerticesSum, globalInput, lackOfCohesion directFanIn, and edgesToVerticesRatio. Program-slicing metrics measurements have an overall accuracy of 82.6 percent for the Apache HTTP project and 92 percent for the Latex2rtf project at the file level, respectively.One of the significant drawbacks of program-slicing metrics metrics is that they can only use them to generate preprogram-slicing data for large projects.Regarding bug classification, the data imply that program-slicing metrics measurements are at least as effective as UC metrics.
Bug fixing is a time-consuming process.The bugs must be grouped into several categories to make this procedure easier [10].Binary categorization is one of the most fundamental software bug classifications, wherein a software code is labeled either as a buggy or clean code.To identify software codes as buggy or non-buggy, the proposed approaches use machine learning algorithms, discriminative words, and a fuzzy similarity metric with a user-defined threshold value.The researcher applied various techniques with dissimilar parameters over the Kaggle dataset.SLC classifiers outperform other classifiers in all aspects.
At the primary stage of app development, the software bug prediction model advances the essential parts, for example, reliability, software quality, and efficiency, and reduces the development cost [16].Bugs constitute a crucial barrier to system consistency and efficiency in most software systems, which become increasingly vast and sophisticated programs.The classifiers, LR, NB, and Decision Tree are used to construct a model to predict the occurrence of software bugs based on the historical data using four supervised machine learning algorithms.Among the many software metrics presented are Metrics of Dimension, Metrics of Complexity, Metrics of Object-Oriented, and Metrics of Android-oriented.Dimensional metrics make available quantitative metrics linked with software sizes like code size and modularity [16].The number of Byte-code Instructions (NBI), Number of Classes (NOC), Number of Methods (NOM), and Instructions per Method (IPM) are the metrics used in this category's analysis.They gathered information from projects available on the GitHub platform.The accuracy of distinct classifier models is lower.Across multiple samples, the models are not tested.As a result, taking random samples is likely to get lower accuracy.They utilized four algorithms, with the random forest providing the best results.
At the primary stage of app development, the software bug prediction model advances the essential parts, for example, reliability, software quality, and efficiency, and reduces the development cost [16].Bugs constitute a crucial barrier to system consistency and efficiency in most software systems, which have become increasingly vast and sophisticated programs.The classifiers LR, NB, and Decision Tree are used to construct a model to predict the occurrence of software bugs based on the historical data using four supervised machine learning algorithms.Among the many software metrics presented are Metrics of Dimension, Metrics of Complexity, Metrics of Object-Oriented, and Metrics of Android-oriented.Dimensional metrics make available quantitative metrics linked with software sizes like code size and modularity [16].Several Byte-code Instructions (NBI), Number of Classes (NOC), Number of Methods (NOM), and Instructions per Method (IPM) are the metrics used in this category's analysis.They gathered information from projects available on the GitHub platform.The accuracy of distinct classifier models is lower.Across multiple samples, the models are not tested.As a result, taking random samples is likely to get lower accuracy.They utilized four algorithms, with the random forest providing the best results.
Classes involving smells are revised more commonly than any other classes, according to Khomh et al. [40].According to Olbrich et al. [41], Smelly-code components need more attention and have different alteration behavior.Smells can be regularly detected using automated technologies.Smells can also be identified and analyzed in massive code bases using tools.As a result, a diversity of closed-source and opensource smell detection technologies have been established.Even though there are various tools available nowadays, each tool captures only a subset of smells.No tool is pre-programmed to do identification of entirely smells [42].The smells detected by the tools have a slight overlap.No single tool can detect all of the smells we investigated.It's impossible to tell which detection method is optimal for real-world systems.Hall et al. [42] studied the carry out of 5 smells on the number of faults in three systems, and that the only conclusion that was steady transversely entirely three systems would be that Switch Statements had no bearing on problems from at all of the systems.
Khomh et al. [28] also discovered that classes with design flaws (''antipatterns'') are much more likely to include bugs in the future.Though this research displayed the efficacy of code smells in bug detection, the discoveries have still to be incorporated into bug prediction models.The research authors, Palomba et al. [22], evaluated the involvement of a metric of the severity of code smells by adding it to current bug prediction model and comparing the findings of the novel model to the baseline model.In this paper, ML classification methods predicted two types of bugs: buggy and non-buggy.Multilayer Perceptron, ADTree, Naïve Bayes (NB), LR algorithm, Decision Table Majority, and Simple Logical were some of the classifiers they investigated.They employed the intensity index, which is identified by Fontana et al. [21].The index is calculated by JCodeOdor, a code smell detector that uses detection techniques applied to metrics.JCodeOdor produces five expressive values to be used per threshold values: VERY-LOW, LOW, MEAN, HIGH, and VERY-HIGH.In this paper, the author added an intensity index with structural metrics of the source code.The intensity index's contributions to bug estimation techniques are based on process metrics.In bug prediction models built on product metrics, process metrics, or a grouping of both, the intensity index assists in distinguishing bug-prone code components influenced by code smells.
Reference [23] defines technical debt as a circumstance where software engineers accept giving up one dimension of a software product (namely, quality) to maximize another (i.e., applying a group of novel attributes before a time limit).Even if this sacrifice brings immediate rewards, the debt must eventually be paid off.When there is too much technical debt, it slows down development and makes code more difficult to maintain.One type of technical debt is code smells.In this paper, Ubayawardana and Karunaratn [1] used several metrics of source code and metrics of code smell-based to construct a bug prediction model.They trained the model on different versions of 13 different Java programming language open-source projects utilizing NB, LR classifier, and RF approach as viable techniques.They demonstrated that when paired with source code metrics, smelly code metrics can pointedly advance the accuracy of the bug prediction model.The RF algorithm-based model outperformed compared to other algorithms in terms of precision and accuracy within a version, within a project, and across the projects.They employed two metrics for bug prediction, one for code and the other for the process.Process metrics gather information from VCSs like GitHub and issue-tracking systems like Bugzilla, whereas code metrics are obtained from source code.To improve traditional bug prediction methods, they incorporated smell-based measures.

III. PROPOSED METHODOLOGY
The research community has presented many bugs prediction [1], [2], [3], [4] and classification [5], [6] models based on various indicators to recognize more error-prone modules in software applications.Few of them [7] have enhanced accuracy and evaluation metrics as compared to others.However, only a few authors [8] did bug classification but their model is not smell-aware.This study used different approaches to do smell-aware bug classification through ML algorithms.Furthermore, we will do the result analysis of the algorithms with each other and compare their accuracy using dissimilar source code and smell-based metrics.
We propose five ML models: LR, RF, SVM, NB, and k-NN, to detect and classify smell-aware bugs.Our objective in these proposed models is to achieve high accuracy.Multiple stages have been conducted to address the challenges in Machine Learning, resulting in significant success in achieving the highest possible accuracy for smell-aware bug classification.However, we aim to investigate the reasons behind the lower accuracy of the ML models and compare the results and performance of LR, SVM, RF, k-NN, and NB.Consequently, we will analyze which machine learning approach is best for smell-aware bug detection and classification.
For our study, we proceeded with the dataset from Jureczko et al. [43], which is accessible from the PROMISE repository [44].This dataset comprises a rich collection of 44 releases from 14 projects, each with 20 code metrics.Additionally, the occurrence of bugs in each release is readily available.It is worth noting that the dataset includes systems of various sizes and scopes, allowing us to enhance the validity of our investigation [45].Furthermore, we considered the findings of Mende et al. [46], who discovered that models trained on limited datasets can yield unreliable performance estimations.
In this study, we utilize source code metrics to develop the first smell-aware bug prediction model.To train our initial model, we incorporate various source code metrics discussed in Section II, which is the literature review.The primary objective of this study is to correlate the code smell metrics proposed by [19] and [22].By associating these metrics, we aim to enhance the predictive power of our improved smell-aware bug prediction model.For bug prediction, we employ five classification models: RF classifier, LR classifier, SVM classifier, NB classifier, and k-NN classifier.Through an extensive evaluation, we demonstrate the effectiveness of the metrics proposed by [19] and [22] in enhancing the predictive power of our developed model.
The proposed methodology for our models consists of several stages, as depicted in Figure 1.The process begins with the input of the dataset, which is then processed by the proposed machine learning techniques.These techniques analyze the dataset and generate output by classifying the code as either buggy or non-buggy.Once the classification is complete, the next stage involves evaluating the performance of the proposed models.This evaluation includes comparing the results of our models with those of other existing models.By conducting this comparison, we can assess the effectiveness and efficiency of our proposed models in bug detection and classification.In summary, the proposed methodology involves inputting the dataset, applying machine learning techniques to classify the code, and subsequently evaluating and comparing the performance of our models with other approaches.

A. INPUT
The first step in our methodology is the input stage, where we gather data from different open-source projects containing bugs.These projects are obtained from the PROMISE bug repository and serve as the training data for our model.The details of the software projects dataset used in this study can be found in 3.3.A comprehensive description of the dataset is provided, including specific information about each project.For further reference, please consult Table 1, which presents the specific details of the dataset used in our study.

B. PROPOSED MACHINE LEARNING TECHNIQUES
In this study, we developed a smell-aware bug prediction model using various machine learning approaches.Our chosen learning style is supervised learning, which means we focus on algorithms that support this type of learning.The prediction outputs of our model are classified into two types: classification and regression.Classification involves linking input variables to discrete output values, while regression predictive analysis maps input factors to continuous output variables.In the case of our bug prediction model, the output type is binary, meaning we categorize a source code segment as either buggy or non-buggy.Consequently, we will only explore methods that support binary classification, as this paper specifically focuses on the binary classification of bugs.For our investigation, we selected five commonly used classifiers in bug prediction research: LR classifier, RF classifier, SVM classifier, k-NN, and NB classifier.These classifiers will be utilized in our study to develop and evaluate the performance of the smell-aware bug prediction model.

1) LOGISTIC REGRESSION
The LR (Logistic Regression) algorithm is widely used in data mining, particularly for binary classification tasks.It is a statistical and data mining method that is commonly employed by statisticians and academic researchers to analyze and classify binary and proportional response datasets.Logistic regression is known for its ability to model the relationship between a set of input variables and a binary outcome.Researchers in various fields, such as statistics and data mining, have extensively utilized the LR algorithm to analyze and classify binary data.This technique has proven to be effective in a wide range of applications and is often chosen as a go-to method for binary classification tasks.Studies referenced as [47] and [48] provide further insights into the usage and application of logistic regression in statistical analysis and binary classification.LR classifier has several key advantages, including the ability to prepare probabilities and can be extended to handle multi-class classification issues [49], [50].Another advantage is that most LR model analysis methods are based on the same principles as linear regression [51].The LR algorithm is a widely used supervised ML classification technique.It works on categorical dependent variables, yielding two discrete variables (0 or 1).As a cost function, the sigmoid function is used.The sigmoid function converts a predicted actual value into a probability value (0- (1).
Logistic Sigmoid function: P(x) is a probability prediction function with a value between 0 and 1, x is the probability function's input (the algorithm's prediction value), and e is Euler's number, which has a value of about 2.71828 as indicated in equation 1.
To predict bugs, a logistic regression (LR) machine learning model is utilized.Initially, the LR model is trained using data from fourteen open-source projects.Subsequently, the model is evaluated with test data to determine its behavior and achieve the highest possible accuracy.The LR model aims to classify the presence or absence of bugs in an application, assigning a category of 1 for true (buggy) and 0 for false (nonbuggy).The pseudocode in Figure 2, describes the Logistic Regression which is used to train and test the bug prediction model.

2) THE RANDOM FOREST ALGORITHM
Random forest is an ensembled learning technique that is encircled of n collections of independent decision trees [49].Traditional machine learning techniques typically result in low classification accuracy and are prone to overfitting.Many people study the algorithm for merging classifiers to enhance accuracy.Many researchers begin their research to improve classification accuracy by merging classifiers.Random Forest is an innovative technique and a new combinational algorithm that is coupled with a succession of tree classifiers, where every tree cast a unit vote for the further most common class which means voting by the majority, and then the findings are merged to achieve the final sorted result [52].Random Forest has a lot of interesting characters.RF has never been over fitted, has good classification accuracy, and is immune to outliers and noise [52].Random Forest is widely used for classification and prediction, as well as regression and our main purpose is to use the RF algorithm for binary classification of bugs classification.For classification, the RF algorithm finding is based on the class's mode.In comparison to typical algorithms, Random Forest has several advantages over traditional algorithms.
As a result, Random Forest can be used in a variety of situations.For classification in the terminal leaf nodes or decision nodes when constructing a prediction, the RF algorithm uses multiple trees to calculate the majority votes.Decision trees are essentially tree-like structures; the top node is called the root of the tree, which recursively split at the decision node series from the root until the decision node is reached [53].The decision tree algorithm divides the dataset into smaller subsets using a top-down, ''greedy,'' methodology.Entropy is calculated to determine which attribute to split on at each node.A tree-like learning method has the benefit of permitting the training of models on large datasets and moreover on both quantitative and qualitative input variables.Furthermore, tree-based models may be resistant to redundant variables or variables with significant correlations that could cause overfitting in other learning algorithms [53].Bagging is the process of randomly selecting samples with replacement, and it produces a new tree for training.The variance will be reduced and a smoother decision boundary will be created by averaging the findings from the 'n' number of trees [49].For example, while using the random forest for smell-aware bug classification, every tree will give an estimation of the class label likelihood that it belongs to a particular class (buggy and clean code).
The likelihood will then be averaged over the 'n' trees, and the tree with the highest likelihood will produce the estimated class label (Figure 3) and the RF algorithm produces the buggy instance of the code.To decrease the variance further in the decision boundary the tree should be entirely uncorrelated.The implementation of the RF algorithm in Figure 4 includes the pseudocode for RF formation.As well in Figure 5 include the pseudocode for RF prediction.

3) SUPPORT VECTOR MACHINE
The newest supervised machine learning technique is Support Vector Machine [54].Reference [55] presents an excellent overview of SVMs, and [56] is a more recent book on SVM.As a result, the SVM classifier is a new way to classify and predict data.Vapnik and Cortes [57] developed this highly popular and powerful classification system.SVMs are identified as maximum margin classifiers so the SVMs find the best segregating hyperplane between two classes (see Fig. 5).So, our problem is also binary classification.The PROMISE bug  repository dataset is used for bug detection.We will simply cover the method's basic principle in the context of classification using supervised learning techniques.Here, we will merely go over the method's fundamental concept concerning classification using supervised learning approaches.To know the nature of the SVM classifier, one needs to comprehend four main ideas: separating hyperplane, maximum-margin hyperplane, soft margin, and kernel function [58].When we have a large-scale dataset, it doesn't perform as well because the training time is longer.SVM analysis is divided into three phases:(i) feature selection, (ii) classifier training and testing, and (iii) performance evaluation.It should be noted that these stages are available in most machine-learning approaches and are not exclusive to SVM.For both linear and nonlinear datasets SVM works well.The SVM classifier performs well when the dataset has a huge number of attributes.SVM works on the fundamental rule of ''margin'', in a nutshell.A distribution between 2 data labels that exist on either side of the hyperplane is built by a hyperplane.The aim is to increase the margins so building enough probable gaps amongst the instances and segregating the hyperplane on both sides of it [59].
Figure 6 segregates dots from triangles, the solid line demonstrates the hyperplane, and the dotted lines running parallel to the solid line demonstrate how far the decision hyperplane can be moved without causing misclassification.
(W.X + b = 0) is a math expression that is a delegation of separating hyperplanes Where, W = {w 1 , w 2 , w n }, symbolized as the weight vector, 'n': number of features; and 'b' stands for a scalar (also referred to as a bias).(iii) MMH (maximum margin hyperplane) can again 28 be written as the decision edge [60], [61].
Representations: (a) y i : Xi support Vector class label (b) X T is a test tuple (c) b0 and α i : numeric parameters (d) l: number of support vectors [62].For the SVM classifier in this study, the basic steps are specified in figure 7.

4) NAÏVE BAYES
The Naïve Bayes classification algorithm uses the Bayesian theorem, which is favored when dealing with highdimensional inputs.We use the Naïve Bayes classification algorithm R function's implementation.For each characteristic X (x1, x2, and x3. . .xn) the likelihoods are computed by the Naïve Bayes Classifier.Then, as a result, it chooses the instance with the highest likelihood value [63].For defect prediction the Naïve Bayes are effectively applied in some research efforts.And in this study, it will be applied to software bug prediction as well.The NB approach in machine learning is particularly efficient.The bug prediction binary classification is treated by the NB model, by examining software modules' historical data it trains and constructs the predictor.The predictor is then used to determine whether a new module contains bugs or not.Equation 3 is the Bayes Theorem, and this is derived from conditional probability.The PROMISE bug repository dataset is used in this study, and it is used for binary classification purposes (as buggy and non-buggy data).The dataset has multiple independent features (in our dataset it is called source code and smell code metrics) for example X = {x 1 , x 2 , x 3 , . . .x n } where x 1 is feature one, x 2 is feature two, and so on.And one dependent feature Y = {0, 1}, '1' means true = buggy, and '0'means false = non-buggy code.So, the Bayes Theorem will be changed to a binary classification problem.
P (A): Probability of A P (B): Probability of B P (A|B): Probability of A when B is given P (B|A): Probability of B when A is given Equation 3 we have to change based on our dataset.We will give all input features (X = {x 1 , x 2 , x 3 , . . .x n }) and predict the dependent feature 'y' and categorize it, whether it is buggy or not.Equation 3 can be written as As we know, X = {x 1 , x 2 , x 3 , . . .x n }.Equation 5 can be reaped from equation 4.

the P (y
The P (x 1 ) P (x 2 ) . . .P (x n ) can be considered as constant because this will be the same for every record.So, P (x 1 ) P (x 2 ) . . .P (x n ) will be directly proportional to the P (y) n i=1 P (x i | y).In order to find out the output of X = {x 1 , x 2 , x 3 , . . .x n } particular values, we need to take the argmax of P (y) n i=1 P (x i | y).Finally, we achieved equation 6 for NB classifier.
Argmax means which will be given the maximum likelihood to consider that.Suppose for True it is given as '0.7' and for False it is given as '0.3', now in this case I will consider '0.7', so the output for the X = {x 1 , x 2 , x 3 , . . .x n } this particular feature will be Tr.

5) k-NEAREST NEIGHBOUR (k-NN)
The idea behind Nearest Neighbor Classification is simple.
According to the class of their closest neighbors, instances are classified.Because that is typically convenient to consider further than one neighbor, the technique is more commonly known as k-Nearest Neighbor (k-NN) Classification, in which k nearest neighbors are applied to determine the class [64].The algorithm needs the training samples at runtime so they must be in memory at runtime.It is called a lazy learning approach as well.The main points of the k-NN classifier are a smaller amount of computation time and effortlessness of interpretation for the training of model but in the testing phase, it will take a longer time.The value of K is significant in the k-NN algorithm and is used to finetune the algorithm.When the value of K reduces, the model becomes less consistent; conversely, when the value of K grows, the model becomes more stable [65].When the number of samples or examples rises, the k-NN algorithm becomes slower.To determine the distance among classes, the k-NN method employs the Euclidean distance formula.

C. DATASET DESCRIPTION AND PRE-PROCESSING
Pre-processing of data is a crucial step in the data mining process as training datasets often contain imperfections such as faults, outliers, missing data, and noise.Tools are necessary for detecting and correcting these issues.Raw and unprocessed datasets are typically inadequate and may contain errors, missing data, outliers, and additional noise.
To address these concerns, it is essential to evaluate the quality and accuracy of the data before conducting experiments.Pre-processing operations encompass various tasks, including data clean-up, data integration, data transformation, and data reduction.These operations aim to improve the overall quality and reliability of the data, ensuring that subsequent analysis and modeling steps are based on accurate and consistent data.Pre-processing operations include data clean-up, data integration, data alteration, and data lessening [66].
Proper data preparation is needed to advance the accuracy of the training model.Each file has an identical set of properties and is in Comma Separated Values (CSV) format.The attributes in a PDFSBP(PROMISE dataset for software bugs prediction) are as follows: All source code metrics can be obtained in a data file, including project name, version, file name, bug count per file, is Buggy File, WMC, DIT, NOC, CBO, RFC, DAM, MOA, MFA, LCOM, Ca, Ce, NPM, LOC, DAM, MOA, MFA, CAM, ACC.ANA, ACM, ARL, and antipattern cumulative pairwise diversity are the code smell-based metrics provided in a data file (ACPD).
In addition to the metric information, each dataset has some metadata.The bug prediction model is unaffected by some of the attributes seen in each file.
For this study, we used data from publicly accessible data repositories.As a result, the data sets have been stripped of project, file, and version names.A file is deemed an 'isBuggyFile' in the data set if at least one problem has been  reported against it in a certain version.Table 1 shows the datasets and versions of several projects that we collected for the bugs classification.
In a subsequent version, the same file could be a non-buggy file.The number of problems that have been reported against a file has also been kept track of.
The 'Bugs count per file' attribute was also removed because predicting the number of bugs associated with a file was outward to the scope of this study.To increase accuracy, the data must be in numerical format.As a result, we focused solely on numerical properties when developing the model.Some of the characteristics have stronger correlations with one another.Therefore, those attributes have a high correlation with one another, we can drop one of them.

D. DATA SPECIFICATION
The purpose of this exploration is to look at the model within a version, even within a project and across the projects.Each dataset was broken down into two parts.The model was trained using 70% of each dataset, then tested with 30% of each dataset.The dataset division illustration is in Figure 8.
For instance, Inside the Apache Ant 1.7 version 745 instances were using real data, we first eliminated 5 occurrences of this dataset at random.The accuracy of the prediction was evaluated with a real dataset.70 percent of instances (518 instances) were used to train the model and 14070 VOLUME 12, 2024 Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.The Apache Ivy version 2 dataset is used for validation purposes in cross-project prediction.This project is a novel project that has less historical information.In the entire project (in all versions) there were almost 16257 instances.
For training, we used 70 percent data (11380 instances) and for testing of the model 30 percent data (4877 instances) are used.
The model's accuracy is also affected by the number of buggy samples in the database.The is-buggy is a Boolean attribute that specifies if, in a certain version, a file is a buggy (it is stated as ''1'') or not (stated as ''0'').In our datasets, there must be an equal number of true and false samples.It's quite difficult to create a training dataset with a balanced amount of buggy and non-buggy samples.The number of instances reported as buggy in all versions of all projects was 34.56 percent.

E. DIFFERENT TYPES OF METRICS USED FOR BUG PREDICTION
Several of the significant source code and code smell metrics that were evaluated in the study are summarized in subsequent Table 2.

F. PERFORMANCE EVALUATION METRICS
The performance is checked by passing through several parameters which are Confusion Matrix, Precision, Recall, F-Measure, ROC curve, PR curve, and Accuracy.This study considered two crucial factors: performance and effectiveness.

1) CONFUSION MATRIX
A typical machine learning approach for measuring the quality of an algorithm is to cross-classify predicted and real decision classes in a confusion matrix as well identified as an error matrix [67].The ideal choice for calculating the accuracy and other measuring metrics of RF, NB, LR, SVM, and most of the classifiers is the Confusion matrix.A confusion matrix is a table that has the amount of correct and incorrect predictions produced by a classification model for the task of binary classification.It creates a table with all of a classifier's predicted and actual values.A classification model provides four different prediction outputs.
•True positive (TP): Malignant instances predicted as malignant via the ML model.
•False positives (FP): Benign instances predicted as malignant via the ML model.
•True negative (TN): Benign instances predicted as benign by via ML model.
•False negative (FN): Malignant instances were predicted as benign via my ML model.
Based on the above prediction results, various evaluation metrics have been presented in the literature.

2) ACCURACY
Accuracy is a parameter for evaluating the classification model.The number of correct predicted values multiplied by the whole number of predicted values is called accuracy.Equation ( 1) is also used to calculate accuracy in binary classification [68].
where ''TP'' is a short form for True Positives, ''TN'' is short for True Negatives, ''FP'' is the short form for False Positives, and ''FN'' is the short form for False Negatives.

3) THE PRECISION
The function of related instances amongst the obtained instances is called precision.The following equation can be used to calculate it.

4) THE RECALL
The recall is the ratio accurately positive prediction for everybody in the actual consequence; the equation can be used to calculate it.

5) THE F1-SCORE
A harmonic means of precision and recall would be the F1-Score.The F1-Score is the average of Precision and Recall when false positives and false negatives are considered.When the data distribution is imbalanced, the F1-Score is more effective than accuracy.F1-Score can be calculated by the Equation below.
The ROC curve shows how well a classification model works throughout the classification thresholds.In imbalance datasets, the AUC and F-measure are typically used to assess classifiers.The area under the ROC curve, which lies between [1, 0], measures the comparative performance of TPR and false positive rate (FPR).This curve displays two values: • True Positive Rate • False Positive Rate There is no difference between recall and TPR, which means recall is the same as TPR.TPR is defined as bellows: The following is how the False Positive Rate is defined: TPR & FPR at several classification thresholds are contrived on an ROC curve.As the classifying threshold is let down, more substances are classified as positive, leading to an upsurge in both False Positives and True Positives.AUC is an efficient, sorting-based technique that can compute the points in a ROC curve.

7) AREA UNDER ROC CURVE (AUC)
The AUC of ROC is one of the most important metrics used to measure classifier performance.ROC is a graphical tool that is used for binary classifiers' performance assessment.FPR and TPR can be combined into a single metric.TPR and FPR are calculated with separate thresholds and then plotted into a graph, with FPR values on the abscissa and TPR values on the ordinate.The produced curve is termed the ROC curve, and the metric we take into consideration is the Area Under the Curve (AUC).It should be remembered that the better the model, the greater the AUC.

8) PRECISION-RECALL CURVE (PR)
The PR curve displays the trade-offs between recall and precision for various thresholds.Having high accuracy showing a reduced FPR and excellent recall showing a low falsenegative ratio, a big area under the curve suggests excellent recall and precision.On an imbalanced dataset, the PR curve is more informative than the ROC curve when assessing binary classification because of the usage of TN in FPR.Precision vs. Recall is plotted on the PR curve.PR curve is a graphical tool that is used for binary classifiers performance comparison.Sometimes PR curve is further suitable than ROC.The ROC gives an idea of how the classifier overall acts, and it considers equally the positive and negative classes.PR curve is better for imbalanced data because it does not consider ''True Negative'', it measures the balance between two classes.The visual representation of the curves is a significant difference between ROC space and PR space.Viewing PR curves can reveal differences between algorithms that are not visible in the ROC space.

IV. EXPERIMENTAL RESULTS AND DISCUSSIONS A. SYSTEM SPECIFICATION
The system that is used for simulations is an HP Intel core i5 4th generation desktop with 8GB RAM and 250 GB SSD and 500 GB HDD, 2.6 GHz processor, and Windows 10 64bit operating system.Python 3.9 is used for smell-aware bug classification simulation.Jupyter Notebook is used for code execution.The library that was used is Scikit-Learn, Seaborn, and Matplotlib for Machine Learning.

B. DATASET DIVISION
The dataset is divided into several train and test ratios, as shown below: The results of the proposed models are discussed briefly in this section.Furthermore, they demonstrate the evaluation outcomes of the bug classification model against various evaluation metrics.The experiment for smell-aware bug classification was performed using the PROMISE bug prediction dataset, and the different classification consequences obtained for various classifiers are shown in Table 4.

D. PROPOSED MODEL CONFIGURATION AND RESULT
Using the PROMISE bug prediction dataset for training, testing, and evaluation of the model.The process of training, testing, and evaluation of the model on different datasets is shown in Figure 1.Following sub-sections discussed the entire experimentation.

E. EXPERIMENT WITHIN A VERSION
In a specific version of experimentation, the dataset is divided into several trains and test ratios, and then the model is trained accordingly.

1) FIRST EXPERIMENT WITHIN A VERSION FOR (90-(10) % RATIO
We experimented on Apache Ant version 1.7 for buggy and non-buggy classes, which is the PROMISE repository project dataset for bug classification.In the first step, the dataset is divided into two parts, 90% for the training of the model and In all trained models, the confusion matrix of the best model is illustrated.Figure 9 demonstrates the confusion matrix of the buggy and non-buggy classes.In the columns, the predicted class is signified, whereas the actual class is signified in rows.The number of true and false predictions formed by the SVM classifier is shown in a confusion matrix in Figure 9.
It can be used to specify performance indicators like accuracy, precision, recall, and F1-score.TPR (True Positive Rate) and FPR (False Positive Rate) are used to evaluate the efficiency of the proposed model.Numbers on the matrix diagonal designate the correct predictions, while values outside the matrix diagonal designate incorrect predictions.In brief, both the RF and SVM models have the same accuracy, correctly classifying the code as buggy and non-buggy classes with 100% accuracy.The ROC and PR curves designate the performance of a classification model at a 90-10% ratio for bug classification.ROC and PR curves are two graphical tools that are used for comparison in binary classification.Figure 10 is a combined ROC curve for all the evaluated models in the ROC curve.AUC indicates how well the model can distinguish between classes.The PR curve shows the trade-off between precision.Figure 11 represents the PR curve, where the recall is on the x-axis and the precision is on the y-axis.Good recall is correlated with a low FN rate, while high precision is correlated with a low FP rate, this represents that SVM and RF models are very precise, and these algorithms did the best classification of codes as buggy or not buggy.Various performance metrics of the SVM, RF, NB, LR and k-NN classifiers for the dataset Apache Ant version 1.7 are shown in Table 3.As presented in Table 4, SVM and RF algorithms achieved a high accuracy rate of 100.The LR classifier gives 99% accuracy and the accuracy of the k-NN classifier is 89%.RF and SVM models showed an equal F1_score value, i.e.,1.0 whereas the LR model revealed the F1_score value of 0.99 and the k-NN classifier displayed the lowest value of 0.55 among all classifiers.
The ROC curve is constructed from two parameters: 1. True Positive Rate 2. False Positive Rate As shown in Figure 10, every classifier has its own ROC curve.Each ROC space is specified by TPR (also called sensitivity) and FPR (also called specificity) as y and x-axis.The optimal prediction model would provide a point at coordinate (0,1) in the upper left corner of the ROC space, corresponding to 100% sensitivity (no false negatives) and 100% specificity (no false positives).In Figure 11, the RF and SVM classifiers have 100% sensitivity and 100% specificity, and they have huge AUC of ROC.So, RF and SVM algorithms did perfect classification, which means these models are the best than others.It should be noted that the k-NN model has the worst ROC curve.
In Figure 10 the PR curve is plotted and this is another good evaluation metric for imbalanced data.The AUC-PR curve is optimal for SVM and RF models.

3) SECOND EXPERIMENT WITHIN A VERSION
The bug prediction experimentation was conducted on the Apache Ant version 1.7 dataset for both buggy and non-buggy classes.In the first step, the dataset was divided into two parts: 70% for training the model and 30% for testing the model.The comparison was performed using five different ML classifiers: SVM, RF, LR, NB, and k-NN classifiers.Among these classifiers, the RF and SVM classifiers yielded the best results, while the k-NN classifier produced the worst output.
Figure 12 demonstrates the confusion matrix of the LR classifier for the buggy and non-buggy classes.In the column, the predicted class is signified while in the row the actual class is signified.It can be used to specify performance indicators like accuracy, precision, recall, F1-score, TPR, and FPR to evaluate the efficiency of the proposed model.Numbers on the matrix diagonal designate correct prediction, but values outside the matrix diagonal designate incorrect prediction.Figure 12 demonstrates the confusion matrix of the buggy and non-buggy classes.Out of 174 values 173 values of 0-class are predicted truly and 1 positive value is predicted wrongly.Correspondingly, 50 out of 50 values of 1-class are predicted accurately and did not predict any negative values incorrectly.

a: PERFORMANCE ASSESSMENT OF SMELL-AWARE BUG CLASSIFICATION ON RF, SVM, LR, NB, AND K-NN CLASSIFIERS ROC AND PRC FOR BUGGY AND NON-BUGGY
The performance of a classification model at a 70-30% ratio for bug classification level is described by the ROC and PR curves.ROC and PRC are two evaluation tools that are used for performance assessment of binary classifiers.Figure 13 is a combined ROC curve for all the evaluated models.It designates how well the model can differentiate between classes.As shown in Figure 13, every classifier has a particular ROC curve.Each ROC space is visualized by TPR and FPR as y and x-axis.In Figure 13, RF and SVM classifiers have 100% sensitivity and specificity, and they have huge AUC.So, RF and SVM algorithms did better classification, which means these models are better than others.On the other hand, the LR classifier AU-ROC value is outstanding, i.e.,0.997.The k-NN model AUC of the ROC curve is not glowing at all.
In Figure 14 the PR curve is plotted, in which the SVM and RF models AUC-PRC is 1, which denotes a high precision  and a high recall, AUC-PRC is another good evaluation metric for imbalanced data.High recall corresponds to a low FN rate and high precision to a low FP rate, this represents that SVM and RF models are very precise, and these algorithms did the best classification of codes as buggy or non-buggy.The AUC-PR curve value is 0.98 for the LR classifier.The k-NN classifier's AUC-PR value is 0.86.Table 4 signified other performance assessment parameters such as accuracy, precision, recall, and F1_score.Which is assessed for bug classification as buggy and non-buggy classes.As Table 4 demonstrated all trained models have given very good output except the k-NN classifier, the main reason is the integration of the code smell metrics with source code metrics in the given dataset.As shown in Table 5, the SVM and RF algorithms give very precise outcomes with both Precision and Recall being 1.00.It is also memorable that the LR classifier has given a precision value of 0.98, and the recall value is 1.00.But the k-NN classifier recall value is low i.e., 0.54 and the F1_score is 0.7 among all algorithms.

4) THIRD EXPERIMENT WITHIN A VERSION
This experimentation is done for buggy vs non-buggy classes.This experiment is done for a 50-50% ratio.
Figure 16 demonstrates the confusion matrix of the NB model for the buggy and non-buggy classes.The predicted class is indicated in the column whereas the actual class is shown in the row.Numbers on the matrix diagonal designate correct prediction, but values outside the matrix diagonal designate incorrect prediction.Figure 15 demonstrates the confusion matrix of the buggy and non-buggy classes, 294 out of 300 values of 0-class (non-buggy) are predicted truly and 6 positive value is predicted wrongly.Correspondingly, 73 out of 73 values of 1-class are predicted accurately, while no value is predicted incorrectly.

a: Performance assessment of Smell-aware bug classification on RF, SVM, LR, NB, and k-NN classifiers ROC and PRC for buggy and non-buggy
The performance of a classification model at a 50-50% ratio for buggy and non-buggy classes is defined by the ROC curve and PR curve.Figure 16 is a combined ROC curve for all the evaluated models, in which each ROC space is stated by TPR and FPR as y and x-axis.SVM and RF algorithms' recall and cut-off are always better than LR, NB, and k-NN algorithms.RF and SVM classifiers have 100% sensitivity and specificity, and these algorithms have vast AUC.On the other hand, the LR classifier AU-ROC value is outstanding, i.e.,0.98.The AUC-ROC for the NB model is 0.97 and k-NN AUC-ROC is 0.86.The PR curve is plotted in Figure 17, in which the SVM and RF model's AUC-PR curve value is 1,  which denotes high precision and high recall.This signifies that SVM and RF models are very precise.The LR classifier value of the AUC-PR curve is 0.98 and the k-NN classifier's AUC-PR value is 0.86.
Table 5 shows the complete performance report of the k-NN, NB, LR, SVM, and RF models.In this experiment SVM and RF give equal performance assessment metrics values.The overall best accuracy of the model is 100 % on SVM and RF classifiers for a 50-50% ratio, as shown above in table 6. RF and SVM models demonstrate equal F1_score value, i.e.,1.0 whereas the LR model revealed the F1_score value of 0.99 and the k-NN classifier having a low F1_score value of 0.63.The k-NN classifier gives a better F1_score value in the (50-50) % division of the dataset than the (70-30) % dataset division as train and test samples.project.The outcome of the Apache Ant project is given in Table 6.
As findings demonstrated in Table 7 conclude, generally all models have shown a good performance of the evaluation and RF, SVM, and LR algorithms provide the most accurate result the project and give an equal output of the performance evaluation metrics.The main reason can be the dataset is balanced by SMOTE Technique and also the number of samples is more than within a version, and the dataset has a good number of buggy instances.
The F2-measure, which is the chosen metric to assess the model is excessively high for all the classifiers, we have seen improvements in the performance of some models (see Table    performance of a classification model within the project for buggy vs non-buggy classes is depicted by the ROC curve in Figure 18.The PR curve is drawn in Figure 19, in which the SVM, RF, and LR model's AUC-PR curve value is 1, which denotes high precision and high recall.This signifies that SVM, RF, and LR models are very precise.
The AUC-PR curve value is 0.99 for the NB classifier.The k-NN classifier's AUC-PR value is 0.96, which illustrates that k-NN gives better results within the project than within a version.shown in the results, e.g., k-NN classifier in 14 datasets has the range of 73% to 97%.This can occur due to sample overlapping, noise interference, and blindness of neighbor selection during balancing and the size of the dataset also have a huge impact on the training of the model.For instance, the Apache Forrest dataset has a total of 61 samples which is insufficient to train the model accurately.Notable that the performance evaluation parameters for some data sets are extremely high, for example, the Apache Synapse and Apache Camel dataset's performance evaluation matrix is very high, the main reason can be enough samples in the dataset for good training of the model.
For instance, the Apache Forrest dataset has a total of 61 samples which is insufficient to train the model accurately.Notably, the performance evaluation parameters for some data sets are extremely high, for example, the Apache Synapse and Apache Camel dataset's performance evaluation matrix is very high, the main reason can be enough samples in the dataset for good training of the model.

V. ACROSS THE PROJECTS
From all 14 projects, all the versions were used for training the model to achieve the results.Table 9 is the evaluation result of the designed models.
For the validation purpose of the trained model, the second version of Apache Ivy is used.In the training dataset, there were 16257 samples, and the buggy samples were 34.56% (5620 instances).However, in the cross-project prediction model, there is 34% of buggy samples in the training set, and therefore, the accuracy of cross-project prediction is low as compared to within the version in the project prediction.
In this experiment, the NB algorithm has given poor results as compared to the k-NN algorithm.Moreover, within the version and the project NB algorithm has given better results than k-NN.For a better understanding and illustration of the models, the following are the ROC and PR curves.
Figure 20 is a combined ROC curve for all the trained models across the projects.Each ROC space is stated by TPR and FPR as y and x-axis.In figure 20, the RF classifier ROC space is 1 and the ROC curve is greater than other classifiers.Therefore, RF algorithm recall and cut-off are always better than LR, SVM, NB, and k-NN algorithms across the project prediction.RF classifier has 100% sensitivity and % specificity and it has massive AUC.On the other hand, the LR classifier AU-ROC value is outstanding, i.e.,0.988.
In Figure 21, the PR curve is contrived, showing that the RF model AUC-PR value is 1, which denotes high precision and high recall.So, this implies that the RF model is very accurate.The AUC-PR curve value is 0.988 for the LR classifier.The k-NN classifier's AUC-PR value is 0.845.

A. COMPARISON OF THE PROPOSED MODEL WITH EXISTING MODELS:
This part presents a performance comparison of the proposed model with the existing models; therefore, we selected five ML classifiers.These classifiers are NB, RF, SVM, k-NN, and LR.This study explores various evaluation metrics of the five models within the version, within the project, and across the projects.The result of this comparison in terms of accuracy, precision, recall, F1 score, AUC-ROC, and AUC-PR are listed in table 9.
The result of this assessment is presented in table 9 that the proposed classifiers exceed the existing classifiers in terms of all six-measurement metrics.

VI. LIMITATIONS OF THE STUDY
In the field of software engineering various tasks can be formulated as learning problems and can be solved using machine learning algorithms.However, in the Software engineering domain the source of training data is source code, and the majority of the datasets are based on JAVA code.Therefore, in this study, only software systems coded in the Java programming language are evaluated for code smell prediction.Moreover, the scope of our smell extraction was restricted to open-source Java projects exclusively from the Apache repository, this limited selection could potentially impact the generalizability of our results.

VII. CONCLUSION AND FUTURE WORK
Software defects are called bugs in a software development process -unanticipated deeds and actions figured out by quality control engineers during application testing and are preserved as software bugs.Bugs have high effects on software quality.The process of bug fixing is exceptionally steady and time-consuming.Therefore, it is crucial to detect bugs automatically.
The primary goal of this investigation is to create a smell-aware bug prediction model by using code smell as a nominee metric.To be smell-aware, we added an intensity index to the dataset.The results showed that using the intensity index as a predictor for bug prediction improves the accuracy of the bug prediction model.Furthermore, the data show that the severity index is more significant than any other quality metric in predicting the bug-proneness of the smelly classes.The findings suggest that using the severity index as a reliable indicator of buggy modules improves the effectiveness of structurally based baseline models for bug prediction.Furthermore, they also emphasize the significance of the intensity of code smells in the process metrics-based prediction approaches.'' We provided empirical evidence in this study that code smell-based metrics are quite useful in bug prediction.Using several source code metrics and code smell-based metrics proposed in the literature, we constructed a bug prediction model.To create the model, we employed k-NN, NB, RF, SVM, and LR algorithms.Multiple versions of fourteen different open-source projects were used to train the bug prediction model.We experimented with how our bug prediction model behaved within the version, within the project, and across the projects.
To emphasize the following are the main conclusion from our research: • Using only source code metrics to anticipate project issues is insufficient.
• When code smell-based metrics are combined with source code metrics, accuracy, and F1 score can be improved.
• When compared to other algorithms, RF and SVM algorithms have demonstrated the best results in terms of accuracy.
• The presence of a large amount of numerical/categorical data, as well as training with a growing number of samples, might be the primary factors behind Random Forest's superior performance.
• The main reason behind the good results of the SVM algorithm might be that it is used effectively for slightly large and complex linear and non-linear datasets.
• Our features not independent of each other, which is why Naive Bayes did not perform well the study.
• Code smell-based metrics can be used to accurately forecast bugs across projects.When there are fewer buggy cases in the system, we were able to get more accurate findings.In future work, we would like to evaluate the performance assessment of the model in other programming languages as well.Furthermore, we will undertake additional research on the attributes of the Intensity index in the context of multi-class bug classification based on the samples collected from different languages from within a project, within a version, and across the project's data set.
APPENDIX A

FIGURE 3 .
FIGURE 3. ADTrees ensemble technique for bug classification in RF.
30 percent of instances (222 instances) were left for testing of the model.Validation was performed by unseen data from Eclipse and AgroUML.We looked at 5 distinct versions of Apache Ant for the within-project circumstances i.e., 1.3, 1.4, 1.5, 1.6, 1.7.All versions of the project are used for training of the model.There were 1692 instances in Apache Ant project wholly versions.There were 1184 samples records of training (70 percent) and 507 sample records of testing (30 percent) in the dataset.

2 )
PERFORMANCE ASSESSMENT OF SMELL-AWARE BUG CLASSIFICATION ON RF, SVM, LR, NB, AND K-NN CLASSIFIERS CONCERNING ROC CURVE FOR BUGGY AND NON-BUGGY CLASSES

FIGURE 13 .
FIGURE 13.ROC curve within a version of the dataset on 70-3a 0 % ratio.

FIGURE 16 .
FIGURE 16.ROC curve within a version of the dataset on a 50-50 % ratio.

F
. WITHIN THE PROJECTThis part concealments the bug prediction in the context of the within-project and specifically does not give evidence on how the model accomplishes within a single version of the
7) both sensitivity and F2-measures values have improved suggesting a better distinction between the two classes and decreases of the false positive's predictions.The

FIGURE 20 .
FIGURE 20.ROC curve across the project.

TABLE 2 .
Source code and smell code-based metrics.

TABLE 3 .
Apache Ant 1.7 within version model.

TABLE 5 .
Within version performance evaluation metrics.

TABLE 6 .
Apache Ant within-project model measures.

TABLE 7 .
Apache Ant across project model metrics evaluation report.

TABLE 8 .
The comprehensive result of the project.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE 9 .
Comparison of our proposed model with other classifiers in the literature.