Introduction
Software testing and maintenance are the most critical phases of software development. Bug reports play a vital role in these stages of development activities [1], [2]. A bug report is generated by the software quality assurance team while testing software modules. It contains detailed information about a specific component or problem that is needed to be fixed [3]–[5]. The information in a bug report includes many attributes such as feature request, functionality enhancement request, code errors, logical errors, and compatibility issues. The report consists of several headings including priority, summary, description of the affected component, and open/close status [6], [7]. However, the major problem encountered during the analysis of bug reports is that the information is in natural language. Therefore, it is very difficult to process and extract information from it. It requires a tedious effort from the development team to understand and address the reported issues [8]–[12]. Many studies exist that address the issues related to bug reports [13]. These include bug categorization [14]–[19], bug prioritization [20]–[23], bug localization [24], bug assignment [25], bug classification [10], [26], [27], bug severity prediction [28], and bug report summarization [29]–[31].
Bug categorization and bug prioritization remain the most important element of information that is required from any bug report. Most of the studies used supervised machine learning algorithms to automate the information extraction process [32]. Through these algorithms, a classification model is constructed by training the manually labeled data of bug reports, which are then used to automatically categorize and prioritize new bugs with pre-defined labels. Supervised learning techniques need a large labeled dataset, which is not easily available. In most of the available datasets, category and priority information is missing. Furthermore, most of the available research is focused on one problem independently, i.e., either automating bug categorization [11], [33] or bug prioritization [34], [35]. Consequently, very limited work has been done in the area of categorization and prioritization of bug reports simultaneously [36]. Therefore, there is a need for a capable framework that automates both bug categorization and bug prioritization at the same time.
This research is motivated to address the above-mentioned requirements. To this end, the key objective of this research is to develop a framework that categorizes and prioritizes each issue in the bug reports automatically. We propose and implement an automated framework for categorizing and prioritizing bug reports called CaPBug. The framework – CaPBug uses NLP and machine learning algorithms to categorize and prioritize bug reports based on their textual and categorical features. Baseline corpus is built by using the XML files of Mozilla and Eclipse repository. Different NLP techniques have been applied in bug reports’ textual descriptions to create a feature vector-set. Afterwards, the Term Frequency–Inverse Document Frequency (TF-IDF) feature extraction method has been used to extract relative and important words from the feature vector-set. Due to the imbalanced nature of priority classes, Synthetic Minority Over-sampling Technique (SMOTE) is used for oversampling the records. Finally, four machine learning algorithms i.e., Naive Bayes (NB), Random Forest (RF), Decision Tree (DT) and Logistic Regression (LR) have been used to train the models that predict the category and priority of bug reports.
Below are the major contributions of this research.
We created a baseline corpus with six labeled categories of bugs and five priorities by using two online bug repositories of Eclipse and Mozilla that are available on Bugzilla. Labeled dataset with predefined categories for bug reports from year 2016 to year 2019 is not available publically.
We proposed and implemented a framework named CaPBug for categorization and prioritization of bug reports using NLP and supervised machine learning. The novel contribution of this research is that it addresses the need for both automated bug categorization and prioritization.
We applied SMOTE to address class imbalance problem and to improve the model’s accuracy that prioritizes the bug reports. Limited work has been done using SMOTE to adjust the number of bug reports for each priority level so that the model can accurately predict the priority of bug reports.
We performed extensive experiments with the most recent dataset comprising of reports from year 2016 to year 2019. It includes a comparison of textual and categorical features for categorizing and prioritizing bug reports using four machine learning algorithms.
Table 1 summarizes our experiments and the corresponding section number in which results are presented.
We anticipate that our work will be useful for the community in automating categorization and prioritization of bug reports. This will be beneficial in maintenance and debugging of large software projects.
The remainder of this paper is organized as follows. Section II presents literature review of bug categorization as well as bug prioritization. Section III introduces proposed methodology for the CaPBug framework. Next, Section IV discusses the results after training and testing of the CaPBug framework. Finally, Section V concludes the research.
Literature Review
Researchers have addressed various aspects of automated software bug management, classification and prioritization. These include automation of bug assignment, duplicate or similar bug detection, bug fixing time prediction, bug localization, bug categorization, bug severity and priority predictions etc. Z. Weiqin et al. [36] conducted a survey of 327 participants to gain insight into bug management techniques and confirmed that these techniques play an important role in improving the automatic management of bug reports.
Y. Tan et al. [37] proposed a novel approach for predicting severity. They linked the bug repositories post on stack overflows to the contents of Mozilla, Eclipse, and GCC bug reports. Three classification algorithms of K-Nearest Neighbor algorithm (KNN), Naive Bayes and Long Short-Term Memory (LSTM) were used to predict the severity of bugs. The results of the experiments showed an increase of 23.03%, 21.86%, and 20.59% in the average F-measurement of Mozilla, Eclipse, and GCC in the proposed method.
R. Chen et al. [38] implemented an improved SMOTE technique called Rectangle SMOTE (RSMOTE) to avoid the poor performance for severity prediction. Due to class imbalance problem in bug reports dataset, RSMOTE was used to balance the size of datasets. Furthermore, a technique of repeated sampling was used to avoid indeterminate results due to over-sampling of records and to acquire multiple balance datasets. Further, ensemble approach, named Fusion of Multi-RSMOTE with Fuzzy Integral (FMR-FI), was used to integrate the trained classifiers that were built on multiple balanced datasets. Four evaluation metrics were used to evaluate the performance of the FMR-FI algorithm, namely accuracy, precision, recall and f1-score. The results show that the FMR-FI algorithm with RSMOTE worked well to improve the classifier’s performance for severity prediction.
Y. Xiao et al. [39] proposed an enhanced Convolutional Neural Networks (CNN) based model called DeepLoc for automated bug localization. Based on CNN, DeepLoc replaced the features of bug reports and source files with word embedding techniques. The experiments were performed on 18,500 bug reports from 2001 to 2014 that have been extracted from five projects of Aspect J, Eclipse UI, JDT, SWT and Tomcat. They compared DeepLoc’s performance from the four bug localization approaches of BugLocator, LR+WE, HyLoc and DeepLocator. The results of the experiments show that with the use of DeepLoc, Mean Average Precision (MAP) has improved from 10.87% to 13.4% for bug localization compared to traditional CNN.
To improve the automatic bug assignment, R. Shakripur et al. [25] suggested a time-based approach named ABA-TF-IDF using the Time TF-IDF weighting technique. The data was collected from the software repository of the Version Control System (VCS) where changes to the source codes are managed and other project facts are documented. Four machine learning algorithms i.e., Support Vector Machine (SVM), Naive Bayes, Vector Space Model (VSM) and Smooth Unigram Model (SUM) were used to train the model. The results show that the proposed approach performed well with a Mean Reciprocal Rank (MRR) up to 11.8% and 8.94%.
The focus of this research is to automate the process of bug categorization as well as bug prioritization. Therefore, the literature review has been presented in two parts. The first part presents a comprehensive overview of studies related to bug categorization. The next part explores the work that has been conducted on the bug priority. The previous studies of each group are discussed below.
A. Bug Categorization
Bug categorization is the process of automatically labeling bug reports with its relevant category. N. Limsettho et al. [14] proposed a model to automatically categorize bug reports using clustering and Hierarchical Dirichlet Process (HDP) techniques with NLP chunking. The clustering algorithms of X-means and Expectation Maximization (EM) were used and implemented using Weka 3.6. Two experiments were conducted on the online bug reports of Lucene, Jackrabbit (JCR) and HttpClient projects and evaluated by using cluster purity/accuracy and f1-score. The clustering result was compared with two classification methods of J48 and Logistic Regression. Results demonstrated that the approach of X-means performed well and cluster purity/accuracy and f1-score were high. Also, comparable results recommend that the algorithm of logistic regression may perform better with a supervised learning approach.
Labeled Latent Dirichlet Allocation (LLDA) based topic modeling was implemented by M. F. Zibran [15] for classifying bug reports. These reports were collected from online projects of Eclipse, GNOME and Python. The dataset comprises 1,138 bug reports from which 428 reports were selected. The results show that the accuracy in terms of precision, recall, and f1-score improved considerably when the LLDA is trained using the larger corpus.
N. Limsettho et al. [16] extended their work [14] and proposed an automated framework without labeled data and used topic modeling and clustering technique to categorize bug reports. Also, a new technique of NLP Chunking was used to automatically label a cluster and top words of that cluster. To solve the labeling problems of terms in previous studies, a weighted-reduction algorithm was chosen to provide a variety of words. Five experiments were conducted and the dataset comprises three online projects from Lucene, Jackrabbit (JCR) and HTTPClient. The results showed that the topic model performed well with a higher average of f1-score. The performance of their proposed framework with no labeled datasets is better than the labeled projects which are trained using the training models. The result of Phrase-level labeling by NLP chunking provided the high-quality labels that are related to the bug.
C. Zhou et al. [17] proposed a new approach called Bug Named-Entity Recognition (BNER). Three features i.e., description phrases, solid distribution, and Parts of Speech (POS) of bug reports’ entities were summarized and the category method was created to categorize bugs into a predefined set of 16 categories based on these features. A baseline corpus was built with all related information and a semi-supervised system of BNER. To extract features from the bug repository, an embedded technique was used. The two online software bug repositories of Mozilla and Eclipse were used to train and evaluate the proposed approach. The result showed that it is very useful to design a baseline corpus in initial phases and their approach increased the accuracy by 70% to 80%. Also, BNER can be effective for entities of cross-projects’ bug recognition.
B. Bug Prioritization
The process of bug prioritization involves automatically prioritizing highly influenced bugs so that the critical bug is identified immediately. An automated approach to prioritize bug reports named Drone was proposed by Y. Tian et al. [20]. For handling imbalanced data of bug reports, a new classification engine called GREY was built by merging linear regression and their thresholding approach. Different dimensions i.e., author, product, related-report, severity, textual and temporal were reviewed to predict bug reports’ priority. The dataset was collected from the Eclipse project with 100,0000 bug reports and divided them into three-set: REP−training data (for identifying similar reports), training data of Drone and testing data of Drone. The proposed approach was compared with the baseline solution of previous studies and the result showed that their approach performed well comparatively up to the f1-score of 209%.
Another study for prioritizing bug reports was proposed by P. A. Choudhary and D. S. Singh [40]. The research focused on five priority levels with six features i.e., temporal, textual, author, related-report, severity and product, to predict the priorities of bug reports using the Artificial Neural Network and Naive Byes classifier. Five versions of the Eclipse project, i.e., 2.0, 2.1, 3.0, 3.1, and 3.2, with three products, i.e., JDT, PDE and Platform are collected from Bugzilla that have been used to train and test the model. To evaluate the model, Receiver Operating Characteristic (ROC) curve and f1-score measures were used. The results showed that the model predicated priority level P3 with 82.7% precision and 80.9% recall measures in Eclipse 2.0. Furthermore, the model performed more efficiently by using Naïve Bayes with ROC ranging from 89% to 98% for different priority levels.
To automate bug prioritization, Y. Wang et al. [22] introduced methods of feature selection by using the classification models on the two most popular bug prioritization projects: WordPress and Trac. By accompanying the two main feature selection methods of wrapper and filter, seven techniques of feature selection: Correlation, CfsSubset, OneR, InfoGain, SymmetricalUncert, GainRatio and ReliefF were considered. Two classification algorithms of Naive Bayes and SVM were also used like previous studies, to train the set of feature vectors. Results were evaluated using precision, recall, and accuracy measures and it shows that the GainRatio, InfoGain and Correlation performed better for bug prioritization.
Q. Umer et al. [23] proposed a new Emotion-based approach for predicting the priority of bug reports. The dataset consists of bug reports from four online projects of JDT, Eclipse, CDT and PDE. The effectiveness of different classification algorithms of Naive Bayes, SVM, Linear Regression, and Multi-Nomial Naive Bayes was investigated. To prioritize bug reports, five class labels were used. To identify and analyze emotion words from the bug reports, the feature vector set was compared with the emotion-word corpus available online. For performance evaluation, Recall, Precision, and f1-score measures were used. Experimental results showed that the proposed approach has improved with f1-score of more than 6%. Also, Pearson correlation coefficient (
An automated approach for predicting bug priority and severity using machine learning classification algorithms was investigated by H. Manh et al. [41]. The performance of different classifiers: SVM, Naive Bayes, Artificial Neural Network (ANN), K-Nearest Neighbors and DT was compared. Random Forest and Decision Tree classifiers were selected to conduct experiments on the datasets of open-source Bug Tracking Systems: Bugzilla, Launchpad, Mantis and Debian. The proposed model used four classes of priority i.e., urgent, high, normal and low and for severity, it was classified into critical, normal and minor. The performance of both the classifiers was evaluated by using time consumption, Mean-Squared Error (MSE) and Median-Absolute Error (MAE) and the result showed that Random Forest outperforms DT with the accuracy of 0.75, which is average.
Existing research is mostly focused on either automating bug categorization [11], [33] or bug prioritization [34], [35]. Limited work has been found in the area of categorization and prioritization of bug reports simultaneously [36] and therefore we present CaPBug framework that automates both bug categorization and bug prioritization. The summary of existing studies which includes datasets, methodology, and results is shown in TABLE 2 whereas, the comparison of the existing studies with the proposed framework is shown in TABLE 3.
Methodology
We now explain the methodology of the CaPBug framework. It includes six phases: 1. Data collection, 2. Pre-Processing, 3. Feature extraction, 4. Class imbalance 5. Classification, and 6. Performances’ evaluation.
In the first phase, data has been collected by using the two online software bug repositories of Mozilla1 and Eclipse2 from Bugzilla. In the next phase, pre-processing NLP techniques have been applied to bug reports’ content. This phase converts the bug reports’ textual feature into topic vector sets which is helpful for machine learning algorithms to easily train the model and predict categories and priorities of bug reports correctly. In the third phase, the topic vector set which has been projected in the second phase is evaluated and important words are extracted based on their similar textual structure by using the TF-IDF approach. Afterward, the class imbalance problem has been resolved for priority levels. In the fifth phase, textual and categorical features are trained by using machine learning algorithms for future inference to automate bug prioritization and categorization. Finally, performance of different algorithms has been analyzed to measure the accuracy of the proposed framework. Fig. 1 shows the overall framework of this research.
We now describe each phase of the CaPBug framework in detail.
A. Data Collection Phase
The dataset used in this research is collected from the two online software bug repositories of Mozilla and Eclipse from the Bugzilla issue tracking system. Eclipse and Mozilla are authentic and open-source datasets available for reporting software bugs and have only real flaws. These are available from Bugzilla’s system, which contains a large number of the latest bug reports, including the entire bug life cycle, which stores all the actions and information to solve bugs [42], [43]. Although a large number of bug reports are available in the Bugzilla system, the category of bugs is not mentioned in the bug reports for recent years. We randomly selected bug reports and labeled them manually after a thorough investigation, based on the 6 categories. Around 2000 bug reports from both the repositories within the time period of 2016 to 2019 have been selected for this research. We’ve used keywords in the selection process to identify bug reports in each category. For example: For GUI type reports, we used keywords i.e., font, color, alignment, view, layout, etc. During this process, we tried to ensure that the number of bug reports in each category of our dataset is almost equal.
Both the textual and the categorical features are used for predicting the category and priority of bug reports. The summary attribute of the dataset is included as a textual feature on which NLP techniques are applied. We have chosen the summary feature as the textual feature because it provides us detailed information about the problem. Whereas, the categorical features include: product, component, assignee, status, classification, priority and category attributes of the dataset. We have chosen these features because of their impact on categorizing and prioritizing bug reports. The description of each feature is given in TABLE 4.
The dataset comprises of six categories: Program Anomaly, GUI, Network or Security, Configuration, Performance, and Test-Code. To label the dataset, a category was assigned to each record manually by the developer after carefully reading the summary. TABLE 5 summarizes the number of records in each category in the dataset.
These categories of bug reports are explained below with examples.
1) Program Anomaly
This category refers to the issues that occur due to problems in source code. Examples of such problems include exceptions, logical errors, return value problems, and syntax errors [44], etc. An example in TABLE 6 shows the summary contents of a bug report in which code for the next line is automatically assigned in if-else condition due to incorrect indentation.
2) GUI
This category refers to the issues that are related to the design and event handling of user interfaces. It involves potential bugs related to widget and text colors, layouts, CSS styles, widgets appearance, visibility [45], [46], etc. An example of the GUI related problem mentioned in the bug report is given in TABLE 6. The report highlights the text unreadability issue that occurred due to the side-scrolling in the text editor.
3) Network or Security
This category refers to those bugs that are related to network problems or security issues. Bugs related to the network category include connection or server problems such as improper usage of communication protocols, unexpected shutdowns of server [47], [48], etc. The network related issue raised due to sending larger files in one request and leading to xmlHttpRequest hang up, is exemplified in TABLE 6.
Bugs related to security involves those bugs that are related to vulnerabilities, deletion of unused permission, reloading of certain parameters [49], etc. The example shown in TABLE 6, is the summary content of a bug report related to the security issue in which permission is denied when the user wants to access windowUtils property.
4) Configuration
This category belongs to bugs in which the problem occurs due to the integration of configuration files. Problems in this bug category are caused by wrong file paths or directory paths in XML, updating in external libraries, fixing external libraries, manifest artifacts, plug-in failures [50], etc. An example in TABLE 6 shows that a bug is reported when updating the application and because of this update, the shared configuration area is missing.
5) Performance
This category refers to those problems that are concerned with memory, which include infinite loops that cause memory to hang up, energy leaks, extra memory usage [51], etc. An example in TABLE 6 shows the performance-related bug in which during the debugging process, the Eclipse project is very slow and consumed 100% of CPU usage.
6) Test Code
This category belongs to those problems that emerge in the test code. When looking at the dataset, it is observed that the bugs related to test-code occurred due to (1) intermittent tests, (2) updating, repairing and running test-cases, and (3) test failures when searching for de-localized bugs [52], etc. A sample report summary in TABLE 6 shows that the bug is reported due to the failure of intermittent JUnit testing in API tools.
Software developers spend a lot of time resolving bug reports that have been reported by their quality assurance team. Sometimes, the number of bug reports for software bug fixes exceeds the generally available resources. As a result, critical bugs are not resolved at all or are handled very slowly. Severity and priority both can be used to mark the level of urgency with which the bug has to be resolved. Severity is defined as the level of impact that a defect has on the product. As severity is typically reported by the user or customer after the system has been deployed [21], we have worked on predicting priority only. Priority is assigned by the development and quality assurance team during the product development [20], [21].
We have used five priority levels in this research i.e., P1 (Very High), P2 (High), P3 (Medium), P4 (Low) and P5 (Very Low). These levels are assigned by the developers in Bugzilla bug reports. The bug reports which are assigned as P1 should be fixed on high priority [53], [54]. Priority level three is more frequently used by the development and testing team as common software bugs fall in this category. Therefore, datasets available at Mozilla and Eclipse have more records of priority level three that is P3. Furthermore, if the development team is unsure about prioritizing any defect or if it is a minor bug, they configure it as a P3 level priority. They can proceed to fix this level of bugs after all critical and high priority bugs are fixed. To ensure, that the ratio of records of each priority level remains similar to the actual dataset, we also selected more records of P3 as shown in TABLE 7.
B. Pre-Processing Phase
In this phase, the textual feature of the bug report i.e., summary feature is converted into vectors of topics by using Python Natural Language Processing Toolkit (NLTK). Three NLTK processing methods: Tokenization, Stop Words Removal and Lemmatization [55] are applied in this phase.
Tokenization: To easily understand the bug reports’ content, the text is transformed into a series of tokens (or words) without unnecessary punctuation and special symbols [56]–[58].
Stop Words Removal: In this step, frequently used words of any natural language with no useful meaning (like articles, prepositions, etc. in English) are removed because when transformed into a feature vector set, these words do not contribute as a meaningful word and do not provide any useful information [59], [60].
Lemmatization: The final step of the pre-processing phase is Lemmatization. A bug report may contain similar words in numerous forms i.e., ’performing’, ’performed’, ‘performs’ have the same meaning. So, this process converts various forms of a single word into a meaningful base form [61]. The research uses the Wordnet Lemmatizer package to perform this step as it is the most frequently used lemmatizer.
C. Feature Extraction Phase
After the pre-processing phase, important words are extracted from the feature vector set by examining each dimension of features. To perform this process, the feature extraction technique of Term Frequency–Inverse Document Frequency (TF-IDF) is used. TF-IDF is the information retrieval technique and numerical statistical measure to find those words which are relevant and important to a document in a corpus [62], [63]. These words are calculated as follows.\begin{equation*} TF\_{}IDF = TF \ast IDF\end{equation*}
\begin{equation*} TF = \frac {\textit {Number of times the word occurs in the text}}{\textit {Total number if words in text}}\end{equation*}
\begin{equation*} IDF = \frac {\textit {Total number of documents}}{\textit {Number of documents with word t in it}}\end{equation*}
D. Class Imbalance
Mozilla and Eclipse datasets are highly imbalanced with respect to priority levels as there are abundant records for priority level P3, while there are very few records for other priority levels. As shown in TABLE 7, priority levels P1, P4 and P5 have very few records. There are a number of common issues that normally occur in software and the development team labels these types of issues as P3. Besides that, the dataset also has many P2 level bug reports but not equivalent to P3. However, the bug reports of other priority levels i.e., P1, P4 and P5 are very rare and as a result, a class imbalance problem occurs in a dataset.
This class imbalance problem makes it difficult to train the model and to accurately predict the priority of bug reports [64]. To overcome this issue, Synthetic Minority Oversampling Technique (SMOTE) is used in this research. SMOTE adjusts the class distribution by oversampling and under-sampling classes of the dataset [65].
The distribution of each priority level before and after applying SMOTE is shown in TABLE 7.
E. Classification Phase
This phase consists of two steps: training and testing. We have divided the datasets of both Mozilla and Eclipse projects into 80% for training and 20% for testing. Four machine learning classification algorithms i.e., Naive Bayes (NB), Random Forest (RF), Decision Tree (DT) and Logistic Regression (LR) were applied to train the model.
1) Naive Bayes Classifier
A Naive Bayes classifier is a classification algorithm that relies on Bayes’ Theorem. It has the concept of independence between features i.e., the presence of one feature does not affect the other. It is a very simple and fast algorithm that is highly sophisticated for large datasets [66]. Bayes’ Theorem describes the equation in which the probability of label is found according to the observed features and this can be written as:\begin{equation*} P(A|B) = \frac {P(B|A)P(A)}{P(B)}\end{equation*}
Here, the probability of a target class A can be found if B (predictors or features) are given. This shows that the features or predictors are independent of one other [67], [68].
2) Decision Tree Classifier
Decision Tree is a classification algorithm that has a tree-like structure in which data is constantly distributed according to certain parameters. A tree consists of (i) nodes for testing the value of each feature, (ii) edges or branches that are connected to the next node or leaf and have the results of a test and (ii) leaf nodes that predict the target label. This research uses the ID3 algorithm that has measures of Entropy and Information Gain (IG) for building a decision tree [69].
3) Random Forest Classifier
Random Forest is an ensemble classification-based algorithm that contains a set of decision trees. These decision trees give classifications and are created from data samples that decide which classification has the most votes [70]. Then, the algorithm selects the trees that give the best prediction result. The combination of different trees makes this classifier ensemble as it gives the best result and reduces overfitting [71].
4) Logistic Regression
Logistic regression is a statistical and classification-based machine learning algorithm that assigns prediction to a number of distinct classes. It is built-up upon the concept of probability and applied when the target class’s value has categorical data. The algorithm uses the cost function i.e. Sigmoid function to map the probability of predicting values between 0 and 1 by using the following equation [72].\begin{equation*} f(x) = \frac {1}{1+e^{-x}}\end{equation*}
F. Evaluation Metrics
Classifiers’ performance is evaluated using a confusion matrix which is used to create different metrics in conjunction with True Positive (TP), False Positive (FP), True Negative (TN) and False Negative (FN) values [73]. Accuracy, Precision, Recall, and f1-score are also used to evaluate the performance of the CaPBug framework. Accuracy shows how correctly the algorithm classifies the target class. It is the portion of correctly predicted values out of the total number of input samples [38], [74]. Below is the formula for calculating the accuracy of the model.\begin{equation*} Accuracy = \frac {TP+TN}{TP+TN+FN+FP}\end{equation*}
Precision is the proportion of positive predictions which is positive and how accurate the proposed model is [75]. It computes the accuracy of minority classes. The following formula shows how to calculate precision.\begin{equation*} Precision = \frac {TP}{TP+FP}\end{equation*}
A recall is the ratio of correct instances among all related instances [76]. It is calculated by looking into a number of False Negatives (FN) in the confusion matrix. It is sometimes referred to as True Positive Rate (TPR) or sensitivity. The formula which is used to compute the recall is:\begin{equation*} Recall = \frac {TP}{TP+FN}\end{equation*}
F1-score is used to find the weighted average of precision and recall [77]. When the dataset has a large number of actual negatives or uneven distribution of classes, it finds out the balance between precision and recall [78]. The following formula shows how to calculate f1-score.\begin{equation*} f1 = 2 \ast \frac {Precision \ast Recall}{Precision + Recall}\end{equation*}
Results and Discussion
We randomly selected more than 2000 reports of two projects i.e., Eclipse and Mozilla from Bugzilla. The baseline corpus has been built by manually labeling the category in these reports to be used for training and testing algorithms. We have predicted categories and priorities of bug reports using four classification algorithms i.e., Naive Bayes, Decision Tree, Random Forest and Logistic Regression. Both textual and categorical features have been used for training and testing the CaPBug framework. Finally, results have been evaluated using evaluation metrics and conducting a comparative analysis between textual and categorical features.
A. Predicting Category and Priority From Textual Feature
The textual feature of the dataset i.e., a summary has been used to predict the category and priority of bug reports. NLP techniques and TF-IDF in textual feature is used to create a feature vector set and extract important features. TABLE 8 shows the results obtained using four classification algorithms.
For predicting category, it is observed from TABLE 8 that the Random Forest classifier achieved the highest accuracy of 88.78% with the highest precision, recall, and f1-score measures of approximately 90.00%, 87.16% and 86.66%. Whereas, Naive Bayes classifier showed the lowest accuracy of 67.05% with the lowest precision of 66.00%, recall of 65.50%, and f1-score of 65.16%. Decision Tree and Logistic Regression classifiers performed moderately as their accuracy is in between 83% to 86%. Both the classifiers recorded precision from 84% to 89%, recall from 81% to 83%, and f1-score from 83% to 85% for category prediction. However, in predicting the priority of bug reports, none of the algorithms performed satisfactorily with a textual feature. Decision Tree and Random Forest classifier achieved the highest accuracy among four algorithms with lower precision, recall and f1-score measures. Both the algorithms did not perform well as their accuracy is 68.22% which is not good enough to train the model. The other two algorithms i.e., Naive Byes and Logistic Regression, both have very low-performance measures as their accuracy is below 60%.
Hence, it is concluded that the category prediction of bug reports performed well with a textual feature. However, no algorithm with a textual feature achieved a good accuracy for predicting priority as there is a class imbalance issue in the dataset. The records of priority classes P1, P4 and P5 are very rare. Most of the bug reports are prioritized with immediate and normal priority levels of P2 and P3. As a result, algorithms did not achieve good performance measures.
1) Category Wise Results From Textual Feature
TABLE 9 presents the accuracy of classification algorithms for each category of bug reports that have been predicted from a textual feature. It is observed that each of the categories performed differently in classification algorithms. Using the Naive Bayes classifier, only a ’GUI’ category achieved a better accuracy of 85.71% whereas, ’Program Anomaly’ worked fine with 71.71%. Other four categories did not acquire good results as their accuracies were in between 50% to 69% with Naïve Bayes. Decision Tree, Random Forest and Logistic Regression classifiers worked well with the category of Program Anomaly, GUI, Performance and Test-Code. The accuracy of these four categories falls within 84% to 98%. The other two categories of Network or Security and Configuration achieved better results with the classifiers of Decision Tree and Random Forest as their accuracies are from 73% to 80%. However, both the categories achieved the accuracy of 66% to 68% using the Logistic Regression classifier which is not good enough. Overall, the GUI category is more accurately predicted by all algorithms as compared to other categories and achieved the highest accuracy of 97.61% using the Logistic Regression algorithm.
2) Priority Wise Results From Textual Feature
TABLE 10 presents the accuracy of algorithms at each priority level that has been predicted with a textual feature. Using Naive Bayes, Decision Tree and Random Forest classifiers, the P4 priority level acquired the accuracy of 84% to 91%, whereas, it achieved the lowest accuracy of 6.06% with Logistic Regression. Priority level P3 obtained good results using Random Forest and Logistic Regression with the accuracy of 84.09% and 90.90% respectively. It achieved an accuracy of 74.43% by using a Decision Tree classifier. The other three priority levels of P1, P2 and P5 are not accurately predicted by the classifiers. These levels obtained the lowest accuracies i.e., P1 achieved the accuracy of 23% to 53%, P2 with 49% to 63% and P5 acquired 18% to 69%.
Hence, we conclude from the above results that only P3 and P4 priority levels acquired a moderate level of accuracy with machine learning algorithms. However, other priority levels did not acquire good results using a textual feature of the dataset.
B. Predicting Category and Priority From Categorical Features
The categorical features have also been used in this research for the automatic prediction of categories and priorities of bug reports. These features include product, component, assignee, status, classification, priority and category. TABLE 11 shows the results obtained after applying four classification algorithms with the above-mentioned features.
When predicting the category of bug reports, we observed that no classifiers worked well with categorical features. The performance measures of Naive Bayes and Logistic regression are from 28% to 42%, which is inadequate to train the model correctly. Furthermore, the remaining two classifiers of Decision Tree and Random Forest also did not produce good results as they achieved the accuracy of 53.74% and 54.43% respectively. Other performance measures of these classifiers achieved accuracy in between 51% to 53%. However, when we predicted the priority with categorical features, it has been noted that all the classifiers worked satisfactorily as compared to the results obtained with a textual feature. Random Forest achieved the highest accuracy among all classifiers as their accuracy increased from 68.22% to 77.33%. Furthermore, the recall and f1-score measure of Random Forest performed slightly better as compared to those results when we predicted the priority with a textual feature. Such as, recall has increased from 65.60% to 71.40% and f1-score from 69.00% to 72.20%. Decision Tree obtained the accuracy of approximately 73.36%, but with low-performance measures i.e., precision recorded 67.80%, recall with 67.80% and f1-score with 56.66%. However, the performance measures of Naive Bayes and Logistic Regression acquired low results with the accuracy of approx. 61% to 66%, and, precision, recall and f1-score of 55% to 63% respectively.
It is evident from the results in TABLE 11 that the category prediction is not working well with categorical features as compared to those results when category prediction was done with a textual feature. Furthermore, priority prediction has also been affected by the class imbalance problem. Records of P3 class are abundant while those of other classes are also disproportionate, thereby creating a class imbalance.
1) Category Wise Results From Categorical Features
TABLE 12 presents the accuracy of classification algorithms for each category of bug reports using categorical features. None of the algorithms have succeeded in achieving good results by using categorical features. Only Random Forest achieved a better accuracy of 71.71% in the category of Program Anomaly which is also not good enough. The other categories did not acquire good results using Random Forest. The categories of Test-Code and Configuration achieved the lowest accuracy i.e., 4.83% and 9.24%, by using Logistic Regression classifier. The categories of Network or Security, Configuration, Performance and Test-Code acquired very low results using Naive Bayes and Logistic Regression classifiers. Only the categories of GUI and Program Anomaly achieved better results with all the algorithms.
A satisfactory level of results was not achieved using categorical features to predict the category of bug reports. We, therefore, conclude that the textual features should be used to predict the category of the bug reports. Since textual features provide detailed information about the bug and help to train the model with the right category.
2) Priority Wise Results From Categorical Features
The performance of each priority level is specified in TABLE 13. We have evaluated that no classifiers succeeded in accurately predicting the priority of bug reports using categorical features. At most, P2 and P3 priority levels obtained better results with all the algorithms. Both priority levels achieved the highest accuracy of 82.96% and 81.81% respectively using the Random Forest classifier. They acquired the minimum accuracy of 73.63% and 67.25% with Logistic Regression. P1 priority level also acquired better results with the Decision Tree classifier as it achieved an accuracy of 81.63%. However, the remaining two priority levels of P4 and P5 were not predicted accurately by the classifiers, except P5 that achieved better results with Random Forest. The accuracy of both priority levels is between 42% to 53% in P4 and 58% to 76% in P5.
Considering the above low-performance measures of priority prediction, we have done an in-depth investigation of each priority level results using both textual and categorical features. After a detailed evaluation, we have concluded that due to the class imbalance problem, no classifiers performed well in priority prediction of bug reports. For this reason, SMOTE has been used in this research to address the class imbalance problem. In our initial experiments, the percentage of P4 and P5 was kept same that is P4 had the smallest percentage before and after the SMOTE. However, we could not attain the satisfactory level of accuracy. It was only when the records of P4 were increased to the level that bypassed P5 as shown in TABLE 7, we reached the desired level of accuracy.
C. Predicting Priority From Textual Feature With SMOTE
After applying SMOTE, all the classifiers performed well for predicting the priority of bug reports using a textual feature. It is evaluated from TABLE 14 that the framework achieved the highest accuracy of 90.43% using the Random Forest classifier with the precision and recall of 91.60% and f1-score measure of 91.50%. Decision Tree acquired an accuracy of 88.94% which is closer to the Random Forest classifier with other performance measures i.e., 89.80% precision, 90.40% recall and 90.20% f1-score. Moreover, the other two classifiers Naive Bayes and Logistic Regression achieved an accuracy of 83.29% and 84.33% respectively. The other performance measures of these two classifiers achieved the best results as well.
When we compared the results of priority prediction using SMOTE with the results before class imbalance, it is observed that the performance measures have been improved using SMOTE with a textual feature. The performance of all the classifiers succeeded in achieving good results. Hence, it is concluded that the framework performed well with a textual feature after applying SMOTE for priority prediction.
1) Priority Wise Results From Textual Feature With SMOTE
TABLE 15 presents the accuracy of each priority level with a textual feature after applying SMOTE. Priority level P1 obtained good results with all the classifiers and achieved accuracy in between 84.34% to 93.43%. Furthermore, priority level P4 and P5 also predicted well with all the classifiers. P4 achieved the highest accuracy of 98.56% against all priority levels. P5 achieved accuracy in between 86.60% to 97.32%. Priority level P2 acquired better results with all the classifiers except Logistic Regression. It achieved the maximum accuracy of 84.45% using Naive Bayes and minimum accuracy of 78.23% using Logistic Regression. However, priority level P3 did not obtain good results with Naive Bayes and achieved an accuracy of 70. 61%. It obtained the maximum accuracy of 85.84% using the Random Forest classifier.
We have concluded that after applying SMOTE, almost each priority level succeeded in acquired good results with a textual feature. Textual features give detailed information of the bug reports. Based on these features, we might train the model to accurately predict the priority of bug reports after handling the class imbalance problem.
D. Predicting Priority From Categorical Features With SMOTE
When predicting the priority of bug reports after applying SMOTE by using categorical features, results shown in TABLE 16 were obtained.
After handling the class imbalance problem, the framework performed well with Decision Tree and Random Forest classifier as their accuracy is 87.32% and 88.47% respectively. The other performance measures of these classifiers are: 87.20% precision, 88.40% recall and 87.60% f1-score in Decision Tree whereas, 88.20% precision, 89.00% recall and 88.40% f1-score in Random Forest classifier. The other two classifiers of Naive Bayes and Logistic Regression didn’t work well with categorical features after applying SMOTE. Naive Bayes achieved the accuracy of 43.31% with 41.20% precision, 42.40% recall and 42.20% f1-score measure. Logistic Regression acquired the lowest accuracy of 41.24% as well as with the lowest performance measures i.e., 40.80% precision, 38.80% recall and 38.49% f1-score.
When we compared these results with the results before class balancing, it has been shown that there is an improvement only in Decision Tree and Random Forest classifiers after applying SMOTE using categorical features. Whereas, the performance measures are decreased in Naive Bayes and Logistic Regression. They didn’t perform well after handling the class imbalance problem. Therefore, we have concluded that after applying SMOTE, the framework performed well with categorical features by using only Decision Tree and Random Forest classifiers.
1) Priority Wise Results From Categorical Features With SMOTE
After applying SMOTE with categorical features, the accuracy of each priority level is shown in TABLE 17.
It is observed that Decision Tree and Random Forest classifiers achieved good results in all priority levels except P3, which obtained 75.22% accuracy using Decision Tree. The Decision Tree classifier achieved the highest accuracy of 94.96% in priority level P4 among all the priority levels. Naive Bayes and Random Forest classifiers could not attain a satisfactory level of accuracy in predicting priority levels, even after applying SMOTE. Naive Bayes classifier acquired the lowest accuracy of 25.25% in P1 and a maximum of 70.46% in P2. The model worked poorly with Logistic Regression as it acquired the lowest accuracy of 17.26% in P4 among all priority levels while the maximum accuracy is achieved in P2 which is 65.28%. Therefore, we have concluded that Decision Tree and Logistic Regression classifiers obtained better results in each priority level after applying SMOTE by using categorical features.
Conclusion
Bug reports play a crucial role in software development and maintenance activities. They allow software developers, quality assurance team, and customers to identify and report related issues. These reports entail extensive details, hence manual extraction of this information is infeasible due to large time complexity. Therefore, an automated mechanism is needed for their categorization and prioritization. The focus of this research is to automate the process of categorization and prioritization of bug reports. We propose CaPBug, a machine learning based framework that recommends a category and priority level to each issue based on information available in the bug reports. The approach is based on supervised machine learning that uses textual and categorical features of the dataset. We conducted an experimental study on 2,138 Eclipse and Mozilla bug reports from the Bugzilla dataset. We manually labeled the dataset with bug categories and applied NLP techniques and machine learning classifiers for predicting category and priority of bug reports.
The CaPBug framework utilizes both textual and categorical features to predict the category and priority of bugs. The summary feature was taken as a textual feature for training the model. Since, the summary feature includes detailed information about the bug reports, therefore, it was more significant in predicting the category and priority. Moreover, as the dataset was highly imbalanced with respect to priority class, we applied SMOTE to correctly train the model with each priority level. Categorical features worked well with SMOTE in predicting priority, but only with few classifiers. Nonetheless, it is evident from the above results that the framework is giving better results with a textual feature for predicting both the category and the priority of bug reports.
The research concluded that the framework achieved the highest accuracy of 88.78% for category prediction and 90.43% for priority prediction by using the textual feature with Random Forest classifier.
We intend to enhance the CaPBug framework by adding more records in the training dataset. Consequently, the framework can be used to predict both the priority and category of new bug reports. Additionally, deep learning techniques can be applied in future to further improve the performance and results.