Analysis of Tree-Family Machine Learning Techniques for Risk Prediction in Software Requirements

Risk prediction is the most sensitive and critical activity in the Software Development Life Cycle (SDLC). It might determine whether the project succeeds or fails. To increase the success probability of a software project, the risk should be predicted at the early stages. This study proposed a novel model based on the requirement risk dataset to predict software requirement risks using Tree-Family -Machine-Learning (TF-ML) approaches. Moreover, the proposed model is compared with the state-of-the-art models to determine the best-suited methodology based on the nature of the dataset. These strategies are assessed and evaluated using a variety of metrics. The findings of this study may be reused as a baseline for future studies and research, allowing the results of any proposed approach, model, or framework to be benchmarked and easily checked.

it primarily involves eliciting, documenting, and maintaining 23 stakeholders' requirements [3]. Meeting and ensuring that 24 stakeholders' essential needs are met regularly is one of 25 The associate editor coordinating the review of this manuscript and approving it for publication was Jolanta Mizera-Pietraszko . the primary reasons for producing a high-quality software 26 system [4], [5]. 27 There is consistently a casual of inexact procedures dur- 28 ing the time spent in the Software Development Life Cycle 29 (SDLC), which may likely defeat software organization or 30 software development. These questionable procedures are 31 known as software risks. The risks burst from various risk 32 influences established in an assortment of exercises in the 33 SDLC. If these risks are not distinguished appropriately, 34 they may get liable for the disaster of the project [6]. These 35 elements should be separated and moderated to restrict the 36 software cost and schedule by risk estimations in the SDLC's 37 underlying phases. Because requirement collection is the first 38 part of SDLC, forecasting risks at this stage may boost soft-39 ware productivity and quality while decreasing the likelihood 40 of catastrophes in the project [4], [6]. 41 Keeping the earlier issue of risk prediction at the early stage 42 of software needs in mind, numerous researchers assessed 43 and created several models applying various categoriza-44 tion algorithms. However, any broad-spectrum preparation 45 to kick-start the use of these techniques is tough to come 46 up with. Overall, despite significant variances in the experiments, it was revealed that no one methodology confers Frequent solutions for predicting software risk at different 99 phases in SDLC are available. In contrast, infrequent methods 100 are available to predict risks in the software requirements 101 phase in the literature [6], [11]. A risk prediction model 102 encompasses data mining classification methods and is pro-103 posed to predict risks on the project's Software Requirement 104 Specifications (SRS). The TF approach is one of the strangest 105 techniques for organizing the most significant variables and 106 their interactions between two or more variables. TFs can 107 develop new features with more significant predicting poten-108 tial for object variables. It needs less data purification than 109 other modelling methodologies. It is not biased to a consid-110 erable degree by outliers and missing data [17], [18], [19].   If a project fails to fulfil the user's needs, budget, or time-135 line, the product's quality suffers. As a result, it is more 136 likely to fail [14]. So, to limit effort and the likelihood of 137 failure, a product must be built within the budget and schedule 138 constraints. The late discovery of risk has a more significant 139 effect on project failure. It is also necessary to forecast risk 140 early in the SDLC process (Software Requirements).

141
The data obtained from previous projects can be used 142 for the growth by either using machine learning ( This research aims to analyse TF-ML approaches for risk 151 prediction in software requirements using the Zenodo repos-152 itory dataset. The dataset used contains the 13 characteristics 153 VOLUME 10, 2022      training and testing purposes to show the better data splitting 166 mechanism in this regards. In case 1, the data is divided into 167 90% for training and 10% for testing, in case to we decrease 168 the training and increase the testing by 10%, so 80% is used 169 for training and the rest of 20% I sused for testing and so 170 on upto 10% for training and 90% for testing. In the last sce-171 nario, 10-fold cross-validation is used. Many studies advocate 172 10-fold cross-validation as a benchmark [15], [16].

173
The employed techniques are evaluated using standard 174 evaluation measures presented in the subsequent.

176
Various assessment metrics evaluate ten TF-ML approaches, 177 including J48, HT, CDT, RF, RT, LMT, CS-Forest, and 178 REP-T. A 10-fold cross-validation procedure is employed 179 for training and testing where the dataset is partitioned into 180  [20], [21], RRSE [17], [19], precision [22], [23], recall [22], 188 [24], F-measure [25], [26], MCC [25], [27], and accuracy 189 [26], [28], [29], where, P ij is the rate of prediction by the  also been explored [31], [22], [32].  In contrast, the proportion of CCI and ICI attained by each 232 approach for each test case module is represented by the 233 remainder of the columns. The best test case that we con-234 sider is 10-Fold cross-validation, the most utilized standard. 235 In Table 3 Tables 9, 10, 11, and 12 show the outcome analysis of 255 average precision, recall, F-measure, and MCC. CDT, F-PA, 256 and J48 exceed other approaches in each table to achieve 257 better results. A ''?'' sign appears in Tables 9, 11, and 12. 258 Due to the ''0'' value in the confusion matrix, this is a Weka 259 auto-generated symbol. If there is a need to divide a value and 260 that value becomes ''0,'' we know that ''0'' is not divisible, 261 VOLUME 10, 2022     with outccomes that best cases for training and testing on 293 the aforementioned datases are the first 4 data training and 294 testing cases that are 90% and 10% for training and testing to 295 60% and 40% for training and testing, and the last case that is 296 10-fold cross-validation. Now, if the goal is to reduce the error 297 rate, our study shows that CDT outperforms other applied 298 strategies on all of the selected (best test case) modules in 299 Figures 4 (MAE and RMSE) and 5 (RAE% and RRSE%). 300 Similarly, in the cases of recall, precision, F-measure, MCC, 301 and accuracy, as shown in Figures 6 and 7, CDT outperform 302   This section discusses the impacts that might jeopardize the 311 validity of this study endeavour.       findings when calculating the error rates may be thrown 330 off. Similarly, using varied datasets, the comprehensive 331 approaches may not be able to provide improved predictions 332 in outcomes. Following that, a thorough examination was 333 carried out on a dataset taken from the Zenodo repository to 334 determine the performance of the approaches used.    Pakistan. He is currently working as an Assis-570 tant Professor of computer science with the City 571 University of Science and Information Technol-572 ogy, Peshawar, Pakistan. His Ph.D. research is 573 user/group modeling on smart TV for enhancing 574 personalization services in general and recommen-575 dations in specific. He published several papers in 576 international journals and conferences. His research interests include soft-577 ware engineering, recommender systems, user modeling, group modeling, 578 smart TV, ubiquitous computing, web mining, search engines, augmented 579 reality, and mobile-based systems for people with special needs.

580
INAYAT KHAN received the Ph.D. degree in 581 computer science from the Department of Com-582 puter Science, University of Peshawar, Pakistan. 583 His current research is based on the design and 584 development of context-aware adaptive user inter-585 faces for minimizing drivers' distractions. His 586 research interests include lifelogging, healthcare, 587 deep learning, ubiquitous computing, accessibil-588 ity, and mobile-based assistive systems for people 589 with special needs. He published several papers in 590 international journals and conferences in these areas.