Queueing-Based Simulation for Software Reliability Analysis

As modern software system is growing in size and complexity, the customer expectations for software quality have become higher. In the past, many software reliability growth models (SRGMs) were proposed and they are helped to evaluate the quality of developed software. It is worth noting that some of SRGMs can be used to model the fault detection process (FDP) and the fault correction process (FCP) through an infinite server queueing (ISQ) system or a finite server queueing (FSQ) system. However, it can also be found that most ISQ and FSQ models were developed on a first come first served basis. In this paper, we propose to use the queueing-based simulations to describe the behavior of FCP and assess the software reliability instead of using model-based approaches. Our proposed queueing-based simulation techniques and simulation procedures will be able to thoroughly investigate the FCP and easily provide system performance information estimated based on the staffing level, the average response time, and the average waiting time. Numerical examples based on three real failure data are given and discussed. Our experiments show that the proposed simulation procedures obtain a good prediction capability for software reliability. We expect that the proposed methods can provide effective information for software developing management and help decision makers in resource allocation and cost control.


I. INTRODUCTION
In modern society, software is already an indispensable part of our lives. To assess software quality, ISO/IEC 9126 is a useful reference [1]. The quality model specifies six characteristics including Functionality, Reliability, Usability, Efficiency, Maintainability, and Portability. Among these characteristics, reliability is generally regarded as a key factor in software quality evaluation. Presently, many software reliability growth models (SRGMs) have been published and are used to evaluate the quality of developed software. Reference [2], [3], [4], [5], [6], [7], [8], [9], [10]. In general, SRGMs are formulated in terms of random processes [3]. SRGMs can typically be used to help managers truly understand the current project status. But Musa [5] once argued The associate editor coordinating the review of this manuscript and approving it for publication was Claudia Raibulet . that many research studies of the software reliability modeling that follow are either theoretical or based on small-scale projects or limited failure data, so they must be generally viewed as heuristic and not yet ready for real-world use.
In the past, some of SRGMs used the infinite server queueing (ISQ) or finite server queueing (FSQ) systems to model the fault detection process (FDP) and the fault correction process (FCP) [11], [12], [13], [14], [15]. It can also be found that most ISQ and FSQ models were developed on a First-Come-First-Served (FCFS) basis. But practically, high priority faults should be fixed quickly to minimize their impact on software development and testing. But it also has to be noted that the fault correction time should not be ignored. Mockus et al. [16] reported that the high-priority faults should be corrected faster than the low-priority faults. Moreover, Zhang et al. [17] analyzed close source software and collected VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ its failure data. Their statistics demonstrated that bugs with the higher priority are fixed faster, which means that the bugfixing time is shorter. Priority scheduling [18], [19] is an operating system process scheduling algorithm and can be generally classified as preemptive or non-preemptive. Based on this concept, Lin and Huang once [9] proposed a preemptive priority queueing (PPQ) model that considers both a finite number of debuggers and different priority levels. In actuality, in addition to above model-based approaches, the discrete-event simulation offers an alternative to analytical models as it can represent the impact of different strategies that may be used during testing [4]. For example, rate-based simulation approach was proposed to relax certain unreasonable assumptions that are common in model-based approaches [4].
Based on our past studies [9], in this paper we further propose queueing-based simulation techniques and simulation procedures to analyze the behavior and reliability of the PPQ model. The proposed simulator can describe the process of fault correction in more detail and optimize various system parameters such as staffing level, average response time, and average waiting time. Some experiments based on real open source software (OSS) and closed source software (CSS) failure data are presented and discussed. We have also presented an assessment tool called R-ViSim, which is currently under development, to automate the simulation process. Practically, developers and project managers can use the proposed simulation procedure to estimate workloads that are similar to the actual situation and allocate appropriate human resources.
The remainder of this paper is organized as follows. In Section II, we give a brief review of model-based, queueing-based, and simulation-based software reliability analysis. In Section III, we will propose two queueingbased simulation procedures described in C-like programming language. The components in the queueing-based simulation procedures will be explained and discussed in detail. In Section IV, we apply three real data sets collected from OSS and CSS to discuss and assess the processes of fault detection and removal, and analyze the experimental results. We also address the threats to validity and answer some research questions in Section IV. Finally, conclusions and future research recommendations are offered in Section V.

II. RELATED WORKS A. MODEL-BASED SOFTWARE RELIABILITY ANALYSIS
In the past, a lot of SRGMs were developed to estimate and predict the reliability of developed software systems [3], [4], [5], [6]. Some models assume that there is a finite and fixed number of faults in system while others assume that an infinite number [20]. Some models need the exact time in between each failure found in testing [21], while others only require the number of failures found during any given time interval [22] (i.e., a day or a week). Some SRGMs have been are commonly used in the field of software reliability modeling. We will provide a brief review in the following.

1) THE GOEL AND OKUMOTO (G-O) MODEL
Goel and Okumoto [23] once proposed an exponential-type model, which describes a software detection phenomenon. The Mean Value Function (MVF) of the GO model is given by where a is the expected number of faults that will be eventually detected, and b is the faults detection rate per fault at an arbitrary testing time.

2) THE DUANE MODEL
This mode (also known as the Weibull process model) assumed that the MVF satisfies [22] where k and b are constant parameters which can be estimated using collected failure data.

3) THE YAMADA DELAYED S-SHAPED (DSS) MODEL
Yamada et al. [24] proposed a delayed S-shaped SRGM to describe the fault detection process. The observed growth curve of the cumulative number of detected faults is S-shaped. The MVF of the DSS model is given by It is a two-parameter S-shaped curve with parameter a denoting the number of faults to be detected and b corresponding to the failure detection rate (and the fault isolation rate).

4) THE INFLECTED S-SHAPED (ISS) MODEL
This model was proposed by Ohba [25] and it described a software failure detection phenomenon with a mutual dependence of detected faults. The MVF of the ISS model is where a is again the total number of faults to be detected, while b and ψ are the failure detection rate and the inflection factor, respectively.

5) THE LOGISTIC AND GOMPERTZ GROWTH CURVE MODELS
The Logistic and Gompertz growth curve models were also commonly used to estimate the fault content of developed software [26]. The expected cumulative number of detected faults at time t for the logistic growth curve model [27] can be described by The MVF of the Gompertz growth curve model [27] is given by where b and k are constant parameters which can be estimated by fitting the failure data. Notice that the parameter a in the above-mentioned models may typically be interpreted as the expected number of initial faults in the software.

6) TWO-ERROR-TYPES MODEL
Yamada et al. [28] proposed a modified exponential SRGM which assumes that the software system contains two types of faults, type I (which are easy to detect) and type II (which are difficult to detect). It is assumed that the faults detected early in testing are different from those detected later on. This model has a mean value function as follows: where p i for (i = 1, 2), denoted the proportion of type i errors in the software system (p 1 +p 2 = 1), a is the expected number of faults and b i is the error detection rate of type i errors per error (per unit time).
Debugging is not an easy thing. Most research has assumed that when faults are detected during testing, they will be removed immediately and successfully in debugging. In reality, the time required to remove the faults cannot be ignored [11], [29], [30], [31]. Kapur and Grag proposed a SRGM model for describing error removal [30]. In this SRGM, it is assumed that for a failure the detection of the faults causing the failure also results in the detection of a proportion of the remaining faults, without those faults causing any failure. Let m r (t) be the number of faults detected by time t. We then obtain where a is the total fault content of the software, b is the failure rate, and d is the error-detection rate of additional detected errors. It is noted that Wu and Huang [10] once demonstrated how to derive mathematical expressions from the computational methods of deep learning models and how to determine the correlation between them and the mathematical formula of SRGMs. They used the back-propagation algorithm to obtain the SRGM parameters. However, the foregoing discussion showed that most studies assume that software reliability growth behavior follows a non-homogeneous poisson process (NHPP), based on collected software failure data. It is worth noting that Hou [33] reported that before applying the failure data to the SRGMs, whether the dataset is of the NHPP type and whether it is used correctly by the NHPP-based SRGM should be explained. On the other hand, Kanoun and Lapire [4] also observed that the use of NHPP-based SRGMs during the early stages of development and validation is much less convincing.
Additionally, Liu and Kang [34] once deduced an imperfect debugging software belief reliability growth model using the uncertain differential equation under the framework of uncertainty theory, and investigated the properties of essential software belief reliability metrics. Garg et al. [35] proposed a entropy-combinative distance-based assessment (CODAS-E) method and presented to select and rank SRGMs based on multiple performance indexes. Nafreen and Fiondella [36] proposed a framework for SRGM possessing a bathtub-shaped fault detection rate and derived stable and efficient expectation conditional maximization algorithms to fit the models. Sabnis and Joshi [37] ever proposed an architecture to enhance, optimize and validate software reliability using machine learning techniques. Yang et al. [38] proposed a quantitative reliability evaluation process and method of network system based on discrete event simulation combining software and hardware.

B. QUEUEING-BASED SOFTWARE RELIABILITY MODELING
It is worth noting that most of the traditional SRGMs are usually based on the same assumptions. Yet some of the assumptions are unreasonable, particularly regarding the development of OSS, such as with perfect debugging and immediate debugging. For example, in practice, the assumption about perfect debugging may not be reasonable. This assumption states that when a failure is detected, developers correctly remove the corresponding fault without introducing new faults.
In reality, developers generally have the experience that as they remove a fault, they at times create other new faults.
The assumption which hypothesizes perfect debugging is not legitimate, because the debugging process is a complex activity which includes locating and removing the relevant faults [39]. Therefore, the more complex the fault that developers want to remove is, the more complex the debugging process to be performed will be. For example, Raymond [40] once reported that beta testing plays a key role in the testing of OSS, hence the test team usually is not the same as the development team in OSS. Thus, once a fault is detected, developers of OSS usually need more time to communicate with testers for removing the fault. The above situation makes debugging time lag occur more easily in OSS.
In the past, some researchers once proposed to use Infinite Server Queueing (ISQ) and/or Finite Server Queueing (FSQ) approaches to model the FDP and the FCP [9], [11], [14], [15]. Basically, both ISQ and FSQ approaches are based on queuing theory to make projections on software reliability. In general, a queueing system can be described as customers arriving for service, waiting for service if it is not immediate, and leaving the system after being served [41]. We observed that some characteristics of this theory are similar to those of software engineering: • Arrive rate of customers: The concept is similar to the fault detection rate of the software system. If the probability distribution of the arriving process is timeindependent (the arrival rate does not change with time), VOLUME 10, 2022 it is defined as stationary. If the probability distribution is not time-independent, it is non-stationary.
• Service time: Similar to debugging time, which is the average time taken for debugging by a debugger in the queueing system. Similar to arrivals, it can be stationary or non-stationary.
• Number of service channels: The service channels may be viewed as the debuggers who can analyze and process the detected faults concurrently.
• Queue discipline: The rule by which faults are selected for debugging when a waiting queue has formed. The rule can be First-Come-First-Served (FCFS), Last-Come-First-Served (LCFS), random, or a priority scheme. The standard shorthand for describing the queueing system is Kendall. It consists of a series of symbols and slashes such as A/B/X /Y /Z , where A represents the arrival time distribution, B represents the probability distribution of the service time, X represents the number of total servers in system, Y is the limits of the system capacity, and Z is the rule for queue discipline. A and B in shorthand can be an exponential distribution (M ), a deterministic distribution (D), a type-k Erlang distribution (E k ), a phase type distribution (PH ), or a general distribution (G). The service rule Z can be FCFS, LCFS, random selection for service (RSS), priority (PR), or general discipline (GD). Y and Z are usually omitted if there are no constraints on the queue length and the queue discipline is FCFS [41].
In the past, Lin et al. [9] once proposed a preemptive priority queueing (PPQ) model that considers both a finite number of debuggers and different priority levels. Inoue and Yamada [15] also proposed an ISQ model considering the time distribution of the fault-isolation process based on DSS models. This model can easily describe and analyze fault detection during the actual testing phase and expresses several NHPP-type models as special cases. Gokhale and Mullen [42] have presented the multi-priority queueing models to estimate the mean resolution time resolution of faults with different severity levels. They found that the priority model mixed non-preemptive priority and preemptive priority principle is suitable to describe the defect data. Kapur et al. [43], [44] and Dohi et al. [45] also discussed how to usethe ISQ model to describe software development behavior.
Furthermore, Zhang [46] incorporated detection and correction efforts into the fault detection and correction processes, respectively. Huang et al. [47] proposed an extended ISQ model with multiple change-points to estimate the software reliability. Zhang et al. [48] showed how to apply ISQ models with testing effort functions (TEFs) to model the FDP and FCP. Their proposed model with an Exponentiated Weibull TEF and a Logistic TEF can consider the influence of resources on the software debugging phase, and enhance the assessment of reliability.
However, Mockus et al. [16] found that in Apache and Mozilla, bugs with higher priority are fixed faster than bugs with lower priority. Zhang et al. [17] analyzed an IT management product for enterprise customer and collected its failure data. They calculated the time needed to fix the bugs and found that bugs with higher priority are fixed faster (as indicted by mean values). In the real world, the debugging team to fix the faults based on their severity and priority.

C. SIMULATION-BASED METHODS
Over the last few decades, simulation has proven to be an effective approach for analyzing software systems. It can also be used to address a variety of issues, such as management of software development, improving software processes, or training software project management. Simulation can provide an alternative method for investigating software reliability because it can model a wider range of reliability phenomena than mathematical analysis [4]. Kellner's study [49] could be the first state-based simulation applied to software engineering processes. Later on, Raffo and Kellner [50] discussed some of the important empirical issues that arise in software process simulation modeling, and proposed the software process simulation model and compared the simulated results to the real-world data. Raffo et al. [51] once addressed some relevant aspects of the multifaceted relationship between empirical studies and the building, deployment and usage of software process models.
Gokhale et al. [52] developed simulation procedures which may be used to assess the impact of individual components on the reliability of an application in the different repair strategies during testing. It can provide project owners with a range of operational configurations to meet the desired reliability. Rus et al. [53] described a process simulator to assist software project planning in the software development program. Using this discrete event model to construct a software reliability prediction model for a realistic project. Lin and Huang [54] once proposed simulation procedures based on queueing theory to describe and explain possible debugging behavior in the software development process. The proposed methods can help project leaders to effectively assess the appropriate staffing level for the debugging team. Juan et al. [55] developed a Java-based simulation software, J-SAEDES, which can estimate the reliability and availability of time-dependent computer systems and networks, and identify those components which play a critical role in system reliability and availability.
Additionally, Gokhale et al. [56] proposed a rate-based simulation framework that incorporates explicit debugging activities into SRGMs conducted according to different debugging policies. Their simulation framework obtains realistic software reliability estimates and determines the optimal software release time. Antoniol et al. [57] incorporated the queue theory and stochastic simulation to explore a real system and to evaluate its cost, risk, and staffing levels. Fan et al. [58] constructed a defect removal process simulation based on finite independent queues with different capacities and loadings. The simulation takes into account the limitations on developers and the differing abilities of developers. It can also provide useful and important information, such as the duration of defect removal, the use of each developer in the defect removal process, and the rate of removed defects at a planned time.
Chang et al. [59] applied the express-queueing theoretic approach to model the fault correction process. All detected faults will be classified and dispatched into either an express queue or regular queue to reduce the mean resolution time. Their experiments show that the performance and efficiency of software debugging processes can be improved. Lin and Li [60] developed a new rate-based queueing simulation framework for open source software (OSS) reliability assessment. It can support the optimal release time decision and assist the OSS and CSS project managers to estimate the number of devoted core contributors. Shu et al. [61] proposed a simulation-based approach to easily analyze the reliability of fault tolerant web services and the execution details of different fault tolerant strategy. Thus, the developers can determine an appropriate strategy for different constraints. Nakahara et al. [62] also proposed a simulation model of software quality assuranc to quantitatively demonstrate the positive effect of adding quality assurance effort especially in early phases of software development. Note that their proposed model can represent the relationship among the number of bugs in each phase, the amount of quality assurance effort, the expected number of detectable bugs and the amount of bug fixing effort.

III. SIMULATION-BASED FRAMEWORK FOR SOFTWARE RELIABILITY ESTIMATION
From above-mentioned studies, we can see that there is no single model or method that can be universally applicable to all the situations [4]. Here we will show an alternative approach to modeling the software reliability. In addition to our proposed queuing-based model [9], we find that simulation procedures can also be used to investigate the software fault correction process and to estimate system performance measures such as the average debugging time and waiting time. Specifically speaking, the queueing-based simulation approaches can relax certain unreasonable assumptions which are common in model-based approaches. In order to mimic the actual situation of FCP, we incorporate the concept of preemptive priority queueing into the rate-based simulation procedure. In this section, the details of the debugging behavior are discussed and analyzed using the fault priority level.
The assumptions of the queueing-based model with preemptive priority queueing (PPQ) policy are as follows [9], [63], [64], [65], and [66]: 1) The software is subject to failures at random times caused by the manifestation of remaining faults in the system.
2) The mean number of detected faults in the time interval (t, t + t) is proportional to the mean number of remaining faults in the system. 3) The detected faults will be put into the waiting queue according to a Poisson process with detection rate (λ). Additionally, the detected faults are fixed by a finite number of debuggers (c); The fault correction time for each assigned debugger is exponentially distributed with correction rate (µ). 4) There are two different types of faults: high-priority faults and low-priority faults. High-priority faults have a preemptive ability for service over low-priority faults.
If the two faults have the same priority, they will be served according to their order in the waiting queue. 5) If all debuggers are busy with high-priority faults, a newly detected high-priority fault follows the FCFS rule and waits at the tail of the same priority faults in the waiting queue. 6) The waiting time and correction time of fault(s) are mutually independent. Fault detection activity continues while faults are removed, and fault correction does not affect the detection process. 7) Fault correction time is non-negligible. When a fault is fixed, it will not introduce a new fault.
Based on the above assumptions, we will be able to obtain a queueing model that describes the fault removal process, M /M /c/∞/PR. Figure 1 shows the rate transition diagram of proposed queueing-based model with PPQ policy. To implement the procedures, a C-like programming language will be used, but there are many possibilities. Figure 2 shows the flowchart of our proposed queueing-based simulations. Note that each iteration in the procedure includes three steps. Note that the staff allocation, the first step, assigns a fault to a free debugger. The second step is used to determine whether a fault is detected according to the failure data set. The end of an iteration is the step of fault correction, which checks whether the faults are completely corrected. The three steps will be repeated until the time for the simulation procedure runs out of the total execution time.

A. PROCEDURE #1: THE SIMULATION PROCEDURE FOR THE NON-PREEMPTIVE PRIORITY SYSTEM
The flowchart of the Procedure #1 is shown in Figure 3. Additionally, the algorithm shown in Figure 4 presents the non-preemptive debugging simulation procedure. Under nonpreemptive scheduling, once a process goes into operation, it will not be interrupted until completion. For example, developers often practically deal with higher-priority faults more rapidly than lower-priority faults. The Procedure #1 accepts three parameters as inputs: the total execution time, defined as stop_time, the length of time for each iteration, dt, and staffing_level, the total number of debuggers in the system. Further, we split the execution time into a great number of iterations. Thus, the probability of multiple events occurring within each iteration is negligible [4], [7], [52].  Note that current_time is a variable representing the current cumulative simulation execution time. After the completion of each iteration, the time spent on iteration (dt) will be added to the current execution time until it reaches the total execution time (stop_time). As the current execution time increases, our system updates the state of faults (processing or waiting) to record the time cost in each component. We denote the number of busy debuggers at present by working_server. The Set data structure correcting_set is used to collect faults being processed. Furthermore, the high_priority_waiting_queue and low_priority_waiting_queue are the Queue data structure used to store the waiting high priority faults and low priority faults, respectively. Based on the assumptions described in this section, Procedure #1 was developed and implemented in three phases: staff allocation, fault detection, and fault correction.

1) PART 1 STAFF ALLOCATION PHASE
In this phase, we firstly check whether there are available debuggers in the system by comparing the number of busy debuggers (working_server) with all debuggers (staffing_level). If unoccupied debuggers actually exist, we check whether there are faults in the waiting queues according to their priority orders (high priority faults first). For example, if there is a waiting high priority fault, the fault will be assign to a debugger and its status will change from waiting to processing. If the number of waiting high priority faults is zero, then we check whether there are low priority faults in the queue. In short, the high priority waiting queue takes precedence over the low priority waiting queue. These actions are described in lines 9-18 of Figure 4, which will repeat until there are no waiting faults or there are no available debuggers.

2) PART 2 FAULT DETECTION PHASE
Note that the detection time of faults is based on the real failure dataset we collected from OSS and CSS projects. Each fault will enter the simulation system based on timestamp ordering (time detected). Therefore, in each iteration, the function occur() compares the timestamps with the current execution time (current_time) to check whether any fault is detected. When the occur() function returns true, the priority of the detected fault will be determined and handled independently. For example, if a low priority fault is detected, it will be added to the low priority waiting queue, and assigned to a debugger if there are available debuggers. These activities are shown in lines 20-37 of Figure 4.

3) PART 3 FAULT CORRECTION PHASE
At the beginning of this step, each fault being processed will be checked by the leave() function which is used to determine  whether the fault will be corrected. The probability that a debugger successfully corrects the fault in the time interval (t s , t s + t), given that it has already been in progress for time t s is [54]: where µ represents the fault correction rate, and T s represents the time spent on fault correction. Thus, we can generate a random number x which is the probability of correcting a fault in a unit time and compared it with µ × t after invoking the leave() function. When x is greater than µ × t, the devoted debugger will successfully correct the fault in this iteration. Otherwise, a fault which cannot be corrected will be rechecked in the next iteration. These actions are given in lines 39-44 of Figure 4.

B. PROCEDURE #2: THE SIMULATION PROCEDURE FOR THE PREEMPTIVE PRIORITY SYSTEM
Here we will further discuss and present the simulation procedure of the preemptive priority system in which high priority faults can preempt low priority faults. The flowchart of the simulation procedure is depicted in Figure 5 and the algorithm of this procedure is shown in Figure 6. Note that Procedure #2, constructed on the basis of Procedure #1, also accepts three parameters as inputs: VOLUME 10, 2022 current_time, dt, and staffing_level. Owing to the preemptive property of the Procedure #2, the Queue data structure, pre-emption_waiting_queue, is used to place the preempted faults (low priority faults). Based on the assumptions described in this section, Procedure #2 was developed and implemented in three phases: staff allocation, fault detection, and fault correction.

1) PART 1 STAFF ALLOCATION PHASE
First, we check the available debuggers by comparing work-ing_server with staffing_level. The activity is the same as the staff allocation of the Procedure #1. Based on our assumptions in Section III, preempted faults are handled prior to waiting low priority faults and faults with high priority take priority over preempted faults. Therefore, the available debuggers will be allocated according to the order of priority: waiting high priority faults, preempted faults, and waiting low priority faults. The details of actions for staff allocation are described in lines 10-23 of Figure 6, which is repeated until there are no waiting faults or no available debuggers.

2) PART 2 FAULT DETECTION PHASE
When a fault is detected, we then check what kind of priority it has. For instance, if a high priority fault enters the system and no debuggers are available in this iteration, we check whether there is a low priority fault being processed in the correcting_set. If there are low priority faults which are being processed, the preemption() function will replace the first entering low priority fault in the correcting_set with the high priority one, and the preempted fault will be sent to the preemption_waiting_queue. Otherwise, the newly detected high priority fault will be placed in the high_priority_waiting_queue. Details of the preemption procedure are shown in Figure 7 and the fault detection part is described in lines 25-44 of Figure 6.

3) PART 3 FAULT CORRECTION PHASE
In each iteration, the correcting_set will be checked, and the faults that are successfully corrected by debuggers will be removed from correcting_set and the number of busy debuggers will decrease by one. When the currently processed faults are corrected, the occupied debuggers will be available at the same time. These actions are described in lines 46-51 in Figure 6. If a fault cannot be corrected successfully in the current iteration, it will be checked in the next iteration.

IV. NUMERICAL EXAMPLES A. SELECTED DATA SETS AND MODEL'S COMPARISON CRITERIA
In this paper, the failure data used for the evaluations in this paper are composed of three sets of data and come from three sources [9], [67], [68]. The first data set (DS1) was collected from the public bug-tracking system, Bugzilla [67]. The system contains the shared components used by Firefox and other Mozilla software. The second data set (DS2) was also obtained from Bugzilla [68]. The third data set (DS3) was collected from the Coretronic Corp. It is noted that in software reliability engineering, the overall failure data sets typically fall into two types: time domain data and interval domain data [2], [3]. Practically, the time domain data provides better accuracy in the parameter estimates, but it also inevitably involves more data collection efforts and computations than the interval domain approach [2], [3], [4]. In our experiments, these three data sets were collected in interval domain format. Table 1 displays a sample of the actual failure data. Additionally, Table 2 shows the data source and system characteristics for each set of failure data.
On the other hand, Figure 8 depicts the cumulative number of detected and corrected high/low priority faults versus the time of the three data sets. The difference between detected and removed faults are the open-remaining (detected but in the waiting queue) faults. They clearly show that the fault removal time is not negligible because the total number of removed faults obviously lags behind the total number of detected faults.
In this paper, except for our proposed simulation-based method, five selected SRGMs are selected for performance comparisons. They are the GO model [4], the DSS model [4], the ISS model [4], the Gompertz model [8], and the PPQ model [9]. The method of maximum likelihood estimation is used to estimate the parameters of all models [2], [3], [4], [8].
The following criteria are used to evaluate ll selected models.
1) The Mean Square Error (MSE) is defined by [4], [8], and [9]: where m i is the cumulative number of detected faults in a given time interval (0, t i ], m(t i ) is the mean value function, i.e., the expected number of software failures by time t i , n is the size of the selected data set, and θ is the degree of freedom. The lesser the MSE, the better the model performance.
2) The Theil's U Statistic (TS) includes two parts, denoted U 1 and U 2. They are defined as follows [69], [70]: and A low value of U 1 and U 2 means that the method provided a more accurate prediction.

3) The Coefficient of Determination R 2 is defined as [4]
: A higher R 2 value indicates the model is a good fit.

B. DS1
Here we will apply the PPQ simulation procedure described in Section III to DS1. The simulation procedures were implemented with Python and the experiments executed on an AMD FX-6350 machine with six 3.9 GHz cores and 8GB of memory, running in a Windows 10 environment. Note that there are two steps in our proposed simulation procedure algorithm. The first step is to calculate the parameters from the data set. The second step is to conduct the simulation and average the experimental results. Using DS1, we simulate the proposed PPQ model for a period of 45 weeks and set each time unit iteration as 0.001 week. Based on our past study in [9], first, Table 3 shows a summary of the DS1 parameters used in the simulation procedure for DS1. Note that c, λ h , λ l , and µ are the parameters that stand for the number of debuggers, the high-priority fault detection rate, the low-priority fault detection rate, and the fault correction rate, respectively [9]. From DS1, we obtain the average detection rates of high and low priority faults, 1.244 (= 56/45) and 0.555 (= 25/45) faults per week, respectively. To ensure   system stability (ρ = (λ/cµ) < 1), we assume that the service rate for each debugger µ is 0.201 faults per week, and the number of debuggers c is 10. Table 3 gives a summary of the DS1 parameters used in the simulation procedure for  DS1. We repeat the simulation procedure for 1,000 times and calculate the average of the experimental results. Table 4 summarizes the performance comparisons of selected models for corrected faults of DS1. The proposed simulation-based method is the best for high priority faults of DS1 in terms of MSE, R 2 , TS-U1, and TS-U2. As for the low priority faults of DS1, the proposed simulation-based method almost performs better than the traditional SRGMs. The PPQ model only shows a significantly better performance than the simulation-based method based on the values of MSE, R 2 , and TS-U1. In general, we can see that the proposed simulation-based method almost gives the lowest values for TS-U1 and TS-U2 compared to traditional SRGMs for DS1. Obviously, we can see from Table 4 that both the proposed simulation-based method and PPQ model provide a better fit for DS1 and predict future fault correction processes well. Since real life debugging teams deal with high priority faults first, it can be seen that the preemptive mechanism better simulates the real world situation. Figure 9 illustrates the cumulative corrected faults of the actual failure data DS1 versus the simulated failure data, which are generated by the procedures described in Section III-B. We can thus investigate the profiles of the actual failure data and the simulation results. We can see from Figure 9 that there is an obvious difference between the actual data and the simulation results in high priority faults between the 2nd and 12th weeks. The simulation results exhibit similar results for low priority faults between the 2nd and 13th weeks. This probably because we only use part of the actual data, ignoring the previous faults being processed and the open-remaining faults. Overall, the approximation of the curves illustrating the cumulative number of corrected faults are close to each other in Figure 9. Finally, the comparison results of performance measurement between model-based and simulation-based methods are highlighted in Table 5. The simulation results and the performance estimation of the PPQ model are very close in average queue length, average waiting time, and the average response time, and the relative errors are within −0.14%. The results show that our proposed simulation procedure is workable to predict future behavior well.

C. DS2
Similarly, we repeat the simulation experiment steps based on the failure data of DS2. Thus, the average detected rate of high priority faults and low priority faults are 2.739 (=126/46) and 5.021 (=231/46), respectively. To ensure system stability (ρ = (λ/cµ) < 1), we assume that the service rate for each debugger µ is 0.24 fault per week, and the number of debuggers c is 40. In Table 6, we show a summary of the parameters used in the simulation procedure for DS2. Table 7 summarizes the performance comparisons of selected models for corrected faults of DS2. The proposed simulation-based method is the best for high-and low-priority faults of DS2 in terms of MSE, R 2 , and TS-U1. As for the low priority faults of DS2, the proposed simulation-based method almost performs better than the traditional SRGMs. The PPQ model only shows a significantly better performance than the simulation-based method based on the values of MSE, R 2 , and TS-U1. Although the ISS model has smaller values of TS-U2 compared to the proposed simulation-based method, but the proposed simulation-based method still shows a significantly better performance than other SRGMs and the PPQ   model based on the value of TS-U2 for high priority faults of DS2. Again, we can see from Table 7 that the proposed simulation-based method and PPQ model provide abetter fit for DS2 and are generally more accurate than the traditional SRGMs.
In Figure 10, we plot the actual and simulated failure data for DS2. We can see from Figure 10 that the simulation results for the high priority faults and low priority faults have a small difference with the actual data and the simulation data in the period from the 4th week to the 10th week. This is probably because we ignore the previous faults being processed and the open-remaining faults. Overall, the approximation of the curves depicting the cumulative number of corrected faults are close to each other in Figure 10. Table 8 shows the comparison results between model-based and simulation-based methods for DS2. It can be found that the simulation results and measurements from the PPQ model are very close to each other in average queue length, the average waiting time, and the average response time, and the relative errors are within 0.08%. Note that for high priority faults the average queue length and average waiting time are zero. This is because the simulation results are too small to be measured. Thus, the relative error cannot be obtained. These experimental results indicate that our proposed simulation procedure is workable to predict faults in the correction process well for DS2.

D. DS3
Similarly, we repeat the experiment steps based on the failure data of DS3. Then, the average fault detection rate of high and low priority faults is 1.071(=30/28) and 2.607 (=73/28) fault per week, respectively. To ensure system stability (ρ = (λ/cµ) < 1), we assume that the fault correction rate for each debugger µ is 0.55 faults per week and that the number of debuggers c is 7. In Table 9, we give a summary of the DS3 parameters used in the simulation procedure. Table 10 summarizes the performance comparisons of selected models for corrected faults of DS3. The proposed simulation-based method is the best for low priority faults of DS3 in terms of MSE, R 2 , and TS-U1. As for the high priority faults of DS3, the proposed simulation-based method almost performs better than the traditional SRGMs in terms of MSE and TS-U2. In general, both the proposed simulation-based method and PPQ model provide a better fit for DS3 and are generally more accurate than the traditional SRGMs.
The cumulative number of actual failure data in the DS3 dataset versus the simulated failure data is illustrated in Figure 11. The simulation result corresponds well to the actual failure data. This may be because there are very few VOLUME 10, 2022  open-remaining faults in the past. In Table 11, we have highlighted the comparison results between model-based and simulation-based methods for DS3. From Table 11, we see that the simulation results and measurements from the PPQ model are very close to each other in average queue length, the average waiting time, and average response time, and the relative errors are within 0.11%. Note that the high priority fault has an average queue length and average waiting time of zero. This is because the simulation results are too small to measure. Thus, the relative error cannot be obtained. These experimental results show that the proposed simulation procedure predicts future fault correction processes well using the DS3 dataset.

E. RELIABILITY VISUALIZATION SIMULATOR
In order to visualize the simulation results, we have developed an assessment tool called the Reliability Visualization Simulator (R-ViSim) to provide quantitative estimation based on the PPQ model. R-ViSim's major functions are listed below.
• PPQ modeling. R-Visim can calculate the performance measures of the PPQ model [9] such as average queue length, average waiting time, and average response time and the parameter settings. • PPQ simulation. Based on real failure data, R-ViSim can simulate the fault correction process using the preemptive priority queueing simulation procedure.
• The results display. R-Visim can graphically display the cumulative corrected faults of the actual failure data versus the simulated failure data, which are generated by the PPQ simulation. The measurement derived from the PPQ modeling is filled in the blank of simulation results.
The high level architecture of developed R-ViSim is shown in Figure 12, which depicts a combination of several submodules. R-ViSim was implemented in the Python language and using the Python 2D plotting library (Matplolib). As we can see from Figure 12 that R-ViSim first reads a failure data file (including the detection time and priority of each fault) for rate-based simulation, and R-ViSim allows the user to set the simulation parameters. Once the parameters have been entered, the user can start the simulation by clicking the Simulate button on the menu bar. The proposed model and simulation results are then reported and displayed. Figure 13 shows a typical window dump from R-ViSim. The high and low fault detection rates, fault correction rate, and the number of debuggers are explained and illustrated  in Section IV-B -IV-D. The length of iteration time can be set to a constant. The right hand side shows the workspace in which the simulation parameters are displayed, while the left hand side shows the plots of these simulation results. The menu items on the top left hand side include the File option which allows for file operations, Reset parameters option which allows the user to reset all parameters, Simulate option which allows the users to start the simulation, Setting option provides options for setting all simulation parameters, and Help option would provide on-line assistance in case the user needs it.

F. EXPERIMENTAL VALIDITY
The degree of the validity of the experimental design is as important as the experimental results [71], [72]. Adequate validity indicates the experimental results or other findings of the study have a high ability to solve similar problems. In the following, we will briefly discuss the internal, external, and construct validity of the experiment, respectively.
Internal validity refers to whether the research design can correctly explain the research results or show the causal relationship between independent and dependent factors. First, the collection of failure data is one threat to our internal validity. In general, there are not so much failure data containing the information about the fault detection, the fault correction, the fault priority level, etc. In this paper, DS1 and DS2 were obtained from Bugzilla. DS3 was collected from a projector firmware project developed by Coretronic Corp. in Taiwan. Note that Bugzilla has an open-fault record and its dataset is widely used in software engineering research. Thus we would be able to use Bugzilla's dataset for doing research. Basically, the three failure data represent a wide variety of applications. Note that the potential inaccuracy of collected failure data is also a threat to internal validity. In actuality, the fault reports from the system could be duplicate, invalid, or unreasonable since OSS projects were usually included thousands of volunteer participants. In this paper, we have carefully inspected the collected data and eliminated the overlaps and invalid bugs from our data sets.
Additionally, the SRGMs we selected for performance comparison are another issue for the internal validity of these experiments. We use some traditional SRGMs introduced in Section II for comparison with our proposed method. These SRGMs have gained wide acceptance in field of software reliability modeling. We also implemented all selected models very carefully.
External validity focuses on the generalizability of the outcome of our work. In other words, whether the experimental results of the research can be applied to other situations. An important threat to the external validity of our experiments is the choice of data sets. In general, we would not be able to generalize our findings and experimental results to every published data set. However, in this paper we totally use three real failure data to evaluate the performance of our proposed method compared with some SRGMs. DS1-DS3 were collected from different OSS and CSS projects, which had different architectures and the different developer compositions. It is also worth noting that Bugzilla has for many years seen widespread use by many users and researchers. It would be more convincing if we perform the experiments based on real failure data collected from the commonly used OSS projects. This diversity allows us to have greater confidence in our experimental results. In this case, we would be able to reduce the threats to external validity.
Construct validity is concerned with whether the measurement can meaningfully reflect or access the underlying construct which is intended to be measured. Thus the construct validity of this study could be affected by the number of comparison criteria in the experiment. In order to check the performance of our proposed method, make a fairly comprehensive comparison with all selected models, and avoid bias, we use three criteria in our paper; they are: the Theil's VOLUME 10, 2022 U Statistic (TS), the Coefficient of Determination (R 2 ), and the Mean Square Error (MSE). It is worth noting that the TS cannot distinguish between under-or over-prediction, but the magnitude of error can be examined from the computed values of Ul and U2 [69], [70]. MSE is one of the criteria we adopted since it is the primary measure used for comparing prediction methods for a long time. In actuality, MSE is a very common tool for assessing the quality of global model. It is also noticed that the MSE was used to judge the Retrodictive ability [3], [4], [8]. On the other hand, the R 2 is a measure of the linear association between the two variables x and y, and the value is bound in the range of 0 to 1. Basically the three criteria may be independent of each other. But it has to be noted that how many of comparison criteria are satisfied and the examination of statistical properties of measures could be another topic of future research [73].

G. RESEARCH QUESTIONS
It is recommended to evaluate a case study by answering some research questions [74]. In this section, we will briefly present the experimental results that answer four research questions.

RQ1: Is there any data-format limitation for the proposed simulation method?
In the field of software reliability engineering, software failure data collections can be classified into two types: interval-domain format and time-domain format [2], [3], [4], [8]. The interval-domain data is characterized by counting the number of failures occurring during a fixed period while the time-domain data records the individual occurrence time of failures. In this paper, DS1-DS3 were collected in interval domain format. In the industry, failure data are usually treated as confidential information. Only few data sets with this feature are released for public access in the literatures. However, most of them belong to interval-domain format. To our knowledge, few time-domain data sets provide the failure correction data in the time-domain format. Considering SRGMs, some may be suitable for one specific format only, while others apply to the other format. If we want to use the failure data in time-domain format for estimation, our proposed simulation procedures will give users/developers more flexibility in meeting the time-domain format.
Q2: What if a more complicated queuing mechanism is used?
In this paper, based on some assumptions in Section III, we develop simulation procedures that describe the fault removal process, M /M /c/∞/PR. Basically, our proposed simulation procedures contain three steps. The first step is to assign a fault to a free debugger. The second step is used to determine whether a fault is detected according to the failure data set. The end of an iteration is the step of fault correction, which checks whether the faults are completely corrected. Besides the basic queueing characteristics, it is possible for the queueing-based simulation method to involve multistaging or feedback. It can be found that the debugging process depicted in this paper takes into account the fault detection and the fault correction. In fact, a debugging process can typically be subdivided into three steps: fault detection, fault isolation (identifying the location of the root cause of the problem), and fault correction [75]. If the debugger fails to correct an isolated fault for too long, the fault may be sent back for re-isolation (i.e. re-identifying the root cause). Through the use of a multistage queueing system with feedback, the fault isolation and re-isolation in debugging processes can be specified, which will approach reality more closely. It is true that a more complicated queueing system can handle more realistic situations, but further discussion of this subject is beyond the scope of this paper. We plan to study this part and publish our findings in the near future.
RQ3: What is the difficulty of the parameter setting in the simulation method?
In this paper, we have developed a R-ViSim tool to provide quantitative estimation and allow to visualize simulation results. After entering the parameters, user can start the simulation and see the results in a short time. Different parameter values in our proposed simulation method can be executed and compared with other models and method soon. Thus we will be able to overcome the difficulty of the parameter setting problem of this work and our results can be more easily and closely replicated.
RQ4: What are the benefits of using simulation method for software reliability estimation?
In addition to model-based approaches in Section II.A, the discrete-event simulation generally offers an alternative to analytical models as it can represent the impact of different strategies that may be used during testing [4]. Simulation approaches can relax certain unreasonable assumptions which are common in model-based approaches. It can be seen that our proposed simulator can describe the process of fault correction in detail and optimize various system parameters. Developers and project managers can use the proposed simulation procedure to estimate workloads that are similar to the actual situation and allocate appropriate human resources. Here we will discuss and illustrate how to apply both PPQ simulation procedure and/or PPQ model [9], to project management. Let's assume that the work efficiency of each debugger is equal for convenience of analysis. Due to the limits of paper size, here we will primarily choose DS1 to illustrate how the proposed method can help project managers accurately estimate the efficiency of fault corrections. We can follow similar procedures to make data analysis and discussion for DS2 and DS3.
From Section IV.B, we obtain the average detection rates of high and low-priority faults, i.e., λ h = 1.244 and λ l = 0.555 faults per week, respectively. To ensure system stability (ρ = (λ/cµ) < 1), we assume that the service rate for each debugger µ is 0.201 faults per week, and the number of debuggers c is 10. In this case, we obtain ρ = (λ h + λ l )/cµ = 0.895 as the percentage of faults worked on by each debugger during the testing phase. This analysis shows that we use almost all resources (about 90%), thus introducing risk into the project schedule. Obviously, if any incidents were to happen, such as a debugger's absence or a change in the requirements, a delay in the project schedule may occur. One way to save personnel resources is to delay the release time. week. The ratio ρ < 1, showing that all faults can be corrected by the time the software is released. Compared to not extending the schedule, the personnel resource savings are (100/2.01) × (2.01-1.61) = 19.9% That is to say, if 10 debuggers are participated, we can assign at most 2 debuggers to another project or make a better deal with a holiday that might occur within the development schedule.

V. CONCLUSION
Because the reliability of software is critical, software quality evaluation is important. Thus, among those factors typically used as measurements of software quality, software reliability is regarded as key. A large number of SRGMs have been proposed and discussed over the past three decades. But most of published SRGMs assumed that detected faults will be immediately fixed and/or removed during software testing and debugging. However, most of developers must spend time analyzing the root cause of faults in the real world.
Presently, queueing models have been shown to be useful in many applications. Some studies have shown that queueing theory and/or queueing model can be used to describe various engineering activities for software development, such as testing, debugging, maintenance, etc. In this paper, we propose to use the queueing-based simulation to describe the behavior of FCP and assess the software reliability instead of using model-based approaches. The proposed simulation-based method is therefore inspired by the process scheduling of operating systems and considers the priority levels of the faults. Each fault is assigned a priority level by the testers and faults with higher priority are corrected sooner than faults with lower priority. We thoroughly investigate the fault removal process considering the fault correction time and the finite debugging resources. Three real data sets collected from OSS and CSS are used to test the performance of our proposed method. Experimental results show that the simulation-based method gives a better fit to the observed data than traditional SRGMs and predicts future behavior well. A tool called R-ViSim is developed to automate the simulation task.
On the other hand, Figure 8 shows the difference between detected and removed faults are the open-remaining faults. Practically, the number of open-remaining faults is very important information for managers to allocate adequate debuggers during the software development. When tracking the trend of the number of open-remaining faults, the interval-domain data are generally more ideal than timedomain data. It should be noticed that the format conversion from time-domain to interval-domain can easily be done without any assumptions and precision loss [3]. Therefore, even though the failure data are collected in the time-domain format, it is better to transform them into the interval-domain format before estimating the staffing needs.
Our future works can be divided into three aspects. First, we plan to add more practical functions to our simulation procedure. These will include determining the optimal version-updating time of software systems and taking imperfect debugging into consideration, in order for project leaders to manage the software development process more efficiently. Second, during our research, we observed that debugging teams not only consider the priority of a detected fault, but also its severity. Critical severity faults cause greater damage to systems than low severity faults. Therefore, in future work, severity will be considered in the fault correction process to estimate and evaluate software reliability Lastly, we also plan to study another queuing mechanism. In this paper, our proposed queueing-based simulation procedure describes the single-queue scheduling mechanism. In the future, we plan to use the express queue service in the simulation procedure if the detected faults are not classified according to their level of severity due to some reasons. The detected faults of the software will be delivered to the express queue or allocated to the express queue debuggers if its required service time is below a certain value. Other faults will be delivered to the regular queue or allocated to the regular queue debuggers. In this case, the waiting time can be reduced.
JHIH-SIN LIN received the B.S. degree in computer science from the National Central University, Taoyuan, Taiwan, in 2015, and the M.S. degree in computer science from the National Tsing Hua University, Hsinchu, Taiwan, in 2017. He is currently an Engineer at Silicon Motion, Inc., Zhubei, Hsinchu. His current research interests include software reliability estimation and quality measurement.
CHIN-YU HUANG (Member, IEEE) received the M.S. and Ph.D. degrees in electrical engineering from the National Taiwan University, Taipei, in 1994 and 2000, respectively. He is currently a Full Professor with the Department of Computer Science and the Institute of Information Systems and Applications, National Tsing Hua University (NTHU), Hsinchu, Taiwan. He was with the Bank of Taiwan, from 1994 to 1999. He was also a Senior Software Engineer at Taiwan Semiconductor Manufacturing Company, from 1999 to 2000. Before joining NTHU, he was a Division Chief at Central Bank of China, Taipei, in 2003. His research interests include software reliability engineering, software testing, software metrics, software testability, fault tree analysis, and system safety assessment. He received the Ta-You Wu Memorial Award of National Science Council, Taiwan, in 2008. He has been on the Editorial Board of Scientific Programming, since 2017 and theJournal of Information Science and Engineering, since 2016. He is currently serving as an Associate Editor for the IEEE TRANSACTIONS ON RELIABILITY. VOLUME 10, 2022