Analyzing Software Rejuvenation Techniques in a Virtualized System: Service Provider and User Views

Virtualization technology has promoted the fast development and deployment of cloud computing, and is now becoming an enabler of Internet of Everything. Virtual machine monitor (VMM), playing a critical role in a virtualized system, is software and hence it suffers from software aging after a long continuous running as well as software crashes due to elusive faults. Software rejuvenation techniques can be adopted to reduce the impact of software aging. Although there existed analytical model-based approaches for evaluating software rejuvenation techniques, none analyzed both application service (AS) availability and job completion time in a virtualized system with live virtual machine (VM) migration. This paper aims to quantitatively analyze software rejuvenation techniques from service provider and user views in a virtualized system deploying VMM reboot and live VM migration techniques for rejuvenation, under the condition that all the aging time, failure time, VMM fixing time and live VM migration time follow general distributions. We construct an analytical model by using a semi-Markov process (SMP) and derive formulas for calculating AS availability and job completion time. By analytical experiments, we can obtain the optimal migration trigger intervals for achieving the approximate maximum AS availability and the approximate minimum job completion time, and then service providers can make decisions for maximizing the benefits of service providers and users by adjusting parameter values.


I. INTRODUCTION
Server virtualization (SV) and operating system (OS) virtualization technologies have been widely used in various fields. SV technology allows that virtual machines (VMs) with different OS (namely, different OS kernels) can run on a single physical machine (PM). Different from SV technology, OS virtualization technology enables a single OS kernel to support multiple isolated user-space instances [1]. PMs in cloud datacenters (CDCs) usually use SV/OS virtualization technology to be capable of constantly running online services and tolerating varying user workloads [2] The associate editor coordinating the review of this manuscript and approving it for publication was Lei Wu.
with critical demands of high availability [3]. SV/OS virtualization technology is also explored for achieving dynamic network resource management [4] in Network Function Virtualization, which is essential for deploying 5G networks [4]. In addition, with the fast development of Internet of Things (IoT) technology, much more delay-sensitive IoT applications like complex event processing and streaming video need to be processed in Edge Computing (EC) [5], [6]. Live VM [7] or live container [8] migration can help achieve efficient allocation of resources like memory and network bandwidth in EC [9].
Without loss of generality, this paper focuses on the SV technology for delivering services. That is, each service runs in each independent VM [7] and then live VM migration can be used to migrate services between PMs, helping users receive high Quality of Service (It is a combination of attributes of service and is used for evaluating services from the perspective of server providers [10].) when their hosted PMs cannot work [11]. Live VM migration and service migration are used interchangeably in the rest of this paper.
Virtual machine monitor (VMM), which can allow multiple VMs to share the same physical machine safely [12], plays a critical role in an SV-based system. However, it is software and then is subject to software aging [13] (a phenomenon of software performance degradation after a long continuous running [14]) and crash due to certain elusive faults [15]. That is, software aging can degrade application service (AS) availability (which is defined as percentage of normal service provision time for the system, in other words, percentage of system available time) and then degrade user Quality of Experience. Here, QoE denotes the user's expected experience effect [16]. In this paper, AS is assumed to be a constantly running service, through which each user can execute his/her job. Amazon Web Services (AWS) Greengrass, as an instance of SV-based system hosting ASs, has suffered from failures that affected AS availability [17], and promises to provide at least 99.9% of the normal monthly uptime percentage for each AWS region [18]. It is obvious that the degradation of AS availability can lead to the increase in the overall job completion time (namely, the total required time to complete a job including the downtime). However, maximizing AS availability does not always mean the achievement of the minimum job completion time, since the AS state affects the job processing rate. For example, job processing rate will be reduced when the running system is in software aging. It is necessary to analyze both AS availability from the service provider view and job completion time from the user view such that service providers can decide when to trigger live VM migration technique for maximizing the benefits of service providers and users, respectively.
Software rejuvenation (a proactive solution for preventing the faults due to software aging) techniques can reduce the impact of software aging [19]. They have already been offered in production clouds, such as Azure [20] and VMware ESXi [21]. Analytical modeling is an effective approach for evaluating the software rejuvenation techniques in terms of AS availability and job completion time. There were various analytical model-based analysis of job performance and/or availability degradation caused by software aging in CDCs. They often assumed that all the time intervals follow exponential distributions [22]- [27]. Moreover, while many studies focused on analyzing either AS availability or job completion time [28]- [30], a few papers analyzed both of AS availability and job completion time together in a system [31], [32]. In particular, none of the existing analytical models has analyzed both AS availability from the service provider view and job completion time from the user view in an SV-based system using live VM migration.
In this paper, we consider an SV-based system composed of hosts for executing a job and supporting live VM migration among the hosts. VMM, running in every host, is software that is subject to software aging after a long continuous running [13]. The system deploys VMM reboot and live VM migration techniques based on the time-based rejuvenation after software aging detection. Aging time, failure time, VMM fixing time and live VM migration time follow general distributions. For such an SV-based system, we quantitatively investigate the impact of the software rejuvenation techniques on AS availability and job completion time. To analyze the state transitions of the SV-based system deploying VMM reboot and live VM migration techniques, we construct a semi-Markov process (SMP) which is used to evaluate the measures of interest. To the best of our knowledge, it is the first time to quantitatively analyze AS availability and job completion time in the aforementioned SV-based system. The main contributions are summarized as follows: • We propose a SMP model for capturing the behaviors of a complicated SV-based system deploying VMM reboot and live VM migration techniques for rejuvenation. Our model can capture the detailed aging process and consider job processing rate during VMM aging thereby job completion time can be calculated more accurately.
• We derive formulas for calculating AS availability and job completion time in order to analyze software rejuvenation techniques from the view of service providers and users quantitatively.
• We conduct analytical experiments for analyzing AS availability and job completion time over a variety of system parameters. Analytical experiments are also carried out to determine migration trigger intervals for achieving the approximate optimal AS availability and job completion time. The rest of the paper is organized as follows. In the second section, we discuss related work. Section III describes the system considered in this paper and presents a SMP model for analyzing AS availability and job completion time. Section IV presents the results of analytical experiments. Finally, the conclusion is drawn and future work is discussed in Section V.

II. RELATED WORK
The past years witnessed significant efforts made for analytical model-based evaluation of AS availability and/or job completion time. See [13], [22]- [31] and references therein.
Changa et al. [22] proposed the continuous time Markov chain (CTMC) model for analyzing VM survivability in the system, where VM failover and live VM migration techniques were applied to improve service survivability. They also quantitatively compared the capability of rejuvenation techniques in an SV-based system in [13]. Okamura and Dohi [23] proposed a phase-expanded software rejuvenation model in order to investigate the interval reliability and solved it by reducing the model to a CTMC model. Rahme and Xu [24] presented an extended Dynamic Fault Tree model to calculate the system reliability and used CTMC technique to verify the capability of their approach. VOLUME 8, 2020 Machida and Miyoshi [25] modeled the system with condition-based rejuvenation as an M/M/1 queue for the rejuvenation decision. Nguyen et al. [26] presented a Stochastic Reward Net model of a system under live VM migration technique. They evaluated merely AS availability, downtime and downtime cost. Torquato et al. [27] analyzed live VM migration based on warm-standby and cold-standby redundancy schemes. They constructed availability models by using Stochastic Petri Nets to evaluate the impact of live VM migration on system availability and power consumption. Note that these studies [13] and [22]- [27] assumed that all time intervals in the models followed exponential distributions. Our work in this paper relaxes this assumption by allowing aging time, failure time, VMM fixing time and live VM migration time follow general distributions, in order to devise a more general model for correctly capturing system behaviors.
There were modeling-based studies [28]- [31] in which some time intervals followed general distributions. Machida et al. [28] captured the aging and rejuvenation behaviors of a server virtualized system by using SMP models. Ning et al. [29] evaluated AS availability and overall loss probability by using a Markov regenerative process. Based on a SMP, Loganathan et al. [30] studied the availability of a manufacturing system. These studies [28]- [30] investigated either AS availability or job completion time. While a few papers presented the analytical models for evaluating both AS availability and job completion time in a virtualized system [31], [32], the models did not used live VM migration technique for rejuvenation. Machida et al. [31] presented a SMP to analyze a software execution environment suffering from software aging from the aspect of both service availability and job completion time. There are two major differences between [31] and our work: • The system modeled in [31] is different from our system. When aging is detected in the host with a running job, the authors in [31] proposed to reduce performance degradation by adding more computing resources to this host. If such solution cannot prevent aging, certain rejuvenation technique is employed. But our system uses live VM migration in order to prevent service performance degradation when aging is detected. In addition, the authors in [31] assumed system crashes only caused by software aging. But we consider other crashing factors, such as certain elusive faults [15]. Namely, we consider the scenario where job crash occurring at any time.
• The model proposed in [31] ignored the system state where aging occurs but this event is not detected.
That is, the model proposed in [31] didn't capture the job performance variation and thereby didn't calculate the overall job completion time effectively. We use FIGURE 1 to illustrate this point. We assume that there is no aging in a host running a job in [t 1 , t 2 ]. VMM aging occurs from time instant t 2 and live VM migration is triggered at time t 3 . Note that aging causes the variation of job performance in unit time (denoted as job unit performance). The model in [31] ignored the system state in [t 2 , t 3 ]. Namely, they assumed job unit performance in [t 2 , t 3 ] is same as that in [t 1 , t 2 ]. Actually, it is not true. Our model developed in this paper captures the variation in job unit performance. We also consider the decrease in job processing rate due to VMM aging in calculating the overall job completion time to make the results more accurate. Recently event transition based methodology is developed to evaluate the performance of time-dependent systems.
Levitin et al. [33] studied both full and partial rejuvenations in a real-time software system by extending event transition based methodology. They focused on evaluating job completion probability. They further [34] considered an operational software system, which has performing real-time tasks and multiple performance degradation levels. Then, they explored an event transition-based numerical method to investigate the optimal state-based rejuvenation policy by minimizing the total expected mission cost in this system. Unlike them, we quantitatively analyze software rejuvenation techniques from AS availability and job completion time.
Besides analytical modeling and event transition based approaches, researchers explored measurement-based approaches. Bovenzi et al. [35] evaluated Kexec and Phasebased reboot techniques in terms of downtime overhead reduction, performance penalty and rejuvenation coverage. Huang et al. [36] used an adaptive sampling technique for signal reconstruction to detect trends in reconstructed signals and evaluated whether the reconstructed signals can be used to track the gradual change of system performance related to software aging. There are two major differences from our paper. One is that we study both job completion time and AS availability. The other is that we apply VMM reboot and live VM migration techniques to achieve high AS availability and small job completion time. Note that their experiment results can be complementary to our work for better evaluation of AS availability and job completion time.

III. SYSTEM DESCRIPTION AND MODELS
This section first presents the SV-based system architecture considered in this paper, showed in FIGURE 2. Then the SMP model is explained. Finally, the formulas for calculating AS availability and job completion time are derived.

A. SYSTEM DESCRIPTION
The SV-based system mainly consists of one powerful and many weak computing capabilities hosts and Management Host. Jobs can be executed in the hosts. The host with powerful computing capability can be regarded as Primary Host, which includes a VMM hosting an active VM for executing AS. AS is assumed to a constantly running service, through which each user can execute his/her job. One of the hosts with weak computing capability can be used as Backup Host, which includes a VMM and is used to support live VM migration. The monitoring tool deployed in Management Host is responsible for monitoring the behaviors of VMM in each host. These hosts are connected through the network. We consider the execution process of a job in the system.
At the beginning, the job runs in Primary Host. If VMM aging is detected during the job execution, Management Host will immediately examine the state of VMM of the remaining hosts in system and select a host with weak computing capability that does not suffer from software aging or crash as the Backup Host. We assume that there is always an available Backup Host. Note that we leave the relaxation of this assumption in our future work. The selection and examination time of Management Host is negligible. Then live VM migration is triggered and Backup Host will take charge of the job. As long as the job leaves the current host, VMM of this host is rebooted in order to eliminate the possible aging errors. Request and session with established open network connections are not lost during live VM migration (namely, all of the phases of live VM migration) [37]. Live VM migration technique ensures that job can continue its execution from the preempted point. Namely, the job execution follows a preemptive-resume (PRS) discipline [38]. Differently, if VMM reboot technique is used, the job is restarted. Namely, the job execution follows a preemptive-repeat (PRT) discipline [31].
It is reasonable to assume that the VMM reboot time are far less than software aging time. Since we assume Primary Host has powerful computing capacity, job may be completed quickly if it is migrated back to Primary Host when this host is ready. Thus, as soon as the Primary Host is ready, the job is moved back to Primary Host shown in FIGURE 3. The above description suggests that a system state can be described by a 2-tuple index (i, j). Here, i and j denote the states of Primary Host and Backup Host, respectively. There are five host states: Free, Running, Failed, Migration and Aging, denoted by 0, 1, 2, 3 and 4, respectively. The meaning of each host state is given as follows: • Host State 0 (Free). The job is not running in this host. • Host State 1 (Running). The host is robust and the job is running in it. Both VMM reboot technique and VMM fixing can bring the host back to this state.
• Host State 2 (Failed). The host at this state is unavailable, which is caused by VMM crash due to certain elusive faults [15].
• Host State 3 (Migration). At this state, the job is ready to move from one host to another via live VM migration.
• Host State 4 (Aging). The host at this state can work but its performance is degraded due to VMM aging. There are total 5 * 5 = 25 system states, among which there are 17 meaningless system states. These meaningless states can be ignored. Take system state (1,1) and (4,2) for example, a job considered in this paper cannot run in Primary Host and Backup Host simultaneously. Therefore, system state (1,1) is meaningless. When a job is running in one host, the state of the other host is always 0 (Free). The change of system state depends on the state of the host with a running job. Therefore, system state (4,2) is meaningless. TABLE 1 defines eight meaningful system states of the system.   illustrates the SMP model for capturing behaviors of the SV-based system. Note that the holding time of system from state (4,0) to state (2,0) has the same general distribution F f2 (t) as that of the system from state (3,0) to state (2,0). No matter whether it is from state (4,0) to state (2,0) or from state (3,0) to state (3,0), it indicates the holding time that the Primary Host VMM suffers from crash after aging. In this model, VMM fixing and VMM reboot will bring the SV-based system to a state without error. In addition, the timer to be used for next VMM fixing or VMM reboot is restarted after completing VMM fixing or VMM reboot. From this point of view, we define {Z s (t) = Z (Y n , T n )|Y n ∈ Y , T n ∈ T } is a stochastic process. The sequence of system states . . , Y n } (n ≥ 0 ) (including the occurrence of VMM aging, live VM migration, VMM failure, VMM fixing and VMM reboot.) corresponds to Markov renewal moments T = {T 0 , T 1 , T 2 , T 3 , T 4 , . . . , T n } (n ≥ 0).

C. FORMULAS FOR CALCULATING AS AVAILABILITY
This section describes the process of calculating AS availability. We use S 0 -S 5 to represent the system states. See TABLE 1. The details are as follows.
First of all, we construct the kernel matrix K(t), which can be represented as in Equation (1).
The non-null element k 01 (t) is defined in Equation (2).
= Pr{Aging of primary host occures within time t} The left elements in K(t) have similar definitions, given in Appendix in the supplemental section of the paper. By solving its one-step transition probability matrix (TPM) P = [p S i S j ], we can characterize the sequence of system states. The onestep TPM is P = lim t→∞ K(t) for the embedded DTMC of the SMP. Then, we can get matrix P.
where the equations for calculating the non-null elements of the matrix are given in Appendix in the supplemental section of the paper. In order to obtain steady-state probability vector V of the embedded DTMC, we can solve the linear system of equations: Then, we can get the equation of v S 0 as follows: The equations for calculating v S 0 , v S 1 , v S 2 , v S 3 , v S 4 and v S 5 are given in Appendix in the supplemental section of the paper. Once V is calculated, we need to obtain the mean sojourn times h S i at system state S i , which is to be used in Equation (7). The formula can be written as follows: where H S i (t) is the sojourn time distribution at system state S i . The equations for calculating h S 0 , h S 1 , h S 2 , h S 3 , h S 4 and h S 5 are shown in Appendix in the supplemental section of the paper. Then the steady-state probability π S i for the system state S i is calculated by using Equation (7) according to [39]: where v S i and h S i can be obtained by Equation (4) and (6). In this model, the steady-state availability of the system A 1 is computed by the sum of the steady-state probability of system state S 0 (π S 0 ), system state S 1 (π S 1 ) and system state S 3 (π S 3 ) and presented as follows:

D. FORMULAS FOR CALCULATING JOB COMPLETION TIME
This section analyzes job completion time is defined to denote the amount of time to complete a job. The work requirement for this job is work units. We assume that a work unit is processed in an hour in the execution environment. If the job encounters a failure at time instant (h > 0), it will be restarted. The details of calculating job completion time are given in the following. According to FIGURE 4, we assume that the job starts its execution from system state S 0 . If h is not less than x, job completion time C(x) is equal to x. If h is less than x, job completion time C(x) becomes the sum of h, VMM fixing time and C(x). The details are as follows: 1) The migration trigger interval a 1 is larger than x.
2) When migration trigger interval a 1 is less than x, there are two cases as follows: 2.1). h is less than migration trigger interval a 1 . We define that the Primary Host VMM suffers from software aging at time a. There are two situations in which the job fails.
One is that Primary Host failure occurs before a and the other is after a.
2.2). h is larger than migration trigger interval a 1 . The mean job completion time is represented as follows [31]: where Laplace-Stieltjes transforms (LST) of job completion time ∼ C (s, x) is derived as follows: Solving Equation (9), we can get the mean job completion time. In addition, if the effect of job processing rate r 1 at Aging state of Primary Host and job processing rate r 2 at Running state of Backup Host on job completion time is considered, ∼ C (s, x) in Equation (10) can be written as Equation (11): The ∼ C (s, x) in Equation (11) is the overall job completion time considering both Running state and Aging state together by setting different job processing rate at these two states.

IV. ANALYTICAL EXPERIMENTS
In this section, we apply our proposed equations to investigate AS availability (using Equation (8)) and the mean job completion time (using Equation (11)) over various VOLUME 8, 2020 system parameters. Section IV-A introduces experiment configuration.Section IV-B and Section IV-C describe the mean job completion time and AS availability under varying parameters.

A. EXPERIMENT CONFIGURATION
Failure time is assumed to have an Increasing Failure Rate distribution because the failure rate caused by software aging tends to increase with time [40]. Hypo-exponential distribution is a typical Increasing Failure Rate distribution [31]. The time of Primary Host (Backup Host) from Running state to Failed state follows the Hypo-exponential distribution, corresponding distribution function F f1 (t) = HYPO(λ 1 , λ 2 ) (F f3 (t) = HYPO(λ 1 , λ 4 )). The time of Primary Host (Backup Host) from Aging state to Failed state is assumed to follow the Hypo-exponential distribution, corresponding distribution function F f2 (t) = HYPO(λ 1 , λ 3 ) (F f4 (t) = HYPO(λ 5 , λ 6 )). In addition, random variables T R1 , T R2 , T Q1 , T Q2 and T M are assumed to follow the exponential distribution with parameter α,γ ,κ,µ and σ , respectively. The use of Hypo-exponential distribution and exponential distribution is just as an example. Other distributions can be used for analytical experiments. Some paraments used for solving AS availability and the mean job completion time are set according to [31]. The left parameters are set in order to demonstrate the effectiveness of our model proposed in this paper. The default settings of parameters are given in 0, where '-' in the 'Distribution' column denotes that variables do not follow any distribution, while '-' in the 'Default Values' column indicates no default settings of parameters. Analytical experiments are conducted on MAPLE [41].

B. EFFECT OF MIGRATION TRIGGER INTERVAL ON JOB COMPLETION TIME
This section describes the relationship between the mean job completion time and migration trigger interval under varying job processing rate and the mean VMM fixing time 1/α. We assume a < a 1 < a 2 < x in the formula for calculating job completion time. x, a and a 2 are set to be 360 hours [31], 50 hours and 354 hours, respectively. Thus, migration trigger interval varies from 100 hours to 350 hours.

1) JOB PROCESSING RATE r 1 AT AGING STATE OF PRIMARY HOST
First, we investigate the job completion time by varying migration trigger interval a 1 and job processing rate r 1 at Aging state of Primary Host. Job processing rate r 1 at Aging state of Primary Host is set to be 0.6, 0.7 and 0.8 respectively. The left parameters are fixed. FIGURE 5 shows experimental results. We can observe: • When job processing rate r 1   completion time and the corresponding optimal migration trigger intervals at r = 0.7 and 0.8, respectively.
• With the increasing job processing rate r 1 at Aging state of Primary Host, the mean job completion time decreases gradually. It can be explained that work units completed increase in unit time when job processing rate r 1 at Aging state of Primary Host increases.
• After the mean job completion time reaches its minimum value, it increases gradually with the increasing migration trigger interval a 1 . When migration trigger interval a 1 is small, the frequency of migration increases, which results in the increase in the mean job completion time. When migration trigger interval a 1 is large, the probability of system failure increases, which results in the increase in the mean job completion time. Consequently, the mean job completion time first decreases and then increases with the increasing migration trigger interval a 1 .

2) JOB PROCESSING RATE r 2 AT RUNNING STATE OF BACKUP HOST
First, we investigate the job completion time by varying migration trigger interval a 1 and job processing rate r 2 at Running state of Backup Host. Job processing rate r 2 at Running state of Backup Host is set to be 0.7, 0.8 and 0.9 respectively. The left parameters are fixed. FIGURE 6 shows experimental results. We can observe:  • When job processing rate r 2 at Running state of Backup Host is 0.9, the mean job completion time is approximately minimized to 1370.4277 hours at a 1 = 160 hours. It is denoted by (160, 1370.4277) in FIGURE 6. Similarly, (184, 1449.4312) and (218, 1538.1866) denote the approximate minimum job completion time and the corresponding optimal migration trigger intervals at r = 0.8 and 0.7, respectively.
• With the increasing job processing rate r 2 at Running state of Backup Host, the mean job completion time decreases gradually. It can be explained that work units completed increase in unit time when job processing rate r 2 at Running state of Backup Host increases.
• After the mean job completion time reaches its minimum value, it increases gradually with the increasing migration trigger interval a 1 . When migration trigger interval a 1 is small, the frequency of migration increases, which results in the increase in the mean job completion time.
The reason is the same as in Section IV-B (1).

3) MEAN VMM FIXING TIME 1/α
Next, we investigate the job completion time by varying migration trigger interval a 1 and mean VMM fixing time 1/α. Mean VMM fixing time 1/α is set to be 0.5 hours, 0.6 hours and 0.7 hours, respectively, while the left parameters are fixed. FIGURE 7 and IV-C.1 show experimental results. We can observe: • With the increasing mean VMM fixing time 1/α, the mean job completion time increases gradually. It can be explained that the increase in job completion time due to the increase in mean VMM fixing time 1/α.
• After the mean job completion time reaches its minimum value, it increases gradually with the increasing migration trigger interval a 1 . The reason is the same as in Section IV-B (1).

C. EFFECT OF MIGRATION TRIGGER INTERVAL ON AS AVAILABILITY
This section describes the relationship between AS availability and the migration trigger interval under different failure rate parameter λ 2 and the mean VMM fixing time 1/α. Moreover, the relationship between the approximate maximum AS availability and the corresponding optimal migration trigger interval under different live VM migration rate σ is investigated. AS availability is closely related to the sojourn time in each system state. As modeled in Section III, migration occurs after system state S 1 . By Equation (A. 27) in the Appendix in the supplemental section of the paper, the approximate maximum sojourn time in system state S 1 is computed to 336 hours. The migration trigger interval is varied from 350 hours to 950 hours.

1) FAILURE RATE PARAMETER λ 2
First, we investigate AS availability by varying migration trigger interval a 1 and failure rate parameter λ 2 . The failure rate parameter λ 2 is set to be 0.00495, 0.00595 and 0.00695 respectively while the left parameters are fixed. FIGURE 8 illustrates the experimental results. We can observe: • When failure rate parameter λ 2 is 0.00695, the AS availability is approximately maximized to 0.9997425 at a 1 = 508 hours. The maximum point is denoted by (508, 0.9997425) in FIGURE 8. Similarly, (523, 0.9997527) and (544, 0.9997652) denote the approximate maximum AS availabilities and the corresponding optimal migration trigger intervals at λ 2 = 0.00595 and 0.00495, respectively.
• With the increasing failure rate parameter λ 2 , AS availability decreases gradually. It can be explained that VOLUME 8, 2020 the holding time of system staying at available states decreases when failure rate parameter λ 2 increases.
• AS availability decreases with the increasing migration trigger interval a 1 after it reaches its maximum value. When migration trigger interval a 1 is small, the holding time of system staying at available states increases, which leads to the increase in AS availability. When live VM migration trigger interval a 1 is large, the probability of system failure increases, which leads to the decline in AS availability. Consequently, AS availability increases up to the maximum value and then decreases with the increasing migration trigger interval a 1 .
2) MEAN VMM FIXING TIME 1/α Next, we investigate AS availability by varying migration trigger interval a 1 and VMM fixing time 1/α. Mean VMM fixing time 1/α is set to be 0.5 hours, 0.6 hours and 0.7 hours respectively, while the left parameters are fixed. FIGURE 9 illustrates the experimental results. We can observe: and (554, 0.9997496) denote the approximate maximum AS availabilities and the corresponding optimal migration trigger intervals at 1/α = 0.6 hours and 0.7 hours, respectively.
• With the increasing mean VMM fixing time 1/α, AS availability decreases gradually. It can be explained that the holding time of system staying at unavailable states increases with the increasing mean VMM fixing time.
• After AS availability reaches its maximum value, it decreases with the increasing migration trigger interval a 1 . The reason is that the probability of Primary Host failure before the service migration to Backup Host increases when migration trigger interval a 1 becomes large.

3) LIVE VM MIGRATION RATE σ
Finally, we investigate the relationship between the maximum AS availability and the corresponding optimal migration trigger interval under different live VM migration rate σ . Mean live VM migration rate σ is set to be 48, 72, 96, 120 and 144, while the left parameters are fixed. FIGURE 10 shows the experimental results. We observe that the maximum AS availability increases and the corresponding optimal migration trigger interval decreases with the increasing σ . The reason is that σ determines time of system staying at unavailable states. When live VM migration rate σ is large, the holding time of system staying at unavailable states decreases, which leads to the increasing maximum AS availability.

V. CONCLUSION AND FUTURE WORK
In this paper, we apply the SMP to quantitatively study the AS availability and job completion time in an SV-based system deploying VMM reboot and live VM migration techniques. We derive the equations for calculating AS availability and job completion time under various migration trigger intervals. Finally, we determine the optimal migration trigger intervals for achieving the approximate maximum AS availability and the approximate minimum job completion time through analytical experiments to help service providers make decisions for maximizing the benefits of service providers and users.
Note that this paper considers VMM reboot and live VM migration techniques. Future work includes the investigation of the scenarios where more rejuvenation techniques are adopted for improving AS availability and job completion time. In addition, we want to calculate more evaluation metrics, such as cost and the mean time to failure, etc., in order to evaluate the effectiveness of the software rejuvenation techniques. In addition, we will investigate whether the extended deterministic and stochastic Petri nets can be applied to model the system considered in this paper.

APPENDIX
This section provides formulas for calculating AS availability (Section III-C of the main paper). The definition of parameters involved in the equations is shown in TABLE 2 of the main paper. The detailed solution process is as follows: First, we obtain the equations for calculating the elements of the kernel matrix K(t), given in Equation (A.1)-(A.10). v S 0 = −1/(p S 0 S 1 p S 1 S 2 p S 2 S 3 p S 3 S 0 − p S 0 S 1 p S 1 S 2 p S 2 S 3 − p S 0 S 1 p S 1 S 2 − p S 0 S 1 − 2) (A. 21) v S 1 = −(p S 0 S 1 )/(p S 0 S 1 p S 1 S 2 p S 2 S 3 p S 3 S 0 − p S 0 S 1 p S 1 S 2 p S 2 S 3 − p S 0 S 1 p S 1 S 2 − p S 0 S 1 − 2) (A.22) v S 2 = −(p S 0 S 1 p S 1 S 2 )/(p S 0 S 1 p S 1 S 2 p S 2 S 3 p S 3 S 0 − p S 0 S 1 p S 1 S 2 p S 2 S 3 − p S 0 S 1 p S 1 S 2 − p S 0 S 1 − 2) (A.23) v S 3 = −(p S 2 S 3 p S 0 S 1 p S 1 S 2 )/(p S 0 S 1 p S 1 S 2 p S 2 S 3 p S 3 S 0 − p S 0 S 1 p S 1 S 2 p S 2 S 3 − p S 0 S 1 p S 1 S 2 − p S 0 S 1 − 2) (A.24) v S 4 = −(p S 3 S 4 p S 2 S 3 p S 0 S 1 p S 1 S 2 )/(p S 0 S 1 p S 1 S 2 p S 2 S 3 p S 3 S 0 − p S 0 S 1 p S 1 S 2 p S 2 S 3 − p S 0 S 1 p S 1 S 2 − p S 0 S 1 − 2) (A.25) v S 5 = −(p S 0 S 1 p S 1 S 2 p S 2 S 3 p S 3 S 0 + p S 0 S 1 p S 1 S 2 p S 2 S 3 p S 3 S 4 − 1)/(p S 0 S 1 p S 1 S 2 p S 2 S 3 p S 3 S 0 − p S 0 S 1 p S 1 S 2 p S 2 S 3 − p S 0 S 1 p S 1 S 2 − p S 0 S 1 − 2) (A.26) What's more, we get the mean sojourn time h S Finally, we solve AS availability by Equation (7)