Online Workload Burst Detection for Efficient Predictive Autoscaling of Applications

Autoscaling methods are employed to ensure the scalability of cloud-hosted applications. The public-facing applications are prone to receive sudden workload bursts, and the existing autoscaling methods do not handle the bursty workloads gracefully. It is challenging to detect the burst online from the incoming dynamic workload traffic, and then identifying appropriate resources to address the burst without overprovisioning is even harder. In this paper, we address this challenge by investigating the appropriate method for online burst detection and then proposed a novel predictive autoscaling method to use burst detection for satisfying specific response time requirements. We compared the proposed method with multiple state-of-the-art baseline autoscaling methods under multiple realistic and synthetic bursty workloads for a benchmark application. Our experimental results show a 60.8% average decrease in response time violations as compared to the baseline method.


I. INTRODUCTION
Cloud computing is attractive to host and manage applications for scalability, performance, and cost-effectiveness. One of the core features of cloud computing is on-demand resource provisioning, which enables the users to automatically scale the application resources to satisfy specific service level objectives (SLOs). An application with high response time loses the user attraction and will reduce business opportunities for application owners. Therefore, application response time SLO is critical to ensure for cloud-hosted applications [1]- [3].
Typically, autoscaling methods are used to maintain the performance of cloud-hosted applications by reducing the latency and response time SLO violations [4]. Most of the exiting autoscaling methods are based on either a reactive or a predictive approach. The reactive autoscaling approach manages the resources to the applications based on specific events and a set of rules. For example, a typical reactive autoscaling method increases the number of resources of a cloud-hosted application whenever application-specific metrics, for example, average response time or allocated hardware utilization, The associate editor coordinating the review of this manuscript and approving it for publication was Francesco Piccialli.
for example, CPU, memory, and I/O reaches a specific threshold [5], [6]. These rule-based autoscaling methods do not eliminate the SLO violations as these methods initiate scaling decisions after the occurrence of particular events. Whereas, predictive autoscaling methods proactively scale the application resources by anticipating any possible performance issue may arise in the future. Most of the existing predictive autoscaling methods [7], [8] are based on analytical modeling [9], machine learning [10], and reinforcement learning [11] techniques.
Public-facing applications observe dynamic and bursty workloads. A sudden burst in the workload drastically affects the performance of the applications. If the autoscaling method does not detect the burst and handle it appropriately, the performance of the application will sharply deteriorate. The existing autoscaling methods do not explicitly treat the burst in workloads to minimize the response time SLO violations. There have been some efforts to identify the burstiness in workloads [12]- [15]. These workload detection methods are based on offline techniques that require a large number of historical workload observations as input to identify the burst. It is challenging to identify the burst online in incoming dynamic workloads using small historical observations and then consume the burst detection in the autoscaling method efficiently to minimize the response time SLO violations.
To highlight the challenges associated with identifying bursts in dynamic workloads, we show two real workloads and one synthetic workload in Figure 1. SW1 and SW2 show the two actual workload traces from the Calgary university web server and ClarkNet servers, respectively, whereas SW3 shows a synthetic workload with bursts. The figures highlight the bursts in each workload, which is challenging to detect online and automatically. These bursts deteriorate the performance of the application if appropriate resources are not allocated timely. A typical autoscaling method will try to allocate and deallocate the resources frequently during the bursts and would not be able to identify appropriate allocation timely for minimizing response time SLO violations [16]. In this paper, we proposed an efficient autoscaling method that detects bursts online in incoming workloads and then use it in resource allocation decisions to minimum response time SLO violations. In our proposed autoscaling method, first, we predict the future workload using the last k historical workload traces. Second, the predicted workload is used to identify the number of resources instances needed to satisfy the response time SLO threshold. Third, we detect the burstiness in the last k historical workload traces using a burst detection method. Finally, burst detection decision and predicted the number of resource instances are used to scale-out or scale-in the application resources dynamically. We have investigated three different techniques, including Sample Entropy (SE), Normalized Entropy (NE), and Fast Fourier Transformation (FFT), for online burst detection in dynamic workloads. Our proposed method using a trace-driven simulation using four synthetic and five real workloads show excellent performance as compared to the existing state-ofthe-art baseline autoscaling methods. We also show the validation of the proposed method on a real containerized testbed infrastructure. The main contributions of this work include: i. Propose an autoscaling method to minimize the response time SLO violations. ii. Design an autoscaling method with online burst detection in incoming workload for better performance. iii. Simulation-based evaluation of the proposed system using four synthetic and five real workloads. iv. Compare the proposed autoscaling method with state-ofthe-art reactive and predictive autoscaling methods as the baselines.
v. Validation of the proposed autoscaling method on a real testbed infrastructure. The rest of the paper is organized as follows. Related work is discussed in Section 2. We explain the proposed system in Section 3. The experimental setup and design is discussed in Section 4. Experimental results are presented in Section 5. Finally, conclusions and future work are discussed in Section 6.

II. RELATED WORK
There have been many efforts to design autoscaling methods for cloud-hosted applications. For example, Liu and Wee [29] present a reactive autoscaling method. The proposed autoscaling method dynamically scale the application resources whenever CPU utilization or bandwidth utilization of the application resources saturates. Krieger et al. [30] proposed a reactive autoscaling method for the bioinformatics and biomedical cloud-hosted applications. The proposed method horizontally scale the application resources to maximize application performance. Chieu et al. [31] proposed a reactive autoscaling method. The proposed system used the number of active sessions to find the number of resource instances. However, the proposed method does not maximize the performance of complex applications under dynamically changing workloads. Liu et al. [32] also present a reactive autoscaling method. The proposed method uses fuzzy logic to find the resource cluster size to maintain the application response time. Abdullah et al. [33] use a simple reactive autoscaling method to maintain the performance of microservices. The proposed method dynamically adds the resources to the microservices whenever response time saturates.
Recently, several researchers designed and evaluated predictive autoscaling methods. For example, Radhika et al. [34] present the predictive autoscaling method. The authors use Auto-Regressive Integrated Moving Average (ARIMA) and Recurrent Neural Network-Long Short Term Memory (RNN-LSTM) to predict the future workload using historical CPU and memory utilization and then scale the application resources. Raghunath and Annappa [35] develop a predictive autoscaling method. The author uses a fuzzy-based system to predict the future resources which are needed to maintain the application performance. A recent work by Abdullah et al. [36] proposed a predictive autoscaling method using a machine learning to identify the required number of resources for satisfying the response time requirements using a forecasted workload and adjust the resources accordingly.
Some researchers are working to scale the multi-player online game applications resources hosted on the cloud. For example, Khorsand et al. [17] proposed a self-learning fuzzy approach for the proactive provision of resources for multi-player online game applications in a cloud environment. The authors applied Maximum Likelihood Estimation and Local Linear Regression for parameter prediction and fuzzy decision-maker to determine appropriate autoscaling VOLUME 8, 2020 decisions. Khorsand et al. [18] proposed an approach to provision resources for multi-tier applications hosted on cloud infrastructure using the autonomic computing MAPE-k control loop. Their approach use SVR and Fuzzy Analytical Hierarchy Process in the analysis and planning phase of the MAPE control loop and SVR is used to predict the workload. A similar approach is used by Ghobaei-Arani et al. [19] for the resource provisioning approach for multiplayer online games in a cloud environment. Their approach forecasts the workload in the analysis phase using ANFIS (Adaptive Neuro-Fuzzy Inference System) predictor and Fuzzy Decision Tree in the Planning phase to estimate the number of resources to be allocated based on predicted workload to minimize the SLA violation rate. Rafieyan et al. [20] proposed an adaptive approach for the multi-criteria task scheduling problems. The proposed method is used to schedule conflicting tasks with minimum QoS violations. Khorsand and Ramezanpour [37] also designed a method for the multi-criteria task scheduling problems with energy efficiency. Safari and Khorsand [38] proposed a Dynamic Voltage and Frequency Scaling method for reducing the energy consumption for time-constrained workflows by distributing different tasks in workflows to appropriate VMs, considering their deadline. The goal of this method is to schedule the user-submitted workflows within given time-constrained with minimum SLO violations.
Workload forecasting is an exciting research area. Many recent works address this problem. For example, Attia et al. [39] use the Differential Evolution (DE) algorithm named MSaDE and Artificial Neural Network (ANN) to forecast the workload of cloud-hosted applications. Iqbal et al. [40] use the unsupervised learning approach to find future workload patterns for web applications. The author uses URI space partitions using response time and document size to compute the distribution of historically access logs in different partitions. Further, partitions are used to predict workload patterns using probabilistic techniques. There have been some efforts to use the forecasted workload in autoscaling. For example, Roy et al. [41] proposed an autoscaling method which uses forecasted workload and application resource utilization for the resource provisioning decisions. Baig et al. [28] proposed a method for window size estimation to maximize the prediction accuracy of data center resource utilization using deep neural networks. Any regression-based estimation model can use the predicted window size method for the prediction of resource utilization with minimum error. Similarly, Chen and Wang [27] proposed a method to improve the accuracy and prediction time of resource utilization. The authors use three different components including Ensemble Empirical Mode Decomposition, Run Tests, and ARIMA to improve the prediction results.
A few efforts are made to measure the burstiness in application workloads. For example, Balaji et al. [21] use a combination of Hurst Exponent and Sample Entropy methods to detect burst patterns using offline workload traces. Zhang et al. [42] use a two-state (ON/OFF) Markov chain model to detect the burst in the given workload. Zhang et al. [43] also present a system to identify the burstiness using the search query analysis in web applications. The authors use a probabilistic model on search queries and URLs of a web application to determine the burstiness. Tamime et al. [44] proposed a model to measure the burstiness in health-related Wikipedia articles. The authors use the ARIMA model and then classify the burstiness as high, low, or moderate. Benmakrelouf et al. [45] proposed a method for detecting abnormal variations in virtualized systems by using a combination of two probabilistic techniques including Z-score and Kullback-Leibler divergence. The proposed solution provides a mapping between resource level and service-level metrics and detects abnormal changes dynamically. Some researcher quantifies the burstiness in workload by analyzing with different methods. For example, Minh et al. [12] proposed a system to quantify the burstiness in the given workload by taking workload as a signal. The authors use normalized entropy to calculate the burstiness in the given workload. However, their approach uses a complete offline workload as a single signal and then calculate the burstiness. Ali-Eldin et al. [13] also quantify the burstiness using the sample entropy method. Shen et al. [14] also proposed a system to indicate the burstiness in the given workload. The authors use the signal processing technique (FFT) to compute the burst density. These methods calculate the burstiness in the given offline workload traces. Table 1 shows the comparison of existing work with our proposed method. All of the existing burst detection methods work offline on given workload traces, taking them as a single signal. However, to detect the burstiness in workloads online and then use the burst detection decision in the resource autoscaling methods to improve the performance of the cloud-hosted application is a challenging task. In this work, we tackled this challenge and proposed an efficient predictive autoscaling method, which enhances the application performance by using burst detection decisions as input. We use sample entropy (SE), normalized entropy (NE), and Fast-Fourier transformation (FFT) as burst detection techniques and then identify the best method to use in our proposed autoscaling method.

III. PROPOSED SYSTEM
The overall proposed systems is illustrated in Figure 2. The flow and components of the system are labeled and numbered to explain the working of the system. The system works in the following steps: • First, at every time interval, the last k workload observations {α t , α t−1 , . . . , α t−k+1 } are used to predict the workload for the next interval α t+1 . We explain this in Section III-A.
• Second, the system uses the last k workload observations {α t , α t−1 , . . . , α t−k+1 } to detect the burstiness using our proposed burst detection method, explained in Section III-B. • Third, the proposed system uses a resources prediction model, explained in Section III-C, to predict the number of resource instances n required to satisfy the response time SLO τ slo requirement to serve the predicted workload α t+1 .
• Finally, the burst detection decision and predicted number of resource instances n are used by the proposed autoscaling method to adjust the allocated resources to the application dynamically. We explained the proposed autoscaling method in Section III-D.

A. WORKLOAD FORECASTING MODEL
Most of the existing workload forecasting techniques use statistical methods, including ARIMA, ARMA, and Moving Average [46], [47], to estimate the future workloads. However, statistical methods are not efficient in forecasting bursty and dynamic workloads. Some of the recent works [48], [49] use advanced machine learning methods to forecast workload, including neural network, support vector machine, and multi-layer perceptron. However, these methods are compute-intensive and required a large number of training data to train the model to yield better estimation accuracy. In our proposed workload forecasting method, we use a small number of last k observations to capture local trends in the incoming workload. Therefore, we used ElastNet (EN), which is a regularized regression method with absolute and squared penalization. EN performs better than other regression techniques [50]. Figure 3 shows the normalized Mean Absolute Error of different regression algorithms for the World Cup workload with different window sizes. EN shows minimum MAE with window size 10. Therefore, we use EN for workload forecasting.
The proposed workload forecasting model predicts the workload α t+1 for the next interval t +1 using the last k actual workload observations {α t , α t−1 , . . . , α t−k+1 }. For a specific application at the current time interval t, we have the last k workload observations {α t , α t−1 , . . . , α t−k+1 }. Then, we can estimate the future workload using: where b 0 and b 1 are the regression parameters, which are estimated using the following objective function:  where λ is a hyper-parameter which decides the relative importance of reconstruction error and the sparseness of coefficients, || · || 1 and || · || 2 2 are the 1 and 2 norms respectively, and ρ is the mixing ratio or 1 ratio.
Once the regression parameters b 0 and b 1 are learned, we can estimate the workload for the next time interval as

B. BURST DETECTION METHOD
Cloud-hosted applications serve dynamic and bursty workloads, which are difficult to detect online. There have been some efforts to detect the burstiness in workloads using offline techniques [12]- [15]. However, online burst detection will help to improve the autoscaling method for minimizing response time violations. In this paper, we investigate the use of different techniques, including Sample Entropy, Normalized Entropy, and Fast Fourier Transformation, for detecting bursts in dynamic workloads for cloud-hosted applications.
Our proposed burst detection technique uses the last k historical workload observation to detect the burstiness online in local workload patterns. Equation 4 shows the burst detection function to identify the burstiness at a current time interval t: (4) where φ is the burst detection function, {α t , α t−1 , . . . , α t−k+1 } are the last k workload observation which are used to identify the burstiness, andx is the burst detection decision.
The value ofx = 0 shows the current workload observations are not bursty, andx = 1 means the current workload observations are bursty. We investigate the use of Sample Entropy (SE), Normalized Entropy (NE), and Fast Fourier Transformation (FFT) techniques to detect the bursts online. We explain these methods in the following subsections.

1) SAMPLE ENTROPY (SE)
Sample Entropy is a measure of information which can be used to detect burst in a given signal and also useful to identify the uncertainty and randomness in time-series data. In our proposed method, last k historical workload observation {α t , α t−1 , . . . , α t−k+1 } are used to detect the burstiness. To compute the sample entropy, first we compute {ω 1 , ω 2 , . . . , ω k−m+1 } sequence vectors of length m in the given k workload observations. Each sequence vector Once we have sequece vectors, then we use the Equation 5 to compute the sample entropy.
ψ m is computed by using following equation: where, where, d[ω i , ω j ] is computed using chebyshev distance formula [51]. Sample Entropy depends on two parameters m and r. m is the length of the sequence in the given workload, and r is the deviation tolerance or similarity criteria. A large value of m and a smaller value of r gives sharper peaks in the given workload. In our evaluations, we refer the online burst detection technique using Sample Entropy as B_SE.

2) NORMALIZED ENTROPY (NE)
Normalized Entropy is another technique commonly used to detect noise, bursts, and randomness in a give time-series signal. In our proposed autoscaling methods, we use normalized entropy to detect the burstiness in last k historical workload observations {α t , α t−1 , . . . , α t−k+1 }. Equation 9 is used to compute the Normalized Entropy of the last k workload observations.
where p i is the probability of each workload observation according to all workload observations, which is computed using the following equation: The value of normalized entropy close to 0 shows that a given workload is bursty. We compare the normalized entropy value of the given workload observation window with the mean of normalized entropy of each previously calculated workload window observations. In our evaluations, we refer the online burst detection technique using Normalized Entropy as B_NE.

3) FAST FOURIER TRANSFORMATION (FFT)
Fast Fourier Transformation (FFT) is an algorithm to compute the Discrete Fourier Transformation of a given signal, which converts the signal from the time domain to the frequency domain and vice versa. This technique divides a signal into different frequencies components to analyze the behavior of the signal. Some researchers use FFT to compute the burstiness in the offline workloads by considering it as a signal.
We used FFT in our proposed autoscaling method to detect the workload burstiness in the last k historical workload observations {α t , α t−1 , α t−2 , . . . , α t−k+1 }. First, we compute the Fourier components of the given workload signal. These Fourier components represent the amplitude of different frequencies. Second, we consider the top 80% frequencies as high frequencies and apply inverse FFT over the high-frequency components to calculate positive values, which give the burst density metric. These metrics are used to identify the burstiness and non-burstiness in the workload. The percentage of high-frequency is set to 80%. If we choose less value of high-frequency, then it can not clearly differentiate between bursty and non-bursty workload. In our evaluations, we refer the online burst detection technique using FFT as B_FFT.

C. RESOURCE PREDICTION MODEL
In our proposed system, the resource prediction model is used to predict the resources instance count n to satisfy the response time SLO threshold τ slo for forecasted workload α t+1 . The output of the workload forecasting model, the estimated future workload α t+1 and user-defined response time SLO threshold τ slo are used as input in our proposed resource prediction model. Equation 11 represents the resource prediction model: where δ is the resource prediction model which predicts the number of resource instances used to satisfy the τ slo for the forecasted workload α t+1 . We have evaluated different machine learning methods in the resource prediction model including Linear Regression (LR), Polynomial Regression (PR), Elastic Net (EN) Regression, Ridge Regression, Lasso Regression XGBoost (XGB) Regression, Random Decision Forests (RDF) Regression, and Decision Tree Regression (DTR). To train the resource prediction model, we used initial performance traces collected from trace-driven simulation. We conducted a small experiment to obtain the dataset for the learning of the resources prediction model. We use reactive autoscaling and increasing workload for the initial examination for the collection of data set. The response time, the number of workload requests and the required number of resources instances are used as the features of the dataset. The reactive autoscaling method dynamically adds the resource instance whenever response time saturates for the linearly increasing workload. We discard all performance traces, which shows response time saturation from the dataset. We split the dataset in 80% for training and 20% for testing to evaluate the model. We select a regression method with a minimum mean square error (MSE). Figure 4 shows the MSE of different machine learning methods to predict the resources required to satisfy τ slo . We observed DTR yields a minimum error compared to other methods. Therefore, we used DTR in our proposed resource prediction method. DTR is a supervised machine learning technique used for regression problems. DTR develop a rule-based decision tree structure to construct the machine learning model. It can train with less number of training dataset and without normalization of data as compared to the other machine learning techniques. Moreover, DTR performs better as compared to the other tree-based machine learning algorithms, for example, RDF, when the dataset has fewer features.

D. PROPOSED AUTOSCALING METHOD
Algorithm 1 shows the proposed autoscaling method. The algorithm takes input including monitoring time interval (ξ ), response time SLO threshold (τ slo ), window size (k), trained resource prediction model (δ), burst detection method (φ), and a set of allocated resources to the application (R).
We used ξ as a monitoring interval to aggregate the incoming user requests as one workload observation.

Algorithm 1 Proposed Autoscaling Method
Input: Application monitoring interval (ξ ), response time SLO threshold (τ slo ), workload window size (k), resource prediction model (δ), burst detection method (φ), set of allocated resources to the application (R) Output: Updated set of allocated resources to the application (R) burstMode ← false while true do Wait for ξ seconds α t+1 ← forecast the workload using For example, ξ = 60 will aggregate the workload observation on 60 seconds time interval. In each iteration, the autoscaling method waits for ξ seconds and then used the forecasting method to identify future workload α t+1 using the last k workload observations {α t , α t−1 , . . . , α t−k+1 }. After workload forecasting, the autoscaling method predicts the required number of resource instances using resource prediction model δ for the forecasted workload α t+1 to satisfy the response time SLO threshold τ slo . Then the autoscaling method uses the burst detection method φ to detect the burstiness. Once the burst decisionx is made, then the system predicts the number of resource instances required to satisfy τ slo . Finally, the predicted number of resource instances n t+1 are used to dynamically scale-out or scale-in the application resources R.
In experimental evaluations, we used application response time SLO threshold τ slo = 200 milliseconds, application monitoring time interval ξ = 60 seconds, and workload window size k = 10. A large value of ξ can slow down the autoscaling method to react, and the autoscaling method waits longer before making a decision. Whereas a smaller value of ξ enables the autoscaling method to react quickly and make decisions to manage application resources. However, a large value of ξ ignores the small bursts in the workload. Therefore, we use ξ = 60 seconds to monitor the application traces before the autoscaling method triggers. The value of τ slo is also important; a lower value of τ slo is challenging to satisfy by the autoscaling method. In contrast, the larger value of τ slo is easier to ensure by the autoscaling method for web applications. We set τ slo = 200ms, which is reasonable for a typical web application to offer as a response time service level objective (SLO). The value of k also affects the autoscaling method for the workload forecasting and burst detection. A smaller value of k may not sufficient for forecasting model and burst detection method, whereas a large value of k will ignore local and small bursts. We use k = 10, which is reasonable to train the forecasting model and burst detection method.

IV. EXPERIMENTAL SETUP AND DESIGN
We evaluate the proposed autoscaling method through a trace-driven simulation for a benchmark application and also validate the simulation results through experiments on a real testbed infrastructure. We compared the effectiveness of the proposed solution with multiple existing baseline autoscaling methods under different synthetic and real workloads. We explained the benchmark application, workloads used in evaluation, baseline autoscaling methods used to compare the proposed system, trace-driven simulation environment, and validation infrastructure and experiments in the following subsections.

A. BENCHMARK APPLICATION
Nowadays, most of the applications are embedding intelligence using Machine Learning (ML) models. The ML models need historical data to train before providing intelligence in the applications. Therefore, to emulate a machine learning workload, we used a regression-based supervised machine learning algorithm (Support Vector Regression) to train on IoT data of size 200K. The dataset consists of time-series data obtained from IoT devices that capture temperature data. The benchmark application is a web application that processes each request by first training the SVR model and then predicts the temperature for the next interval as HTML output. This benchmark application emulates a typical CPU-bound workload.

B. WORKLOADS
In our experimental evaluations, we used four synthetic and five real workloads. Figure 5 shows the synthetic workload 1 (W1), synthetic workload 2 (W2), synthetic workload 3 (W3), and synthetic workload 4 (W4) reflecting different burstiness behaviors. Whereas Figure 6 shows FIFA World Cup [52], web traces of Wikipedia [53], Calgary University web server traces, web traces of NASA's Kennedy Space Center web server, and web traces of ClarkNet web server, which are real workloads used in our experimental evaluation. All of these real workload traces are available publicly. 1

C. BASELINE AUTOSCALING
We use multiple baseline autoscaling methods to compare and evaluate the proposed autoscaling method. We use two state-of-the-art autoscaling methods, namely React and Predict, used as baseline methods. The React [54] is a reactive autoscaling method that uses a predefined set of rules to scale-out and scale-in the application resources automatically. The React scale-out the application resources whenever the response time of the application increase from the user-defined threshold and the React scale-in the application resources whenever the response time of the last three time interval of the application decreases from the half of the user-defined threshold. The Predict [36] autoscaling method is a recent work that uses a predictive resources provisioning model to identify the required number of resources to satisfy the response time requirements using a forecasted workload and adjust the resources accordingly.

D. TRACE-DRIVEN SIMULATION
We evaluate the proposed autoscaling method using a trace-driven simulation presented in our recent work [36]. For a trace-driven simulation, we model the response time of the benchmark application. We then use the model to identify the application behavior on different workloads and use autoscaling algorithms to manage the allocated resources dynamically. We evaluated three different burst detection techniques with our proposed autoscaling method to detect the burstiness, as explained in Section III-B. We evaluated the proposed solution under nine different workloads and compared it with two state-of-the-art autoscaling method.

E. VALIDATION INFRASTRUCTURE AND EXPERIMENT
To validate our proposed autoscaling method, we performed experiments on a Docker Swarm testbed Cluster [55]. The testbed cluster consist of four core i7 physical machines with 8-cores CPU, 16 GB physical memory and 2 TB disk. We configured three physical machines as Docker swarm worker nodes and one physical machine configured as Docker swarm manager node. We can run a maximum of 12 containers instances simultaneously with one-core CPU and 2 GB physical memory on the testbed cluster.
To validate the proposed solution on the real testbed, we used the Clarknet workload to study the effectiveness of the proposed autoscaling method. We compared the results with React and Predict baseline autoscaling methods. For the validation experiment, we generated the workload using httperf [56] for the benchmark application.

F. EVALUATION METRICS
To evaluate the proposed method and compared it with baseline methods, we compute total processed requests, response time SLO violations, and the number of scale operations for each experiment.
i. Total process requests shows the number of requests served by the application for a given workload. The autoscaling method yields higher total process requests is considered better as it shows that maximum possible requests are served and less requests are rejected compared to the other autoscaling methods. ii. Response time SLO violations refer to the number of requests takes higher response time as compared to the expected response time threshold. The autoscaling method yield low number of SLO violations are considered better. iii. The number of scale operations refers to the number of scaling decisions performed by the autoscaling method to dynamically add or remove the resources. The autoscaling method yield scale operations are considered better as it will use less number of instances and will incur less operational cost. Table 2 summarizes the experimental results of React, Predict, and the proposed autoscaling method using B_SE, B_NE, and B_FFT burst detection techniques under different synthetic and realistic workloads. For each experiment, we computed the total number of processed requests, the total number of response time SLO violations, and the total number of scale operations for all workloads and autoscaling methods. The autoscaling method yields the maximum number of processed requests, the minimum number of SLO violations, and the minimum number of scale operations are considered the best. The experimental evaluations show that the proposed autoscaling method using burst detection techniques outperform the baseline autoscaling methods. Overall, the proposed autoscaling method using B_SE techniques shows better performance by minimizing SLO violations, maximizing number of processed requests, and minimizing the scale operations.

V. EXPERIMENTAL RESULTS
To explain the effectiveness of the proposed autoscaling method, we show the box plot of application response time for all experiments. Figure 7 and 8 show the box plot of response time using each autoscaling method with synthetic and real workloads respectively. Figures show that the application response time with burst detection techniques is less as compared to the other autoscaling method without burst detection. Moreover, Sample Entropy-based burst detection technique (B_SE) performs better as compared to the other two burst detection techniques during the experiments to minimize the response time violations. Figure 9 shows the application response time and the dynamic allocation of resource instances of each time interval during the experiments using different autoscaling methods under the W4 synthetic workload. We use 200ms as a response time SLO threshold. The figure shows the response time SLO violations are high using the baseline methods (React and Predict) as compared to the autoscaling with burst detection techniques. Moreover, Sample Entropy-based burst detection technique (B_SE) outperforms other autoscaling methods to minimize the response time SLO violations. Percentage of the total number of processed requests, percentage of total SLO violations, and the total number of scale operations for all workloads using baselines and the proposed autoscaling method using B_SE, B_NE, and B_FFT burst detection techniques.      ClarkNet and NASA workloads respectively. Moreover, B_SE yield 0.84×, 0.97×, 0.6×, 0.98× and 0.98× times less number of response time SLO violations as compared to the reactive autoscaling method for Calgary, World cup, Wikipedia, ClarkNet and NASA workloads respectively. Whereas, the scale operations count is also minimized by 0.9×, 0.8×, 0.7×, 0.9×, and 0.9× times for the B_SE as compared to the reactive autoscaling method using Calgary, World cup, Wikipedia, ClarkNet and NASA workloads respectively.
To study the cost of using different autoscaling methods, we used 0.0014$ per minute cost for each instance similar to AWS c5.large instance. Figure 12 shows the cost comparison of autoscaling methods relative to the React baseline method for synthetic workloads. B_SE yields 58%, 186%, 67%, and 76% more cost as compared to reactive autoscaling method for W1, W2, W3, and W4 workloads respectively. B_NE yields 6%, 30%, 11%, and 1% more cost as compared to baseline for W1, W2, W3, and W4 workloads respectively. B_FFT method takes 13% and 4% more cost as compared to reactive for W2 and W3 workload. Whereas for W1 and W4 B_FFT takes 1% and 9% less cost as compared to the reactive baseline method. Figure 13 shows the cost comparison of autoscaling methods relative to the React baseline method for real workloads. B_SE yields 58%, 31%,74%, 85%, and 147% more cost as compared to reactive autoscaling method for World Cup, Wikipedia, Calgary, ClarkNet and NASA workloads respectively. B_NE yields 58%, 12%, 72%, 67%, and 145% more cost as compared to baseline for World Cup, Wikipedia, Calgary, ClarkNet and NASA workloads respectively. B_FFT method yields 58%, 5%, 22%, 19%, and 43% more cost as compared to baseline for World Cup, Wikipedia, Calgary, ClarkNet, and NASA workloads respectively.
To summarize the effectiveness of the proposed solution, we computed the average percentage relative to the React baseline method using the results of all workloads. We used B_SE as a burst detection technique in the proposed autoscaling method. Figure 14 shows the average percentage relative to the React baseline method for Predict and proposed autoscaling methods. The proposed solution shows a 9.65% average increase in the total number of processed requests as compare to the React baseline method. Moreover, it yields an average of 60.8% less response time SLO violations and 61.0% fewer scale operations using results of all workloads as compare to the React baseline method.

A. VALIDATION ON CONTAINERIZED TESTBED
To validate our trace-driven simulation results, we performed experiments on a containerized testbed cluster, as explained in Section IV-E. We compare the proposed autoscaling method using B_SE burst detection technique with both of the baseline autoscaling methods under ClarkNet's real workload traces. Table 3 shows the percentage of total processed requests, percentage of response time SLO violations, and scale operations count during the validation experiments using React, Predict, and the proposed autoscaling methods for the benchmark application. The proposed autoscaling method yields a higher number of processed requests, minimizes the response time SLO violations, and also uses a fewer number of scale operations to satisfy the response time requirements compared to the baseline methods. Figure 15 shows the comparison of the total number of processed requests, response time SLO violations, and scale operation for the validation experiments using the proposed and baseline autoscaling methods. The proposed autoscaling method outperforms the baseline methods to minimize the SLO violations, maximize the number of processed requests, and minimize the scale operations. The number of process requests during the proposed autoscaling method increases 0.10× times as compare to reactive autoscaling. The proposed autoscaling method yields 0.95× fewer response time SLO violations and 0.30× less scale operations as compared to the React baseline autoscaling method.

1) HYPOTHESIS TESTING
To determine the statistical evidence for the obtained results, we designed the two sets of hypotheses. The first set of hypothesis is related to the SLO violations for the comparison of the proposed and the baseline algorithms.
The second set of hypothesis is related to the feature Proposed Requests as  The level of significance is set at α = 0.05. We rejected the null hypothesis (H 0 or H 0b ) when the p-value based on the paired t-test is less than 0.05 for any specific feature. The null hypothesis H 0a for the first set of the hypothesis is rejected for the Average SLO violations count as the p-value is less than 0.05 for this feature as described in Table 4. It means that the average SLO violations count for the baseline methods is greater than the proposed method. We rejected the null hypothesis H 0b for the second set of the hypothesis for the feature Processed Requests as the p-value is less than 0.05, as described in Table 4. It means that one of the baseline algorithm React's average Processed Requests is less than the proposed algorithm. On the other hand, we failed to reject the null hypothesis H 0b for one of the baseline algorithm Predictive as it's p-value is greater than 0.05. It means that there is no significant difference between one of the baseline method (Predictive) and the proposed method for the feature Processed Requests. The hypothesis analysis proved the superiority of our method for both the features compared to the baseline algorithms.

B. DISCUSSION
Typical autoscaling methods dynamically provision the application resources without explicitly considering the burstiness in the incoming application workload. The bursty workloads degrade the application performance significantly, even in the presence of traditional autoscaling methods. The proposed autoscaling method detects the bursts in workload automatically and then use the burst detection decision to manage the application resources for minimizing the response time SLO violations. Our solution is capable of identifying the bursty traffic, which is frequently changing, and then disable the scale-in decisions and only performs predictive scale-out operations during the burst. This reduces the oscillation in scale operations and helps to minimize the response time SLO violations. The proposed autoscaling enhances the application performance by maximizing the number of processed requests and minimizing the response time SLO violations under real and synthetic bursty workloads compared to the existing autoscaling methods.
For the validation experiment on a containerized testbed infrastructure, the proposed autoscaling method increase 0.10× processed request, minimized 0.95× response time SLO violations, and 0.30× less scale operations as compared to the React baseline autoscaling. The proposed autoscaling method will be helpful in handling the workloads containing burstiness gracefully compared to the existing autoscaling methods. Moreover, the proposed method yields an average of 60.8% decrease in the number of SLO violations as compared to the baseline methods by efficiently handling the burstiness.

VI. CONCLUSION AND FUTURE WORK
Nowadays, cloud-hosted applications face dynamic and bursty workloads, and autoscaling methods are used to maintain the performance of cloud-hosted applications. However, bursty workloads degrade application performance because the existing autoscaling methods do not explicitly handle the bursts. In this paper, we proposed an efficient predictive autoscaling method which is capable of detecting bursts in dynamic workloads. The detected burst is incorporate in the autoscaling method to satisfy a specific response time SLO requirements. Our extensive evaluation using simulations and then validation experiments on a real testbed show the effectiveness of the proposed method by outperforming the existing state-of-the-art autoscaling methods. The proposed solution will be helpful to ensure application performance in the presence of bursty workloads.
Currently, we are extending our work by investigating the use of multi-objective optimization techniques to ensure response time SLO requirements with minimal cost. In the future, we also plan to improve the proposed method by predicting the bursts to eliminate the chances of response time SLO violations.