Gaussian Process Regression Ensemble Model for Network Traffic Prediction

Network traffic prediction is substantial for network optimization and resource management. However, designing an efficient predictive model considering different traffic characteristics, including periodicity, nonlinearity, and nonstationarity, is challenging. Recently, ensemble learning is attracting much attention from researchers in the machine learning community. Although ensemble learning has proven exceptional performance in modelling the intricate problems, it may not be able to handle varying patterns and chaotic behaviour, which are typical properties of traffic data (and many other time-series problems). For this reason, ensemble methods show a limited prediction accuracy in network traffic modelling. We address this issue by proposing an ensemble of learners for time-series prediction that considers the accuracy of individual learners as well as diversity among their outcomes. Each learner contributes to the optimization process by finding the optimal accuracy-diversity balance in a segment of feature space. This divide-and-conquer approach avoids complicated objective functions with many local optimums while fitting the ensemble model on large datasets. Experimental results on the real traffic traces show our proposed method outperforms other state-of-the-art predictors with 12% on average in prediction accuracy for different datasets.


I. INTRODUCTION
Network traffic prediction is an efficient tool for improving proactive resource scheduling and traffic engineering. It has been utilized to solve various problems such as resource management, mobile data offloading, data-center traffic management, etc. Traffic prediction is challenging because the behaviour of network traffic is affected by many factors including users' behaviour, network protocols, topology, and management policies. Many predictive models have been proposed based on different algorithms including neural networks [1], kernel-based methods [2], time-series models [3], etc. They rely mainly on a single learner to train and forecast the traffic data. Although those models are efficient for specific types of network traffic, they lack flexibility and generalization to capture the complex and varying behaviour exhibited in traffic time-series. This work investigates a The associate editor coordinating the review of this manuscript and approving it for publication was Pengcheng Liu . new approach based on ensemble learning to fulfill this shortcoming.
Ensemble learning [4] is a recent direction of research in machine learning algorithms in which a group of learners are trained and combined to increase the accuracy and generalization of the prediction. The base learners in an ensemble model can be either the same type of machine learning algorithm or different types of algorithms. The learners are trained on the same training set or different subsets of the training set. The outcome of the ensemble model for an input sample is calculated by combining the prediction results of individual learners on this sample. Combining techniques include voting (for classification), weighted sum (for regression), etc. Many ensemble models have been proposed for classification [5], regression [6], online learning [7], and learning concept drift [8]. Nevertheless, there are a limited number of ensemble models for time-series prediction [9]. In this work, we propose a new ensemble model for time-series prediction and apply it on the complex time-series of network traffic.
An ensemble of learners outperforms any of its members if the individual learners are both diverse and accurate [4]. The accuracy of a learner is measured based on the difference between the learner's prediction and the actual target value. Accuracy is a necessary condition to avoid poor learners to obtain the majority of votes. Diversity measures the discrepancy among the outputs of the learners. It is required to ensure the learners make uncorrelated errors and achieve better generalization performance [4]. In a diverse ensemble, the learners rectify the prediction errors of each other. Maintaining accuracy and diversity is the cornerstone of ensemble learning. Existing ensemble models employ various techniques to increase these two factors. For example, accuracy has been increased by exploiting effective features, finding appropriate lag (in time-series prediction) [10], and optimal parameters for base learners [9]. Diversity has been increased using various methods such as bagging, boosting, and employing different base learners [4].
Accuracy and diversity are two conflicting objectives [11]. The increase in diversity among learners can lead to a decrease in their accuracy. Finding the optimal balance between accuracy and diversity improves the ensemble prediction performance. Despite this relationship, many ensemble algorithms employ separate techniques to enhance both objectives independently. They do not consider the relationship between accuracy and diversity as two conflicting objectives. Therefore, they cannot guarantee the optimal balance between accuracy and diversity to minimize the ensemble prediction error and maximize its generalization. To address this shortcoming, the trade-off between accuracy and diversity must be reflected in the training phase of the ensemble model. This idea forms the basis of our approach to ensemble modelling.
In this work, we propose an ensemble model for traffic time-series prediction which optimizes both the accuracy of learners and diversity between them. During the training phase of the proposed model, the accuracy of learners and diversity between their outputs are adjusted as two parameters to find the global optimum point which minimizes the ensemble prediction error. The control variables that optimize the balance between accuracy and diversity are (i) the number of base learners, (ii) their parameters, and (iii) their training sets. The training phase determines the number of required learners in the ensemble, values of their parameters, and the samples that must be assigned to the training set of each learner.
We designed a divide-and-conquer approach to distribute the process of finding the optimal accuracy-diversity balance between the base learners. It reduces the computational requirements of the training phase and improves the training performance for large datasets. In this approach, each base learner provides a trade-off between accuracy and diversity in a small region of feature space. The feature space is firstly segmented into multiple subspaces, each is assigned to a base learner. Then, each base learner provides the balance between accuracy and diversity in a local area of the feature space. In other words, each base learner maintains the balance between two terms: (i) accuracy on its corresponding segment, and (ii) diversity on the adjacent segments.
We employ Gaussian Process Regression (GPR) as the base learner in our ensemble model. GPR is a kernel-based Bayesian method and a powerful tool for regression and function approximation [12]. In prior work, we showed GPR can handle different traffic characteristics including long/short range dependencies (LRD/SRD), self-similarity, periodicity [13], and multiscale behaviour [14]. The standard GPR is not appropriate to be used in the proposed ensemble model because the Gaussian likelihood in GPR considers only the prediction accuracy. We changed the training phase of GPR to make it compatible with the requirements of our ensemble model by introducing a new likelihood function which considers both accuracy and diversity.
The proposed model has two advantages over the existing time-series ensemble models. First, it optimizes the accuracy-diversity trade-off to improve the prediction performance. Second, it is appropriate for large datasets. The main contributions of the proposed model are as follows.
• Our model considers both accuracy and diversity as two conflicting objectives and finds the optimal balance between them to minimize the ensemble prediction error. This is the main focus of this work.
• We proposed a divide-and-conquer approach to optimizing the balance between accuracy and diversity during the training phase. In this approach, each learner considers a small portion of the training samples in a region of the feature space during the accuracy-diversity optimization. Comparing to the approaches when each learner enhances the accuracy-diversity balance considering the whole training set, our approach leads to an unsophisticated objective function for optimizing the parameters of the base learners. It resolves the issues related to complexity of the problem and allows the model to achieve accuracy-diversity balance for large datasets in a reasonable time.
• We proposed a new GPR ensemble likelihood function which improves the the standard GPR to satisfy the requirements of our ensemble model. • We compared our ensemble model with well-known time-series prediction algorithm such as Long Short-Term Memory (LSTM), autoregressive integrated moving average (ARIMA), fractional autoregressive integrated moving average (FARIMA), Support Vector Regression (SVR), Least absolute shrinkage and selection operator (LASSO), Gradient Tree Boosting (GTB), Random Forest (RF), and Extremely Randomized Trees (ERT) using real traffic datasets. The remainder of the paper is organized as follows. Section II reviews prior research related to this work. Three main components used in the proposed model (i.e., the concept of ensemble learning, GPR, and clustering algorithm) are explained in Section III. The phases of the proposed GPR ensemble model are detailed in Section IV. VOLUME 8, 2020 Section V presents our experimental results. Finally, the conclusion is drawn, and we outline the future work.

II. RELATED WORK
Network traffic prediction has been utilized to solve various problems. For example, it has been used in proactive resource management [15], mobile data offloading [16], data-center traffic management [17], optimizing inter-datacenter traffic flows [18], and forecasting big-data applications demands [19]. State-of-the-art traffic predictors are based on different machine learning algorithms including neural networks [1], Wavelet transform [18], kernel-based methods [2], time-series analysis [3], and LASSO [20]. ARIMA is a class of statistical models for analyzing and forecasting time-series data that has been used for SRD traffic modelling and prediction [21]. FARIMA is the generalization of the ARIMA model in which non-integer values for the differencing parameter is allowed so that it can capture LRD traffic [22]. LSTM is the result of recent advances in deep learning algorithms. It is a type of Recurrent Neural Networks (RNN) [23] and is famous for its performance in time-series prediction. To the best of our knowledge, all existing traffic models and predictors are based on an individual learner.
In [13], we proposed a GPR-based traffic predictor by designing a covariance function to handle different traffic characteristics such as LRD/SRD, self-similarity, and periodicity. It achieved remarkable results compared to other time-series forecasting methods. However, the standard GPR has two main limitations. First, computational requirements of GPR scale cubically with the number of training samples [24] which is the consequence of the inversion of the covariance matrix. Second, using a stationary covariance function, GPR with lacks the flexibility for modeling the nonstationary data [25]. Both issues are fixed in this work. The problem of cubic time complexity is fixed because each GPR expert is trained over a limited subset of the samples. Thus, the inversion of a large covariance matrix in the standard GPR is substituted with the multiple inversion of small matrices. Also, using multiple GPR experts can easily handle the nonstationarity [25]. These improvements are the result of using a group of GPR learners in an ensemble model instead of an individual GPR.
In this work, we proposed a divide-and-conquer approach to improve GPR's time-complexity and optimize the accuracy-diversity balance. A great example of the divideand-conquer approach for reducing GPR time complexity is presented in [26]. They studied efficient global optimization (EGO), which employs GPR (also known as Kriging) as a surrogate model to approximate the objective function. Since GPR's time complexity is a real bottleneck in EGO, they proposed to use a GPR approximation called Cluster Kriging that splits a huge data set into several small clusters and improves the GPR time complexity. Another example is a gradient boosting algorithm for approximating a Gaussian process regression (VAGR) proposed in [27]. VAGR sequentially creates random training subsets and approximates the full Gaussian process regression model using the residuals computed from variance-adjusted predictions. VAGR has a time complexity of O(nm 3 ) for a training dataset of size n and the chosen batch size m.
Recently, ensemble learning has drawn attention of researchers thanks to its promising ability to resolve complex problems. Existing ensemble models employed different base learners including neural network [28], support vector machine (SVM) [29], nearest neighbors classifier [5], and also hybrid of various classifiers [30]. There are well-known ensemble models based on decision trees including Gradient Tree Boosting (GTB), Random Forest (RF), and Extremely Randomized Trees (ERT). GTB is an ensemble of decision trees based on boosting approach [31]. RF algorithm is an ensemble of decision trees, and achieves diversity using random split of the dataset [32]. ERT is an extension of RF in which the randomness of the subsets has been improved [33]. Ensemble models combine a set of learners to solve different problems such as classification [5], regression [6], online learning [7], and learning concept drift [8]. A small portion of studies on ensemble learning focused on the time-series prediction. In [34], a large number of RNN have been generated by training on different sets of examples. An ensemble of support vector regression (SVR) has been designed for prognostics of time-series data in [6].
Accuracy and diversity are two essential criteria to decide the performance of ensemble learning. Unfortunately, the majority of existing time-series ensemble models consider either accuracy or diversity only [9]. Few models consider both criteria. Recently, a layered ensemble architecture (LEA) has been proposed in [9] to address this problem. In LEA, accuracy is obtained by finding an appropriate time lag, and diversity is mainly achieved using different training sets for learners. Unfortunately, LEA does not model a trade-off between accuracy and diversity as two conflicting objectives. Instead, it uses separate techniques to increase both factors independently. This approach cannot guarantee the optimal trade-off between accuracy and diversity of individual learners.
In [11], accuracy and diversity have been formulated to create the multiobjective deep belief networks ensemble (MODBNE). Each learner in MODBNE refines the error of other learners in the cost of reducing its accuracy. In MODBNE, the objective is a function of the output of all the learners for all the training samples. MODBNE is appropriate for small training sets because its complexity increases significantly by the rise in the number of samples and number of learners. The objective function in MODBNE measures the accuracy and diversity considering all the training samples and base learners. For large data sets, it leads to a complicated function with too many local optimums and prevents the ensemble from finding the optimal balance. In our time-series ensemble model, the accuracy-diversity trade-off is calculated locally for each learner which allows avoiding the complexity issues. Each learner considers the samples in a small region of the feature space during accuracy-diversity optimization. Since the complexity of objective function (for training each learner) is not affected by the size of the training set, our approach is appropriate for large datasets.

III. BACKGROUND
This section describes the main building blocks of the model including ensemble learning, Gaussian process regression (GPR), and Dirichlet process (DP) clustering.
A. ENSEMBLE LEARNING An ensemble learning system consists of a set of individual learners where each learner provides an estimate of the target variable. It predicts the target variable by combining the results of all the individual learners. There are three main steps in building an ensemble learning system: (i) ensemble generation, (ii) training each model, and (iii) combining the results [35]. The learners can be the same or different types of machine learning algorithms. The success of the ensemble learning system depends on the accuracy and diversity among the results of the learners [8].
Accuracy is measured based on different metrics such as mean squared error (MSE), normalized mean squared error (NMSE) [13], etc. In an ensemble model with M learners (f i ), diversity over N data points (x i ) can be measured as: Equation (1) is the basis for negatively correlated ensemble learning [35]. It measures the difference between the prediction result of learner f m and the average of outcomes of other experts in the ensemble.

B. GAUSSIAN PROCESS REGRESSION (GPR)
We used GPR as the base learner in our ensemble model. According to the definition, a Gaussian process (GP) is a set of random variables that any subset of them has a joint Gaussian distribution. Gaussian Process Regression (GPR) [12] is a supervised machine learning algorithm that provides the mapping function between the input X = {x i } and (continuous) output Y = {y i }. Consider n pairs of input and noisy output observations, D = {(x i , y i )|i = 0, 1, 2, . . . , n−1}, and the unknown mapping function f (x i ): ).m(X ) denotes the mean function and k(x i , x j ; θ) is the arbitrary covariance function with hyperparameters θ . We employ the Rational Quadratic (RQ) covariance function defined as [13]: with the set of hyperparameters θ = {α c , l, s}. In [13], we showed that RQ covariance function can handle traffic characteristics such as LRD/SRD and periodicity. Periodicity is a traffic pattern that is consistently observable at different time scales of traffic data. Thus, using periodic covariance functions to model such behaviour is a well-known approach. For example, in [13], we proposed a semi-periodic self-similar (SPSS) covariance function for GPR, capturing LRD/SRD and patterns of periodicity in traffic data. In this work, however, the periodic pattern may not be perceptible by individual learners because the data samples are clustered into smaller subsets. The samples in a cluster are expected to have the same pattern, but they do not necessarily illustrate periodic behaviour. Therefore, we used the RQ covariance function for modelling. The goal is to predict y * for previously unobserved sample x * which does not belong to D. The conditional distribution Since the values of hyperparameters have a significant impact on the prediction accuracy, they have to be selected carefully. We used Bayesian inference to estimate the values of hyperparameters. This approach is based on the maximization of the Gaussian likelihood function [12]: This likelihood function simply obtained by considering Y ∼ N (0, K + σ 2 I ). Equivalently, we can maximize the loglikelihood function [13]: This log-likelihood function considers only the prediction accuracy. We designed a new likelihood function for GPR to satisfy the requirements of our ensemble model regarding the accuracy-diversity optimization.

C. DIRICHLET PROCESS CLUSTERING
A clustering algorithm is required to partition the feature space into multiple subspaces. Any clustering algorithm that satisfies both following requirements can be used. First, the number of subspaces is not known as prior knowledge and depends on the training samples. The clustering algorithm has to determine the number of subspaces automatically. Second, the probability that a new data sample belongs to an existing cluster is required in our ensemble model (in the training and prediction) and the clustering algorithm must be able to estimate such a probability. Dirichlet process (DP) clustering is a natural choice to provide both requirements. DP has been used in nonparametric Bayesian models [25]. It is a distribution over distributions, i.e., each draw from a Dirichlet process is itself a distribution while the marginal distributions are Dirichlet distributions. A DP(α, G 0 ) is described by a base distribution G 0 and a positive scalar α, usually indicated as the innovation parameter. Consider a sample distribution G drawn randomly from a DP, and random variables c i sampled from G: Random variable c i indicates the cluster of sample x i . Assume the indicators (c i ) take M unique values. It can be shown the conditional probability of a single indicator c i given other indicators when integrating out G and letting M tends to infinity exhibits a clustering effect [36]: where c ¬i is the set of all indicators excluding the ith indicator c i , and o ¬i,m is the occupation number of the cluster m ignoring the value of c i . In the standard DP, the occupation number is calculated as: According to the equations (12) and (13), the indicator c i belongs to a new cluster (with probability

IV. GPR ENSEMBLE MODEL FOR TRAFFIC PREDICTION
This section presents the proposed GPR ensemble model. The algorithm is explained in three phases: preprocessing, training, and prediction. The proposed model requires three steps for being used in real applications. First, the input data (i.e., traffic time-series) must be processed and formatted to create a dataset. Second, the model is trained using the dataset. Finally, the trained model is employed for prediction. For example, the proposed model may be used for proactive resource allocation in data centers. For such an application, we can employ an instance of the trained model in the network controller, where (i) it can access the traffic data on the links, and (ii) its outcome is accessible to the network controller for resource allocation. In preprocessing phase, traffic samples are prepared and structured in a dataset to be used in the training and prediction phases. It normalizes the time-series and eliminates their trends. In the training phase, the GPR learners are generated and trained considering learners' accuracy and their diversity. In the prediction phase, the outputs of base learners for a new sample are combined to create the ensemble prediction. The ensemble prediction is a weighted sum of individual outputs. The input of this model is the time-series of traffic bit-rates (presented as t i in this section), and the output is the predicted traffic bit-rates (presented as f ens (x i )). The type and source of traffic bit-rates depend on the application. For example, it can be traffic bit-rates on the links (for predicting the link utilization) or traffic load between each pair of points in the network (for predicting the traffic matrix). In our experiments, both traffic types are considered (i.e., CAIDA and Waikato datasets consist of link utilization, but Abilene datasets consist of traffic matrix).

A. PREPROCESSING
In the preprocessing phase, two transformations are applied to the traffic time-series, and then the dataset is created using the transformed samples. The input data of this step includes time-series values shown as t i . This step's output is a dataset (called D) consisting of a feature vector (presented as x i ) and a target value (shown as y i ) for each data point. The first transformation is known as the difference operator which eliminates the nonstationarity and trend in the time-series [37]: where t i is the original traffic sample, and B is the backshift operator. The second transformation is the normalization of input using a sigmoid function: where l n is the link capacity, and a n is a coefficient value that alters the shape of the normalization function. The function (16) maps the input value into the range of [−1, 1]. The transformed traffic data, q i , is used to create the train- and label y i are: It means for data point i, the target value is q i , and the feature vector is the d values observed before q i (i.e., q i−1 , q i−2 , . . . q i−d ) where d is the size of the feature vector.

B. TRAINING
The base learners have to be trained in such a manner that the whole ensemble achieves the optimal balance between the accuracy of base learners and the diversity of their outputs to minimize the prediction error. The ensemble training process must find the optimal balance by tuning two parameters at the same time: accuracy and diversity. In the proposed model, this optimization process is done using a divideand-conquer approach. Each base learner finds the optimal balance between accuracy and diversity in a small region of the feature space. Hence, the feature space is segmented into smaller subspaces, and each subspace is used as the training set of a base learner. Then, each base learner is trained to provide a balance between the following targets: (i) accuracy on its corresponding segment (or training set), and (ii) diversity on the adjacent segments.
Database D is employed to train the ensemble model. The samples in D are divided (randomly) into two groups: the training dataset, and the boosting dataset. The training dataset is denoted as D and includes 80% of samples in D.
The training phase is done in four steps as shown in Fig. 1. In Step 1, the training dataset is segmented into M segments using DP clustering algorithm while the value of M is determined automatically in DP. In Step 2, M base learners are DP clustering algorithm (explained in Section III-C) is used to partition the training dataset. DP algorithm generates M clusters D m (m = 1, . . . , M ) while each cluster represents a subspace in the feature space. The number of generated clusters is determined automatically during the clustering and can be tuned by changing the innovation parameter α. The goal is to place similar samples (i.e., with similar feature vector) in the same segment. The probability that a new sample belongs to a cluster depends on the similarity of the new sample to the members of the cluster. It is calculated based on the occupation number. Since the occupation number in Equation (14) is not data-dependent [36] we replace it with the following kernel-based function: The function k(x i , x j ; θ ) is an arbitrary covariance function. We used the dot-product covariance function: if P max > P new then 12: Dm ← x i Assign to an existing cluster  (19). The probability of assigning x i to the existing clusters is calculated in Line 7 using Function clusteringProbability according to Equation (12). Function newClusterProbability computes the probability of creating a new cluster and assigning x i to the new cluster based on Equation (13). The output of Algorithm 1 is M clusters of samples while M is determined automatically.

2) STEP 2
Base learners f m (m = 1, . . . , M ) are generated in this step. Cluster D m is assigned as the training set of learner f m . All the learners utilize the RQ covariance function defined in Equation (4) while the hyperparameters of f m are denoted as θ m . The hyperparameters of the learners are initialized using standard GPR training process. In standard GPR, the values of hyperparameters θ are determined by maximizing the Gaussian likelihood function p(Y |X , θ) defined in (8). The learners achieve the best fit on their training set. However, they cannot guarantee the accuracy-diversity balance.

3) STEP 3
In this step, the ensemble is trained to provide the balance between accuracy and diversity. Each base learner optimizes the accuracy-diversity balance in a small region of the feature space. For each base learner, the goal is to compromise between the accuracy on its corresponding subspace and diversity of predictions on its neighbor subspaces. This goal has two sides. On the one hand, it enhances the learner's accuracy which is independent of the samples in other segments and the results of other learners. On the other hand, it improves the diversity which depends on the outputs of other learners. To achieve this goal, a new likelihood function is introduced which considers the accuracy (on the learner's corresponding subspace) and diversity (on the learner's adjacent subspaces). The values of θ m is selected considering the accuracy of f m on D m and diversity of the learners on the subspaces that are adjacent to D m . The values of hyperparameters have been initialized in Step 2. In this step, they are evolved to maximize the proposed GPR ensemble likelihood function which considers accuracy and diversity as follows: where P m = log p(Y D m |X D m , θ m ) is the Gaussian log-likelihood defined in Equation (9), and V m measures the diversity: in which D ¬m is the set of samples x i in the clusters adjacent to D m (excluding D m ) for which the p m (x i ) ≥ , and |D ¬m | is the number of samples in while I m (x i ) is defined based on Equation (1) and measures the distance between estimation of expert m for x i andf ¬m (x i ): wheref ¬m (x i ) (the average of predicted values of other learners on x i ) is defined based on Equation (2): Also, E m (x i ) is the estimation error of expert m on x i : p m (x i ) is the probability of x i belongs to D m based on Equations (12) and (19). In other words, p m (x i ) measures the similarity of between x i and D m and is defined as: Function p m (x i ) can be assumed as a similarity score between x i and samples in D m . When p m (x i ) is less than a threshold (which means x i is not similar to cluster m), x i is excluded from the calculation of V m . This allows learner m to focus on its local neighbourhood instead of the whole training dataset. The value of threshold is determined based on the minimum value of p(x) for x in cluster m.
The likelihood function L m is the sum of two terms. The first term (P m ) is calculated using the samples of subspace m (i.e., D m ) and shows the accuracy of f m on D m . The second term (V m ) is computed using the samples of surrounding which is illustrated in Fig. 3. As shown, v m has a minimum value at f m|θ (x i ) =f ¬m (x i ), and it has a maximum value at f m|θ (x i ) = y i . Probability p m (x i ) limits the calculation of V m to the subspaces that are adjacent to D m . When the distance between x i and D m is high (i.e., x i is not similar to samples in D m which means x i is not in the surrounding subspaces), f m|θ m (x i ) does not affect V m . Since p m (x i ) takes small values (close to zero) for samples that are far from D m , it limits the number of samples that affect the likelihood function of f m . Since f m optimizes the accuracy-diversity trade-off in its local neighborhood, the complexity of objective function V m (and L) is decreased significantly (compared to an objective function which considers the whole dataset). Also, the complexity of training process for f m does not depend on the size of the dataset D. Fig. 4 shows this property of our ensemble training. During the training of f m , P m is calculated using the samples in the subspace D m , and V m is calculated based on samples in adjacent subspaces D ¬m . The samples that are not in the adjacent subspaces and the predictions of experts for those samples do not affect L m . Fig. 5 gives an intuitive illustration of the accuracy-diversity balance. The samples of one cluster (D m ) are given in a . It means they result in the same level of accuracy for f m , so bothθ andθ have equal chances to be selected in the standard GPR training. However,θ is more likely to be selected when we employ our GPR ensemble likelihood function. Consider the prediction results of f m , f m 1 , f m 2 , and f m 3 for x i . It can be shown f m|θ (x i ) leads to higher diversity compared to f m|θ (x i ) according to Equation (1). So,θ is preferred because it provides a higher diversity on x i . In this example, we considered only the outcome of three learners on sample x i . In a real scenario, we need to consider all the training samples and base learners in the neighbor segments.
The accuracy-diversity balance is improved after one round of optimizing the experts' hyperparameters within Step 3.
It is possible to enhance the balance by multiple iteration of Step 3.

4) STEP 4
In this step, the samples in the boosting dataset are used to enhance the ensemble performance. First, the generated (and trained) learners are employed to predict the samples in the boosting dataset. The prediction algorithm is explained in Section IV-C. The goal is to select the samples in the boosting dataset that their corresponding prediction error are significant. In other words, the samples that cannot be predicted accurately by the existing learners are selected to form a subset called D . The prediction error is measured using NMSE [13]: where (x i , y i ) is a sample in D , |D | is the number of samples in D , σ 2 is the variance of {y i }, andf ens (x i ) is the voted prediction for x i (defined in IV-C). The smaller values of NMSE are preferred, and NMSE = 0 corresponds to a perfect predictor with no error. (x i , y i ) ∈ D is added to the boosting dataset if its corresponding prediction error is greater than threshold e thr (i.e., 1 ≥ e thr ). The value of e thr is set to 50 percentile of the NMSE of the boosting dataset. Therefore, 50% of the boosting samples (with significant prediction error) will be selected.
The prediction error of generated experts is high for the samples in D . It will be used to generate and train M new learners which will be added to the ensemble to improve the overall prediction accuracy. The new learners are generated as explained in Step 1, Step 2, and Step 3. First, the D is clustered using DP algorithm to create M clusters (while the value of M is determined in DP). Then, M learners are generated and initialized using the new clusters. Finally, the learners are trained to optimize the accuracy-diversity by maximizing the likelihood function.
Algorithm 2 implements steps 2, 3, and 4. In Step 2, the hyperparameters of GPR experts are initialized using function fitStandardGPR (line 2). It takes the samples in one subspace and optimizes the standard Gaussian likelihood in Equation (8). In Step 3, the accuracy-diversity balance is optimized using function fitAccuracyDiversity (line 4). In Step 4, the ensemble model is used to predict the samples in boosting dataset and create D (lines 5 to 11). Then, M clusters are created by applying DP clustering to D (line 12). Each cluster is employed to initialize and train a new GPR expert (lines 13 to 16). Note the difference between inputs of fitAccuracyDiversity in lines 4 and 16. In line 4, the input consists of training dataset D . In line 16, both D and D are required for the accuracy-diversity balance because the new experts must consider the results of M previously generated experts on the whole dataset.
The final voted prediction for a new sample x i is the weighted sum of predictions of the GPR components: where M is the number of generated experts, and p m (x i ) is defined in Equation (28). The contribution of expert m in the final voted predicted value,f ens (x i ), is proportional to p m (x i ) (i.e., the similarity of x i to the the samples in the cluster m). The term p m (x i ) gives strong weights to the experts that have been trained on samples similar to x i . It reduces the effects of learners that have been trained on clusters which are far from x i in the feature space. The ensemble prediction is shown in Fig. 6. As shown, the outcome of each learner is multiplied by the probability of sample belongs to the training set of the learner which is calculated using DP clustering algorithm.

D. COMPUTATIONAL COMPLEXITY
In standard GPR, the inverse of covariance matrix K is required to optimize the likelihood function in Equation (8).
The time complexity of matrix inversion is O (N 3 ) where N is the number of training samples. Thus, the standard GPR has a cubic computational requirement. In our algorithm, two parts must be considered for time complexity analysis (i) DP clustering (in Step 1 of training),

and (ii) hyperparameter optimization (in Step 2 and
Step 3). In DP clustering, a covariance matrix of size N ×N is required for calculation of the occupation number in Equation (19). Each element of the covariance matrix is the result of Equation (21). The time complexity for calculation of such a matrix in DP clustering is in the order of O(N 2 ).
In steps 2, 3 and 4, the hyperparameters of M learners must be optimized. For expert m, the inverse of covariance matrix of D m is needed. The size of D m is not predetermined and depends on the DP clustering, training set, and length of feature vector. Assuming |D m | ∼ N M (on average), the time complexity of hyperparameter optimization in our algorithm is O( N 3 M 3 ). For small values of M , the time complexity of algorithm is close to the standard GPR. This can be avoided by choosing appropriate values for α. As the number of clusters increases, the average size of clusters decreases. In Section V, the number and size of clusters for different values of α are investigated. It shows that for a wide range of values for α, the time complexity of GPR ensemble model is significantly smaller than standard GPR.

E. GRADIENT DESCENT OPTIMIZATION
The proposed likelihood function is optimized using the gradient descent method, which requires the likelihood function derivative. In this section, we provide the derivative of the likelihood function in Equation (22) with respect to hyperparameters θ : The first part (i.e., ∂P m ∂θ m ) is the derivative of the standard GPR log-likelihood function, and has been investigated in existing GPR models (e.g., [13]): where Tr is the trace of the matrix, and ∂K ∂θ m is the covariance matrix derivative: The second part of the derivative (i.e., ∂V m ∂θ m ) depends on f m|θ m , I m , and E m . Function f m|θ m (x * ) is the prediction of learner m for input x * , and it is defined in Equation (6): Its gradient is calculated as: Accordingly, the derivative of I m (x i ) and E m (x i ) are: where ∂f m|θm Finally, we can calculate the derivative of ∂V m θ m : In calculating the derivative, those terms that do not depend on θ are considered constant (e.g., p m (x i ), D ¬m ,f ¬m (x i ), and Y ).

A. SETUP
We carried out experiments on six different datasets which have been created using traffic samples from three well-known sources of traffic data. The list of datasets and data sources are shown in Table 2. The first two datasets (CAIDA-01, and CAIDA-02) have been created from the traffic data monitored on 10GigE links from 2008 to 2015 provided by the Center for Applied Internet Data Analysis (CAIDA) [38]. Abilene Internet2 Network traffic data [39] (from 2007-01-01 to 2007-10-14) has been used to build the datasets Abilene-01 and Abilene-02. Abilene Internet2 Network is a high-performance backbone network for research  Table 2. These datasets are created based on the traffic bit-rate time-series (in Mbps). Abilene datasets consist of traffic bit-rates between pairs of points (i.e., traffic matrix), but Waikato and CAIDA datasets include traffic loads on the links in the network (i.e., link utilization). The process of preparing the dataset is the same for all these datasets, as explained in Section IV-A. The main difference between these datasets is the time-scale of traffic samples, as illustrated in Table 2. Therefore, the prediction task for these datasets is to estimate target values (i.e., y i ) according to the feature vector (i.e., x i ). We compared our model with different algorithm including traditional time-series algorithms (i.e., ARIMA [21], FARIMA [22]), supervised regression methods (i.e., standard GPR [13], SVR [6], LASSO [20]), ensemble learning methods (i.e., GTB [31], RF [32], ERT [33]), and a deep learning time-series predictor (i.e., LSTM [23]). These models have been discussed in Section II. We performed separate experiments on the datasets in Table 2. In the experiments, each dataset is divided into two non-overlapped subsets. The first subset is used for the model selection (i.e., selecting the optimal size of the training set, the optimal number of features d, and the optimal values for parameters) for each algorithm using a cross-validation process. The second subset is divided into 100 portions, and each portion is employed to create a pair of the non-overlapped train and test sets (random train-test split). For each pair, the models are fitted on the train set (using the training process explained in Section IV-B) and then, evaluated on the test set. The reported results are the average of 100 prediction error measurements which have been achieved from this process. The prediction error is evaluated using NMSE in Equation (30).

B. RESULTS
In this section, we present the results of our experiments carried out on the datasets. In Fig. 7, an example of the outcome of the GPR ensemble model for predicting the traffic samples in Abilene-02 is presented. Each point is the outcome of one GPR expert. Also, the final predicted values (weighted sum of individual predictions) are illustrated. The individual experts have different errors (diversity), so the error of each expert is corrected by other experts. Thus, the GPR ensemble model was able to predict most of the samples presented in Fig. 7 accurately. For few samples such as t = 581, the individual errors are not diverse around the actual value. Therefore, the final prediction error is high compared to other samples. It shows that diversity of individual predictions has an important role in reducing the final prediction error in ensemble model.
The average prediction error of different algorithms are illustrated in Table 3. As shown, the GPR ensemble model outperforms other models in the experiments on CAIDA-02, Abilene-01, Abilene-02, and Waikato-02. In two cases (CAIDA-01 and Waikato-1) ARIMA and LSTM have the smallest prediction error respectively. Nevertheless, GPR ensemble error is close to the winner algorithms in these two cases. The details of results are illustrated in Fig. 8, 9, and 10. The prediction error of different models for datasets CAIDA-01 and CAIDA-02 are shown in Fig. 8. On average, ARIMA has the lowest prediction error for CAIDA-01 compared to other algorithms. GPR ensemble model has the second smallest prediction error. Also, its variance is less than ARIMA. For dataset CAIDA-02, GPR ensemble model has the best prediction results (in term of average prediction error). The performance of FARIMA is close to GPR ensemble model. In both cases, the GPR ensemble model has the minimum variance in prediction which shows it is more stable compared to other models. The average of prediction error of the algorithms for CAIDA datasets is higher than other datasets (see Table 3). It is because of the traffic behaviour at small time-scales. The time-scale of samples in CAIDA-01 and CAIDA-02 are 30 seconds and 1 minute respectively. It has been shown that traffic exhibit short-range dependency (SRD) at small time-scales [41]. The traffic fluctuations are notable and temporal changes in traffic patterns are more frequent at small time-scales. Therefore,   the average of the errors on these datasets is higher for all the algorithms.
The prediction results for datasets Abilene-01, and Abilene-02 are presented in Fig. 9. The time-scales of traffic samples are 5 and 10 minutes for these datasets respectively. In both cases, the GPR ensemble model outperforms other models with average NMSE of 0.19 for Abilene-01, and 0.20 for Abilene-02. Also, the variance of its prediction results is lower than other models. The second smallest prediction error belongs to LSTM with average NMSE of 0.26 and 0.24 for these datasets. Generally, the average of prediction error of all algorithms is smaller for Abilene datasets compared to the results for CAIDA datasets. The results of predictions for Waikato datasets are demonstrated in Fig. 10. In the case of Waikato-01, the average NMSE for LSTM is 0.16 (which is less than other competitors), and the average prediction errors of GPR ensemble model and RF are equal to 0.18. For Waikato-02, the GPR ensemble model has the best prediction result with average NMSE equal to 0.14.
The results of our experiments show the GPR ensemble model has a higher prediction accuracy compared to other ensemble models for many traffic datasets. Also, our model experiences a small variance in its prediction results which shows it is able to handle traffic characteristics and varying patterns. Unlike other models which have different results at various traffic time-scales, GPR ensemble model has a satisfactory and stable performance at all the traffic timescales. The high prediction accuracy in GPR ensemble model is achieved by optimizing the balance between accuracy and diversity between prediction results of individual experts during the training phase.

C. LIKELIHOOD OPTIMIZATION
We monitored and analyzed the results for likelihood function convergence during the hyperparameter optimization. An example learner is selected, and its likelihood values (for P m and V m ) are tracked. This learner is assigned to a cluster with 166 samples during the training process, and it has 237 samples in its neighbourhood (i.e., samples that p m (x) is greater than the threshold for them).
The changes in P m and V m as two terms of the likelihood function are shown in Fig. 12. As shown, the value of V m decreased during the first iterations (before iteration 41), while there is a significant improvement for P m . In the middle of optimization, there is a period that both terms are increased together (between iterations 41 to 51). After iteration 51, it seems P m reaches its optimum point and stays stable, while there is a notable gain for V m . We noted the increase in V m (in the final iterations) causes a minor decrease in P m , but the total likelihood is improving. In general, we observed similar behaviour in other learners. In the first optimization iterations, the hyperparameters are changed to improve P m (i.e., improve accuracy on the assigned cluster), and in the final iterations, the hyperparameters are tuned to gain V m (i.e., to improve diversity on adjacent samples).

D. EXECUTION TIME
We compare the execution time of different algorithms. As mentioned in Section IV-D, the computational complexity of the proposed algorithm can be affected by the number and size of the clusters. In the worst case, the time complexity of the GPR ensemble model is comparable to the standard GPR model if a cluster's size is close to N (size of the training set). On the other hand, the size and number of clusters depend on the innovation parameter α in DP clustering. Therefore,  we need to investigate the size and number of clusters for different values of α. Fig. 11 shows the average size and number of clusters created by applying the DP algorithm on dataset of size N = 10000. DP clustering algorithm has been executed 100 times for each dataset in Table 2 while 10000 samples were selected randomly from the dataset in each run. As shown, the number of clusters is small when the value of α is close to zero, and it increases as α increases. At the same time, the average cluster size decreases as α increases. Even for small values of α, the cluster size is much smaller than the whole training set. For example, the average cluster size is around 5% of the training set (i.e., ∼ 500 samples) for α = 0.5.  . LASSO has the shortest training time compared to other models as it is a low performance linear regression model (with high prediction error). The effect of training size on the execution time of the GPR ensemble model is significantly smaller than standard GPR model (and also LSTM and GPT). Since each expert in GPR ensemble model is trained on a segment of feature space, the size of training set has a small effect on the training time of the model. This is the result of local optimization of accuracy-diversity balance in our ensemble training.

VI. CONCLUSION
While network traffic prediction is a powerful means to improve the network performance, it is a very challenging task. The traffic prediction model must handle different traffic patterns at various time-scales. In this article, we presented an algorithm for traffic prediction based on the ensemble of GPR experts, and proposed an optimization framework to optimize the balance between accuracy and diversity of these experts. The proposed framework reduces the complexity through a divide-and-conquer approach where each learner optimizes accuracy-diversity balance in a segment of the feature space, and each segment is used to train a base learner. The proposed GPR ensemble model has been applied on real traffic traces. The experimental results show our GPR ensemble model outperforms other models in traffic prediction. In the future work, we will investigate the performance of our model regarding various traffic characteristics such as burstiness, and LRD/SRD. We intend to integrate the model into existing network controllers to improve the resource and traffic management in networks. We will also compare our model with new deep learning algorithms and graph-based neural networks to show its efficiency.