Network Traffic Modeling and Prediction Using Graph Gaussian Processes

Traffic modeling and prediction is a vital task for designing efficient resource allocation strategies in telecommunication networks. This is challenging because network traffic data exhibits complex nonlinear spatiotemporal interactions. Moreover, the data can have missing values when traffic statistic collection is unavailable in certain nodes. In this paper, we introduce a graph Gaussian Process (GP) model for this challenging problem. The GP is a Bayesian non-parametric model and highly flexible in capturing complex patterns in the data. Additionally, it provides uncertainty information which can be exploited for robust resource allocation problems. The developed graph GP model is almost free of hyper-parameter tuning, can accurately capture short-term and long-term temporal patterns and can infer missing values by learning spatiotemporal interactions among the nodes in the network. Subsequently, we approximate the intractable posterior distribution using Variational Bayes (VB) algorithm which can be efficiently implemented. Finally, we evaluate the accuracy of the proposed model for predicting the data traffic using two real-world network datasets. Our simulation results shows that the proposed model can achieve better prediction accuracy with respect to the state-of-the-art approaches.


I. INTRODUCTION
Accurate traffic modeling and prediction is essential for efficient proactive resource allocation and traffic engineering in telecommunication networks. It has been widely used to perform different management tasks such as network maintenance, network optimization, routing policy design, load balancing, protocol design, anomaly detection and Virtualized Network Functions (VNF) deployment decisions [1], [2], [3]. Network traffic patterns can be affected by many factors, such as user behavior, network topology and routing strategy, and can have complex nonlinear spatiotemporal interactions. Moreover, even though currently available technologies, such as software-defined networking (SDN), allow for the centralized collection of network statistics, older equipment or failures often make it impossible to have a complete view across all network nodes. Therefore, due to The associate editor coordinating the review of this manuscript and approving it for publication was Abdel-Hamid Soliman . missing values, it is challenging to effectively explore and utilize the data for accurate prediction.
In the literature, various statistical time series models and analysis methods have been developed for traffic prediction. The most commonly-used is autoregressive integrated moving average (ARIMA) model [4], [5]. A limitation of ARIMA is that it can only capture a short-term temporal interaction. Nevertheless, data analysis shows that data traffic can also exhibit long-term interaction [6]. The Seasonal ARIMA (SARIMA) model is an extension of ARMA can improve the prediction accuracy by capturing a long-term interaction and it has been adopted to traffic prediction in [7] and [8]. However, ARIMA and SARIMA are linear models and have limited expressiveness power therefore they cannot capture nonlinear patterns.
More recently, artificial neural networks have attracted major attention in both academia and industry. These models are non-linear and can potentially capture any non-linearity in the data. The authors in [9] applied a general multilayer perceptron (MLP) network to predict the base station traffic under different wireless network setups. Recurrent neural networks (RNN) which are specific to sequence data are used in [5], [10], [11], [12], and [13]. In [14], a multiple RNN based learning models along with a multi-task learning framework is proposed to explore spatio-temporal correlations among base stations in cellular networks. Similar works used RNN combined with convolutional neural network (CNN) to capture the spatiotempral structure [15]. However, the aforementioned works assume that the data lie within a regular Euclidean space and do not explicitly exploit the graph structure of telecommunication networks, and therefore the proposed models may not perform satisfactorily.
Artificial graph neural networks [16], [17] which are particularly designed for modeling and predicting timevarying graph-based data are adopted in [18] and [19] for network traffic prediction. However, these developed artificial neural networks have two important limitations. Firstly, they cannot efficiently handle missing values due to their special architectures. Secondly, they cannot provide uncertainty information in the prediction because they are deterministic models.
GP, a Bayesian non-parametric model, which can efficiently capture uncertainty in the prediction has been used, more recently, by researchers for network traffic prediction [20], [21], [22], [23]. The authors in [20] studied a mixture of Gaussian processes using Dirichlet process to improve the scalability of inference and to model data non-stationarity. In [21], a GP model is used to capture a quasi-periodic pattern. In [22], the authors developed an enesemble learning algorithm where each learner is modeled by a GP and the predictions of the GPs are combined to improve the prediction accuracy. In [23] the alternating direction method of multipliers (ADMM) algorithm is used for parallel hyper-parameter optimization to scale up the GP inference. Nevertheless, the developed GP models are specifically designed for univariate time-series and cannot capture complex traffic pattern interactions among different nodes in the network.
In this paper, we aim to introduce a graph-based GP model for traffic prediction. In particular, our contributions are as follows.
• We develop a graph-based GP model which can capture the spatiotemproal interactions in the data. In particular, the GP exploits the structure of the telecommunication network which can be leveraged to infer the missing values.
• Moreover, a structured kernel function is proposed to capture short-term and long-term temporal dependencies to provide accurate prediction.
• Using the VB, we develop an inference algorithm to approximate the posterior distribution of the GP model. The developed inference algorithm is almost free of hyper-parameter tuning and the hyper-parameters are estimated using historical data.
• Finally, throughout our simulations, we show that the introduced graph-based GP model outperforms the stateof-the-art models on two real-world datasets.
This paper is organized as follows. The problem statement is described in Section II. In Section III, we provide an overview to GP. In Section IV, we introduce a GP model for network traffic modeling. In Section V, we develop a VB algorithm for inference and prediction. Finally, Section VI shows the simulation results and Section VII concludes the paper.

II. PROBLEM DEFINITION
We consider a communication network which consists of a set of nodes, e.g., routers, which are connected to each other by communication links. Due to the graph-structured topology, we define the traffic prediction problem on a graph, as depicted in Fig. 1. Mathematically, the network can be represented by un-directed graph G t := (y t , V, E) where V is the set of nodes and E is the set of edges (i.e., communication links). We denote the presence of an edge between node i and node j as {i, j} ∈ E. Moreover, y t ∈ R M is a vector that contains data traffic at nodes at time t, e.g., in Mb per time unit, where M is the total number of nodes. Set V is divided into two disjoint sets V o and Vō such that where V o and Vō contain nodes that traffic can be measured and nodes that traffic cannot be measured respectively. The number observed nodes and missing nodes are respectively defined by M o and Mō such that The objective in traffic prediction problem is to predict the traffic over the next H time steps given T historical observations of data traffic. In this paper, we advocate a probabilistic approach. This requires to compute the predictive distribution p(y t+1 , . . . , y t+H |y o,t , . . . , y o,t−T +1 ), where y ot ∈ R M o is the observation vector of M o nodes a time t, each element records historical traffic observations for a specific node in set V o . The traffic at nodes are not independent but related by pairwise relationships. In other words, it is expected that neighboring nodes have similar traffic patterns over time. Moreover, in practice, some of the nodes may degenerate, e.g., due to packet loss, or generate, e.g., edge nodes, the data traffic. Therefore, the flow balance equations cannot be used to predict the traffic at the missing nodes. In particular, the relationship among the nodes in the graph is not deterministic and is stochastic. Therefore, it is essential to effectively capture the network structure for improved prediction accuracy. In the next section, we review the basic of GP since it is used as the main tool to construct our graph-based probabilistic model for traffic prediction.

III. OVERVIEW TO GAUSSIAN PROCESS
GPs are powerful non-parametric Bayesian tools suitable for modeling real-world problems. A GP is a collection of random variables, any finite number of which have a joint Gaussian distribution. Using a GP, we can define a distribution over non-parametric functions f (x): where x is an arbitrary input variable with Q dimensions, and the mean function, m (x), and the Kernel function, K x, x , are respectively defined as: This means that a collection of N function value samples has a joint Gaussian distribution: where m := [m (x 1 ) , . . . , m (x N )] T and the covariance matrix K has entries [K] i,j := K x i , x j . The kernel function specifies the main characteristics of the function that we wish to model and the basic assumption is that variables x which are close are likely to be correlated. Constructing a good kernel function for a learning task depends on intuition and experience. More details about GP can be found in [24].

IV. GAUSSIAN PROCESS FOR NETWORK TRAFFIC DATA
In this section, we introduce a Bayesian model based on GP for data traffic. The developed model includes both the missing and observed nodes. In Section V, we explain how the model is used for inference and prediction in presence of missing nodes in more details. Our model assumes that the traffic is e generated based on a factor analysis model with Gaussian likelihood as: where W := [w T 1 , . . . , w T M ] T ∈ R M ×D is factor loading matrix which row w m captures node m specific patterns and f t := [f 11 , . . . ., f Dt ] T ∈ R D is a temporal latent variable vector which captures common temporal patterns among the nodes. Vector b ∈ R M is the bias term which captures the overall traffic volume at each node. Moreover, n t := [n 1 (t), . . . , n M (t)] is the additive white noise assumed n m (t) ∼ N (0, σ ), ∀m = 1, . . . , M . The expressiveness power of the model is determined by the number of latent factors D. In particular, the model can capture more diverse and more complex patterns in the data as D increases. However, we should note that increasing D can also increase the risk of overfittng. Therefore, choosing the right value for D is important for achieving accurate prediction.
In order to effectively capture the graph structure of the network and temporal patterns in the traffic, it is important to model variables W, f t and b with flexible priors. In the following, we focus on this problem.
A. PRIOR FOR f t In order to capture the time evolution of the data traffic, we assume that the latent variables f t follows an additive model with four components given by: The additive model in (6) is used to capture the disjunction of four main temporal characteristics of the data. Function f 1d (t) models a short-term trend and is assumed to be a GP with radial basis function (RBF) kernel as: Parameters α 1d and β 1d model the important behavior of function f 1d (t). In particular, α 1d captures the horizontal variation and β 1d captures the vertical variation of function f 1d (t) in time domain. Function f 2d (t) models a daily quasi periodic pattern and is modeled by a GP as: Kernel function k 2d (t, t ) is the product of a RBF and a purely periodic RBF kernel which can capture the conjunction of both kernels [25], [26]. The overall kernel is a quasi periodic. As in 7, parameters α 2d , β 2d and γ 1d capture the horizontal and the vertical variations of function f 2d (t). Moreover, parameter λ 1 should be set as the number of observations per day. In a similar way, function f 3d (t) models a weekly quasi periodic pattern and is modeled by a GP as: where λ 2 should be set as the number of observations per week. Similarly, parameters α 3d , β 3d and γ 2d capture the horizontal and the vertical variations of function f 3d (t).
Furthermore, function f 4d (t) models the unstructured patterns and is given by: where δ(.) is the Dirac delta function. We note that, using the additive property of GP, function f d (t) can be rewritten as the following GP model: B. PRIOR FOR W Now we define prior destitution for W such that to explicitly exploit the graph topology of the network. To do so, we use a Gaussian random field (GRF) model as: or equivalently: where Q d = θ d1 L + θ d2 I M and L is the graph Laplacian matrix. The graph Laplacian matrix is defined as: In 14, A is the adjacency matrix corresponding to the graph where A ij is 1 if node i is connected to node j otherwise is 0. GRF is a sparse approximation of a GP defined over discrete input set. The connection between GRF and GP has been studied in [27]. The assumption by using the GRF is that nodes that are connected to each other should have similar data traffic patterns. The GRF encourages the values w i,d and w j,d to be similar if nodes i and j are neighbors and therefore acts as a regularizer which can improve the prediction accuracy.

C. PRIOR FOR b
Similarly, for the bias term, b, we use a GRF as prior given by: where Q D+1 = θ D+1,1 L + θ D+1,2 I. Matrix Z contains some nodes specific features such as the degree of nodes. In particular, by using this prior we encourage neighbor nodes to have similar bias if they have similar features.
Since the coefficient a in (15) is unknown, we put the following non-informative prior over its values: Finally, the complete probabilistic graph-based GP model is depicted in Fig. 2. A summary of the model's parameters is also shown in Tab. 1.

D. THE COVARIANCE STRUCTURE
To have a better understanding about the underlying data pattern that the model in (5) can capture, we compute the overall covariance structure as: (17) where for simplicity of the analysis we ignore the bias term. We can express (17) in a matrix form given by: It can be seen that the covariance matrix is sum of D covariance matrices which each has a Kronecker structure. In a special case, the covariance matrix has a simple Kronecker structure for D = 1. Designing covariance matrices with Kronecker structure is a common approach to model multi-variate output GP for graph structured data, e.g., in [28]. The limitation with the simple Kronecker structure is that it cannot capture complex patterns in the data. On the other hand, the sum Kronecker structure introduces much more expressiveness power for complex and non-linear patterns.

V. MODEL INFERENCE AND PREDICTION
In this section, we focus on model inference and prediction in the light of the data traffic measurements at the observed nodes in set V o . We note that the likelihood function in (5) depends on the parameters of both missing and observed nodes which are highly correlated through the GRF prior distributions in (13) and (15). However, the traffic measurements are unavailable for the missing nodes and therefore the inference is not trivial. To tackle the issue, our approach is to integrate out the parameters of missing nodes from the model to obtain a marginal model which includes only the observed nodes. Using the marginal model, we can perform the inference. In Section V-A, we focus on this problem. After the inference, we define posterior predictive distributions to make predictions about both the observed and the missing nodes by exploiting their interaction structures encoded in the GRF priors. This problem is explained in Section V-B.
where the joint distributions p(W o , Wō) and p(b o , bō) are given by the GRFs in (12) and (15) respectively. By re-arranging W and b, the GRF models can be rewritten as: Using Schur complement lemma [29], we can compute the marginal distributions of W o and b o as: Q o,d = Q 11d − Q 12d Q −1 22d Q 21d , ∀d = 1, . . . , D + 1. Given the marginal priors in (21), the inference problem focuses on the following Gaussian marginal model: where y ot contains the data traffic measurements of the observed nodes.
In particular, the inference goal is to compute the posterior distribution of random variables h := {F, W o , b o , a}, where F := [f 1 , . . . , f T ] ∈ R D×T ; and find the best values for the hyperparameters of GPs ψ GP , θ d2 } and noise variance σ given the traffic observations over T samples in time, i.e., Y := {y ot } T t=1 . The posterior of h has the following form: where ψ := {ψ GP , ψ GRF , σ } and Z (ψ) is the marginal likelihood which only depends on the hyperparameters. The common approach for finding the best values for the hyperparameters is to maximize the log marginal likelihood as: However, neither of the problems in (23) and (24) are trivial to solve because computing the marginal likelihood involves multiple high dimensional integrals which are difficult to obtain in a closed form. To tackle the issue, we use the VB learning algorithm [30]. The VB is an iterative approach where each iteration consists of two main steps. In one step, we approximate the posterior distribution of h given the hyperparameters. In the next step, we optimize the hyperparameters given the approximate posterior distribution. In particular, at iteration i, we solve the following subproblems: Subproblem 1) We approximate the posterior distribution by q (i) (h) given ψ (i−1) as: The approximate distribution q (i) (h) is determined such that to have minimum dissimilarity with the true posterior. Using the KL divergence to measure this dissimilarity, q (i) (h) can be found by solving the following optimization problem: where KL (q (.) || p (.)) := E q(.) log q(.) p(.) . Subproblem 2) We optimize the hyperparameters by maximizing a lower bound of the log marginal likelihood. It can be shown that the negative of KL divergence in (26) is an upper bound for the log marginal likelihood [31].
Theretofore, the KL divergence in (26) can be used to optimize the hyperparameters as: We now explain the VB learning algorithm in more detail. Hereafter, we ignore superscript i for notational simplicity.

1) POSTERIOR APPROXIMATION
To simplify the optimization problem in (26), the assumption will be that the probability density function q(h) is factorized with respect to each variable in h as: Note that the second equality is obtained without any further assumption. In particular, when we assume W o and F are independent, they are automatically factorized over latent dimensions d due to the factorized form of their prior. Using the Karush-Kuhn-Tucker (KKT) conditions, it can be shown that the optimized form of j factor based on the minimization of (26) is given by [30]: where the notation E ∼q(h j) [.] means to take the expectation with respect to all the variables except h j . Each optimal variational distribution can be obtained as in the following.
• Compute q(f :,d ): The optimal un-normalized log variational density can be written as: log q(f :,d ) where we define C w d := E ∼q(f :,d ) The un-normalized log variational density can be written as where . It can be seen that (34) is in a form of a normal density given by: • Compute q(a): The un-normalized log variational density can be written as which is in a form of a normal density given by:

2) HYPERPARAMETER OPTIMIZATION
In this step, the goal is to optimize the hyper-parameters given the posterior distribution. Using the optimization problem in (27), the objective functions for the GP, GRF, parameters and σ to be minimized can be written as: The maximal value of the objective (41) can be obtained by taking its gradient equal to zero which is given by: Due to non-linear dependency of GP and GRF objective functions to the hyperparameters, there are no-closed form solutions for (39), (39) and (40). However, the functions are differentiable with respect the hyper-parameters and gradient-based methods (e.g., Newton method) can be used in order to minimize the objective functions.
Overall, the inference algorithm for posterior approximation and hyperparameter optimization takes the form as in Alg. 1.

B. PREDICTION
For the H -step ahead prediction, we need to compute two types of posterior predictive distributions (PPDs), one for the observed nodes and one for the missing nodes.

1) OBSERVED NODES
The PPD for the observed node is given by: where Y * o = y o,T +1 , . . . , y o,T +H and F * = f T +1 , . . . , f T +H . The conditional p(F * |F) density can by computing by noting that the joint p(F, F * ) is a normal as: Using the conditional property of normal [29], we obtain: (45) We can integrate out the latent variables f :d and b o to simplify (43) as: where It is difficult to obtain a closed-form expression of (46). However, the mean and the variance can be computed analytically as:

2) MISSING NODES
The PPD for the missing nodes is given by: where Y * o = yō ,T +1 , . . . , yō ,T +H is the future traffic of the missing nodes and p(F * ) is the marginal predictive distribution of the time latent variables defined in (47). To compute (50), we need to obtain the distributions of the missing nodes latent factors Wō and biases bō conditioned on the inferred parameters, which are respectively represented by p (Wō|W o ) and p(bō|b o , a, Zō). The latent variables Wō, bō and F * are subsequently used as the parameters of the distribution of the missing nodes p(Y * o |Wō, F * , bō) to generate the data traffic. Note the this distribution is similar to (28) and is given by: In order to simplify the PPD in (50), we compute the marginal predictive distribution of the missing nodes latent variables by integrating out all the inferred latent variables of the observed nodes. In the following, we derive these marginal distributions.
Using the GRF model in (20), the conditional density p(Wō|W o ) can be computed as: where The marginal predictive Wō can be computed by integrating out W o as: (53) Similarly, we can compute the marginal predictive distribution of bō as: where Overall, the workflow of posterior and prediction computations can be summarized in Fig. 3. In particular, first, the recorded data traffic at the observed nodes are loaded. Next, in the preprocessing step, we scale down the data. This step is essential because the raw data can have large values and rescaling can help better convergence of the VB algorithm. Subsequently, using Alg. 1, we compute the posterior distribution. Finally, using (43) and (50), the traffic flow at the missing and the observed nodes are predicted. The per-iteration computational complexity of approximating the posterior distribution is O(DT 3 + DM 3 ) which is mainly due to matrix inversion operations. The computational complexity of prediction is the same.

VI. SIMULATION RESULTS
In this section, we evaluate the prediction accuracy of the proposed graph-based GP (GGP) model using the following two real-world network traffic datasets.    • Abilene dataset [32]: The Abilene was a highperformance backbone network created by the Internet2 community in the late 1990s. The network consists of 12 routers in United States which are connected by bidirectional links, as shown in Fig. 4. The traffic data on each link were recorded every 5 minutes (in Mb) from 2004/03/01 to 2004/09/10. In the original dataset, the traffic was measured on each link which are in total 30 links. We aggregate the data traffic of the incoming links to each node in order to obtain node-level traffic measurement. We further aggregate the dataset on an hourly basis.
• Geant dataset [32]: Geant is a pan-European data network. The network consists of 22 routers which are connected by bidirectional links, as shown in Fig. 4. The data were taken in 15 minutes steps starting on 04/05/2005 and ending on 31/08/2005. In the original dataset, the traffic was measured on each link which are in total 72 links. We aggregate the data traffic of the incoming links to each node in order to obtain node-level traffic measurement. We also aggregate this dataset on an hourly basis. To help better convergence of learning procedure, the data have been scaled down by factor of 10000.
• LSTM [34]: A fully connected network with two recurrent layers. In each recurrent layer, we select the number of hidden unit form the the set {50, 100, 150, 200} which yields the best performance.
• GCRN [16]: It consists of a stack of graph CNN and LSTM. The structure of the LSTM is selected similar the previous model.
• Graph Multi-Attention Network (GMAN) [35]: It is designed using attention mechanisms with gated fusion to model the spatio-temporal correlations. This model has three hyperparameters including the number of spatio-temporal attention blocks, the number of attention heads, and the dimensionality of each attention head. We fix the number of spatio-temporal attention blocks to 3 and number of attention heads to 8 as in [35]. We select the dimensionality of attention heads from the set {4, 8, 12, 16} which yields the best performance. Moreover, we train the neural network models, i.e., LSTM, GCRN and GMAN, using Adam optimizer [36] with an initial learning rate of 0.001.   Table 2 shows the MSE for different prediction ahead length, H , for different models. Note that the MSE computes the discrepancy between the ground truth and the predicted traffic flow values. Therefore, a prediction model with smaller a MSE has better performance since its prediction is closer to the ground truth. In this experiment, we set Mō = 0 and D = 5. We trained the models for T = {100, 200, 300, 400, 500} observations and the predictions are computed for the subsequent H time steps for each T . The MSE is computed by taking predictions in all T s. The table shows that the developed GGP has lower MSE with respect to the other benchmarks. Additionally, it can be observed that, in general, the MSE increases as H increases for all the models. This is expected since predicting a far-distant future is more challenging. Moreover, in Fig. 6 and Fig. 7, we show the ground truth and the predicted traffic flow trajectories versus prediction ahead H , for T = 500, for nodes ''ATLAng'' and ''it1.it'' in Abilene and Geant datasets respectively. We can see that the predicted traffic flow using GGP model is much closer to the ground truth with respect to the other models.
Further, we investigate the sensitivity of the models against noise. For this, we add artificial noise to the training datasets  and train the models. In particular, we add a folded Gaussian noise with zero mean and varianceσ . Fig. 8 and Fig. 9 illustrate the MSE versusσ for Geant and Abilene datasets respectively. The results are obtained by averaging over 25 Monte Carlo simulations. It can be observed that the GGP model is much more robust against the noise with respect to the other models. In addition, we see that, in general, the MSE first decreases and then increases as the noise variance increases. The reason is that injecting small noise to the training dataset can reduce overfitting and improve the prediction accuracy. Note that applying noise to the training dataset is widely used in deep learning literature as a data augmentation technique to reduce overfitting [37]. However, increasing the noise variance eventually increases the MSE because the added noise dominates the underlying patterns in the data that are useful to capture for accurate prediction.
Next, we show the prediction error in the presence of missing nodes. We set T = 300, D = 5 and H = 15. For this experiment, it is not straightforward to implement the GCRN because it requires the data traffic of all the nodes. Therefore, we only use LSTM and ARIMA as benchmarks. For these models, we use an iterative mean imputation method to predict the data traffic at the missing nodes. In particular,  we fix the predictions at the observed nodes and initialize the missing nodes with zero values. Next, each missing node computes the mean of data traffic from its one-hop neighbors and sends the computed mean to its neighbors. This procedure is repeated until convergence. Fig. 10 and Fig. 11 depict the MSE versus the number of missing nodes, Mō for the Abilene and Geant networks respectively. The missing nodes are selected randomly, similar to the previous scenario, the results are obtained by averaging over 25 Monte Carlo simulations. We can observe that the developed GGP model outperforms the other benchmarks on both datasets. Moreover, it can be seen that as the number of missing nodes increases the MSE increases for all the models. This is expected because as Mō increases, the number of observations decreases and it becomes much more difficult to extract useful patterns among the nodes in the network.
Finally, we show the MSE versus the latent dimensions D for T = 100 and T = 300 in Fig. 12 and Fig. 13. In this experiment, the number of missing nodes is 30% of the total number of nodes. From the figures, we can see that, in general, the MSE first decreases and then slightly increases as D increases. The decreasing trend is because the model is not yet expressive enough to explain the data traffic well  which results in to underfit. The increasing trend is because the model is too much flexible and fits to the noise rather than useful patterns which results in to overfit. Moreover, it can be seen that as the number of observations T increases, a larger D is required to achieve accurate prediction. For example, in Fig. 12, we can observe that 8 dimensions are enough for T = 300 while 4 dimensions are needed for T = 100.

VII. CONCLUSION
In this paper, we introduced a graph-based Gaussian process model for network traffic analysis and prediction when the network graph is not fully observable. The model is Bayesian non-parametric and highly flexible in capturing complex nonlinear patterns in the data. We defined a structured kernel function which can model long-term and short-term temporal trends for accurate prediction. Additionally, the model can learn spatial interactions among the node in the network which can be used to predict the missing values. Next, we developed a variational inference algorithm to approximate the intractable posterior distribution which can be efficiently implemented. Finally, using two real-world datasets, we showed simulation results to demonstrate that the proposed model can achieve improved prediction accuracy with respect to the state-of-the-art approaches.