Trend-Aware Proactive Caching via Tensor Train Decomposition: A Bayesian Viewpoint

Content caching at base stations is an effective solution to cope with the unprecedented data traffic growth by prefetching contents near to end-users. To proactively servicing users, it is of high importance to extract predictive information from data requests. In this paper, we propose an accurate content request prediction algorithm for improving the performance of edge caching systems. In particular, we develop a Bayesian dynamical model through which a complex nonlinear latent temporal trend structure in the content requests can be accurately tracked and predicted. The dynamical model also leverages tensor train decomposition to capture content-location interactions to further enhance the accuracy of predictions. To infer the model’s parameters, we derive an approximation of the posterior distribution based on variational Bayes (VB) method with an embedded Kalman smoother algorithm. Based on the predictions of the proposed model, we design a cost-efficient proactive cooperative caching policy which adaptively utilizes network resources and optimizes the content delivery. The advantage of the proposed caching scheme is demonstrated via numerical results using two real-world datasets, which show that the developed Bayesian dynamical model substantially outperforms reference methods that ignore the temporal trends and content-location interactions.


I. INTRODUCTION
M OBILE traffic is explosively increasing, under the constant pressure of data hungry applications such as on-demand video streaming [2]. The proactive caching, as an effective approach for mitigating this issue, aims at exploiting predictable user demand behavior in smoothing out communication network traffic by prefetching the most popular contents close to end-users during off-peak hours [3]. Leveraging these predictive abilities, network resources can be pre-allocated more efficiently by servicing predictable peak-hour requests.
To design an accurate content prediction algorithm, it is crucial to understand the underlying structures in the requests. In particular, content requests may exhibit three important patterns: 1) Non-linear temporal trends: In reality, there exist different kinds of temporal request trends among different types of contents. For example, it has been observed that a video can go through multiple phases of increase and decrease in requests during its life-cycle [4].
2) The interactions among contents: The requests for some contents may exhibit a similar pattern, for instance due to sharing the same category of features.
3) The local content popularities and their interactions across locations (i.e., spatial correlations): It has been observed that the requests for the majority of contents are location-dependent [5]. For example, users in residential, academic, and other type of areas are interested in specific categories of contents. Yet, the requests in various locations are not independent and they may exhibit similar patterns. For example, content popularity can spread through word-of-mouth across social connections. Most of prior work on content request prediction focused mainly on static scenario and ignored the time evolution pattern in the requests, e.g., [6]- [8]. Content requests, however, exhibit time-dependency which can have considerable effect on the caching performance. The auto-regressive movingaverage (ARMA) in time-series is the most widely-used model for prediction [9]- [11]. The ARMA model specifically is designed for continuous-valued data and it may not be suitable for count-valued content requests. In other words, due to the Gaussian assumption, it is likely to predict negative values for the requests which are not meaningful. In [12], a prediction algorithm was designed where the requests are restricted to be on the probability simplex space. Though the algorithm therein guarantees the positiveness of the requests in prediction, it has an important shortcoming. Particularly, in many caching policies, e.g., as in [13], it is critical to predict the actual request rate (e.g., to measure data traffic) over time and location in a network. Therefore, the normalized requests may not be usable for such polices. Additionally, both the ARMA model and the proposed model in [12] are linear and also ignore the latent structures in the content requests, as mentioned in points 1-3, and hence, they can be quite restrictive.
Deep neural network (DNN) modeling approach has been used in [14]- [16] for the prediction task. DNN models are non-linear and potentially can capture complex structures in the requests. An importing drawback of most DNN models (and also ARMA model) is that they require rich historical data for prediction. However, content providers continuously release new contents for which there is rare or no historical data. Additionally, different from conventional content delivery networks, the number of requests is very small at the network edge [17] which may limit their applications in edge caching systems due to overfitting problem.
The authors of [18]- [22] proposed to apply reinforcement learning (RL) to learn the underlying dynamics of the request for cache optimization. Nevertheless, as it was shown in [23], RL may not be yet competitive in comparison to simple heuristics, e.g., random caching. The reason is that RL typically requires millions of learning samples which leads to slow reaction times in dynamic environments [24]. In caching systems this issue is significant since servers can face quickly changing unexpected data traffic patterns.
Bayesian approaches are widely popular for tackling data scarcity problem and being robust against overfitting [25]. In our previous work [26], we proposed a Bayesian model based on matrix factorization approach to capture the interactions among the contents in a time varying scenario. More recently in [27], we introduced a dynamical model using a CANDECOM/PARAFAC (CP) tensor decomposition to capture content-location interactions.
Parallelly, there has been a great interest in exploring optimal proactive cooperative caching policy and performance analysis for an efficient use of network resources. The vast majority of work, however, focused on static scenarios or designed policies with fixed resource dimensioning [13], [27], [28], and [29]. A more efficient and realistic scheme is to design a policy that adaptively adjusts network resources according to the dynamic of content popularity. This can be achieved thanks to the recent technological developments on cloud service providers where physical resources are mapped to virtual resources [30]. As a result, the network operator can dimension the consumed resources as required.
In this paper, we consider edge caching systems with dynamic resource dimensioning and introduce a flexible dynamical model to capture the nonlinear trend information of the requests. Our contributions are as follows: • We model the observed requests by a tensor with three modes, i.e., content, location and time. By deploying tensor train factorization approach, we decompose the requests into three latent factors (content, location, time) through which complex interactions can be captured. The train factorization method numerically is more stable and provides a more accurate representation of the data than the classical CP factorization method [31], [32]. Subsequently, by assuming that the time factors follow a structured Gaussian process, complex temporal trend structures in the requests can be modeled. • Due to insufficient request observations in the network edge [17], we adhere to the Bayesian principle, which is very efficient for dealing with the problem of overfitting, for model learning. Moreover, a simple-toimplement approximation method based on the Kalman smoother and VB is derived to infer the parameters of Bayesian dynamical model. The Bayesian trend-aware train decomposition model developed in this work has a more expressiveness power than the non-trend CP decomposition model in [27]. Therefore, the VB algorithm designed here is more sophisticated than the one in [27]. • To provide an end-to-end caching system design, a joint routing and content placement policy is designed for cooperative caching to minimize network cost by adaptively utilizing the network resources, i.e., cache memory and link bandwidth, in time-varying scenario. The caching policy is coupled over time and is a function of future requests therefore it is infeasible to solve for online setting. To tackle the issue, we predict a sequence of future requests by leveraging the dynamical model to optimize a multistage caching policy defined over a prediction horizon. • The caching policy is a mixed-integer non-linear programming problem which is NP-hard and difficult to solve. To overcome the challenge of the formulated policy, we develop an approximation algorithm based on difference of convex (DC) programming which can be solved in polynomial time.
• We show via numerical results that the developed Bayesian dynamical model is substantially outperforms reference methods which ignore the temporal trends and content-location interactions using two real-world datasets. The rest of the paper is organized as follows. Section II describes the Bayesian dynamical model for proactive caching. In Section III, we explain the VB approximation algorithm. In Section IV, we describe an adaptive cooperative caching policy. Section V provides numerical results. Finally, VI concludes the paper.
Notation: Scalars are denoted by lower-case letters, e.g., a, vectors are denoted by bold-face lowercase letters, e.g., a, and matrices are denoted by bold-face uppercase letters, e.g., A. The superscript (.) T denotes matrix transpose. The expectation is denoted by E [.]. N (μ, ) denotes Gaussian distribution with mean μ and covariance matrix , and the symbol ∼ is used to mean "distributed as". G(α, β) denotes gamma distribution with shape α and rate β, and Pois(r) denotes Poisson distribution with rate r.

II. BAYESIAN TREND MODEL FOR PROACTIVE CACHING
Consider a cellular network composed of N base stations (BSs) which are distributed over a geographical area and cooperatively serve theirs users. Each BS n ∈ B = {1, . . . , N} has a cache with limited storage capacity and can store a subset of contents from library M = {1, . . . , M} with M contents. Over time, the users in a cell submit requests from library M towards the BS's cache. All BSs are connected to the content server through back-haul links via mobile network operator core and we assume all the processing tasks are performed in this core. The mobile network operator aims to provide the users a proactive caching service for the peak hours. To this end, it collects requests from the BSs, makes prediction and subsequently caches the contents based on caching policy. Hence, to obtain an efficient caching policy, an algorithm for tracking and predicting the requests is essential. In this section, we introduce a Bayesian trend model by which tracking and prediction can be performed accurately.
Lets consider a three-mode tensor D ∈ Z M×N×T consisting of request observations during T time slots, where element D(m, n, t) denoting the total number of requests for content m by the users at cell n at time slot t. We assume that the t-th frontal slice of D, denoted more compactly by D t , can be decomposed under Poisson likelihood as: where vectors A(:, m) ∈ R K 1 ×1 + and B(:, n) ∈ R K 2 ×1 + contain the latent-factor representations of content m and cell n respectively; and Z t ∈ R K 1 ×K 2 + models the interactions of the latent factors at time t. Moreover, k 1 = 1, . . . , K 1 is the column index of matrices A and Z t ; k 2 = 1, . . . , K 2 is the row index of matrices B and Z t ; and r mnt is defined as the mean of the requests D t (m, n), i.e., the mean of the requests for content m at cell n during time slot t. Generally, it is assumed K 1 < M and K 2 < N such that the latent factors offer a more compact description of the requests. The decomposition approach deployed in (1) is called tensor train factorization [31] and [32]. This is a modification of probabilistic CP tensor factorization in [33] but with three differences. First, the Poisson substitutes the Gaussian which makes the model to be suitable for the requests that are non-negative integers. Second, the latent factors are positive which improves the interpretability of the model. Third, the time factor, Z t , is a full matrix which enhances the expressiveness power of the model. We also note that the CP tensor factorization method proposed in [27] for Poisson data is a special case of the formulation in (1), where the time factor matrix is diagonal. A graphical view of decomposition model (1) is shown in Fig. 1.
Using the property of the Poisson model, (1) can be equivalently reformulated as [34]: whereD mnt is an auxiliary latent variable such that D t (m, n) = The reformulation in (2) facilitates the inference, as will be shown in the following.
Regarding the time factor Z t , which reveals the time evolution in the requests, it is rational to assume that it varies smoothly over time. Specifically, we consider the following hierarchical Gaussian process model: . .
and the time evolution in Z t is modeled as Z t = exp(X 0t ), where x ,t = vec (X ,t ) and vec(.) is the vectorization operator. Moreover, e ,t is a random variable represents unpredictable change in latent variable x ,t between time t and time t − 1, and it is assumed that e ,t ∼ N (0, −1 ). Additionally, matrix ∈ R K 1 K 2 ×K 1 K 2 captures the dependency between x ,t and x ,t−1 .
The Gaussian model in (3) consists of L + 1 latent layers and constructs a non-linear trend model [35]. Specifically, the zeroth latent layer, modeled by x 0t , captures temporal locality in Z t and equivalently in the content requests. The first latent layer, modeled by x 1t , captures a linear trend in the evolution of x 0t . The second layer, modeled by x 2t , captures a linear trend in x 1t which results a quadratic trend in x 0t . Similarly, the Lth layer, modeled by x Lt , captures a linear trend in x L−1,t which results a quadratic trend in x L−2,t , a cubic trend in x L−3,t and so on. Therefore, by increasing the number of latent layers a non-linear trend model can be constructed. Due to this particular presumed underlying structure on the way latent factors Z t are generated, the model can uncover a complex time evolution pattern in the requests. Now, we explain the form of priors we choose for the parameters of dynamical model. The motivation in the following priors is to satisfy conjugacy property for efficient inference [36]. Similar to [34], we use the following hierarchical gamma priors for the elements of B as: The gamma priors in (4) promote the location factors to have sparse representations. In particular, the gamma distribution puts non-zero mass on zero value and has an exponential shape for β ≤ 1; therefore it is sparsity promoting. The sparsity level increases as β decreases. Additionally, defining hyper-priors on location-specific gamma rate parameters captures the overall activity level of users across locations. This hierarchical structure captures the heterogeneity across locations, some can have a larger number of active users than others. Similarly, we use the following hierarchical priors for the elements of A as: Likewise, the hierarchical gamma priors in (6) and (7) promote the content latent factors to have sparse representations and also capture the heterogeneity across contents, some can be more popular than others. The priors for and are jointly modeled by a Gaussian-gamma density as: where, for simplicity, we assume that they are diagonal such that λ = diag( ) and π = diag( ). We additionally assume that the prior of initial states in (3) to be Gaussian x 0 ∼ N (0, Q −1 ,init ). A graphical representation of the Bayesian model is shown in Fig. 2.
For ease of notation, we collect all unknowns into a single parameter denoted by h = Our goal is to develop a method to compute the full posterior of variables in h given the observed content requests over T time slots, stated as:

III. MODEL LEARNING VIA VARIATIONAL BAYES
An exact Bayesian inference in (9) needs to integrate over all the variables encapsulated in h, which is analytically intractable. To find a scalable Bayesian inference, we approximate p(h|D 1:T ) with factorized density q(h) = j q(h j ), found by maximizing a variational lower bound to log marginal likelihood, log p(D 1:T ), [36]. By using the coordinate decent method, it is shown that the optimized j-th variational factor has the following update [36]: The notation E ∼q(h j ) [.] denotes an expectation with respect to all variables except h j . Since any term that does not depend on h j can be constant under the expectation and can be subsumed into the constant term, we can rewrite (10) as: where p(h j |.) is the un-normalized (or normalized) conditional posterior of h j . We employ the following factorization for q(h): We now present the forms of the optimized variational factors in the following. We drop subscript ∼ q(h j ) in (10) and (11) for notational brevity, and all the expectations need to be computed with respect to the variables inside the expectation argument under the variational density unless we explicitly mention it.

1) MULTINOMIAL UPDATE FOR q(D mnt )
By using the variable augmentation method in (2) and the property of Poisson and multinomial [37], the conditional posterior ofd mnt = vec(D mnt ) is a multinomial density given by: where By substituting (13) into (11), we obtain: wherȇ We can see that by normalizingp mnt , expression (14) is in a form of multinomial density. Therefore, we assume q(d mnt ) is a multinomial density given by: withp mnt =p mnt kpmnt (k) . (17) Note that the second term inside the expectation in (15), which cannot be computed analytically, is automatically canceled out due to the normalization in (17).

2) GAMMA UPDATES FOR q(A)
The gamma and Poisson distributions in (6) and (2) are conjugate. It is straightforward to show that the conditional posterior of A(k 1 , m) is a gamma density given as: By substituting (18) into (11), the vartaional density q (A(k 1 , m)) is again a gamma density. Table 1 provides the exact expression.
Gamma updates for q(â), q(B), q(b): The updates for these densities can be derived in similar way as for q(A) and are summarized in Table 1.
Gaussian-gamma updates for q( , ): It can be shown that the optimized variational distribution (41) has the following form (the proof is given in the Appendix): where: Gaussian Updates for q(x 0(0:T) ): The functional dependency of (11) on x 0t has the following form 1 : where The Gaussian process x 0t is used as a prior for the natural (or equivalently rate) parameter of the Poisson distribution in which they are not conjugate. Because there does not exist close-form expression for (21), similar to [26], we approximate it by a Gaussian density using the Laplace method as: where m 0t = x * 0t and x * 0t is the point of maxima of (21).
is the Hessian matrix of (21) at x * 0t . Gaussian updates for q(x 0:T ) for > 0: Unlike in (21), there is no source of non-conjugacy for x 0:T and since they construct a singly connected network with Gaussian density in each node, the update follows a Gaussian density. The required quantities during the inference and the prediction are marginal q(x t ) and pairwise marginal q(x t , x t+1 ) which are again Gaussian densities. To this end, we deploy a Kalman smoother algorithm. The modification is that, except x t , all variables are replaced with their expectations under the vartaional distributions. Please refer to [38] for more details.
The marginal density q(x t ) can be computed in two recursions: forward and backward. By combing the two recursions, we obtain: where q(x t ) and q(x t ) are the forward and backward density messages, respectively. The forward message is computed as: Assuming (25) is a Gaussian quadratic function in x ,t−1 . Thus, (25) has a closed-form solution which is a Gaussian density (after normalization) given as: 1. Note that we will not pay special attention to the boundaries of x t at t = 0 and t = T and details can be worked out easily.
From (26) and (24), we obtain where The backward density message is computed by: ,t+1 ), the integrand is a Gaussian quadric function in x ,t+1 . Thus, expression (28) has a closed-form solution and is given by: where To run the backward recursion, an initial condition is required for x T . This is usually set to be q(x T ) = 1 which can be obtained by a zero mean Gaussian density with high variance. Now, according to (23), by combining the forward and backward density messages, we obtain: where Similarly, the pairwise marginal can be computed as: By using the Schur complement, we can show that the covariance between x ,t and x ,t+1 is given by: We summarise our inference algorithm to learn the model parameters in Algorithm 1. Specifically, we initialize the parameters of the variational distributions and iteratively updating them, according to their derivations as in the above, until convergence.

A. PREDICTION
After computing the posterior distribution, we will predict the request rates at H-step ahead from time slot T, for all H > 0, which can be obtained by computing the mean of the posterior predictive distribution, p(r mn,T+H |D 1:T ), as: where This requires the computation of the expectation of exp(X 0,T+H ) under the marginal posterior predictive q(.)p({x ,T+h } h, ). Because this expectation does not have an analytical expression, we instead use the Jensen's inequality and approximate (32) by: The operation in (33) requires to compute the expectation of x 0,T+H under q(.)p({x ,T+h } h, ), which is given by the following recursive formula: with initial conditions: We use the H-step ahead predictions in (33) to design a multistage caching policy presented in Section IV.  A,â, B Updatep mnt using (17), ∀m = 1, . . . M, n = 1, . . . , N, t = 1, . . . , T; 4 Update m t , Q t using (30) and (21) 1)). iv) The time complexity of updating the parameters of , , ∀ = 0, . . . , L, is O(K 1 K 2 (L + 1)). Overall, the time complexity of the VB algorithm is O(TK 1 K 2 T nnz +MK 1 +NK 2 + TK 1 K 2 (L + 1)). Note that the update of each parameter depends only on the other parameters in its Markov blanket, where the Markov blanket consists of the variable's parents, co-parents, and children in the graphical representation of the Bayesian model, and is independent of other parameters. For instance, the updates of a content factors are independent of all other contents (and similarly for a cell factors). Thus, we can easily parallelize the posterior inference computations to scale up the performance of the VB algorithm.
Remark: Note that, in practice, the number of contents is not fixed and new contents are released over time. In such cases, our proposed dynamical model is still capable of providing accurate popularity prediction with some minor modifications. A simple procedure is to add a dimension in the content space in the tensor when a new content is released. The initial values for the latent factors of this new content are sampled from the prior distribution and will be updated after observing its requests. Nevertheless, an issue with this procedure is that the computational complexity of the learning algorithm increases as more new contents are released over time. To tackle this issue, we can first obtain a coarse popularity prediction by using a much simpler algorithm (e.g., ARMA) to select the most M temporal popular contents, where M is chosen by the available computation resource at the mobile network operator. Subsequently, these M contents are fed into the dynamical model for higher prediction accuracy. In this way our dynamical model would have a fixed computational complexity in terms of the number of contents. We believe that the time varying nature of the number of contents is crucial to practical applications, which is our next target in our future work for a more detailed algorithm design.

IV. COOPERATIVE CACHING PROBLEM FORMULATION
In this section, we formulate an adaptive cost-aware proactive cooperative caching policy. Following the system model explained in Section II, we assume that the BSs can cooperate via dedicated inter-BS links, e.g., the X2 link. The connectivity between BS n and BS n is denoted by δ nn ∈ {0, 1}, where δ nn = 1 if BS n is connected to Bs n , otherwise δ nn = 0. The set of BSs connected to BS n is denoted by B(n) = {n : δ nn = 1}. When receiving a request for a content, BS n directly serves its users if the content is available in its cache, otherwise it retrieves the content from the neighboring BSs that have the content. If the content is available neither in its cache nor in the other BSs' caches, the BS downloads the content from the content server which holds all the contents via the back-haul link. In this scenario, the content delivery phase is jointly designed with the content placement and routing decisions. In particular, we design a cost efficient caching policy which adaptively utilizes the network resources, i.e., the cache memory and the link bandwidth, over time and is formulated as 2 : min y nn mt ,S nt ,f nn t t∈T ∀n ∈ B(n ), n ∈ B, t ∈ T (35b) m∈M s m y nnmt ≤ S nt , ∀n ∈ B, t ∈ T (35c) y nn mt ≤ y n n mt , ∀n ∈ B, n ∈ B(n),  In problem (35), the objective function is a sum of four costs: i) C 1t measures the download cost from the content server; ii) C 2t is the cost for content routing among the BSs where f nn t is the amount of bandwidth allocated between BS n and BS n at time t; iii) C 3t is the caches storage cost where S nt is the amount of memory assigned to the cache at BS n at time time t; and iv) C 4t is the cache refreshment cost. s m is the size of content m, c nn is the cost for transferring data between BS n and BS n , c n is the cost for transferring data from the content server to BS n and c n is the cache storage cost. Moreover, y nn mt is decision variable defined as: From the definition of y nn mt , it follows that y nnmt is the cache placement decision variable whose value equals to one if content m is stored at cache n, and zero otherwise. In problem (35), constraint (35b) ensures the link capacity between BS n and BS n is less than assigned f nn t . Constraint (35c) guarantees the cache memory at BS n is less than assigned S nt . Constraint (35d) ensures that BS n can send content m to BS n if it is stored in its local BS. Constraint (35e) ensures that BS n can only store content m in its cache or fetch it from only one BS. Finally, constraints (35f), (35g) and (35h) represent the space restriction of the decision variables (i.e., y nn mt , f nn t and S nt ), where S max and f max are respectively the maximum cache memory and the maximum link capacity. Note that since, in practice, the caches are large we assume S nt is a continuous real number and can take any real number.
The optimization problem (35) is coupled over time slots thorough C 4t . Specifically, a decision at present time depends on the content requests in the future [27]. For instance, it may be optimal to remove a content from a cache for the current time slot if we ignore the future requests. This decision may not be optimal if we knew the content requests during the next time slots. Specifically, the request for this content may increase in the next time slots and therefore keeping this content can save the cost for fetching the content from the server for the next time slot. This means that in order to obtain a global solution, problem (35) needs to be solved jointly over all time slots. This is not feasible in online settings because the future requests are unavailable. A simple and common approach is to ignore the future requests and solve a one shot optimization [27]. Ignoring the future requests, however, degrades the caching performance. In the following section, we propose an approach to achieve improved performance.

A. H-STEP AHEAD PREDICTIVE CACHE OPTIMIZATION
To improve the quality of the solution, we exploit the latent trends in the requests extracted by the Bayesian dynamical model and propose a H-step ahead predictive optimization problem. In particular, given the cache state y nnmt at time slot t , we solve min y nn mt ,S nt ,f nn t t +H t=t +1 where T = {t + 1, . . . , t + H}. The caching policy requires the true popularity of contents over the next H time slots, r m,n,t +h , h = 1, . . . , H, which are approximated by the predictions computed in (33), i.e., r m,n,t +h ≈r m,n,t +h , h = 1, . . . , H. We note that as a result of multiplication of y nm,t +h and y nm,t +h+1 in cost function C 4,t +h , h = 1, . . . , H, problem (37) is nonlinear mixed-integer programming which is not trivial to solve. Therefore, we equivalently transform the non-linear problem in (37) into a linear problem by linearizing the non-linear terms in C 4t using the following Proposition [39]: Proposition 1: Using the transformation χ nmh = y nm,t +h y nm,t +h+1 the following constraints linearize the problem (37). (38) where 0 ≤ χ nmh ≤ 1.
Proof: The proof can be derived in a similar way as in [39,Proposition 2].
Under the above linearization procedure, we can now formulate a linear mixed-integer programming formulation of problem (37) as follows: An optimal solution of problem (39) can be obtained by using global optimization methods, e.g., branch and bound [40].
Since the optimization problem in (39) is mixed-integer programming, there is no guarantee to be solved in polynomial time. Considering the stringent requirement and large-scale of the caching problem, it may not be efficient to implement the exact methods for optimal solution. In the following section we, develop a polynomial time algorithm to approximate the solution of (39).

B. A POLYNOMIAL TIME CACHE OPTIMIZATION
A common approach to approximate a solution of (39) is to relax the problem by replacing binary constraints with continuous interval [0, 1]. This convex relaxation approach often suffers from the fact that the computed solutions can be far from binary and that subsequent heuristic method is required to obtain binary solutions which it may substantially degrade the quality of the solutions. An approach to enhance the solutions is to incorporate a regularization function as a measure of relaxation tightness. In this work, we consider a penalty function g(x) = x(1 − x) [40] and relax the optimization problem at time slot t as: where ζ nn mt is sufficiently large penalty parameter. We can see that for binary solutions the penalty function term vanishes and problem (40) becomes equivalent to problem (39). Note that the minimization problem in (40) is not convex because it is sum of a linear function and a concave function, due to the term −x 2 in penalty function g(x), and therefore it can not be solved efficiently. Since the objective function is sum of a linear function and a concave function, we resort to difference of convex (DC) programming approach [41] to solve the problem. The DC procedure is a heuristic algorithm for finding a local optimum by convexifying a convex-concave function which is commonly performed by linearizing the concave term. The DC programming for solving the caching policy in (40) is depicted in Algorithm . Specifically, at iteration ith of the algorithm, we solve problem (40) by linearizing the penalty function term around y i nn mt . In lines 5-9, we increase the penalty parameters by multiplication with a factor ϕ only if the improvement of the solutions towards binary values measured by g(y i+1 nn mt ) is not decreased by a factor δ over the previous iteration. This form of update guarantees the convergence of the algorithm and mitigates numerical issues caused when the penalty parameters grow too large [40]. To check the convergence of the algorithm, a reasonable stopping criterion is when the improvement in the convexified objective function is less than small threshold .
The DC programming procedure does not guarantee to yield binary solutions and rounding the continuous values  may violate some of the constraints. To provide a feasible binary solution, we sort the contributions of the continuous solutions to the objective function in increasing order. Then, in a greedy manner, we successively remove a content from the caches or delete a communication link which has the smallest contribution in the objective function until to satisfy all the violated constraints.

V. NUMERICAL RESULTS
In this section, we numerically evaluate the performance of the proposed dynamical hierarchical Poisson-Gaussian trend (DHPGT) model in Section II and the proposed proactive caching policy in Section IV. To provide exponentially shaped gamma priors for the sparsity of the latent factors, we set α = β = γ = 0.5 unless otherwise stated. The rest of the model hyper-parameters are provided in Table 2. All simulations are run with MATLAB version 2019a on 64bit operating system with processor Intel Core i7-6820HQ CPU @2.70GHz and 8 GB RAM.

A. SYNTHETIC DATA
In this part, we show the performance of DHPGT model using syntactically generated requests. We first evaluate the prediction accuracy of the developed VB approximation. We use the mean absolute error (MAE) as a performance metric: To generate the synthetic data, we set M = 25, N = 6, K 1 = 8, 0 = 4 1 = 16 2 = 0.25 × 10 −4 I, 0 = 0.9I, 1 = 0.95I and 3 = 0.98I. Additionally, the values of the contents and cells latent factors are randomly drawn from their priors. The results are taken average over 25 different data realizations. In this scenario, we assume that the true latent dimensions, K 1 , is unknown and we fit the model with dimensions different from the true one. Fig. 3 shows the MAE versus different values of K 1 for different number time slots T. We can see that, for a fixed T, as K 1 increases the MAE decreases. This is expected since by increasing K 1 the flexibility of the model increases and the data requests be explained more accurately. However, it can be seen that after a specific value of K 1 the MAE increases slightly which is due to over-fitting problem. In addition, we observe that the minimum value of the MAE is attained at higher K 1 values as the number of time slots T increases. In particular, for T = 75 and T = 100 the minimum value of the MAE occurs at K 1 = 8 which is the true latent dimensions. Moreover, we can observe that for small values of K 1 (e.g., at 2 and 4) the MAE increases as T increases. This happens because of under-fitting problem i.e., the model cannot explain the data well with a very small number of parameters. Fig. 4 illustrates the convergence behavior of variational lower bound to log marginal likelihood for different values of K 1 for T = 25. We can see that the variational lower bound increases as the algorithm progresses. Specifically, the developed coordinate ascent VB guarantees to increase at each iteration since each update has a unique solution. It can be seen that as K 1 increases, the VB algorithm takes more time to converge. This is expected since more parameters are required to be estimated as K 1 increases. Moreover, we observe that the variational lower bound increases as K 1 increases from 3 to 5 which indicates that with K 1 = 3 the model underfits and more factors are required to describe the data well. On the other hand, we see that the variational lower bound decreases as K 1 increases from 5 to 10. This is because the log marginal likelihood penalizes models with too many parameters, i.e., overfitted models, and assigns them less likelihood [42]. This also confirms our results in Fig. 3.
Furthermore, in Fig. 5, we explore the latent structures inferred by the model. Fig. 5-a shows the trajectories of actual requests, the true request rates and the inferred means of the request rates for a content in a cell over time. We can see that the VB approximation tracks the true request rate accurately. It is also observed that the requests for this content has an increasing trend followed by a decreasing trend over time. Fig. 5-(b) shows the inferred mean of the time latent factor. For the ease of visualization, we only show the latent time factor which has the largest content and  cell factor weights denoted by k * . We can observe that this time factor can explain the temporal pattern of the content requests very well. Fig. 5-(c) illustrates the inferred m 0t (k * ) which it can be seen that it has the same temporal pattern as in Fig. 5-(b). Fig. 5-(d) illustrates m 1t (k * ) which we can observe that it captures linear temporal trends in m 0t (k * ). Particularly, when m 0t (k * ) increases (or it has a positive slop), m 1t (k * ) is positive, and when m 0t (k * ) decreases (or it has a negative slop), m 1t (k * ) is negative. Furthermore, Fig. 5-(e) shows the interfered mean of m 2t (k * ) which, with the same argument, captures linear temporal trends in m 1t (k * ).

B. REAL-WORLD DATA
In this section, we show results on prediction accuracy using two real-world MovieLens-20M [43] and Netflix [44] datasets. We choose ratings on a monthly basis where we select the most popular M = 100 movies and the most active 1000 users. We only use the ratings during the last 200 and 70 months for the MovieLens and the Netflix datasets respectively because the majority of the ratings are provided during theses time frames. Moreover, since our focus is on count data requests, we discretize non-integer ratings to the their nearest integer values. Additionally, since the datasets do not provide information about the locations of the users, we randomly distribute the users across N = 10 cells, each has 100 users. Note that the datasets consist of ratings for each users for movies. Therefore, we have aggregated the users' ratings to obtain BS-level requests in order to adapt the data to our model.
We compare the proposed prediction model scheme against the following two baseline models: where I and J specify the order of the model, ϕ i and α j are the parameters, and n m,t is white noise error term. We set I = J = 7 which is also used in [10]. For its implementation, we used the econometrics MATLAB toolbox. Note that due to the Gaussian noise assumption, ARMA model may predict negative values for the requests. To provide meaningful predictions, we round the negative values to a small positive value e.g., 10 −5 .
Since the true underlying requests are unknown for the real-world datasets, we define MAE = 1 MNT m,n,t |D t (m, n) −r mnt | which measure the dispensary between the instantaneous requests and the predicted request rates. Table 3 shows the prediction error of different models on the datasets for various prediction time horizon H. We also give the average burstiness of each dataset which we

define as
From the table, it can be seen that DHPGT model is more accurate than DIHPG, DPG and ARMA models. Moreover, for DHPGT model, increasing the number latent layers, L, improves the prediction accuracy. This indicates that, by increasing L, DHPGT model can capture a deeper hidden trend structures in the requests and therefore a higher accuracy is attained. Additionally, we observe that DIHPG model has the worst prediction accuracy among DHPGT and DPG models. This is because it ignores important hidden structures in the requests i.e., the temporal trends and the spatial correlations. Furthermore, it can be observed that as the prediction time horizon H increases, MAE increases for all the models which is due to the difficulty to predict a far away future time slot. We can also see that the MoveiLens dataset has a smaller error prediction with respect to the Netflix dataset. This can be explained by examining the burstiness of the datasets. Specially, since the Netflix dataset is more burst in comparison with the MoveiLens dataset, it is also more difficult to predict. We also show the prediction error of DHPGT model in terms of latent dimensions, K 1 . Fig. 6 illustrates the prediction error versus K 1 for different datasets. In this scenario, we use T = 69 and T = 199 time slots for respectively the Netflix and the Movielens datasets for training and the last time slot for prediction. It can be observed that as K 1 increase the error decreases for both datasets. This is expected since by increasing K 1 the flexibility of the model increases and therefore it can explain the data more accurately. Moreover, for each dataset, it can be seen that after a specific value of K 1 (K 1 = 24 for the Netflix dataset and K 1 = 34 for the MovieLens dataset) the prediction error increases which is due to the overfitting problem. Furthermore, in Fig. 7, we show the prediction error versus the sparsity level of the latent factors A and B which is specified by γ for different datasets for K 1 = 5. From  Fig. 7, we can observe that the model performs fairly robust for wide-range of γ values (0.01 − 1) on both datasets.

C. CACHING PERFORMANCE
In this section, we study the impact of the prediction accuracy on the caching performance by examining on the Netflix dataset. Since the focus is to compare the performance of prediction models, without loss of generality, we assume all the contents have equal sizes of one unit. We set c n = 50, c n,n = 5 and c n = 2. Furthermore, the results are obtained by computing the global optimum solution of the caching policy optimization in (39). In our implementation, we use the MOSEK package. Fig. 8 illustrates the aggregate network cost versus f max for S max = 25 and for different time horizon prediction H. We also show an oracle approach which knows the requests in advance. As expected, the oracle gives a lower bound to the other prediction approaches. Moreover, it can be seen that DHPGT model outperforms the other models, i.e., DIHPG, DPG and ARMA models, and its performance improves as the number of latent layers L increase. This also confirms the results in Table 3. Additionally, we can see that the network cost decreases as H increases for the oracle approach. This indicates that solving the one shot caching   policy is sub-optimal even if the true content requests is known in advance. We also observe that solving the caching policy for more than one time slot, H = 2, can not improve the caching performance for non-trend models, e.g., DPG. On the other hand, our proposed trend model, DHPGT, reduces the network cost for H = 2. This indicates that our trend model provides more robust and accoutre mean predictions for more than one time slot ahead in comparison to the non-trend-models. Furthermore, Fig. 8 shows that the network cost decreases as f max increases. This is because the BSs can be more cooperative for content sharing which bypasses the need to download the contents via the costly back-haul links.

D. CACHING POLICY APPROXIMATION
Finally, we show numerical results on the performance of different approximation methods for the mixed-integer programming problem in (39). The content sizes are randomly generated from interval [0, 1]. The parameters of the DC programming problem are set as δ = 0.25, ϕ = 2, ϕ max = 10 5 , = 10 −6 and ζ 0 nn m = 2, ∀m ∈ M, n, n ∈ N . The results are obtained by choosing 10 randomly selected time slots from the Netflix dataset. The parameters of the caching policy is set as in Fig. 8. Fig. 9 illustrates the performance of different approximation methods for different values of f max . We can see that the DC programming achieves a better performance in comparison to the linear relaxation programming and the performance gap increases as f max increases.
We also depict a fixed caching policy which does not optimize the cache sizes and the link capacities. Fig. 9 shows that the adaptive caching policy attains a better performance with respect to the fixed one. This is because the adaptive caching policy can adjust the network resources according the content requests which can save the network cost more efficiently.
To investigate the computational complexity of the optimization problems, we show the running time of different optimization methods in terms of the number of contents and cells in Table 4. The Table shows that the running time for the developed DC programming is much smaller than MSOEK solver and the relaxation method has the smallest running time. Moreover, the running time increases as the number of contents/cells increases for all the methods.

VI. CONCLUSION
In this paper, we introduced a flexible Bayesian dynamical model to capture time-varying content requests in practical edge caching systems. We first developed a probabilistic tensor factorization model which can capture interactions among contents over location and suitable for count-valued requests. Then, a Gaussian process model was used to capture complex non-linear temporal trends. The inference performed based on vartaional and Kalman smoother algorithms, and a simple-to-implement posterior approximation was derived. We subsequently designed a dynamic caching policy which can adapt the network recourses according to dynamics of the content requests. To overcome the non-convexity of the formulated problem, we developed an approximation algorithm based on difference of convex programming which can be solved in polynomial time. In the simulation results, the Bayesian dynamical model accuracy was examined on two real-world deatasets and we showed that it outperforms reference prediction methods.