Online Content Popularity Prediction and Learning in Wireless Edge Caching

Caching popular contents in advance is an important technique to achieve low latency and reduce the backhaul costs in future wireless communications. Considering a network with base stations distributed as a Poisson point process, optimal content placement caching probabilities are obtained to maximize the average success probability (ASP) for a known content popularity (CP) profile, which in practice is time-varying and unknown in advance. In this paper, we first propose two online prediction (OP) methods for forecasting CP viz., popularity prediction model (PPM) and Grassmannian prediction model (GPM), where the unconstrained coefficients for linear prediction are obtained by solving constrained non-negative least squares. To reduce the higher computational complexity per online round, two online learning (OL) approaches viz., weighted-follow-the-leader and weighted-follow-the-regularized-leader are proposed, inspired by the OP models. In OP, ASP difference (i.e, the gap between the ASP achieved by prediction and that by known content popularity) is bounded, while in OL, sub-linear MSE regret and linear ASP regret bounds are obtained. With MovieLens dataset, simulations verify that OP methods are better for MSE and ASP difference minimization, while the OL approaches perform well for the minimization of the MSE and ASP regrets.


I. INTRODUCTION
With the continuous development of various intelligent devices such as smart vehicles, smart home appliances, mobile devices, etc, and various sized innovative application services such as news updates, high quality video feeds, software updates, etc., wireless mobile communications has been experiencing an unprecedented traffic surge with a lot of redundant and repeated information, which limits the capacity of the fronthaul and backhaul links [1]. To lower the redundant traffic, caching has emerged as an effective solution for reducing the peak data rates by pre-fetching the most popular contents in the local cache storage of the base stations (BS). In the recent years, caching at the BS is actively feasible due to the reduced cost and size of the memory [2]. In the cache enabled macro-cell networks, heterogeneous networks, D2D networks, etc [2], given a set of a content library and the respective content popularity (CP) profile, content placement and delivery have been investigated in order to optimize the various performance measures like backhaul latency delay [3], server load [4], cache miss rate [5], [6] CP profile, reinforcement learning approach [7] is presented for learning the content placement matrix. In [3], femtocaching is modeled as disjoint set cover problem. However, in practice, CP profile is time-varying and not known in advance, therefore, it needs to be estimated from the past observations of the content requests. Deep learning based prediction is employed with huge training data in [8], [9]. In [10], auto regressive (AR) prediction cache is used to predict the number of requests in the time series. Linear prediction approach is investigated for video segments in [11]. Transfer learning methods are used in [12] by leveraging content correlation and information transfer between time periods. To learn CP independently across contents, online policies are presented for cache-awareness in [13], low complexity video caching in [1], [14], user preference learning in [15], etc. These works are employed for a particular system with the fixed number of BSs and users, i.e., the statistical performance of the network as whole is lacking with respect to content delivery in the physical layer.
Parallelly, in the literature [5], [6], [16], geographical caching in the Poisson point process (PPP) network is employed for multi-cell system to maximize cache hit rate with respect to the content placement probabilities (CPPs), which represent availability of the contents at the BSs. Similarly, in [17], the area success probability and area spectral efficiency are maximized for CPPs. In these works, PPP has been a useful tool to assess the performance of a given network. Therefore, it is important to understand the caching performance variations with respect to time [18]. The above existing works with PPP assume the CP profile to be known or unchanged over time. In practical scenarios, the CP changes dynamically in both time and space dimensions owing to randomness of user requests, and needs to be predicted for the efficient caching placements. Therefore, in addition to PPP analysis, we investigate the CP prediction models under dynamic scenarios, and its effect on the caching, which have not been investigated in this context to the best of the authors' knowledge.

A. Motivation and Contributions
In this paper, for the PPP network where both the BS and users are distributed as homogeneous PPP and content requests are characterized using a global CP profile, we compute the average success probability (ASP) caching measure as a function of CPs and CPPs. ASP is the probability of successful transmission of the content in the physical layer. From caching perspective, it is a measure for content placement as well as content delivery. Further, to optimize ASP for a given CP profile, an algorithm is proposed, which reduces the ASP maximization with respect to CP and CPPs to the prediction of CPs only. To optimize ASP for the future time slots, online prediction (OP) and online learning (OL) methods are investigated for the prediction of CP profile for the next time slot. The prediction performance is measured by mean squared error (MSE), while caching is evaluated via ASP. Therefore, the challenge is to investigate the prediction approach to maximize the ASP. It is shown at the end of section III that the joint optimization of ASP and MSE is non-convex and leads to a fixed MSE reduction. Therefore, separate prediction approaches based on MSE are carried out. Towards that, for OP methods, linear popularity prediction model (PPM) and non-linear Grassmannian prediction model (GPM) are proposed, and the respective prediction MSEs and ASP differences are analyzed. The motivation behind using linear prediction is that the parameters controlling the popularity change (such as location, time, etc.) can be modeled using linear predictors [19], and are already present in the past observations. However, a constrained non-negative least squares (CNNLS) optimization is required to solve per online round. Therefore, to reduce the computational complexity, OL methods are investigated which are inspired by PPM and GPM. In OL methods, weighted follow-the-leader (FTL) and weighted follow-the-regularized-leader (FoReL) are presented and the corresponding MSE and ASP regret bounds are analyzed. The difference between OP and OL is that OP yields a linear sum of recent past observations for prediction, whereas OL provides a convex sum of all the past ones. In simulations, considering MovieLens dataset [20], the MSE, ASP and the respective regrets are verified for both OP and OL methods. It shows that OP methods are suitable when MSE and ASP difference are minimized, while for the regret minimization, OL approaches provide better results. The contribution of this paper is summarized as follows: • For a network with PPP distributed BSs and users, we find the optimum CPPs to maximize the ASP, when the popularity distribution is known. It shows that there are three kinds of contents viz., most popular, mid-popular and least popular. To maximize the ASP, most popular content is placed in each cache, and the least popular ones should be omitted, while the mid-popular content needs strategic placement proportional to square-root of content popularity (SCP). We provide the method to find the indices of the contents of these three kinds. • For a given CP profile, the ASP maximization with respect to CP and CPPs is reduced to the prediction of CPs only. Therefore, we start with an intuitive PPM approach. However, the novel use of unconstrained coefficients which enables to the predict any positive/negative change in CP, leads to CNNLS with additional sum constraint. This CNNLS is solved by modifying the existing fast-NNLS algorithm, which does not deal with additional constraints except non-negativity. Further, to improve the ASP whose optimum value is proportional to SCP, GPM is proposed to predict SCP.
• Since the OP methods require to solve an optimization problem per online round, to reduce the computational complexity, OL methods (weighted-FTL and weighted-FoReL) are presented in order to minimize the MSE and ASP regrets respectively. These methods are inspired by PPM and GPM respectively. • In OP, the bounds on the ASP difference are derived which is minimized when CPs (for mid-popular contents) are close to uniform. The analysis for regret bounds show that they achieve sub-linear MSE regret and linear ASP regrets. • These analysis for both OP and OL methods are verified via simulations for the MovieLens dataset. ASP and ASP-regret are better for GPM and weighted-FoReL respectively, whereas for MSE and MSE-regret, PPM and weighted-FTL respectively provides better performance.

B. Organization
The organization of the paper is as follows: section II describes the system model. In section III, ASP has been maximized. For time-varying popularities, the next section IV explains the online prediction methods, while the following section V presents the online learning approaches. Simulation results are provided in section VI. Section VII concludes the paper.

C. Notations
Scalars, vectors, matrices and sets are represented by lower case (a), lower case bold face (a), upper case bold face (A) and calligraphic A letters respectively. Transpose and Hermitian transpose product of matrices are denoted by (·) T and (·) † respectively. The notations · 2 or · , and · F denote the l 2 norm and Frobenious norm. D(A i ) denotes a block diagonal matrix with matrices A i as its block diagonal components.

II. SYSTEM MODEL
We consider the edge caching scenario in a cellular network, where a large number of BSs with limited cache size are spatially distributed according to a two-dimensional (2D) homogeneous PPP. For example, a dense area with small cells and moderate user mobility, a park with relatively dense crowd and high user mobility, or a stadium with ultra-dense crowd and relatively very low user mobility are the instances of a typical scenario that can be benefited from the edge caching [15]. Owing to the variations in the density and the mobility of users in different scenarios, the popularity of the contents varies with time.
Let Φ BS denote the positions of base stations which are distributed as PPP with density λ bs > 0 as shown in Figure  1. These BSs serve the users which are also assumed to be distributed as PPP. From Slivanyak-Mecke theorem, for stationary and homogeneity of PPP, we consider a typical user This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.  at the origin for evaluating the performance. Without loss of generality, it is assumed that each content has the same size 1 and each BS has the equal storage capacity to store up to L contents from the content library F := {1, 2, . . . , f, . . . , N }. The content library may change over time and is accessed via backhaul link. Considering the user mobility, the BSs regularly monitor the users during discrete time periods t = 1, 2, . . . , T , where T is the length of finite time horizon. During a discrete time period, the position of the users remain unchanged. Let n t,f denote the number of requests of f th file during t th time period; n t = N f =1 n t,f be the total number of requests; and p t,f = n t,f nt represent the popularity of the f th file. Due to large number of requests in a time slot, p t,f is assumed to be the true content popularity unknown in advance and is obtained by exchanging information among BSs at the end of time slot t. For convenience, we use p t = [p t,1 , . . . , p t,N ] T to the content popularity profile vector such that p T t 1 = 1 and p t ≥ 0. The cache memory at the BS for t th time period is denoted by L t , which is a subset of F, such that |L t | ≤ L. For simplicity, we adopt probabilistic (random) placement method, where each content f at time t is stored in the BS is given with the probability q t,f = Pr [f ∈ L t ] , ∀f ∈ F [5]. The probability that in t th time period, a typical user finds the desired content in the cache depends on the distribution of contents in the random set L t through the one-set coverage probabilities q t,f , collected in a vector form as q t = [q t,1 , . . . , q t,N ] T . These probabilities satisfy the cache constraint N f =1 q t,f ≤ L, ∀t. We consider the association of a BS to a user based on both the channel state information (CSI) and the cached files in the BSs. Specifically, when a user requests the f th file, it associates with the BS that has the required file and the strongest received power. The chances that the required file at time t is available in the BS's cache, is given by q t,f . If a file is not available in any of the caches at the BSs, it is considered as a failure event and the required file must be fetched via backhaul link. Let Φ BS (f ) denote the thinned PPP of the BSs whose cache has the f th file. The associated k th BS transmits the f th file to the typical user with the power P over the Rayleigh fading channel, denoted by h i . At the typical user, the received signal is given as where x f,k [t] is the transmitted symbol for f th file from the k th BS, w[t] ∼ CN (0, σ 2 ) is the additive Gaussian noise, and α is the path loss exponent. The first term in the above equation corresponds to the desired signal, the second term pertains to the interfering transmission from the other BSs (Φ BS (f ) \ {k}) having the f th file transmitting to other users, and the next term is for the interfering signals from BSs , who do not have f th file in the cache. For the above received signal model, the downlink signal to interference plus noise ratio (SINR) at the typical user can be given as where represents the received interference power. Due to the concurrent transmissions in the PPP network where the interference terms are dominant, it is essential to ensure the successful reception of the f th file. Therefore, from the user's perspective, to maintain a quality of service and measure the caching performance, we consider average success probability, which is defined as the probability that the achievable rate of a typical user exceeds the rate requirements R 0 . The ASP can be written as where W is the transmission bandwidth.
In the above formulation, the popularity profile (p t ) denotes the global popularity across all the BSs, and the caching probabilities (q t ) represent the probabilistic status of the caches at the BSs in the network. In practice, the cache placement at time t + 1, which depends on the caching probabilities (q t+1 ), is decided by the present content popularity (p t ). However, the success probability in time t + 1 apparently depends on the popularity in the time period t+1. Therefore, it is essential to accurately predict the future content popularity in order to maximize the success probability.
As BSs monitors the mobility of the users and tracks the popularities, we assume that the future popularity at time t +1 is related to the past ones by the following relation where ψ t is an unknown function. The corresponding optimum cache placement probabilities can be computed as some function of present and the past content popularities as Our objective in this paper is to maximize the ASP at time t + 1 with respect to the content popularity profile and the caching probabilities subject to (6), (7), where at time t + 1, the above ideal ASP can be determined using the popularity profile p t+1 and the respective caching q t+1 , if both are known perfectly in advance, which is not possible in practice. Letp t+1 =ψ t (p 1 , . . . , p t ) and q t+1 = ϕ t (p t+1 ; p 1 , . . . , p t ) be the estimated future popularity and the corresponding placement probabilities. Therefore, we choose to maximize the estimated ASP as In the time t, where the true content popularity is p t , the achievable ASP can be given as P (p t+1 ,q t+1 (p t+1 )). To measure the discrepancy in the prediction and the optimization, we define the MSE, the observed and expected ASP differences respectively as follows In above, the observed ASP difference is the ASP difference which is measured with respect to true popularity; while the expected one is related to predicted ASP. The utility of the former one is to analyze the caching theoretically, while the latter is useful for the prediction as well. Since the placement probabilities are function of the content popularities, in the following, we first seek the optimum caching probabilities given the content profile. Subsequently, we focus on the prediction employing two classes of methods viz., OP and OL.

III. AVERAGE SUCCESS PROBABILITY (ASP) MAXIMIZATION
In a given time slot, the placement probabilities (q t ) depends on the content popularities (p t ). Therefore, for the given CP profile, we find the optimal placement probabilities to maximize the ASP. Towards this, the following result presents the ASP expression in terms of p t and q t . For clarity of notation, we drop subscript t in this section. Theorem 1. Average success probability of a typical user requesting f th file with popularity p f and caching probability q f is given as Proof: Proof is given in Appendix-A. Since heterogeneous networks are usually interference limited, it is reasonable to neglect the noise i.e., σ 2 = 0. For this case, the corollary below simplifies the ASP.
Corollary 2. For interference limited case, i.e., σ 2 = 0 or at high SNR, the ASP is simplified as Now, given the content popularity profile (p), the next step is to compute the placement probabilities to maximize the ASP expression above. The ASP maximization problem can be expressed as subject to q T 1 ≤ L which is a simplified problem of (8) and (9). The above problem is convex, however, to be solvable in CVX tool [22], the above problem can be cast in semi-definite program (SDP) as has been simplified using Schur's lemma [22]. Analytically, the expression of the solution is presented in the following theorem. (14) is given as

Theorem 3. The solution of the maximization problem in
The corresponding ASP is obtained as where a P denotes the sub-vector of a with the entries given by P, , and a T = The above expression shows that when the popularity of the f th content is high (p f → 1) i.e. for most popular contents, its caching probability is one i.e. f th content should be stored at each BS's cache. On the other hand, when the contents are least popular (p f → 0), it is reasonable not to cache the content at any of the cache storages. For the midpopular contents 0 < p f < 1, the expression of q * f depends on the index sets (Z and R), which can be obtained from KKT conditions in the proof of Theorem 3. The procedure to get these sets are presented in Algorithm 1. In the first part of the algorithm, the set R is obtained by individually checking the f th content popularity for q * f = 1 in the decreasing order of popularities, i.e., checking v f > 0, we In the later part, similarly the remaining content popularities in ascending order for q * f = 0 yields the set Z, i.e., for w f < 0, Now, with the caching probabilities obtained for a PPP network, a random caching strategy is utilized to place the content in individual caches [5]. Thus, with the caching prob- Z ← Z ∪ {j}, P ← P \ {j}, and j = j − 1 8: end while abilities known as a function of content popularities, the ideal, estimated and achievable ASPs can be given as P s (p t+1 ) = P s (p t+1 , q t+1 (p t+1 )), P s (p t+1 ) = P s (p t+1 ,q t+1 (p t+1 )), and P s (p t+1 ,q t+1 (p t+1 )) respectively. The ASP maximization problems (P 0) simplifies to P s (p t+1 ) as p t+1 is known in (P 0). Similarly, (P 1) can be recast as which suggests to maximize the future ASP based on the data up to the present content popularities. Therefore, it is important to accurately predict the future CP in order to cache contents in advance. Based on the past content popularities, we employ two classes of methods to obtain the accurate prediction to further optimize the ASP. For each of the classes, we propose two models based on PPM and GPM. The motivation to use these is as follows. To maximize the ASP, the prediction should be as close to the ground truth, i.e., the mean squared error of the popularities should be minimized which leads to PPM. Further, from the Theorem 3, it can be observed that the square-root of content popularities maximize the ASP. This observation leads to GPM, which is presented in details in the next section. Remark (Reliability Assumption for analysis): Let P t+1 andP t+1 denote the sets of indices for CPPs, obtained by maximizing the ASP for the ideal CP p t+1 and the estimated CPp t+1 . Under the reliability assumption for analysis, we assume that ASP maximization under the estimated CP leads to the same set of indices as with the ideal ASP. This is a reasonable assumption as with the sufficient past CP observations and the respective CPPs, the index sets can be precisely estimated i.e., to cache f th file in t + 1 on at least one of the BSs or not. Therefore, we set P t+1 =P t+1 and R t+1 =R t+1 .
Remark (Feasibility of the ASP and MSE based joint optimization): The combined optimization problem for maximizing ASP in edge-caching can be cast in the most favorable form as where the objective function is not convex and cannot be reduced to a linear matrix inequality (LMI). Moreover, since p i ≤ 1, the MSE constraint reduces to the correlation constraint d k=1 c k p T i−k p i ≥ = 1 − /2∀i, which is linear and can be solved trivially if sufficient CP observations (τ > d) are provided. However, the solution c k , which satisfy the linear constraint with equality, cannot improve the correlation beyond , or the MSE below the constraint . Therefore, due to these reasons, to maximize the ASP, the separate prediction approaches are proposed.

IV. ONLINE PREDICTION MODELS
In this section, two models are presented. First, a linear model is fitted on the past CP observations and the problem of obtaining regression coefficients is modeled as constrained non-negative least squares (CNNLS) with additional sum constraint. Further, for GPM model, the regression problem is formulated as a regularized CNNLS. With these models, the observed and the expected ASP difference are analyzed.

A. Popularity Prediction Model (PPM)
In this model, we approximate the present content popularity vector at time t to the linear sum of the content popularities of the past up to time t − 1 as where c t,k ∈ R for all k = 1, . . . , d are the prediction coefficients such that p T t 1 = d k=1 c t,k = 1, and d is the order of the prediction. Note that the constraint c t,k ∈ R is essential for the proper and accurate prediction. If c t,k is set to a non-negative value (c t,k ≥ 0), the equation (21) will be a convex sum which means that the variations in the f th content popularity at time t beyond [min i p f,i , max i p f,i ] cannot be predicted ∀f ∈ F. Therefore, with c t,k ∈ R and the known τ popularities p f,(t−τ +1) , . . . , p f,t , ∀f observations, the future content popularity estimate can be obtained aŝ where the coefficients can be given by the least squares problem subject to non-negativity constraint of the future estimate as The above problem is convex and can be numerically solved using any convex solver such as CVX [22]. However, in practice, a more efficient solution can be obtained by casting the above problem as a constrained NNLS and the solution is given in Appendix-C. NNLS without the constraint has been solved using active set method via fast NNLS (FNNLS) algorithm in [23], [24]. Therefore, with the constraint, we modify FNNLS algorithm using the Karush-Kuhn-Tucker (KKT) conditions. The complexity of the this method is related to the least squares solution and the number of observations (τ ) i.e. O(d 3 N τ ).
The above model tries to minimize the MSE of content popularity based on previous popularity data. However, it does not contribute actively in the ASP maximization. Therefore, the observed difference between the ideal ASP and the achievable ASP is expressed as whose proof is given in Appendix-D. In the best case, this difference is minimized when the distribution over subset of library (P t+1 ) is uniform, i.e.,p f,t+1 1 Tp t+1,P t+1 represents the distribution function over f ∈ P t+1 . This results P s (p t+1 ) − P s (p t+1 ,q t+1 ) ≤ 0.
Letp t = p t + e t , where e t is a random error vector and p t is defined by (21). Thus, we approximatê 2p f,t . Similarly, the expected difference between the ideal ASP and the estimated ASP can be defined from (19) as From the observed and the expected difference above, it can be observed that the ASP difference for PPM is composed of the first order prediction error term and the squared difference term. The first error term which is a random vector with zero mean, can be minimized by using the sufficient observations (τ ). The second term corresponds to the difference between squared sum of squared root popularities, which cannot be reduced to zero with the current model, as PPM is not tailored to minimize the prediction error between squared roots. Therefore, we investigate the prediction model which considers the square root of popularities in the following.

B. Grassmannian Prediction Model (GPM)
From the optimization problem in (18), it can be observed that the positive square root of the caching probabilities maximizes ASP. The positive square root of the content popularity vector (p) represents a line in the Grassmannian manifold G N,1 [25], [26]. Similar to PPM, here, we model the current SCP vector as a linear sum of the previous d SCP vectors as where z k ∈ R∀k are the coefficients, used to predict the future estimate of SCP as followŝ These coefficients can be obtained using the least squares minimization subject to regularization constraint p t ≤ 1 as {z k ∀k} = arg min which is a convex optimization problem and can be solved using CVX tool. However, for an efficient solution, the above problem can be formulated as constrained NNLS as in the previous subsection and can be solved similarly like in the Appendix-C. To avoid redundancy of the content, we omit the details.
In the above, the SCP prediction is intended to maximize the ASP. It can be seen from the observed ASP difference in (25) that as the estimation error is improved in the SCP vectors, the ASP difference decreases, i.e., the achievable ASP of GPM is better than that of PPM. Towards this, letp t =p t +ē t wherē e t is a random error vector, and approximatê with e f,t = 2p f,tēf,t . For GPM described by (28), the expected difference using (26) can be similarly written as which consists of two errors terms. Comparing the above equation with (26) for PPM, it can be observed that in the first term, e f,t is much larger than e f,t , while for the second termē f,t is lower thanẽ f,t . Together, it concludes that the GPM improves the ASP over PPM. Remark: In OP models, each round requires to solve an independent optimization problem given τ previous observations, i.e., there is no-learning. Therefore, in each round, the resultant MSE is approximately similar, i.e., the regret measure is not essential in this case.

V. ONLINE LEARNING MODELS
In the above OP methods, the least squares optimization is required to be solved per online round, which can be computational intensive for large content library. Therefore, to reduce the cost further, we present OL methods using the weighted follow-the-leader (FTL) and weighted follow-theregularized-leader approaches.

A. Weighted FTL
In the FTL approach, the CP estimate for time t + 1 is obtained by minimizing the weighted sum of l 2 -losses up time t asp where w i ≥ 0 for i = 1, . . . , t are the weights such that t i=1 w i = 1, ensuring the sum of the predicted CPs to be one. If all the past CPs are equally important in the learning, a trivial value can be selected w i = 1 t , 1 ≤ i ≤ t for t th online round. However, in general, if the recent CPs dominate the prediction, one can choose w i = κ t a t−i , 1 ≤ i ≤ t such that κ t is set to satisfy the sum constraint, κ t = 1−a 1−a t . This yields weights at t th round to be w t = 1−a 1−a t . With a = 1, the trivial selection is obtained. The value of 0 < a ≤ 1 can be set according to the preferences for the recent or the past observations. The solution of the above optimization leads to the following prediction which is a weighted sum of the previously observed CPs. This prediction at time t consists of a balance between the observed CP and the predicted CP at time t, i.e, it forms the convex sum. This is in contrast to OP, where a linear sum is considered. It means that FTL can only predict what has been observed in the past. The step-wise procedure for the CP prediction is listed in Algorithm 2. In t th round, the prediction is obtained and the CP is observed.

Proof: Equation (34) can be rewritten aŝ
Therefore, the difference with respect to the CP (p) can be given as where (a) is from [27, Lem. 2.1]; (b) comes by (35); (c) is obtained by ignoring −w 2 t as w t is small; and considering p t = p t ≤ 1, the last inequality arises from the triangle inequality, p t − p t ≤ 2. We therefore, obtain the upper bound on the total regret as Similarly, for a given CP p, the regret in terms of ASP can be defined as which is O(T ) and can be seen as follows. The expected ASP difference ∆ t can be approximated with the observed ASP difference in (25), which can be seen to be independent of t,

B. Weighted FoReL
Analogous to GPM in OP methods where the difference between SCPs is minimized, in FoReL approach, we consider the chordal distance as a loss measure of the prediction error, since SCP lies on the Grassmannian manifold. The chordal distance is defined as the principle angle between two unit norm vectorsp 1 andp 2 , i.e., d 2 c (p 1 ,p 2 ) = sin 2 θ 12 = 1 − p † 1p 2 2 . Minimizing the chordal distance is equivalent to maximizing the correlation, i.e., cross product cos θ 12 =p † 1p 2 > 0. Therefore, the prediction problem with respect to the SCP norm constraint (regularization measure in FoReL terminology) can be expressed asp where we choose w i = κ t a t−i with a ∈ (0, 1] similar to FTL, and it leads to the following simplification aŝ =w t,tpt +w wipi ≤ 1 +w t,t−1 using the triangle inequality, and yieldingw t,t wt,t−1 ≥ 1 − w t,t . To see that the above bound is a convex sum, the weightw t,t should be less than 1. Thus, we writew t, , which is the inverse of a norm of the converging sum. In the worst case, the norm has the lowest value, whenp i is uniform 2 , i.e., forp f,t = 1 Thus,w t,t ≤ 1, where the equality is obtained for t = 1. The corresponding online learning procedure is presented in Algorithm 3. In this procedure, two intermediate variables (k t andp t+1 ) are introduced to simplify the computations. The rest of the process is same as weighted FTL algorithm. The respective regret is analyzed in the following result.
1: for t = 1, 2, . . . Proof: The CP estimate can be simplified aŝ It is used to get the difference with respect to the CP (p) as where in (a), we approximate the regret of CP by the regret of SCP; in (b), the result from [27, Lem. 2.1] is used; in (c), 2 The norm minimization problem can be written as which has more unknowns than the known. Relaxing the norm constraint (since the final vector can be normalized), the problem reduces to it. This result after normalizing gives uniform distributionp f,t = 1 −w 2 t,t is ignored asw t,t ≤ 1; in (d), triangle inequality has been used with p t = p t = 1. The total regret up to time T can be obtained as t where the equality of the upper bound onw t,t holds forp f,t = 1 √ N ∀f . This gives O (log T ) regret.
Remark (OP vs OL): The OP methods work via linear prediction for given τ observations, while OL methods improves the estimation by the experience. In OP methods, the predictor coefficients are assumed unconstrained, while the same assumptions cannot be set for OL methods due to the objective of weighted MSE minimization. With these observations, the OP yields a linear sum of the recent past CPs, while OL provides the convex sum of all the past CPs. These differentiations suggest that OP can estimate any wide changes in content popularity, while OL can track the changes in the CPs within a convex set of past observations.

VI. SIMULATION RESULTS
To evaluate the performance of the proposed prediction methods with the optimized edge caching policy, we use MovieLens dataset [20]. In this dataset, we choose the user ratings of N = 100 movies with IDs 1-100. Using the timestamps provided, the whole duration is divided into time slots to simulate the content request process. A movie rating from users is assumed to be the number of requests of that movie, while the popularity profile for each time slot is obtained by normalizing the ratings across the movies. Moreover, for OP methods, d = 4 and τ = 10 are selected, while for OL, a = 1 is chosen. PPP parameters for computing ASP are as follows: noise power σ 2 = 0, BS density λ BS = 200, bandwidth W = 24kHz, path loss exponent α = 3.5, rate threshold R 0 = 1, BS cache size L = N 2 = 50. Besides, the performance of OP and OL methods are compared with the request prediction method (OP-AR) [10], and mean guessing [15], [21]. In OP-AR, after modeling the logarithm of the number of requests using auto-regressive model, we use least squares to find the coefficients of prediction. In mean guessing, a mean of previous τ observations is selected as the prediction.
The above methods accurately predict the CPs in the order of 10 −2 . For comparison, Figure 2 depicts the averaged prediction MSE and the averaged expected ASP difference (dASP) for the methods. It can be observed that for both MSE and ASP, OP-GPM and OP-PPM yield better results than OP-AR, mean guessing and OL methods. This is because in each online round of OP-PPM and OP-GPM, the MSE of CP estimate is minimized to get prediction, while in others, it is not. Among the OL methods, OL-FTL provides better MSE while OL-FoReL has better dASP, as OL-FTL considers CP in the formulation, while OL-FoReL is SCP based. Also, the plots of mean guessing lie in between OP and OL methods, which is because mean guess considered here has similarities to both OP and OL, i.e., in other words, mean guessing is equivalent to a trivial OP or a sub-optimal OL-FTL. Note that OP predicts with past τ observations, while OL utilizes the whole past. Therefore, OP performs well when the instantaneous MSE is considered, while OL provides improvements for the cumulative MSE, i.e., regret, which is shown in the Figure  3. OP-AR for MSE and dASP can be seen to provide an approximation to OL-GPM. Figure 3 (a) and (b) show the MSE and dASP regret respectively. Unlike Figure 2, OL-FTL yields better MSE regret than OP methods and mean guessing, while OL-FoReL provides better dASP regret as per construction. The MSE regret can be seen to be O(log T ), while dASP regret is O(T ) as presented in the previous sections. Mean guessing provides an approximation between OP and OL methods for both MSE and dASP, as it is a sub-optimal FTL as well as a trivial OP. OP-AR approximates OP-GPM as seen in Figure 2. Figure 4 shows variations of the expected dASP with the cache size constraint. Here, the MSE remains constant as it is independent of caching scheme. It can be observed that with the increase in the cache size, dASP decreases, i.e., the achievable ASP increases. The trend of different methods is similar to Figure 2, i.e., OP-GPM yields the minimum dASP and so on. For larger cache size, the dASP gap between different methods can be seen to be closing i.e., dASP converges to zero as L → N . The expected dASP in (25) is inversely proportional to L (in η) i.e., dASP ∝ L −1 =⇒ log (dASP) ∝ − log L, which is a negative proportionality as can be seen in Fig. 4.   (N ) increases, while the cache size is set proportional to the library size i.e. L = N /2. It can be observed that both the MSE and dASP decrease with the size N for all methods, except OP-AR, which increases because the prediction is done without normalization by N . The trend of the performance curves for different methods is similar as in Figure 2, where OP-GPM performs the best in both MSE and ASP measure.

VII. CONCLUSION
In this paper, online prediction (PPM and GPM) and online learning (weighted-FTL and weighted-FoReL) methods have been investigated. First, for the given popularity profile, caching probabilities have been optimized to maximize ASP of the PPP based network. In PPM, a linear model is used to predict the popularities to minimize the MSE and ASP difference has been analyzed. In GPM, we predict the future

A. ASP Derivation
The CCDF of sum rate can be obtained as Now, the expectation on distance will be taken [28] E rj j∈Φ bs (l)\{i} α . Therefore, the resultant expression from (55) can be written as where C = πλ bs .

B. Proof of ASP maximization
Proof: The ASP optimization problem can be recast as The corresponding Karush-Kuhn-Tucker (KKT) conditions can be obtained as where λ ≥ 0, v f ≥ 0, w f ≤ 0, ∀f are the dual variables and g (q) = BC [B+q(A+C−B)] 2 . Simplifying above yields Let F = Z ∪ P ∪ R with Z = {f |v f = 0, w f < 0, q f = 0} P = {f |v f = w f = 0, 0 < q f < 1}, and R = {f |v f > 0, w f = 0, q f = 1} being the sets of indices of q f with zero, positive and one values respectively, i.e., for each set, we have Further, the objective function is simplified as (77)

Algorithm 4 Constrained NNLS algorithm
Output: x * = arg min x≥0 y − Hx 2 2 such that 1 T x = 1 Input: Initialize P = ∅, R = {1, . . . , n}, p = 0, v = H T (y − Hx) and tolerance 1: while R = ∅ and max i v i > do 2: set j = arg max i v i , add j in P and remove from R 3: set s R = 0 and s P = x P (v), 4: if min s P ≤ 0 then x = s and v = H T (y − Hx) − λ P (v)1 14: end while respectively. Solving these equations gives However, for faster updation in an online round especially for a large content library, the following modified equations are used (inspired from FNNLS scheme [23], [24]) where A P denotes the sub-matrix of A with the rows-columns indices defined by P. These equations depends on the dual variable v, which is acquired by active set method as presented in the modified NNLS procedure in Algorithm 4. After initializing v, this method works by computing the positive set of entries and updating the corresponding v iteratively. The number of "while" iterations is equal to the number of nonzero entries in the solution. The difference between Algorithm 4 from FNNLS algorithm [23], [24] is the presence of dual variable λ which handles the additional constraint other than non-negativity.