Optimal Service Caching and Pricing in Edge Computing: A Bayesian Gaussian Process Bandit Approach

Motivated by the emergence of function-as-a-service (FaaS) as a programming abstraction for edge computing, we consider the problem of caching and pricing applications for edge computation offloading in a dynamic environment where Wirelesss Devices (WDs) can be active or inactive at any point in time. We model the problem as a single leader multiple-follower Stackelberg game, where the service operator is the leader and decides what applications to cache and how much to charge for their use, while the WDs are the followers and decide whether or not to offload their computations. We show that the WDs’ interaction can be modeled as a player-specific congestion game and show the existence and computability of equilibria. We then show that under perfect and complete information the equilibrium price of the service operator can be computed in polynomial time for any cache placement. For the incomplete information case, we propose a Bayesian Gaussian Process Bandit algorithm for learning an optimal price for a cache placement and provide a bound on its asymptotic regret. We then propose a Gaussian process approximation-based greedy heuristic for computing the cache placement. We use extensive simulations to evaluate the proposed learning scheme, and show that it outperforms state of the art algorithms by up to 50% at little computational overhead.


INTRODUCTION
B ATTERY powered Wireless Devices (WDs) are increasingly used for computationally intensive applications such as augmented reality, natural language processing, face, gesture and object recognition [1], [2].Nonetheless, executing these kinds of applications on WDs results in high energy consumption and can adversely affect battery lifetime and the user experience.
Edge computing could become a promising solution for offloading computationally intensive tasks from WDs to nearby compute resources in the infrastructure via wireless networks.By computation offloading, it could become possible for WDs to reduce their energy consumption, while meeting application latency requirements.Nonetheless, if many WDs offload simultaneously, application performance could suffer due to congestion on the limited wireless and computational resources in the edge infrastructure.This realization has spun a great interest in edge resource management, pricing and admission control [3], [4], [5], [6], [7].
Nonetheless, the management of storage and its interaction with wireless and computing resource management have received much less attention in the literature [8], [9].Storage is an essential prerequisite for the availability of executable code and data at the edge server, and hence for computation offloading.Code availability and pricing for computation offloading become particularly important in the case of emerging Function as a Service (FaaS) offerings (also called serverless computing) where tasks are executed on-demand, by loading container images from storage to memory, and charging is based on execution time.Yet, optimizing code availability and pricing is challenging, as the price and loaded container images affect the decisions of WDs, and the service operator may not have access to information about the WDs and their workloads a priori, e.g., in emerging mobile network architectures [10], [11], [12].
In this work, we study this important problem.We explore the interaction between a profit-maximizing service operator that performs storage management and pricing, and cost-minimizing autonomous WDs that can offload their computation, subject to application availability and latency constraints.We provide an analysis of the strategic interaction between the service operator and the WDs under complete and perfect information.We then consider the incomplete information case where the service operator has to learn what applications to cache and what price to charge through repeated interaction with the devices, whose decisions whether or not to offload are orchestrated by a network operator.Our main contributions are as follows.
We propose a Stackelberg game to model the interaction between the service operator and the WDs.
We show that the interaction of the WDs can be modeled by a player-specific congestion game and we prove the existence of pure strategy Nash Equilibria.We propose a polynomial time algorithm for computing the optimal price to be charged by the service operator.We propose a Bayesian Gaussian Process (GP) Bandit based approach to approximate Subgame Perfect Equilibria (SPE) of the game under incomplete information.We use extensive simulations for showing that the resulting solution outperforms state-of-the-art Multi-Armed Bandit (MAB) based algorithms with a small computation overhead.The rest of the paper is organized as follows.We present the system model and problem formulation in Section 2. We show the existence of Nash Equilibrium (NE) and we propose an algorithm for computing the optimal pricing under complete information in Section 3. We present the proposed Bayesian Revenue Maximization (BRM) algorithm for learning a SPE under incomplete information and we provide a regret analysis in Section 4. We show numerical results in Section 5, and discuss related work in Section 6. Section 7 concludes the paper.

SYSTEM MODEL AND PROBLEM FORMULATION
We consider a multi-access edge computing system that consists of an edge server with storage capacity S managed by a service operator, and a set N ¼ f1; 2; . . .; Ng of WDs that can offload their computational task for execution at the edge server via a wireless link.Time is slotted, and we consider that each WD i 2 N is active with probability q i > 0 in time slot t, independent of other WDs and of its activity in previous time slots [13], [14].We consider that a single time slot is long enough for performing each user's task both in the case of local computing and in the case of computation offloading.This assumption is reasonable in the case of real time applications if the worst-case task completion time is less than the time slot length.We define the random variable B i ðtÞ to model whether or not WD i is active, i.e., P ðB i ðtÞ ¼ 1Þ ¼ q i , and define the set N a ðtÞ ¼ fi 2 N jB i ðtÞ ¼ 1g of active WDs in time slot t.An inactive WD has no task to execute in time slot t, while if active, WD i wants to execute a task of type f i 2 J , where J is the set of applications (i.e., the set of task types).The applications are the software images required for the execution of the tasks; tasks of different WDs may need the same application image.The computational task of WD i is characterized by the size D i of the input data (e.g, in bytes), by the expected number L j of cycles per byte required to perform the task (e.g, in Gcycles/byte) for j ¼ f i , and by the completion time requirement t i .
At the beginning of time slot t, the service operator can decide to cache a subset XðtÞ J of applications, subject to its storage capacity constraint X j2XðtÞ s j S; where s j is the size of the software image for application j.
Caching application j in time slot t involves a usage cost c j 2 R þ to the service operator, e.g., corresponding to the cost of licensing the application from its owner.
If WD i is active in time slot t (B i ðtÞ ¼ 1) and the application it intends to use is cached by the service operator (f i 2 XðtÞ) then WD i can decide to offload the computation to the edge server.We denote by a i ðtÞ the offloading decision of WD i; a i ðtÞ ¼ 1 corresponds to offloading, and a i ðtÞ ¼ 0 to local computing for time slot t.A WD that offloads in time slot t is charged unit price pðtÞ !0 by the service operator.We denote by P ¼ ½0; p the domain of prices and assume it is compact.The active WDs use an orchestrator provided by the network operator for coordinating whether or not to offload.We consider that the price pðtÞ is application independent, aligned with pricing in current FaaS offerings.The service operator has to choose and announce the price pðtÞ before the set N a ðtÞ of active WDs becomes known, together with the caching decision XðtÞ.
Next, we present our model of local computing and computation offloading, followed by the problem formulation.Fig. 1 illustrates the considered system, and Table 1 summarizes the most commonly used notation.

Local Computing
If WD i chooses to perform the task locally, the task needs to be executed using local computational resources.We denote by f l i the local processing capability (frequency) of WD i, and express the local processing time as We consider that f l i can be chosen such that local computing ensures that the task is completed just upon its deadline, i.e., t l i ¼ t i .This assumption is reasonable, as dynamic frequency scaling is widely used for reducing the energy consumption of battery powered WDs while meeting performance needs [15].

Computation Offloading
If WD i decides to offload, it has to transmit D i amount of data over the wireless channel to the edge server via an Access Point (AP), and then processing is performed at the edge server.We denote by mðtÞ ¼ X N i¼1 a i ðtÞ; (3) the number of WDs that offload at time slot t, and for simplicity we consider that the available frequency spectrum and the edge processing capacity are equally shared among offloaders.More complex models of resource sharing could be used in practice.
For data transmission, we make the common assumption of a Gaussian channel [8], [16], and we express the data rate achievable by WD i using the Shannon formula [17] where W is the channel bandwidth, h i is the channel coefficient from WD i to the AP, p i is the transmit power of the WD i, s2 i is the noise power at the AP, and bandwidth is shared equally among the mðtÞ WDs that decide to offload.The transmission power is bounded by the maximum transmission power pi , i.e., p i pi .Given the data rate, we can express the upload time as We denote by f c the computing capability of the edge server, and we consider that it is equally shared among the tasks that are offloaded, consequently we can model the the processing time at the edge server as

WD Cost Model
We model the cost of WD i as a combination of its energy consumption and the price charged by the service operator for computation offloading.In the case of local computing the cost is due to the energy consumed by the local processor to execute the task, i.e., where g l i is the power use coefficient, and b i is the unit energy cost of WD i.
In the case of offloading the cost is the sum of the energy consumption of the transmission of the input data and the execution cost that is to be paid to the service operator.We consider that the execution cost is proportional to the task complexity L f i and the input size D i , which is reasonable for today's FaaS offerings following the pay-as-you-go model.The cost of WD i in the case of offloading is thus where p is the unit price charged by the service operator.The cost of WD i is thus in each time slot where a Ài ðtÞ denotes the offloading decisions of WDs 8i 0 2 N n fig.

Problem Formulation
We consider that the WDs and the service operator are rational, strategic entities.The objective of WD i is to minimize its cost subject to its completion time requirement, the constraint on the maximum transmission power, and the caching decision X of the service operator for each time slot, i.
The service operator can choose a policy k for computing XðtÞ and pðtÞ based on past actions XðtÞ and pðtÞ, and observations, including the set fija i ðtÞ ¼ 1g of offloading WDs and the obtained rewards RðXðtÞ; pðtÞÞ, t 2 f0; . . .; t À 1g.Let us denote by K the set of policies of the service operator.For a policy k 2 K we define the expected average regret of the service operator up to time T as the loss of reward compared to a static decision X Ã , p Ã with maximum expected reward, The objective of the service operator is to find a policy k Ã that asymptotically minimizes the expected average regret, subject to the memory storage constraint (1).
The resulting problem is a stochastic sequential game, in which the service operator and the WDs play a multi-follower Stackelberg game in every time slot.In the Stackelberg game the service operator is the leader and the WDs are the followers.We refer to the problem as the Dynamic Time Constrained Computation Offloading (DTCCO) game, and we are interested in learning a policy k Ã that solves (14) under incomplete information, i.e., through interaction with the WDs.Importantly, we assume that the service operator does not know the set of active WDs and their parameters, instead WDs report their parameters to a network orchestrator node owned by the network operator and the orchestrator node coordinates the offloading decisions of the WDs.This assumption is reasonable when the network orchestrator and the service operator are different entities [10], [11], [12], [18]: the network operator can have access to traffic information and WDs' parameters [12], while the service operator owns or rents computation resources that it makes available to the WDs, but needs to decide caching and pricing before WDs decide whether to use its service.

EXISTENCE OF EQUILIBRIA IN THE STAGE GAME
We first focus on the stage game played in time slot t among the active WDs, and we characterize their interaction for a given caching decision XðtÞ and price pðtÞ, chosen by the service operator.We then consider the problem of pricing and caching faced by the service operator under complete information when the service operator knows the set of active WDs and their parameters.This is equivalent to assuming that the network operator and the service operator are the same entity, hence in this section we use the term operator to refer to both entities.As we focus on a single stage, throughout the section we omit the time index to simplify notation.

Equilibrium Existence Among WDs
Let us consider a caching decision X and price p, and investigate the interaction of the active WDs, which is in effect a strategic game.We thus investigate whether the strategic game played by the WDs admits a pure strategy NE.To assess whether NE exist, we start with characterizing the optimal offloading decision for WDs.
where pi is the maximum transmission power of WD i: Proof.Observe that if t c i ðmÞ > t l i then WD i cannot complete the task on time, thus the optimal offloading decision is a Ã i ¼ 0. Otherwise, WD i should choose a transmit power that minimizes its cost while ensuring timely completion.It is easy to see that the upload time t u i ðp i ; mÞ is a strictly monotonically decreasing function of p i , and C i ð1; p i ; a Ài Þ is a strictly monotonically increasing function of p i .Thus, WD i minimizes its cost by choosing a transmit power p Ã i that yields We can substitute t u i ðp Ã i ; mÞ ¼ t l i À t c i ðmÞ, (2) and ( 6) into (16), and obtain (15), which proves the result.

t u
The optimal offloading decision of a WDs given other WDs' decisions is called the best reply, and is used in characterizing NE, defined as follows.

Definition 1 (Nash Equilibrium). A NE is a collection of offloading decisions ða
Observe that the game played between WDs is a playerspecific network congestion game with the topology shown in Fig. 2. Unfortunately, in player-specific congestion games the existence of pure strategy NE is not guaranteed.Next, we use a topological equivalence argument to show that in the considered game a NE always exists.
Theorem 1.The stage game possesses a pure strategy Nash equilibrium among WDs.
Proof.Note that the stage game is a player-specific network congestion game with topology shown in Fig. 2 (left).The nodes S, A, and D stand for Source, Access Point, and Destination, respectively.In the network topology the path (S, A, D) corresponds to computation offloading, while the direct path (S, D) corresponds to local computing with edge weights C i ð1; p Ã i ; a Ài Þ and C i ð0; p Ã i ; a Ài Þ, respectively.To show the existence of equilibria, in what follows we show that G can be transformed to a network with parallel edges G such that the games played on the two networks are best response equivalent.We do so by replacing the edge (A, D) and its two end vertices A and D in G by a single vertex, and by redefining the costs of incident edges.Thus, we obtain the parallel network topology G shown in Fig. 2 (right), where the local computing cost is defined as Observe that the difference between the cost functions of WD i in G and that in G depends only on the strategy of the operator.This in fact implies that G and G are best-response equivalent, and thus they have identical sets of pure strategy Nash equilibria.Since G is a singleton player-specific congestion game, it possesses a pure NE [19], and so does G.This concludes the proof.

t u
Given that equilibria do exist, the next question is whether a NE can be computed easily.E.g., one could allow one WD at a time to improve its strategy, i.e., increase the payoff it receives, leading to an improvement path.If every improvement path is finite then the game is said to have the Finite Improvement Property (FIP), and a NE can be computed easily.Unfortunately, the FIP is not guaranteed in playerspecific congestion games.Next, we show that the considered game does have the FIP.
Lemma 2. The stage game possesses the finite improvement property, i.e., if WDs update their offloading strategies one at a time, they reach a NE in a finite number of steps.
Proof.Each WD has two strategies, thus the result follows from Theorem 1 in [20].

t u
Hence, an equilibrium can be computed through letting WDs update their offloading strategies one at a time.We can thus conclude that for any caching decision X and price p set by the operator, there is a NE for the WDs in the stage game, and a NE can be computed efficiently.For a practical implementation, each WD can calculate its threshold price (15) for all 1 m N, and hence the active WDs can find a NE by only sharing their threshold prices with a network orchestrator entity.

Optimal Pricing Under Complete Information
Next, we consider the problem of the operator in the stage game, and we propose a polynomial-time algorithm for computing the optimal equilibrium price for a given caching decision X .Throughout the subsection, we consider Strong Stackelberg Equilibrium (SSE), i.e., if there are multiple subgame perfect equilibria then one with maximum utility for the operator will be chosen.Throughout this subsection, we denote by Uða; X ; pÞ ¼ RðX; pjN a Þ the instantaneous reward.
Let us denote by p i;m the maximum price at which WD i would choose to offload for a particular number of offloaders m N, and let us call p i;m the threshold price of WD i for m.In addition, we define the notation that we will use in this section.Let us define the set N o ðp; mÞ ¼ fij i 2 N ; p i;m !pg of potential offloaders at price p if there were m offloaders, and define the set N X ¼ fij i 2 N ; f i 2 Xg of WDs whose applications are cached by the operator.We then define the set P t ¼ fp i;m ji 2 N ; 1 m Ng of threshold prices.We define corresponding sets for the set X of cached applications; we define P t X ¼ fp i;m ji 2 N X ; 1 m N X g as the set of threshold prices.We define the set N o X ðp; mÞ ¼ fi 2 N X j p i;m !pg of WDs that would want to offload at price p if a total of m WDs offload for cached application set X .Under the complete information assumption the threshold prices p i;m , i 2 N can be calculated using Lemma 1.For an application placement X and price p we denote by a Ã ðX ; pÞ the set of Nash equilibria among WDs that yield maximum utility to the operator.
We continue with an important result that we will use for proposing a polynomial time algorithm that computes the utility maximizing price.Lemma 3. Consider an application placement X and threshold prices p 0 ; p 00 2 P t X such that there is no threshold price in the interval ðp 0 ; p 00 Þ, i.e., ðp 0 ; p 00 Þ \ P t X ¼ ;.Let p 1 ; p 2 2 ðp 0 ; p 00 , p 1 < p 2 .Then the set of equilibria a Ã ðX ; p 1 Þ ¼ a Ã ðX ; p 2 Þ.Furthermore, for any a 2 a Ã ðX ; p 1 Þ the utility of the operator is a monotonically increasing linear function on ðp 0 ; p 00 , i.e., Uða; X ; p 1 Þ < Uða; X ; p 2 Þ.
Proof.We start with proving the first statement, i.e., Let a 2 a Ã ðX; p 1 Þ be an equilibrium under price p 1 .Now, since there is no threshold price on ðp 0 ; p 00 Þ, for any p 2 2 ðp 1 ; To prove the second statement, let us consider an equilibrium a 2 a Ã ðX; p 00 Þ.By the previous statement we know that a 2 a Ã ðX; pÞ for p 2 ðp 0 ; p 00 .We can rewrite (11) for the equilibrium strategy profile a under homogeneous pricing and obtain Uða; X ; pÞ ¼ which is monotonically increasing in p for any given X on ðp 0 ; p 00 .For all p 2 ðp 0 ; p 00 , WDs will not change their decisions.Therefore, we can treat , and can substitute it in (18) to obtain Since @Uða;X ;pÞ @p ¼ C, the utility of the operator is a linear function on ðp 0 ; p 00 .This concludes the proof.

t u
The above result allow us to characterize the reward as a function of the price p set by the operator.Proof.Consider prices p 0 ; p 00 2 P t X for some X J such that ðp 0 ; p 00 Þ \ P t X ¼ ;.Then, by Lemma 3, Uða; X ; pÞ is an increasing affine function on the interval ðp 0 ; p 00 .Since the set P t X has a finite number of elements, Uða; X ; pÞ is a collection of left-continuous monotonically increasing linear functions, and it is thus piecewise linear.
t u Next, we characterize equilibria to allow finding an optimal price efficiently.
Lemma 4. Let a 0 ; a 00 2 a Ã ðX; pÞ be NE for application placement X and price p.Then P i2N X a 0 i ¼ P i2N X a 00 i m, i.e., the number of offloaders is the same in the NE.
Proof.We prove i and m 00 ¼ P i2N X a 00 i , and without loss of generality, assume that m 00 < m 0 .Then, for strategy profile a 0 , there has to be at least m 0 WDs with p i;m 0 !p.Similarly, for NE strategy profile a 00 , there have to be at least m 00 WDs with p i;m 00 !p. Observe that p i;m 0 < p i;m 00 since by assumption m 00 < m 0 .However, if a 0 is a NE then we know that there are at least m 0 WDs for which p i;m 00 !p.Thus in strategy profile a 00 there are at least m 0 À m 00 WDs that would prefer offloading at price p, and hence a 00 cannot be a NE, which contradicts the initial assumption.Thus, m 0 ¼ m 00 must hold, which concludes the proof.t u We now use Lemma 4 for designing an algorithm for computing the NE.First, we show that for given p and application placement X , a NE with maximum payoff for the operator can be computed in polynomial time.To show this, observe that for given price p, the operator's income from a WD that offloads is Uða i ; ff i g; pÞ ¼ a i 1 X ðf i ÞL f i D i p, and is independent of what other WDs are offloading.
Lemma 5. Consider a price p and application placement X .Let such that jN y j ¼ m 0 and P i2N y nN o X ðp;m 0 þ1Þ L f i is maximal.Then the strategy profile a in which a i ¼ 1,i 2 N y is a NE with maximum payoff for the operator, and can be found in polynomial time.
To see that the solution can be obtained in polynomial time, observe that m 0 can be found based on N o X ðp; mÞ, and N y can be found by sorting WDs in decreasing order of L f i D i , both in polynomial time (see Algorithm 1).

t u
Lemma 5 allows us to compute a NE among the WDs efficiently, one that is in accordance with the SSE assumption, i.e., it maximizes the revenue of the service operator.Given a NE for a given price, we are now ready to compute the price that maximizes the operator's revenue for given application placement.
Theorem 2. Consider an application placement X .Then the price p Ã computed by Algorithm 2 maximizes the operator's revenue, i.e., Uða Ã ; X ; p Ã Þ !Uða; X ; pÞ; 8a Ã 2 a Ã ðX ; p Ã Þ, ðp; aÞ 2 a Ã ðX ; pÞ: Proof.Consider a price such that p = 2 P t X , and allows a set of equilibria a Ã ðX; pÞ.Then, for any consecutive threshold prices (i.e., p 0 ; p 00 2 P t X , ðp 0 ; p 00 Þ \ P t X ¼ ;), let p be such that p 0 > p > p 00 .By Lemma 3 the two prices allow the same set of equilibria, a 2 a Ã ðX ; p 0 Þ ¼ a Ã ðX ; pÞ, and the utilities satisfy Uða; X ; p 0 Þ > Uða; X ; pÞ.Thus, to be able to find the profit maximizing price and the corresponding strategy profile, it is sufficient to compute Uða; X ; p 0 Þ; 8p 0 2 P t X and then find the price such that p Ã ¼ arg max p 0 2P t X Uða; X ; p 0 Þ.Hence, Algorithm 2 computes a price that maximizes the utility of the operator.
t u Algorithm 1. Computing a NE for WDs Theorem 2 implies that the optimal price p Ã is a threshold price for the given caching decision X , i.e., p Ã 2 P t X , and the operator can find the optimal price by calculating and comparing the utility at the threshold prices.Thus, an optimal price can be computed in polynomial time for any given caching decision X under complete information.

BAYESIAN OPTIMIZATION FOR REGRET MINIMIZATION
We have so far shown how to choose an optimal price for a caching decision X under complete information and characterized the reward as a function of the price.In what follows we consider the incomplete information case, i.e., the network operator and the service operator are different entities.
We characterize the expected reward of the service operator under incomplete information and we propose an online policy for maximizing it, including the computation of a near-optimal caching decision X Ã together with a corresponding optimal price, which together approximate a solution to (14).

Characterization of the per Stage Expected Reward
We first start with characterizing the expected reward as a function of the price pðtÞ chosen by the service operator for a single time slot t.For brevity, we omit the time index t in this subsection.

Proposition 2.
rðX; pÞ is a piecewise linear, left-continuous function of p.
Proof.Recall that rðX ; pÞ is by definition the expectation of the reward RðX ; pjN a Þ ¼ Uða; X ; pÞ, where the expectation is taken over the set N a of active WDs.Thus, by Proposition 1 it is the weighted sum of piecewise linear left-continuous functions, and is thus itself piecewise linear and left-continuous.

t u
Thus, for any X J there is a price p Ã X 2 arg max p rðX ; pÞ.We now continue with the analysis of the maximal expected reward, starting with the definition of two properties of reward functions.Definition 2. The set function r : 2 J ! R is monotone if for any X & J and j 2 J n X we have rðX [ fjgÞ !rðX Þ.
Monotonicity is a common assumption, e.g., in Knapsack problems with independent item values.As we show next, in the considered problem the service operator's expected reward need not be monotone.
Proof.We prove the result through the following example.t u Example 1.The WDs' parameters are as shown in Table 2.
Together with monotonicity, submodularity is often used for obtaining approximation ratio bounds for NP-hard optimization problems.As we show, in our considered problem the expected reward need not be submodular, as it is not even weakly submodular.Proposition 4. Let X ; X 0 J , X \ X 0 ¼ ;.Then, for any n 2 ð0; 1.
Proof.We prove the statement by giving a counterexample for n-weak submodularity.t u Example 2. The WDs' parameters are as shown in Table 3.
We have thus shown that the expected reward r is neither monotone nor weakly submodular in general.Hence, existing results on monotone submodular function maximization do not hold for our problem.
Moreover, our analysis of the expected reward highlights two key challenges in learning an optimal policy k Ã based on past observations, challenges not found in the literature on Combinatorial Multi-Armed Bandit (CMAB) optimization [21], [22].First, the expected reward of caching an application depends on what other applications are cached, i.e., in bandit terminology, the expected reward of a bandit arm is not independent of the set of arms chosen.Second, the rewards of the cached apps X depend on the chosen price p, i.e., there is an additional continuous decision variable that needs to be optimized.As a consequence, existing approaches for solving CMAB problems, which choose a set of arms (called a super arm) using a computation oracle that is provided with the empirical distribution of the rewards of individual arms, can not be applied directly to our problem.Instead, the choice of what set of applications to cache has to be combined with learning the corresponding optimal price.

Combinatorial Bayesian Revenue Maximization
Motivated by the above observations, we propose the BRM algorithm for approximating an optimal policy.BRM combines the exploration of the expected reward of individual applications with the maximization of the reward of a set of applications that are expected to provide the highest reward, computed based on current best estimates.Importantly, the optimization of the price is specific to the set of cached applications, so as to address the issue of potential non-monotonicity.The pseudocode of the algorithm is shown in Algorithm 3.
The key tenet of BRM is that it simultaneously learns to approximate the maximum expected reward of individual applications and sets of applications.With a small, decreasing probability, at time t it caches a single application chosen at random, while otherwise, it selects a set of applications to cache based on their estimated maximum expected rewards, computed using the posterior mean reward of the applications obtained using a GP approximation.For the chosen set X of applications, it then chooses a price p based on samples of the instantaneous rewards collected in the past, for that set of applications, using a GP approximation of the expected reward function (Line 11).
For a given application placement X , the function that we want to maximize is one dimensional, i.e., rðX ; pÞ : P !R, and we propose to approximate it by a GP using Bayesian Optimization (BO).Let us denote by D t ¼ fðX ðlÞ; pðlÞ; RðXðlÞ; pðlÞÞg t l¼1 the set of reward samples collected up to time t, and by D X t ¼ fðX ; pðlÞ; RðX; pðlÞÞjl ¼ 1; . . .; t; XðlÞ ¼ Xg the reward samples collected for a set X of applications up to time t.Let n X t ¼ jD X t j be the number of reward samples collected for set X , and let p X ðlÞ be the price used when the target function of set X was sampled the l th time, 1 l n X t , and denote by P X t ¼ ½p X ð1Þ; . . .; p X ðn X t Þ the vector of prices used when the target function of set X is sampled.
For a set of applications X at time t the GP approximation of the expected reward function rðX ; pðtÞÞ as a function of the price models the expected reward as a collection of random variables f rðX ; pÞg p2P , such that the finite collection of random variables f rðX ; p X ðlÞÞg l n X t are jointly Gaussian with mean E½ rðX ; p X ðlÞÞ ¼ m X ðp X ðlÞÞ; (23) and covariance covð rðX ; p X ðlÞÞ; rðX; p X ðl 0 ÞÞÞ ¼ E rðX ; p X ðlÞÞ À m X ðp X ðlÞÞ À Á rðX ; p X ðl 0 ÞÞ À m X ðp X ðl 0 ÞÞ À Á Â Ã ¼ k X ðp X ðlÞ; p X ðl 0 ÞÞ 1; for all l; l 0 n X t , where k X is called the kernel function.An example of commonly used kernel functions is the squared , where u is called the length scale parameter.Let us denote by y y y y y y y X t ¼ ½RðX ; p X ð1ÞÞ; . . .; RðX; p X ðn X t ÞÞ T the vector of revenue samples collected until time t.Then, the posterior distribution of the GPs approximation of the expected reward with zero mean prior (i.e., GP ð0; k X ð:; :Þ) will have mean m X t ðpÞ, covariance k X t ðp; p 0 Þ and variance s X t ðpÞ that can be computed as [23] where is the positive semi-definite kernel matrix, I I I I I I I is the n X t Â n X t identity matrix, and s 2 is the prior of the noise variance.Given the posterior distribution GP ðm X t ; k X t Þ, the algorithm chooses the next price p X ðn X t þ 1Þ to be explored so as to maximize the upper-confidence bound of the expected reward, which is computed based on m X t and s X t (Line 11).Intuitively, maximization of the upper confidence bound aims at finding a tradeoff between maximizing the instantaneous reward based on past samples and between exploring prices for which the estimated reward has high variance.BRM is inspired by the recently proposed Enlarged Confidence Gaussian Process Upper Confidence Bound (EC-GP-UCB) algorithm [24], but compared to EC-GP-UCB it uses a novel acquisition function that results in a deterministic regret bound.In addition, it extends the GP approximation to the selection of cached objects, performed using the GP-Non-negative Greedy (NNG) algorithm, which also makes use of the estimated posterior means.

Regret Analysis
In this section we provide a bound on the regret achieved by the proposed algorithm.We consider a particular set X of Proof.The proof can be found in the Appendix, available in the online supplemental material.Combining Proposition 5 and Lemma 7 we can bound the error of the posterior mean with respect to the hypothesis function.t u Corollary 1.For all p 2 P, where c t ¼ max ffi ffi Proof.The proof follows from Proposition 5 and Lemma 7 by substitution, This proves the statement.

t u
Based on the above results, we can provide a deterministic asymptotic regret bound for our algorithm with respect to the hypothesis function.
Theorem 3. Consider the hypothesis class F k ðP; V Þ of functions on the domain P & R for some V > 0. For any r defined on P and D ! 0 such that min where G T is kernel specific sublinear maximum information gain.
Proof.The proof can be found in the Appendix, available in the online supplemental material.t u Combining the above with (32), we can bound the regret of the proposed algorithm with respect to the target function r.
The bound given in Theorem 4 shows that the choice of the kernel function plays a prominent role in minimizing the regret bound.Observe that D; V; G T depend on the choice of the kernel by the service operator.We know by Proposition 2 that the average revenue is a piece-wise linear function, hence it is rough (the opposite of smooth).Thus, choosing a smooth (rough) kernel function will result in a high (low) pointwise error D and low (high) V since the norm jj r h jj k V is a measure of the roughness of the target function, and low (high) G T since the information gain obtained from smooth (rough) functions is low (high) because of the high (low) correlation between nearby observations.The service operator can influence these parameters through the choice of the kernel, and thus it can influence the worst case accuracy of the algorithm.

NUMERICAL RESULTS
We used extensive simulations to evaluate the performance of the proposed algorithm in terms of service operator utility, exploration versus exploitation, and the effect of the WDs' probability of being active on the average reward.
For the evaluation we consider a system with up to N ¼ 100 WDs, up to jJ j ¼ 60 applications and storage capacity up to S ¼ 8.The computational complexity L j is drawn from a uniform distribution on [100,1100] cycles/B, and the cost c j of application j is drawn from a uniform distribution on ½0:01; 0:1$.The computational capability of the edge server is f c ¼ 12 GHz.The task types of the WDs are chosen uniform at random from J .
For each WD, the maximum transmission power p is drawn from a uniform distribution on [150,350] mW, and f l i is drawn from a uniform distribution on [0.1,0.8]GHz, and D i is drawn from a uniform distribution on [1,50] MB.The channel noise variance s2 i and the channel gain h i are uniformly distributed on [0.1,0.3] and [0.8,1], respectively.We set g i ¼ 10 À18 ; b i ¼ 1; 8i 2 N .The probability q i that WD i is active is drawn from a uniform distribution on [0,1].Lastly, the channel bandwidth W is chosen uniform at random on [200,300] MHz for each simulation.These choices of parameters are similar to those used in previous work [30], [31].The results shown are the averages of at least 150 simulations, together with 95% confidence intervals.
We use three baselines for comparison.The first baseline knows the realizations of B i ðtÞ and the parameters of the active WDs in every time slot t T .It uses Algorithm 3 given in [32] to compute the cached set X Ã ðtÞ at every time slot t and the corresponding optimal price.We refer to this as the Oracle.The second baseline is Static Expected Reward Maximization (SERM), which knows the parameters of the WDs, and uses this knowledge for computing the expected reward rðfjg; p Ã Þ for all j 2 J .SERM then uses Algorithm 3 in [32] with rðfjg; p Ã Þ as input to estimate the optimal service caching (instead of the instantaneous reward Rðfjg; p Ã Þ).This baseline is expected to serve as an upper bound for the performance upper of BRM since it has access to more information about the system.The third baseline is the Combinatorial Upper Confidence Bound (CUCB) algorithm proposed in [21] for combinatorial multi-armed bandit problems.CUCB knows the WDs' parameters, and at the end of every time slot it can calculate the price p Ã X that would have been optimal given the active WDs by using Algorithm 2, and the corresponding reward of each application.It maintains the average of the computed optimal prices and rewards for each application j, which it uses for choosing the set of applications to be cached using the CUCB algorithm, together with the average of the prices of the chosen applications.

Approximation of Average Reward
Fig. 3 shows the average reward as a function of the price ($/Gcycle) for jJ j ¼ 3 and S 3. Solid lines represent the actual expected rewards and the dashed lines show the estimates obtained using the BRM algorithm.BRM approximates well the rewards around their maxima, and as such it manages to find prices that are close to optimal.It is interesting to note that rðf2g; p Ã f2g Þ is slightly higher than rðf1; 2; 3g; p Ã f1;2;3g Þ, i.e., caching more applications is detrimental to the average reward.

Service Operator's Profit
Fig. 4 shows the average reward of the service operator as a function of the number of WDs for three scenarios.The figure shows results for SERM for N 20, as the time required for calculating the expected reward in (12) increases exponentially with the number of WDs.Nonetheless, for N 20 we can observe that BRM performs close to SERM, and as such BRM approximates the expected reward of the individual applications sufficiently well.In addition, BRM outperforms CUCB for all scenarios, especially as the number of WDs increases.This is because the interaction between WDs become more intricate as the number of WDs increases, and hence taking the mean of the optimal prices from all realizations of N a ðtÞ for a given application placement in each time slot fails to perform well.It also is interesting to note that as the number of WDs increases, the gap between the Oracle and the rest of the curves increases since the entropy of the active WDs increases (c.f.Section 5.4).
Comparing the results for the three scenarios, we observe that scenario ðjJ j ¼ 16; S ¼ 8Þ has the highest average reward, which is due to that this scenario has the highest average number of WDs per cached application ( S jJ j ), which allows more reward per application.Scenarios ðjJ j ¼ 32; S ¼ 8Þ and ðjJ j ¼ 8; S ¼ 2Þ have the same ratio, yet the former allows higher reward because there are more applications that can offer better options to service operator to choose from.Fig. 5 shows the average reward as a function of the number of applications (jJ j) for N ¼ 20, up to S ¼ 8.The figure shows that the algorithms are fairly insensitive to the increase in the number of applications.The figure also shows that BRM can explore well and can approximate the optimal application placement despite high number of applications and consistently outperforms CUCB.

Exploration versus Exploitation
Fig. 6 shows the number of distinct sets of applications explored by the algorithms as a function of the number of applications.The figure shows that the proposed BRM scales well as it does not explore as many sets as CUCB does, and yet achieves superior reward, i.e., it performs better in exploitation.The difference is most significant for higher values of storage capacity S.

Effect of the Activation Probability
Fig. 7 shows the average reward as a function of the activation probability of the WDs.For simplicity, we used the same  activation probability for all WDs, i.e., q i ¼ q.The figure shows that the gap between the Oracle and the rest of the algorithms is highest for q ¼ 0:5, when the randomness of activations is highest.On the contrary, for q i ¼ 1, Oracle and SERM achieve the same reward because N a ðtÞ ¼ N .BRM performs very close to SERM for all values of the activation probability, despite it not having access to WDs parameters.Overall, we can conclude that the proposed BRM algorithm achieves high utility at low computational complexity.

RELATED WORK
A number of recent works deal with energy efficient computation offloading for a single mobile user [33], [34], [35], [36], [37].[33] proposes a system that enables energy-aware offloading to the infrastructure.Also the proposed algorithm maximizes energy savings with minimal computational burden.[34] proposed CPU frequency scaling and transmission power adaptation to optimize energy consumption of the computation of a task.[35] investigated the cloud computing in terms of use of bandwidth and energy consumption, and provided the results obtained from an experimental platform (Amazon EC2).The results show that cloud offloading is sustainable considering the energy consumption.[36] presents a dynamic offloading algorithm in order to achieve energy savings under time constraints.In [37], experimental results are used to show that battery power savings can be achieved using computation offloading.Inspired by these works that show the potential energy savings through offloading, we consider a system level optimization problem with an emphasis on the interaction between the WDs and the service operator, and provide a game theoretic analysis combined with online learning.
Going beyond offloading by a single device, a number of recent works proposed optimization approaches to minimize the cost of task execution for multiple mobile devices [38], [39], [40], [41].Authors in [38] model the cost of the users as a combination of the energy consumption and the completion time, formulate the problem as a Markov decision process, and provide a near-optimal offloading policy.Authors in [39] study task partitioning to maximize throughput in processing streaming data.A two-tiered edge/cloud model with user mobility in a location-time workflow framework was considered in [40], and a heuristic was proposed to minimize the sum cost of mobile users.Authors in [41] consider the joint allocation of wireless and cloud resources and proposed an iterative algorithm to minimize users' energy consumption.Unlike these works that focus on the WDs costs only, our model and problem formulation account for the financial incentives of the service operator as well, and provides a joining treatment of the problem faced by WDs and by the operator.
Another line of works provide a game theoretic and optimization treatment of the computation offloading problem [42], [43], [44], [45], [46], [47].[42] allows WDs to choose what share of their task to offload in order to minimize the energy consumption and at the same time to meet its delay constraint, while the cloud allocates resources accordingly.
[43] considers a model in which tasks arrive simultaneously to the cloud through a single wireless link and proposes a non-cooperative game among users that minimize their own energy use.The users are subject to execution deadlines, and have user specific channel bit rates.[45] considers a hierarchical MEC network, where mobile users can make offloading decisions, and decide the uplink transmission power, perform cloud selection, and route the tasks.A distributed offloading approach is developed based on the game theory, in which UEs collaborate with each other to minimize the network cost in terms of energy consumption and latency.[47] models the load-balancing problem as a stochastic congestion game in which each users aims to minimize its task execution time.The experiments show that the proposed algorithm can improve the load balancing of the cloud system, and enhance the quality of service.Different from these works, our model considers service   caching and pricing together with the optimization problem faced by WDs, resulting in a Stackelberg game formulation.
Most related to ours are recent works that consider application caching and offloading [8], [9].[8] formulates a Bayesian Stackelberg game, where the leader is the operator and followers are WDs.The operator's aim is to maximize the total revenue by choosing a price and applications to cache, while WDs aim to minimize their cost in terms of the charged price and delay.[9] considers the joint optimization of computation, caching, and communication to an edge cloud and uses simulations to show that the proposed method achieves shorter completion times compared to the other schemes.Our work is different from both of these works in terms of the modelling assumptions and the problem formulation.In [8], authors do not consider slotted time, dynamic population and resource management.In [9], authors do not consider slotted time and resource management, but they consider dynamic task requests.Unlike our work, they do not analyze the interaction between WDs and the operator, and thus they formulate a cost minimization problem to be solved by the operator.
Contrary to the works that formulate a game theoretic model, our model considers the interactions between WDs as a player-specific congestion game, and we model the interaction between WDs and the operator as a Stackelberg game.We then analyze the existence of equilibria, and we propose an algorithm for calculating the optimal pricing for given application placement under perfect and complete information.In addition, we consider the incomplete information case with a dynamic population of WDs, and we propose a novel Bayesian Gaussian Process Bandit optimization approach for joint pricing and caching.
Related in terms of methodology are recent recent works that propose to use BO in edge computing.In [48] authors use BO for finding a trade-off between performance and energy consumption in virtual Base Stations (vBS), based on a GP model combined with contextual bandit optimization.Authors in [49] propose BO for learning the relationship between the cost and the run-time of serverless functions and the function instance configuration, and they aim at minimizing the cost of using a serverless system from a single WD's perspective by choosing the memory allocated to the function.Different from [49], where authors use the expected improvement as acquisition function, which may fail to find a good balance between exploration and exploitation for very rough target functions, we propose to use an acquisition function based on an upper confidence bound, so as to provide robustness despite a discontinuous target function.Different from these works, our proposed solution BRM employs BO with GP by introducing a new acquisition function and combines this with a novel heuristic that approximates the optimal service caching.

CONCLUSION
In this work we have provided a game theoretic analysis of pricing, application caching and computation offloading for edge computing.For the case of complete and perfect information, we showed that an equilibrium of offloading decisions and an optimal price for a particular caching decision can be computed in polynomial time, but the efficient computation of a strong Stackelberg equilibrium is infeasible due to the intricate interactions between caching decisions for different applications.We then analyzed the incomplete information case with a dynamic population of users, and proposed a novel Bayesian Gaussian Process Bandit optimization approach for joint pricing and caching.Our numerical results show that the proposed algorithm is computationally efficient, and it outperforms state-of-the-art combinatorial multi-armed bandit algorithms.Future directions of research include considering pricing for the use of wireless resources, heterogeneous pricing for computing resources, and WDs whose activity may be correlated over time or may be nonstationary.

Fig. 1 .
Fig. 1.System with N ¼ 4 WDs and jJ j ¼ 5 apps.For each time slot the figure shows the active WDs, some of which offload (indicated by an arrow to the access point).The set of offloaders depends on the set XðtÞ of cached apps and the price pðtÞ, the decision is coordinated by a network orchestrator.

Fig. 2 .
Fig. 2. Topology of the network congestion game G and G used in the proof of Theorem 1.

Fig. 3 .
Fig. 3. Average reward versus price for caching various applications, GP approximation versus actual.

Fig. 6 .
Fig. 6.Number of explored set of apps.versus the number of apps.(jJ j).

TABLE 1 Table of
at the same time there is at least one NE in which P i2N X a i ¼ m 0 , thus by Lemma 4 we know that P i2N X a i ¼ m 0 for any a 2 a Ã ðX ; pÞ.Clearly in a NE with m 0 offloaders a i ¼ 1 for WDs payoff for the operator, recall that the income Uða i ; ff i g; pÞ from WD i offloading is independent of what other WDs offload.Hence the set N y of offloaders that maximizes the income of the operator is such that N o X ðp; m 0 þ 1Þ N y , and it contains the WDs with highest jN y jgÞ Algorithm 2. Calculating Optimal Price for Given X Data: X ; P t X Result: p Ã ; U Ã /* Calculate the operator's revenue for each p

TABLE 2
WDs' Parameters for Example 1