Combining Lyapunov Optimization and Deep Reinforcement Learning for D2D Assisted Heterogeneous Collaborative Edge Caching

The problem of shared node selection and cache placement in wireless networks is challenging due to the difficulty of finding low-complexity optimal solutions. This paper proposes a new approach combining Lyapunov optimization and reinforcement learning (LoRL) to address content sharing in heterogeneous mobile edge computing (MEC) networks with base station (BS) and device-to-device (D2D) communication. Device in this network can choose to establish D2D links with neighboring devices for content sharing or send requests directly to the base station for content. Content access and energy consumption of shared nodes are modeled as a queuing system. The goal is to assign content sharing nodes to stabilize all queues while maximizing D2D sharing gain and minimizing latency, even in the presence of unknown network state distribution and user sharing costs. The proposed approach enables edge device to independently select associated nodes and make caching decisions, thereby minimizing time-averaged network costs and stabilizing the queuing system. Experimental results show that the proposed algorithm converges to the optimal policy and outperforms other policies in terms of total queue backlog trade-off and network cost.


Combining Lyapunov Optimization and Deep
Reinforcement Learning for D2D Assisted Heterogeneous Collaborative Edge Caching Ziyi Teng, Student Member, IEEE, Juan Fang , Member, IEEE, and Yaqi Liu, Student Member, IEEE Abstract-The problem of shared node selection and cache placement in wireless networks is challenging due to the difficulty of finding low-complexity optimal solutions.This paper proposes a new approach combining Lyapunov optimization and reinforcement learning (LoRL) to address content sharing in heterogeneous mobile edge computing (MEC) networks with base station (BS) and device-to-device (D2D) communication.Device in this network can choose to establish D2D links with neighboring devices for content sharing or send requests directly to the base station for content.Content access and energy consumption of shared nodes are modeled as a queuing system.The goal is to assign content sharing nodes to stabilize all queues while maximizing D2D sharing gain and minimizing latency, even in the presence of unknown network state distribution and user sharing costs.The proposed approach enables edge device to independently select associated nodes and make caching decisions, thereby minimizing time-averaged network costs and stabilizing the queuing system.Experimental results show that the proposed algorithm converges to the optimal policy and outperforms other policies in terms of total queue backlog tradeoff and network cost.

I. INTRODUCTION
T HE INCREASING number of smart devices joining wire- less networks has led to a surge in wireless multimedia traffic [1].However, a significant portion of this traffic consists of repeated requests for popular content such as news articles and TV shows.To address this issue, mobile edge computing (MEC) technology has emerged as a promising solution, allowing content retrieval from edge storage nodes like base stations or clouds.However, this approach often results in redundant data transmission within a short timeframe and is constrained by the cache capacity of edge devices.
To mitigate the limitations of storage capacity and duplication in edge devices, collaborative caching has been recognized as an effective solution [2].The progression of integrated circuits has led to the integration of storage and computing abilities into edge devices.Consequently, content sharing via device-to-device (D2D) communication has become feasible.Furthermore, the integration of D2D communication with MEC caching can yield additional benefits such as improved spatial frequency reuse and boosted cellular network throughput.These benefits can lead to reductions in transmission and backhaul loads while augmenting the service probability of tasks [3], [4].
Extensive research has focused on optimizing user performance in Mobile Edge Computing (MEC) networks through collaborative caching and device-to-device (D2D) caching.Content cache placement strategies have been specifically studied [5], [6], [7], [8], [9], [10], [11].However, there are still unresolved issues that require attention.Firstly, existing studies mainly concentrate on improving cache performance through effective placement strategies.While this approach proves beneficial, there is a limitation on the amount of content a device can transfer within a given time frame.When simultaneous content requests exceed the transmission capacity, it causes delays and increases retrieval latency.To tackle this challenge, researchers propose using a virtual queuing system to represent access requests [10], [11].This allows for optimization of queue management to mitigate transmission delays.Secondly, in collaborative networks, forwarding content sharing node requests incurs energy consumption costs.Virtual queues effectively represent node energy consumption dynamics.Therefore, ensuring stability in the consumption queue becomes crucial for overall system stability.
The selection of shared content delivery nodes and cached content replacement in a D2D-assisted MEC network are key issues.When a user's local cache cannot fulfill a request, it is necessary to determine which cache node (e.g., neighboring user node, local/base station, neighboring base station) should handle the request and how to cache the content [3].User requests can be routed to other users or accessible edge nodes, and dynamic queues represent the content access and energy consumption of all edge nodes.The objective is to select shared nodes for unsatisfied requests, ensuring the stability of the request and energy consumption queues.However, there are three main considerations when deciding on shared nodes and caching.Firstly, the data rate for content sharing via D2D links depends on user distance and channel conditions.User preferences can be challenging to understand, but machine learning methods can be employed to learn and predict preferences for better decision-making.Secondly, content sharing incurs transmission costs, necessitating consideration of both request delay and network cache node energy consumption to ensure network stability.Lastly, the cache replacement policy should adapt to the evolving wireless network environment, facilitating optimal replacement decisions based on changing network conditions.
Note that when confronted with a large number of users, simultaneously selecting shared nodes and making caching decisions for each user in the system poses a significant challenge.However, this challenge can be addressed by employing a stochastic Lyapunov optimization method that does not rely on prior knowledge of the network state distribution.Nevertheless, in our case, the uncertainty surrounding the user's mobile location, network channel state, and user preference further complicates the problem.This increased complexity arises from the need to stabilize all the queues while also considering cache decision problems.Therefore, We introduce context-aware preference learning strategies and propose dynamic shared node selection and cache replacement methods that combine Lyapunov optimization and reinforcement learning (LoRL).Thus, the main contributions of this paper are as follows: • Our approach integrates Lyapunov optimization theory and reinforcement learning to develop a novel method for shared node selection and cache replacement.By considering random fading channels and data arrival, our method enables intelligent decisionmaking for user delivery node selection and cache replacement.The primary objective is to minimize user request latency and energy consumption while ensuring the stability of user request and energy consumption queues.

II. RELATED WORK
Existing research on edge cache optimization can be divided into: i) accuracy improvement of popularity prediction models; ii) joint optimization of edge caching and wireless resources; iii) collaborative caching.

A. Popularity Prediction
Due to the repetitive nature of content requests in the network, edge caching should cache content with high popularity.Cache placement policies based on popularity prediction have demonstrated caching effectiveness, and reactive caching [12], [13] or proactive caching [14], [15] by analysing past historical request information to obtain request patterns has been extensively investigated.Hassine et al. [16] used Auto-regressive and Moving Average (ARMA) models for centralised content popularity prediction.To overcome the sparse nature of user requests, Chen et al. [17] proposed a popularity prediction scheme based on weighted clustering and also described an explicit relationship between cache performance and popularity prediction accuracy.In addition, considering the private nature of user data, a federated learning approach is used for edge caching policy optimization [3], [18].Due to the consumption caused by learning-based approaches, some online popularity prediction approaches that do not require a training phase have also been proposed [12], [19].However, while the improvement of the accuracy of the popularity prediction model can improve the caching performance to a certain extent, the channel conditions as well as the network state in mobile edge networks can have an impact on the quality of service (QoS) of the users.Therefore, the relationship between cache performance and prediction accuracy is implicit, and the impact of popularity prediction errors on cache performance is difficult to estimate.The most popular (MP) algorithm with a priori popularity knowledge of the user request model is compared with the proposed algorithm in Qian et al. [21], and this is verified by the poor performance exhibited by the MP algorithm.

B. Joint Caching and Resource Optimization
The decentralization of cache capacity in MEC networks leads to strong coupling between cache strategies and wireless communication resource management.From the perspective of limited cache and wireless resources, study the caching problem.Existing research approaches to caching policies are classified as optimization-based, reinforcement learningbased and deep learning-based, and game-theory-based.Optimization based caching strategies are usually designed to maximize certain performance metrics within the constraints of network resources.For the optimization problem of complex joint wireless resources and caching, simple heuristic algorithms often require a long time and can only obtain suboptimal solutions.Therefore, in existing research, a Lyapunov optimization method for online joint utility maximization and stability control framework has been proposed.This method decouples multi-stage stochastic optimization problems into continuous deterministic sub-problems for each stage, while providing theoretical guarantees for the long-term stability of the system [13], [20].Strategies based on reinforcement learning and deep learning use observable user data or environmental states, such as user contextual information, channel gain or cache state, for online caching decisions and resource allocation.Wireless channels have a finite amount of data that can be transmitted per unit time, and proactive caching strategies are investigated in order to maximise bandwidth utilisation [21], [22], [23], [24].However, when the user request or environment state space is large, centralised Reinforcement Learning caching strategies are complex and difficult to handle, hence distributed reinforcement learning approaches are proposed [25].Game theoretic caching strategies have been used for caching and computing resource allocation in MEC network environments, where service providers or users compete among themselves for limited computing and bandwidth resources to meet their own interests [26], [27].

C. Collaborative Caching
In MEC networks, collaborative caching is an effective approach to reduce network service load, improve service latency, and enhance spectrum usage efficiency by expanding cache capacity.Existing research divides collaborative caching into two types based on cache location: Coordinated Multi-Point (COMP) and D2D caching.COMP involves obtaining requested content from adjacent devices, base stations, or other caching devices.The optimization goal in this context is to jointly optimize content caching and delivery decisions, considering network constraints and aiming to minimize service latency or content retrieval costs [28], [29].To address the joint optimization problem of user collaboration nodes and cache placement, a decoupling approach can be employed for dual-scale joint optimization [20], [30].
Decentralised cooperative sharing methods address the challenges of diverse network states in wireless networks.They aim to solve the node selection problem in centralised cooperative transmission effectively [18].These methods decentralise decision-making, allowing nodes to independently select cooperative partners and make transmission decisions based on local information.In networks with caching capabilities on the user side, collaborative sharing can be performed between D2Ds by establishing D2D communication [31].Furthermore, the uncertainty of user requests and movement patterns makes it challenging to establish D2D connections.Therefore, based on learning methods, the user's movement trajectory as well as user request patterns are predicted to enable dynamic delivery of content [8].Further, user data information is private and has the property of not being willing to be shared, while an effective local caching policy requires knowledge of user preference information.Therefore, to maximise the benefits for users, D2D content sharing approaches with social awareness and incentives have been proposed in existing studies [6], [27].Moreover, in addition to horizontal collaboration between network cache nodes, vertical inter-tier collaboration between cache nodes is also an important solution to achieve service demand by expanding cache capacity.Similarly, some D2Dassisted heterogeneous collaboration approaches have been proposed to maximise spectrum efficiency and reduce request latency [3].

D. Our Contribution
Based on the aforementioned categorization, similar to the works in [3], [8], and [12], this paper investigates heterogeneous collaborative caching strategies supporting D2D assistance.Similar to the contributions in [14] and [15], we predict user preference popularity by analyzing historical request information.Similar to the contributions in [23], [24], [25], [26], [27], and [28], we utilize reinforcement learning for dynamic cache decision optimization.However, what sets our work apart from these contributions is that we employ predicted user preferences for D2D shared node selection.Furthermore, we combine Lyapunov optimization with reinforcement learning for user-associated node selection, cache decision-making, and maintaining stability in the request latency queue and cost consumption within the requesting network nodes.

A. Network Model
Consider a wireless D2D-assisted heterogeneous collaborative network architecture, as shown in Fig. 1.The network architecture includes three types of cache nodes, namely user equipment (UE), base stations (BS) and cloud servers.The cloud server connects to all BS via backhaul links to provide services to users, and the BS serve users via cellular links.In addition to the traditional BS-to-User use of the BS's wireless spectrum for content delivery, the considered network architecture allows for D2D links between users for content sharing.Given the abundant storage and computing resources of cloud servers, we assume that the cloud server has access to all the content that users may request within the storage area, denoted as F = {1, 2, 3, . . ., F }, where F represents the total number of contents [2], [3], [5], and its size is denoted as (s 1×F f ).Each UE and BS has a restricted cache capacity to store content with high content popularity.The cache capacity of user device u is M u , where ∀u ∈ N = {1, 2, . . ., N } is the set of tags of users.To simplify the model, we consider the existence of one BS in the network, serving the UE.In particular, the BS have the limited cache capacity, denoted as M B .
To meet generality, the capacity of the network's cache nodes is M u ≤ M B ≤ F .All cache nodes in the network architecture, except for cloud nodes, are represented as Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
H = {N ∪ B }.In the network under consideration, operations are organized based on a time slot framework, represented as T = {1, 2, 3, . . ., t}.The time axis is divided into equal time intervals, referred to as slots, with a small duration denoted as Δt and t ∈ T .Within a time slot t, all network parameters (e.g., user location, channel quality, content prevalence) remain constant.This time-slot-based organization enables the analysis and optimization of network performance within well-defined and consistent time intervals.Let us define the position of user u during time slot t as l u,t , denoted by l u,t = {x u,t , y u,t }.Here, x u,t and y u,t represent the coordinates of user u in the given time slot.It is essential to emphasize that within any time slot t, each user can make at most one request, and this request must be fulfilled prior to the commencement of the subsequent time slot.

B. D2D Sharing Mode
In the considered heterogeneous collaborative network, cache node content sharing has a restricted physical range, so it is assumed that the service range of the base station is R B and the range of D2D is bounded by the radius R u and satisfies R u ≤ R B , i.e., πR 2 u ≤ πR 2 B , and cache nodes exceeding the service range cannot establish connections.In addition, user requests have the characteristics of being numerous and diverse.Therefore, in order to improve the effectiveness of D2D links, we will model the D2D connections from the physical domain and user similarity respectively in the following.
Physical domain: Due to physical limitations such as signal attenuation between D2D, only users within the user coverage can communicate [32].Therefore, similar to [10], a graph G p = {N , Y p } is introduced, where N represents the set of vertices of all users and Y p = {(u, v )|e p u,v = 1, ∀u, v ∈ N } represents the edge set.When e p u,v = 1, it means that the distance between user equipment u and user equipment v is L t u,v < R u , that is, within the communication range of the user, where Otherwise, e p u,v =0.User similarity: Humans are herd animals and tend to have a herd mentality for content acquisition, and each person has a different request preference P u .Therefore, the concept of cosine similarity is introduced to represent the relationship between users' preferences.The cosine similarity between user u and user v is A user with a high preference similarity indicates that the content stored in the cache is more similar, and therefore, the content requested by the user is more likely to be stored.
User sharing probability: Based on the physical map and user similarity obtained above, we get the connection probability R u,v = e p u,v • S u v between users, and define In the next sections, user preferences are predicted using an stack AutoEncoder (SAE)-based algorithm, and then content delivery and latency models for D2D-assisted heterogeneous networks are investigated in Section III-D.

C. User Preference Learning
In order to obtain the similarity between users, a hypothesis is made that, within a time frame, users send preference information to surrounding cache nodes.User preferences, i.e., the request popularity of a user, are uncertain and therefore obtaining numerical results for their popularity probability is very complex.Additionally, we assume that the preferences of users for the requested content follow an independent and identically distributed process.Unsupervised learning is a feasible approach to solve this problem.Therefore, in order to obtain the preference probability P u of user u to further obtain the sharing probability R u,v between users, we propose a contextual information-based preference learning framework based on an unsupervised learning-hybrid filtering neural network model, as shown in Fig. 2.
The core of the proposed model is to minimise the discrepancy between the input and output, thus training a single hidden layer neural network to reconstruct the input data from the latent representations.The model consists of two main components: (i) an encoder that receives the input data and (ii) a decoder that outputs the results.The difference between the input and output is measured by the loss function where {D 1 , D 2 , . ..} is the input dataset, D (i) ∈ R d represents the dimension d of each element in the dataset, mainly including the user's contextual information, such as name, gender, age, movie, movie rating and etc.The input data is represented implicitly using the encoder through the activation function h(x ) for the input mapping as . .} is the output of the corresponding encoder and the parameters w and b are the weight matrix and bias vector, respectively.
The training of user preferences means that the model is to be continuously updated and used to minimise the reconstruction error of the input data set.By training the SAE neural network, the hidden features encoding z of the training data are obtained and these features are used to calculate the user similarity.Since what content is needed in the future, i.e., the popularity of the user's request, may depend on the user's context.Therefore, by combining the hidden feature encoding z with the user's contextual information, the user's preferences, i.e., the user's own content request popularity, are obtained.

D. Content Transmission and Delay Model
In a heterogeneous collaborative network, a user's cache list in time slot t can be represented as x u = {x u,f ∈ {0, 1}, u ∈ N , f ∈ F}. x u,f = 1 means that content f is stored in the cache, otherwise, if x u,f = 0, it means that the user does not store content f.When the user's requested content f cannot be satisfied locally, the user selects a node for content sharing from the network.We define b u,t = {b n u,t ∈ {0, 1}|u ∈ N , n ∈ H/u} to denote the set of associated candidate nodes for user u. b n u,t = 1 indicates that user u selects cache node n to process user requests.Specifically, b 0 u,t = 1 indicates that users directly obtain content requests from the base station.Therefore, in D2D assisted heterogeneous networks, there are several methods for obtaining content.
1) The user's request is saved in the local cache list, i.e., the request is satisfied in the local cache.Therefore, the corresponding user request delay d L u,t is 0. 2) If a user request is not saved in the local cache list, we can establish a D2D link to obtain the request from nearby users n.In line with the approach taken in many existing studies [3], [5], [6], we adopt orthogonal models to allocate non-overlapping radio resources for D2D transmission.This allocation scheme involves dividing the bandwidth of each node into equal sub-bands and assigning equal sub-bands to each node while ensuring no interference among them.Thus, the request delay , where s f denotes the size of the user request file and r n u,t denotes the data transfer rate between user u and user n.
), with parameters B D , g D t , h D t denote the inter-user channel bandwidth, transmission power consumption and channel gain, respectively.
3) The user u can also send a request to the local base station for content retrieval.If the base station can fulfill the request, the delay for file f is given by d .
The transmission rate is calculated using the formula ), where L B u,t represents the path loss between user u and the base station at time slot t.Here, r B u,t represents the transmission rate from user u to the base station at time slot t.It is determined by parameters such as the bandwidth B B , transmission power consumption g B t , and channel gain h B t associated with the user-to-base station link.4) Eventually, when content cannot be obtained through direct sharing or local caching, the base station forwards the request to the cloud server.The latency of fetching the content from the cloud server is the sum of the base station delay d B u,t and the transmission time required to transfer the content from the cloud server.This can be expressed as , where s f is the size of the requested content and r c is the constant transmission rate between the base station and the cloud server.Therefore, the request latency for the user of the above fetch request method satisfies When the user makes the selection decision b n u,t , if the selected content sharing cache node cannot meet the request, it will obtain the request from the same or upper layer cache node by the above collaborative method.

A. Queuing Model
When user u initiates a request, if the request cannot be fulfilled locally, a content sharing node b n u,t is selected.Based on the chosen delivery node b n u,t , the content access queue associated with node n is denoted as Q n (t).The dynamic backlog of all queues in the network is captured by the vector Q(t) = {Q n (t)|n ∈ H}.The arrival rate of Q n (t) represents the total size of content shared by node n as selected by users.Therefore, the arrival rate of queue is The service rate of the queue Q n (t) is expressed as the average rate at which the current request is satisfied.Based on the above description, the dynamic request queue {Q n (t)} H n=1 at any time slot t can be dynamically described as Specifically, at the initial stage, the request queue maintained by user n is set to Q n (0) = 0.In addition, since the system state is random, the system-dependent queuing vector process {Q n (t)} t∈T is also random.
When users select nodes, there is a cost associated with content sharing.This cost is related to the delivery rate of the content from the user to the selected node.We assume that the transmission cost is proportional to the data rate, which is a general performance metric that can be converted to other metrics such as battery life, transmission delay, and interference.Following similar calculations in prior studies [10], we assume that the transmission cost is charged per unit of data rate.Therefore, at any given time slot t, the transmission cost from the selected node n to the user u can be expressed as: where p trans u,n is denoted as the transmission power consumption from user u to node n.To deal with the network cost of the delivery nodes in the network, virtual cost queues Y (t) = {Y n (t)|n ∈ H} are introduced.Specifically, at the initial stage, the cost queue maintained by user n is set to Y n (0) = 0. Based on the above description, at any time slot t, the dynamic cost queue {Y n (t)} H n=1 can be dynamically described as where is a positive scaling factor.Dynamic queue Y n (t) can be seen as a random energy consumption C n (b n u,t ) and a fixed service rate v .

B. Problem Formulation
In this paper, the overall objective is to make dynamic node selection and caching decisions that maximise D2D sharing gains while minimising user request latency.The specific problem under consideration can be formulated as Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
1) D2D shared gain: If the user's request cannot be satisfied locally (i.e., x u,f = 0), then by selecting the content sharing node b n u,t establishes a D2D link with adjacent user n to meet user requests, where n > 0. Therefore, the gain obtained through the D2D sharing method can be expressed as 2) Content fetch gain: To account for the fact that a User Equipment (UE) can only share content with a chosen node during a time slot, we introduce an average queue delay that is directly proportional to the serving UE.Consequently, the network's benefit from user acquisition requests can be expressed as follows: Where τ represents the introduced parameter, we observe a negative exponential function that highlights the inverse relationship between user request delay and channel gain.In simpler terms, when the user request delay is low, a larger gain is obtained.By using this formulation, we can analyze the impact of user request delay on the network's benefit.Based on the above description, the gain of the obtained content is represented as where λ 1 and λ 2 is the two introduced parameters, which satisfy the condition λ 1 + λ 2 = 1, where 0 ≤ λ 1 , λ 2 ≤ 1.These parameters represent the weights or proportions assigned to D2D gain and delay, respectively.Therefore, the goal of this article is to optimize the problem P 1 , can be expressed as For ease of understanding, the notation used in this article is summarized in Table I.

A. Stochastic Lyapunov Optimization
Directly solving the aforementioned problem P 1 becomes a challenging task without prior knowledge of the system's status, queue backlog, and network cost distribution.In addition, the choice of content sharing delivery nodes creates an imbalance in the request queue and delivery cost of caching nodes in the network.In order to tackle the aforementioned challenges, we employ stochastic Lyapunov optimization methods to regulate the selection of content sharing nodes.The primary objective of this approach is to minimize the average network latency while ensuring the stability of both the request and cost queues within the network, all without relying on any prior knowledge.We started by defining the functions used in our analysis.
Definition 1 (Quadratic Lyapunov Function): In order to jointly control the request and network cost queues of any time slot t, the total queue is defined as Z (t) = {Q n (t), Y n (t)}, where {Q n (t)} H n=1 and {Y n (t)} H n=1 .The quadratic Lyapunov function L(Z (t)) of the random queuing process is equal to half the sum of the squares of the backlogs of all current queues, which is The Lyapunov function is a scalar measure of the total queue backlog in the network, with a smaller L(Z (t)) indicating a lower queue occupancy in the network.
Definition 2 (Conditional Expectation Lyapunov Drift): At any time slot t, the conditional expectation Lyapunov drift ΔH t represents the expectation of the time slot difference of the Lyapunov function, namely which ΔH t describes the variation of the quadratic Lyapunov function, i.e., the degree of fluctuation of the function.A smaller ΔH t indicates a more stable queue in the network system.Therefore, we choose to minimise ΔH t for each time slot t to stabilise the whole system network.However, if one wishes to stabilize the request and energy consumption queues in the network while minimizing average latency, one must add the expected cost E {G(b n u,t , x u,f )} to ΔH t [10] and then, transform the function that minimises ΔH t , into a cost function that maximises the dift-plus-cost where the weights V ≥ 0 to balance the impact on network cost and network stability.In the following, to obtain an upper bound on the drift plus cost in ( 13) within an arbitrary time slot t.Firstly, we have Secondly, by combining equation ( 4), (6), equation ( 12) can be rewritten as Therefore, at any slot t, the drift-plus-cost function in ( 13) is upper-bounded by where B is a constant independent of V and the Q n (t) and Y n (t) in the total queue Z (t) = {Q n (t), Y n (t)} are independent of each other.Therefore, define where B 1 and B 2 can be obtained separately from The terms of the second inequality in ( 18) and ( 19) relate to the content sharing node b n u,t selected by user u.Instead of minimizing the dift-plus-cost function in (13), we minimize its upper-bound function.Therefore, in order to minimize the right-hand side of the inequality (17), it is necessary to consider the current historical request queue Q(t) and energy cost queue Y (t) in time slot t.This can be achieved by selecting the appropriate shared delivery Node b n t .Then, we obtain the following optimization problem P 2 max Ω(b n u,t , x u,f |Z (t)), where the objective function

B. Deep Reinforcement Learning for Shared Delivery Node Selection and Cache Replacement
Notice that the problem P 2 is linear and, therefore, decomposable.In particular, we can then decompose this problem into N subproblems, given by for all users u ∈ N , which can be solved in parallel by the users separately.Therefore, solving P 2 is synonymous with finding the optimal content delivery node and caching policy, i.e., It is worth noting that users need to make dynamic node selection and caching decisions under constantly changing channel conditions.We transform the joint optimization problem P 3 of content delivery node selection and cache replacement into a Markov decision process (MDP), as shown below: State: The state of user at time t can be expressed as denote the energy consumption queue and request queue of user u, respectively.In addition, R u is the probability of a user establishing a D2D connection with neighbouring user, and h u (t) is denoted as the channel gain of the user requesting content to other delivery nodes.
Action: After receiving the status s t u , select the content sharing transmission node and replace the file.Therefore, the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.action can be expressed as a t u = {b n u,t , I t u }, where b n u,t is the indication identifier indicating the node (e.g., including UE's neighbouring nodes and base station).I t u is the indicator whether the file in cache list need to be replaced.
Reward value of the system: Our goal is to maximize revenue from D2D and content sharing while reducing user request latency and maintaining the stability of the request and energy queues in the system.Therefore, we set the reward as the optimization problem P 3 , i.e., r u = Ω u (b n u,t , x u,f |Z (t)).In general, obtaining the optimal action, including content node selection decision (b n u,t ) * and cache replacement I t u , involves exploring a vast number of possible decisions, which can be as high as 2 (N +Mu ) options.This results in significant computational complexity even when N is very small.As a result, a reinforcement learning approach is employed to enable online shared node selection and cache decisions.Specifically, we use the deep deterministic policy gradient (DDPG) reinforcement learning algorithm to dynamically make node selection and cache decisions, thus solving the formulated MDP problem.
Fig. 3 illustrates the architecture for addressing the proposed MDP problem in DDPG.The DDPG algorithm employs two independent deep neural networks (DNNs), which follow the actor-critic paradigm, including an online network and a target network.In each selection period t, the state s u is sent to the actor network of the online network.To ensure comprehensive exploration of the environment while maintaining a balance between exploration and exploitation, a Gaussian noise vector is incorporated into the action policy function during its output.Following the execution of action a u , the ensuing reward r u and the subsequent state s u (t) are observed.The observed Transition state (s u (t), a u (t), r u (t), s u (t)) is saved to the experience pool to facilitate subsequent network learning.
The target network, which acts as a delayed replica of the online network, progressively tracks the acquired knowledge and updates the parameter configuration of the target network model through a process of soft updates.Throughout the training of the system, the agent randomly samples from the experience replay pool.In an effort to ensure the proximity of the Q-value output produced by the critic network to the actual value, the Q-value estimated by the critic network is employed within the target network.Additionally, the mean square error loss function is utilized to guide the training of the actor network.The model is trained Episode times, and in each epoch, the target network implements a soft update Algorithm 1 Combines Lyapunov Optimization and Reinforcement Learning (LoRL) for Solving (P 3 ) input:

C. Complexity Analysis
The computational complexity analysis of the proposed LoRL scheme is as follows: The execution of the LoRL algorithm consists of two parts, namely, joint user association and cache decision, and policy update.Between these two parts, a joint decision action generation is performed in each time frame, while policy updates are less frequent.Therefore, we focus on analyzing the complexity of policy decision generation in each time frame.
Careful observation reveals that within each time slot, the algorithm's complexity includes the computation of similarity for each user (12), the probability of connections between users, and the updating of DDPG model parameters.Specifically, the complexity of computing similarity and user connection probability is O(2N), where N is the number of users.
Additionally, the time complexity of the DDPG algorithm, which mainly consists of initialization, memory replay, and four deep neural networks, is as follows.The state is initialized at the beginning of each training episode, with a time complexity of K. Additionally, both the actor network and the critic network are designed as fully connected networks, assuming the actor network has N A fully connected layers, and the critic network is composed of N C fully connected layers.Therefore, the time complexity of DDPG is Based on the description above, the time complexity of the proposed LoRL algorithm is O(

A. Parameter Setting
In our simulation study, we utilized the MovieLens 1M dataset [34] to model the request behavior in the network.This dataset contains user ratings for a total of 3952 movies.Each record in the dataset includes a user ID, a movie ID, a rating, and a timestamp.Since user ratings are typically provided after viewing, we treated these ratings as request records for our simulation.To calculate user preferences, we divided the dataset into two parts.The period from January 1, 2000 to April 13, 2002 was used as a historical training set to obtain user preferences.The remaining data served as the test set to evaluate the performance of our algorithm.The content database F consists of the 3952 movies contained in the dataset.In order to reflect the degree of queue backlog in the system network, we set the total number of requests to be 10,000 under different numbers of users.We set the default cache sizes of user nodes and base stations to 40M and 100M, respectively.These cache sizes determine the amount of content that can be stored locally at each node.
In our simulation, we modeled the user's movement trajectory using the Random Waypoint model [35], which is a widely applied and proven effective approach in simulating user mobility.The Random Waypoint model has also been commonly used in other studies, particularly in research related to caching strategies [36], [37].This model generates random coordinate locations for the user at each time slot within a specified region.To determine the range of D2D connections, we assumed that users can establish connections within a physical range limit of 100 units.Users can connect to the base station within a range of 500 units.The path loss in the network was modeled using the formula 36.8 + 36.7log(d),where d represents the distance between user cache nodes.The small-scale fading was modeled using the unit variance of Rayleigh fading.Other network parameters included a channel bandwidth of 20 MHz and a background noise level of -95 dBm.In addition, all users run a DRL agent with a three-layer neural network.All these agents use the Adam optimizer with adaptive learning rates to learn their respective training parameters, starting from a learning rate of 10 −2 .For specific simulation parameter settings of DDPG, please refer to Table II.

B. Content Request Preference Analysis
In our experiment, we simplified the dataset by extracting the user IDs with the highest number of requests.We then selected a specific number of users for simulation purposes.User preference in this context refers to users' focus on requesting certain types of files.To illustrate how user preference changes with different numbers of users, we provide a simple example.In the example, we consider the numbers of users to be [20,40,60,80,100,120,140]. Fig. 4 shows the distribution of requested content IDs at different numbers of users.From the figure, we observe that the content popularity exhibits certain preferences for different numbers of users.The requested content IDs are clustered within specific ranges, i.e., [500-1000], [1250-1500], [2000-2500], [3000-3250], .This indicates that certain content types or categories are more popular among users, and their preferences can be observed based on the content IDs requested.

C. Baseline Schemes and Performance Metrics
To evaluate the performance of the proposed LoRL algorithm under different parameters, we consider the following baseline scheme: 1) LRU: Randomly select content sharing nodes within the communicable range, and the earliest stored content will be replaced with new content.2) MLPLRU [38]  user preferences and solely focuses on the connection probability between users.4) DAC: A delay-aware D2D caching (DAC) [39] algorithm with the goal of request latency.5) GHM: The greedy heuristic method (GHM) [10] searches within a limited number of user/file pairs to maximize the target D2D gain and delay value.To evaluate these solutions, we use the following performance metrics: (i)hit rate (satisfied by local cache, D2D sharing, or BS); (ii) average delay; (iii) queue backlog; (iv) network cost; (v) D2D offloading rate.

D. The Impact of Weight V
Figures 5-7 verify the analysis results related to Lyapunov optimization established in (17).Figures 5 and 6 show the dynamics of the time-averaged network cost and sum of queue backlogs during different values of weight V, respectively, in LoRL during the period t = 1, . . ., 100.It is observed that LoRL converges to stable delivery cost and queue backlog levels around time slot t = 30.We also see that when weight V = 10, there is lower network cost and queue backlog.Figure 7 illustrates the target value, which is the time average of the sum of transmission cost and queue backlog, as a function of the weight V in Lyapunov optimization.The graph demonstrates a proportional relationship between the sum of queue backlog and network cost.This can be attributed to   the fact that a higher queue backlog prompts the selection of delivery nodes with lower transmission delays to minimize retrieval costs, thereby resulting in an increase in delivery costs.

E. The Impact of User Numbers
Results in Figures 8-12 describe the performance of the cache strategy in terms of cache hit ratio, time-averaged latency, time-averaged delivery cost, and time-averaged queue backlog under different numbers of users.The Figure 8 shows that the higher the number of users, the higher the cache hit rate.As depicted in Figures 9-11, it is evident that with the expansion of the network user scale, the queue backlog,   transmission cost, and average request delay decrease accordingly.This can be attributed to the increased opportunities for D2D communication resulting from a larger number of users, thereby enhancing the effectiveness of collaborative caching.It is worth noting that it is observed from Figure 12 that as the number of users increases, the D2D offloading rate has a slow decreasing trend, which is mainly because our proposed mechanism combines user preference node selection and cache decision, local The cache decision is more biased towards local requests, increasing the probability of local cache hits, thereby reducing the proportion of D2D offloading.Furthermore, it can be seen from these figures that although the proportion   of D2D offloading decreases, the proposed mechanism still outperforms other strategies.

F. The Impact of Cache Size
In order to compare the performance of LoRL with other caching strategies, we present the results in Figures 13-17 with respect to cache size.These figures display the hit rate, request latency, time-averaged network cost, and the sum of queue backlog and D2D offload ratio.It is evident that the cache hit ratio improves with larger cache sizes.Conversely, the combined metrics of queue backlogs, transmission delays, and network costs decrease as the user cache size increases.As can be seen from these figures, LRU and MLPLRU Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.are not very effective in minimizing content delivery delay, transmission cost, and queue backlog.The reason is that LRU and MLPLRU do not consider the popularity of file content, and only update the cache according to local requests.Also, these algorithms randomly select shared nodes from within the exchange range.In addition, both the GHM algorithm and the DAC algorithm are greedy node selection algorithms, which can improve the content sharing rate by optimizing the D2D node selection, but it does not consider the local cache decision, so its local cache hit rate is low.We also observed that the DAC and GHM algorithms are inefficient, leading to performance degradation at the expense of queue backlog and network transmission costs.The main reason is that when the number of users is fixed, the cache nodes that users can establish D2D links are limited, so GHM is compared to DAC advantage is not particularly obvious.In particular, LoRL has certain performance advantages compared to LoRL with no preference, and can better select shared nodes with high D2D gains, making the network have lower queue backlogs and transmission costs.

VII. CONCLUSION
In this paper, we introduce a novel approach for selecting shared nodes and making cache decisions in D2D-assisted heterogeneous collaborative edge computing networks.Our approach involves formulating a joint optimization problem and utilizing a Lyapunov optimization algorithm to decouple the problem.To enable intelligent user association and caching decisions, we propose a content caching algorithm based on deep deterministic policy gradient (DDPG) [33].The algorithm aims to minimize user request latency while ensuring the stability of request and energy consumption queues in the system.To evaluate the effectiveness of our proposed algorithm, we conduct an extensive study and compare it with five baseline schemes.The results demonstrate that our algorithm surpasses the baseline schemes in terms of average content download latency and system queue stability.In our future work, we intend to delve deeper into the fine-grained characteristics of users and requests.This will enable us to make more informed decisions regarding user association and caching, ultimately enhancing the overall performance of the system.

Fig. 2 .
Fig. 2. A framework for learning user preferences based on contextual information.

Fig. 7 .
Fig. 7. Time averages of the content transfer cost and the sum of queue backlogs in LoRL as functions of weight V.

Fig. 8 .
Fig.8.The cache hit ratio varies with the number of users.

Fig. 9 .
Fig. 9. Time-average transmission delay varies with the number of users.

Fig. 10 .
Fig. 10.Time-average sum of queue backlogs with varies with the number of users.

Fig. 11 .
Fig. 11.Time-average network cost varies with the number of users.

Fig. 12 .
Fig. 12.The Device-to-Device (D2D) offloading rate varies with the number of users.

Fig. 13 .
Fig.13.The cache hit ratio varies with cache size.
Ru,v → R u .Observer the state s t u = {Y u (t), Q u (t), R u , h u (t)}. u∈N