Joint Content Update and Transmission Resource Allocation for Energy-Efficient Edge Caching of High Definition Map

Caching the high definition map (HDM) on the edge network can significantly alleviate energy consumption of the roadside sensors frequently conducting the operators of the traffic content updating and transmission, and such operators have also an important impact on the freshness of the received content at each vehicle. This paper aims to minimize the energy consumption of the roadside sensors and satisfy the requirement of vehicles for the HDM content freshness by jointly scheduling the edge content updating and the downlink transmission resource allocation of the Road Side Unit (RSU). To this end, we propose a deep reinforcement learning based algorithm, namely the prioritized double deep R-Learning Networking (PRD-DRN). Under this PRD-DRN algorithm, the content update and transmission resource allocation are modeled as a Markov Decision Process (MDP). We take full advantage of deep R-learning and prioritized experience sampling to obtain the optimal decision, which achieves the minimization of the long-term average cost related to the content freshness and energy consumption. Extensive simulation results are conducted to verify the effectiveness of our proposed PRD-DRN algorithm, and also to illustrate the advantage of our algorithm on improving the content freshness and energy consumption compared with the baseline policies.


I. INTRODUCTION
T HE High Definition Map (HDM) is an essential tool to help autonomous vehicles make path planning and relative driving decision [1], [2], [3].Generally, the HDM can be roughly divided into two layers, namely the static layer and the dynamic layer [9].The static layer contains the road topology information stored in the remote cloud platform or pre-cached onboard, while the dynamic layer contains the real-time traffic condition of the specific road section requiring frequent information update from the roadside sensors.To reduce the file response latency and sensor energy consumption of the remote data transmission, a promising solution is to cache the dynamic layer files of the HDM at different network edges [4], [5], [6], [7], [8].However, the frequent file update still brings high energy consumption of the roadside sensors due to the traffic condition perception and the data transmission.Actually, a large proportion of updates is unnecessary if the HDM files arriving the destined vehicles can meet the vehicular requirement on the freshness.Therefore, a new and dedicated research is deserved to explore how to relieve the energy consumption of the roadside sensors while keeping a relatively high file freshness.
Note that most of the above mathematical optimization theories rely on the extra information about the models and the environment, which are difficult to implement in a practical scenario.The learning-based methods can overcome this disadvantage, they consider a more realistic case that the environment information is unknown, and obtains the optimal decision through the interaction with the environment.Most of the existing learning-based methods transform the proposed problem to a Markov decision process (MDP), and utilize model-based reinforcement learning (RL) (e.g., value iteration algorithm), or model-free RL methods (e.g., Deep Q-learning) to obtain optimal policy [41], [42].The traditional Q-learning is an effective method to solve MDP problems [43], which utilizes a two-dimensional Q-table to evaluate the system performance of actions taken in each state.However, when applying to a large-scale reinforcement learning (RL) problem, it will face the curse of dimensionality due to the large state or action space and becomes ineffective.Therefore, the combination of Q-learning and the deep neural network (DNN), which is called DQN, has been proposed to approximate the Q-function of state-action pairs and execute automatic learning under the large system state [44].The goal of the natural DQN and its variants is to maximize the long-term discount reward by utilizing the deep neural network (DNN) as an approximation function to learn policy and state value.Their execution usually contains three processes, namely agent exploration and exploitation, offline training and online decision [45].During the exploration and exploitation process, the agent interacts with the environment and performs the optimal actions according to the greedy policy.It will cache the experience in the replay buffer once it performs the relevant action.When the agent obtains enough training experiences from the environment, it starts to train the network model with the cached experiences.Once the network performance has met the certain requirements, the agent can run the trained DQN model online to make optimal decision based on the given MDP.The learning-based methods on AoI optimization mainly put an eye on minimizing AoI and the relative file update cost (such as energy consumption, transmission latency, etc.) by finding out optimal status update policies [28], [29], [30], [33], [34] or transmission related optimizations [35], [36].
However, the status update policy and the relevant transmission optimization are considered separately in the existing works.In general, reasonable transmission resource allocation can reduce the number of the instant updating when the real-time AoI of the request cannot meet the user's requirement.This means that more transmission resource can be allocated to the user whose requested file's AoI is approaching to its AoI requirement threshold.Therefore, jointly scheduling the HDM content update and the transmission resource allocation in the dynamic edge caching system is a promising solution in reducing the file update times while keeping a relative high file freshness.In this paper, we investigate how to satisfy the vehicular AoI requirements while maintaining relatively low energy consumption of the battery-powered roadside sensors in the edge HDM caching scenario by jointly schedule the content update and the downlink transmission resource allocation.We propose a PRD-DRN algorithm, which combines the superiority of prioritized double deep Q-learning [47], [61] and R-learning [46].In the proposed algorithm, the agent can interact with environment and execute the optimal scheduling action for maximizing the long-term average system reward without adjusting the discount factor.
The main contributions of this paper can be summarized as follows.
r We first model the joint scheduling problem of the content update and the downlink transmission resource allocation in the HDM edge-cached scenario as an MDP, which depicts the real-time AoI of the edge-cached content and the AoI difference of vehicles' requested files in nonuniform decision epochs.During each decision epochs, the system cost is derived as the sum of each vehicle's AoI-related cost, that is, AoI difference and sensor energy consumption brought by the content updating.
r We further propose a PRD-DRN algorithm to adaptively solve the scheduling problem when the vehicular request patterns and the dynamics of environment information are unknown.The PRD-DRN algorithm has the properties of the R-learning, which can obtain the maximal long-term average reward without adjusting the discount factor in the traditional DQN-based algorithm.
r Extensive simulation results are conducted to verify the PRD-DRN algorithm, and also to illustrate the improvement of the content freshness and energy consumption under the PRD-DRN algorithm compared with the baseline policies like the heuristics and the traditional DQN-based policies.The rest of the paper is organized as follows.Section II summarizes the related works.Section III introduces our concerned network model and formulates the problem.In Section IV, we transform the scheduling problem into an MDP and propose the PRD-DRN to solve it.Extensive simulation results are provided in Section V. Finally, Section VI concludes this paper.
Later, researchers find out that the pursuit of AoI minimization in environmental monitoring systems will inevitably increase the energy consumption of the sensors [34].Relevant works focus on improving the AoI performance with lower energy consumption by utilizing different optimization theories [19], [20], [21], [22], [23], [25], [26], [27].In [19], the authors optimize the update rate to avoid unnecessary updates and reduce the energy consumption of the sensors in a monitoring system.The authors of [20] explore the optimal online status update policies in the finite battery scenario and the infinite batter scenario, respectively.A renewal structure is also proposed in the finite battery scenario to give the order of the sensor charging and the status update.In [21], the authors prove that the erasure status feedback is good for online timely updating when the available energy of the sensors is limited.The authors of [22] solve the AoI-Energy optimal problem from a communication perspective, where optimal transmission policies for two-hop networks have been investigated.In [23], the authors investigate optimal state update policies under different battery recharge models.In [25], the authors investigate the age-energy tradeoff of IoT monitoring systems and adopt a Truncated Automatic Repeat reQuest (TARQ) scheme.The authors of [26] focus on the average AoI and energy cost for Low Density Parity Check Code (LDPC) coded status update over Additive White Gaussian Noise (AWGN) and Rayleigh fading channels.By utilizing the renewal processes theory, the expressions of the average AoI and energy cost can be derived.In [27], the authors investigate how to realize a tradeoff between AoI and energy consumption over an error-prone channel by taking sleep and retransmission mode into consideration.
Recently, learning-based methods [28], [29], [30], [31], [32], [33], [34], [35], [36] have also been applied to optimize the AoI-related caching problems.These works usually transform the proposed problem to a Markov decision process (MDP), and utilize model-based reinforcement learning (RL) (e.g., value iteration algorithm), or model-free RL methods (e.g., Deep Q-learning) to obtain optimal policy.The authors of [31] and [32] investigate the achievable optimal information sampling and updating strategies which can minimize the AoI in the environment monitoring system.The authors in [28], [29], [30], [33], [34] study the status update control problems under different scenarios where for the unknown energy-related information of the sensors, they propose various file update policies to optimize the AoI performance with low energy cost.Authors of [35] propose an optimal transmission mode selection scheme to realize a trade-off between AoI and energy consumption.In [36], the authors aim to minimize the AoI by controlling the network's actions on an unknown network topology and delay distribution.In [40], the authors investigate the age-energy tradeoff in fading channels with packet-based transmissions, and solve the specific problem by using Bellman optimal equations.

TABLE II PARAMETERS
We summarize the characteristics of the aforementioned works in Table I.The parameters used in this paper are provided in Table II.

A. Network Model
As illustrated in Fig. 1, we consider a HDM dynamic layer edge caching scenario consisting of a single Road Side Unit (RSU), F traffic information acquisition roadside sensors and several vehicles in the RSU's coverage range.The RSU is equipped with H b transmission resource blocks for downlink data transmission.Each roadside sensor is responsible for refreshing the relevant HDM file with the same size l cached on the RSU.The vehicle and HDM dynamic file sets are denoted by N ={1, 2, . .., N} and F={1, 2, . .., F }, respectively.To deal Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.with the real-time network changes brought by the vehicular requests, we consider a time-slotted system, where each time step t is slotted into equal-sized time slots τ based on the practical demand.At each time step, the RSU may receive the vehicular HDM file request, and then it will decide whether to pull the up-to-date states of HDM files from the relevant sensors based on the file AoI demands of vehicles.If the demands can be satisfied, the RSU will respond to the vehicular HDM file requests with its local cached files.Otherwise, it will provide the updated HDM files from these sensors for the vehicle.It is notable that the proposed model with a single edge node in this paper is also suitable for the multiple edge nodes with non-overlapping scenario, which is widely used in the previous works [9], [15], [17].
In any time step t, the query detail for the request of vehicle n can be represented as a query profile The RSU can obtain the query profiles all the vehicles in the time step t, but it has no prior knowledge of the vehicular request arrival rates and the popularity of each cached HDM file.
Once if the RSU receives the query profiles D(t), it will make the HDM dynamic file update decision based on the requested HDM files and the relevant AoI demands.We use U (t) = {u 1 (t), u 2 (t), . .., u F (t)} to represent the HDM file update decision in the time step t, where u f (t) ∈ {0, 1}, f ∈ F. u f (t) = 1 represents that the RSU decides to refresh file f and pull the up-to-date states from the relevant sensor in the time step t, otherwise u f (t) = 0.The RSU will select file f to refresh in the time step t based on the comparison between its real-time AoI on the RSU and the AoI demand in the query profile D(t).Notice that, when there is no query of file f in the query profiles D(t), file f may also be updated to reduce the transmission delay caused by the temporary request update only if there is available uplink transmission resources.Then, the RSU responds the vehicular requests with its cached HDM files.
In our network model, we consider that the RSU is assigned with limited transmission resource blocks, which can be optimally allocated to the transmission from the RSU to each vehicle (V2I) based on service requirements of the vehicle's request.The max transmission rate μ n (t) from the vehicle n to the RSU in the time step t is given by: where B is the available bandwidth of each transmission resource block, H n (t) is the number of the consecutive resource blocks allocated to the vehicle n, Γ n (t) is the spectrum efficiency of vehicle n associated with the RSU.Here, the unit of μ n (t) is KB per second.Thus, the file transmission latency from the RSU to the vehicle n can be determined as l μ n (t) in the time step t.We consider a more realistic time-varying channel between each vehicle and the RSU.The channel is modeled as a finite-state Markov channel (FSMC) [52] without loss of generality.The spectrum efficiency is divided into Z levels.Let Z = {γ 0 , γ 1 , . .., γ Z−1 } denote the state space of the spectrum efficiency: In each time step, the Γ n (t) can change from one state in the set Z to another with a certain transition probability.
Meanwhile, the total number of resource blocks allocated to downlink transmissions of N vehicles is no more than H b , i.e., where H n (t) represents the number of the resource blocks used by the vehicle n in the time step t.
As for the uplink transmission process of file update, we consider the update time of each HDM file keeps unchanged at different time steps due to the identical file size and transmission time.The update time can be depicted as T r = {T 1 r , T 2 r , . .., T F r }, where T r represents the file update time consumption set, T f r represents the time consumption for file f to be updated, T f r < T (t), f ∈ {1, 2, . .., F }. T (t) denotes the duration of time step t.To avoid uplink channel congestion, the maximum number of updated files in each time slot is no more than a constant Y , i.e., where Y < F. Particularly, an update operation for each file f needs E f unit energy consumed by the traffic sensing and the information uploading.

B. AoI Analysis
The real-time AoI value of the cached HDM files is of great importance for the RSU to conduct a file update decision.Since the AoI values of the same HDM file may be different in the RSU Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.We define a metric α max which represents the maximum system AoI a file cached on the RSU can reach.For any HDM file f in the RSU, its real-time AoI value α f 0 (t) can be expressed as For a vehicle n, the transmission latency brought by the file respond process also increases the staleness of the information.Additionally, the instant file updating1 also brings extra response latency to the request.Thus, the AoI value of the file f requested by the vehicle n can be expressed as To ensure that the AoI of the requested HDM file meets the demand of each vehicle, we set an AoI limitation for all the requests in the time step t as where α V max ≤ α max .Fig. 2 illustrates the AoI variation of the HDM file cached on the RSU.To meet the communication needs of other vehicles, the RSU can allocate redundant downlink transmission resources to vehicles once if the requested file's AoI exceeds the threshold α V max .This will avoid the occurrence of instant file update and reduce the update energy consumption.The file update decision and the downlink transmission resource allocation can be jointly optimized to realize a reduction in the update energy consumption at the cost of downlink transmission resource while ensuring a relatively low AoI experience of the vehicles.

C. Problem Formulation
Our objective is to minimize the average AoI experienced by the vehicles and also to reduce the extra sensor energy consumption caused by file update.We design a joint scheduling mechanism for the HDM file update and downlink transmission resource allocation.
To better characterize the satisfaction with the AoI of the requested HDM file, we define a new metric, namely AoI difference cost, as the gap between the real-time AoI value of the file and α V max .The following equations ( 8) and ( 9) express the AoI difference of each vehicular request on the RSU and on-board, respectively.
As for a vehicle n, we use the average AoI difference cost Δ n (t) of all the HDM files it requested as the representative of its AoI satisfaction within time step t, which can be expressed as: According to the above analysis, the AoI related cost during each time step can be expressed as the weighted sum of each vehicle's average AoI difference cost, i.e., where , and the value of each β n depends on the automatic driving level of the vehicle.Vehicle with higher automatic driving level possesses a higher value β n . 2 In this paper, we consider the case that each request only contains one HDM file to simplify the analysis.
Meanwhile, the total energy consumption of the roadside sensors in each time step can be expressed as Here, we denote P AoI (t) as the AoI relevant penalty brought by the average AoI difference cost, and denote P E (t) as the energy relevant penalty brought by the total energy consumption in each time step, respectively.
Therefore, the overall system cost in each time step t can be expressed as: where ω AoI and ω E are used to nondimensionalize the function and can realize a tradeoff between the AoI relevant penalty and the energy relevant penalty in each time step.Since the AoI requirement of the requested HDM file is more important than the energy consumption of the roadside sensors.we consider ω AoI is larger than ω E .Based on the cost function (13), as the time T max goes to infinity, the average cost of the requesting HDM file can be defined as Our objective can be formulated as minimize 4), ( 5), (8) ( This is a nonlinear and nonconvex optimization problem.It is generally difficult to solve such a problem.In the following section, we propose a DRL-based algorithm to solve it.

IV. DEEP REINFORCEMENT LEARNING-BASED ALGORITHM
This section first formulates the HDM content update and downlink resource allocation process on the RSU as an MDP.Then, the PRD-DRN algorithm is further proposed to minimize the long-term average cost of the requesting HDM file by jointly optimizing the HDM content update and downlink transmission resource allocation.
is defined as the system action set in time step t, which represents the HDM file update decision and the downlink transmission resource allocation of the RSU.
r State Transition Probability P: represents the distribution of the transition probability P (s | s, a) from the system state s to a new system state s (s, s ∈ S) when an action a ∈ A is chosen, which is largely effected by the real environment conditions, such as the HDM file request rate, the request popularity of each cached HDM file, etc. r Reward Function R: S × A → R maps a state-action pair to a value R(s(t), A(t)).Our objective in this paper is to minimize the long-term average cost C ave (t) given in equation (14) under the constrained conditions, and then we can define the reward function as R(s(t), a(t)) = −C ave (t).
Here, we define the policy π as an action a ∈ A that the RSU will execute by given a specific system state s ∈ S.Then, the objective function in (15) can be rewritten as:

B. PRD-DRN Algorithm
With the MDP model aforementioned, we can well characterize the effects of the diverse HDM AoI on vehicles under different file update actions based on the vehicular autonomous driving requirements.Here, we need to design an adaptive and efficient HDM dynamic layer update strategy, which can proactively make file update decision in each state, so as to earn a higher reward by considering the long-term system performance.
In our scenario, the rewards obtained by the agent in different time steps are considered to have the same importance.Thus, this paper is to maximize the long-term average reward rather than the long-term discount reward.We modify the state value function V π (s) and the state-action value function Q π (s, a) by combining the idea used in the R-learning [46] as follows: where R π is the long-term average reward of taking policy π in state s, which can be written by: The optimal policy π * can be obtained by utilizing the Bellman Optimality Equation: The architecture of our DRL-based HDM dynamic layer update mechanism is presented in Fig. 3. θ and θ * are the DNN parameters of the main network and the target network respectively.The agent interacts with the environment and observes the real-time system state.Based on the current state s(t), the agent selects an action using the -greedy strategy.Under such a strategy, the action max a R(s, a, θ) is selected with probability , and the action a ∈ A is selected with probability (1 − ), where ∈ [0, 1]).Notice that, the agent not only uses the previous experience to maximize current rewards, but also keeps exploration and exploitation to improve the Q π (s, a) and the R π .After the agent performs an action a(t), the corresponding reward R(s(t), a(t)) can be obtained from environment, and the system state s(t) transfers to s(t + 1).Thus, a new experience tuple E(t) = (s(t), a(t), R(s(t), a(t)), s(t + 1)) is generated and will be cached in the experience replay buffer M.Then, the former steps go into a loop to obtain enough experience in the replay buffer for the future training.Notice that, the oldest experience tuple will be discarded when the experience buffer M is full.
As for the training procedure, we first utilize a prioritized experience sampling scheme [47] to acquire a mini-batch of the cached experience tuples W = {E 1 , E 2 , . .., E W m } based on the pre-defined batch size W m , where E j = (s j , a j , R(s j , a j ), s j ), j ∈ {1, 2, . .., W m }.Unlike the random sampling, the prioritized experience sampling tends to replay experiences with high priority more frequently, which is measured by the magnitude of their temporal-difference (TD) error δ j defined as: where R is the average reward of the cached experience in the replay buffer.Meanwhile, the sampling priority of each cached experience tuple E j can be determined as p t = |δ j |.The sampling can be executed by utilizing a SumTree method [47], where the experience tuple with higher sampling priority has a higher probability of being selected.
After obtaining the batch of sampled experiences and the relevant TD errors of these experiences, the average reward R will be updated as: After obtaining the update of the sampled experiences set, the Q value of the target network Q target of the PRD-DRN algorithm can be expressed as: Meanwhile, the main network and the target network can be trained by minimizing the loss function L(θ), which can be expressed as In this paper, we use the stochastic gradient descent (SGD) method to update the DNN parameter θ iteratively as equation ( 24): where ξ is the learning rate.The parameter θ of the main network is updated every step while the parameter θ * of the target network will be updated every i steps.Then, θ * t = θ t−i .The pseudo code in Algorithm 1 shows the details of the proposed PRD-DRN algorithm.The replay buffer and the parameters of the main network and the target network will be initialized at the beginning of the algorithm.During each episode, the environment needs to be reset at first.Then, the agent starts to explore the environment for T pre loops.For a given state s(t), the agent selects an action a(t) by utilizing the -greedy method.With the -greedy method, the agent tends to take a random action from Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
the action set at the beginning of the iteration since it doesn't know much about the environment.After executing multiple iterations, the agent becomes more aware of the environment, and it will select the action with the maximum Q-value with a higher probability.The immediate reward R(s(t), a(t)) and the following state s(t + 1) of s(t) can be obtained based on the selected action a(t).Then, the agent can get a corresponding experience tuple E(t) = (s(t), a(t), R(s(t), a(t)), s(t + 1)) and cache the tuple in the experience replay buffer M for the subsequent training.So far, the training of the network model starts and a mini-batch will be sampled from the experience replay buffer M by utilizing the prioritized experience replay method in [47].The loss function will be minimized by the SGD procedure to update network parameters until it is converged.
The exploration rate is set to 0 initially and increased by during each training episode.

C. Algorithm Complexity Analysis
The time complexity of the Artificial Neural Network based algorithm can be deduced based on its number of network neurons [48].Assuming that a fully-connected network has x I input neurons, x O output neurons and H hidden layers with x h neurons each layer (h ∈ {1, 2, . .., H}), the time complexity can be expressed as O(x [49].Meanwhile, the time complexity of the SumTree method is O(log |M|) [47].Thus, for the proposed PRD-DRN algorithm, the time complexity of each episode can be determined as where |S| and |A| are the dimensions of the state and action space, respectively.Combining with the network model proposed in this paper, the value of |S| and |A| can be deduced as: We know that the training process of a DRL is extremely time-consuming [50].Similar to [51], we can offline perform the training procedure in our proposed DRL-based algorithm for many episodes under different channel states.The trained model needs to be updated when there is a significant change in the environment characteristics.Specifically, we can update the trained model when computing resources are idle (e.g., midnight).

V. SIMULATION RESULTS AND DISCUSSIONS
In this section, we evaluate the performance of our proposed PRD-DRN algorithm.We describe our simulation settings firstly, which consists of the parameters of the network model and the hyper-parameters of the PRD-DRN.Meanwhile, the configurations of the baseline algorithms are also been presented.Then, we show the performance comparison of the PRD-DRN with the benchmarks in different environments and give the relevant analysis.The whole experiment is implemented by the Tensorflow frame and runs on a PC with an Intel Core i7-6700 CPU @2.6 GHz, Memory 16 G.

A. Simulation Settings
Simulation Scenario: We build a simulation scenario, where there are one RSU with an MEC server, N connected vehicles and 10 traffic information acquisition sensors.The value of N ranges from 10 to 40 with the interval of 10.The wireless channels between the vehicles and the RSU follow the finite-state Markov channel (FSMC) model.We assume that the state of the channel is considered to be bad when spectrum efficiency is 1 or good when spectrum efficiency is 2. The transition probability of staying at the same state is set to be 0.7, and the transition probability from one state to another is set to be 0.3 [52].The value of AoI limitation α V max is set to be 20 slots.Meanwhile, the allocated bandwidth B of each resource block is 1 MHz while the number of resource blocks H b is set to be 50.
In each time step, the arrival vehicular requests for each edge-cached HDM file follows a Zipf distribution, where the value of the distribution parameter is set to be 1.5 [53].This kind of distribution is representative for the practical vehicular networks which has been widely adopted in many relevant references [54], [55], [56], [57], [58].The proposed algorithm needs to be retrained if the distribution changes.For each roadside sensor, the relevant file update latency is randomly selected from the value set {0.5τ, 0.6τ, 0.7τ, 0.8τ, 0.9τ }, where τ is the length of the unit time slot.The value of τ is set to be 1.Once the file update latency of each roadside sensor has been determined, their values will remain unchanged during the whole simulation process.Based on this, we set the extra request latency of a specific file to be the same as its update latency.In this paper, we consider the edge nodes (e.g., road side units and base stations) are equipped with stable power supply facilities, which are less affected by energy consumption.Thus, we do not consider the energy consumption of the edge nodes.Meanwhile, the transmission power for each roadside sensor is set to be 10ρ mW, where the value of ρ belongs to [0.5,1].As for a specific sensor, the energy consumption for traffic status sensing per file updating is set to be the same as that for data uploading [59].
Training model architecture: As for our proposed PRD-DRN model, the main network and the target network are made up of two identical fully-connected ANNs.Each ANN is consisted of four layers, i.e., an input layer, an output layer and two hidden layers.The input layer is consisted of (N + 1)F + N cells including the pre-processed system state.Each hidden layer has 256 cells while the output layer gives the HDM file update action.We utilize ReLU as the activation function and Adam [60] as the optimizer.To make the model easier to train, the input state has been normalized by the maximum allowable AoI α max , i.e., where X is the input value.The learning rate of the PRD-DRN parameters θ, θ * are set to be 4 * 10 −4 .The learning rate of the average reward is also set to be 4 * 10 −4 .The target network update interval i is set to be 2000 steps.The average return is calculated by the agent interacting with the environment for 10 4 steps.The memory buffer size is set to be 2 * 10 5 , and the mini-batch size W m is set to be 32, 64 and 128.The exploration rate increases linearly from 0 to 1 and keeps fixed.We compare our proposed PRD-DRN algorithm with the following baseline ones.
Random Policy: During each time step, the RSU randomly selects an update action for the current state.This policy doesn't take into account the transmission resource allocation.Greedy Policy: During each time step, the RSU manages to maximize the immediate reward by executing the update action.This policy doesn't take the transmission resource allocation into consideration.
DQN-based Policy: The DQN-based policy is based on the traditional double DQN (DDQN) algorithm [61].Its network architecture is similar to the PRD-DRN, which consists of a main network and a target network.The objective of the DDQN is to maximize the cumulative discount reward instead of the average one.Notice that, the state value function V π (s) and the state-action value function Q π (s, a) in the DDQN model are defined as (29) and where γ is the discount factor equal to 0.95 in the subsequent simulation.Meanwhile, the DDQN does not adopt the priority experience sampling method, and its loss function is given by Equation (31) shown at the bottom of this page.
Except the differences described above, all the network training configurations are set to be the same as those in the PRD-DRN.

1) Convergence Performance:
To ensure the reliability of our proposed PRD-DRN algorithm, we first verify its convergence performance.
Fig. 4 shows the reward of PRD-DRN under different minibatch size (32,64,128).To control variable, we set the ω AoI L(θ) = E R (s j , a j ) − γQ s j , argmax a s j , a; θ ; θ * − Q (s j , a j ; θ) 2 (31) Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.to 0.6 and N to 30.We can see from Fig. 4 that different value of the batch-size has a significant effect on the reward of PRD-DRN, and a higher value of the mini-batch size helps PRD-DRN converge faster to a certain extent.This can be explained as follows.Firstly, the mini-batch size determines the number of the experience samples used for training per round.A smaller mini-batch size increases the randomness of the experience sample, which may impede the convergence speed of the model.Meanwhile, the PRD-DRN utilizes the prioritized sampling method, which can reduce the influence of mini-batch size on the convergence speed to a certain extent when the mini-batch size becomes larger.However, a large batch-size can result in a single-direction gradient descending during the training process, and this may cause a local optimal solution.We set the mini-batch size to 64 in the subsequent simulation.
Fig. 5 shows the convergence comparison of the PRD-DRN and the baseline policies when N = 30.Here, we also consider that the discount factor γ of the DQN-based policy can affect its performance in the convergence and the average reward [61].Generally, a smaller γ means that the agent pays more attention to the immediate interests, and the training difficulty may be smaller.On the other hand, a bigger γ means that the agent pays attention to the long-term interests, which may make the algorithm unstable.It can be seen from Fig. 5 that a smaller γ (0.89, 0.92, 0.95) of the DQN-based policy indeed ensure its convergence speed, while the average reward becomes higher with a bigger γ.However, an extremely big γ (0.98) makes the DQN-based model difficult to converge during the training process.Meanwhile, the greedy policy and the random policy obtain a relatively low reward.By comparison, our PRD-DRN obtains a higher reward while ensuring a faster convergence speed.It can be explained as follows.The prioritized experience sampling reduces the amount of experience the agent required to learn since it always selects the more valuable experience sample.Although the sampling process consumes some computation resources (as the analysis in Section III-C), the increased computational overhead is acceptable relative to the increased performance gain.Moreover, the PRD-DRN can achieve a better performance without adjusting the discount factor, which also reduces the training cost to a certain extent.Based on the above analysis, we set the discount factor of the the DQN-based policy to 0.95 in the subsequent simulation.
2) Efficiency Analysis: To verify the efficiency of our proposed method, we make performance comparison with the mentioned baseline policies.
Fig. 6 shows the average AoI cost when vehicles receive their requested HDM files under different number of vehicles.It can be observed from Fig. 6 that the RSU with the PRD-DRN policy maintains relatively low AoI cost compared with the baseline policies.Meanwhile, it is interesting to find that although the number of state-action pairs increases exponentially with the increase of vehicular number N , the performance of PRD-DRN keeps stable with respect to N .It is due to the fact that the PRD-DRN is to maximize the long-term average reward, and thus it can execute optimal update actions in response to the vehicular requests.Even when there are no vehicular request arrival in a specific time step, the RSU may execute appropriate file update actions based on the historical request record and the real-time AoI of the cached files.We can also find out that the average AoI cost performance of the proposed PRD-DRN algorithm is under the pre-defined value of the AoI limitation (α V max =20).Thus, the PRD-DRN can ensure the stability of the AoI cost performance and realize a reasonable utilization of the network resources.Fig. 7 shows the average file updating energy consumption of the system at each time step under different number of vehicles.It can be seen from Fig. 7 that the PRD-DRN keeps a relatively low energy consumption compared with the baseline policies with the increasing number of vehicles.This is due to the following reason.To achieve a high average reward, the agent will adjust the balance between the AoI cost and the energy consumption appropriately.Meanwhile, available transmission resource is allocated to the vehicle whose requested file is close to the AoI threshold, which avoids the unnecessary updates, and thus reducing the file update energy consumption.Fig. 8 shows the average file update time of the system at each time step under different number of vehicles.We can see from Fig. 8 that the greedy policy has worse performance than PRD-DRN and DQN-based policy.This is because the greedy policy only considers the instant system performance and is prone to fall  into the dilemma of local optimal.Moreover, the greedy policy ignores the benefit brought by the optimal transmission resource allocation in reducing the number of instant file updates.The agents of the PRD-DRN and DQN-based policy can jointly schedule the file update and downlink transmission resource allocation from the long-term interactions with environment.The available downlink transmission resources are reasonably allocated to different vehicles to ensure that the AoI of the requested file meets the requirement of the vehicle with minimum file update times.

VI. CONCLUSION
This paper studied the file update and downlink transmission resource allocation in the vehicular HDM edge caching system.For this purpose, we first formulated the file update and resource allocation as a nonlinear and nonconvex optimization problem.To solve this challenging problem, we then proposed a PRD-DRN algorithm combining the perceive capability of R-learning and the scheduling capability of reinforcement learning.Under the proposed PRD-DRN algorithm, the content update and transmission resource allocation procedures on the RSU were modeled as an MDP.Based on the advantages of deep R-learning and prioritized experience sampling, we obtained the optimal decision to minimize the long-term average cost related to the AoI and energy consumption.The extensive simulation results show that our PRD-DRN algorithm can achieve high long-term reward without managing the discounted factor.Remarkably, in comparison with the baseline policies, our algorithm can achieve lower average AoI and energy consumption with relative low file update time during a fixed period.An interesting study is to explore the joint content update and transmission resource allocation in the multiple edge nodes scenario with overlapping service region in our future work.
where d f n (t) = {0, 1}, and f ∈ {1, 2, . .., F }. d f n (t) = 1 represents the HDM file f has been requested by the vehicle n in the time step t, and d f n (t) = 0 otherwise.Meanwhile, we use an indicator d n (t) to represent whether vehicle n raises a request in the time step t.Then,

Fig. 2 .
Fig. 2. AoI variation for an HDM file f on the RSU.

TABLE I SUMMARY
OF AOI MINIMIZATION IN ENVIRONMENTAL MONITORING SYSTEMS