Machine Learning Enables Radio Resource Allocation in the Downlink of Ultra-Low Latency Vehicular Networks

Autonomous driving and intelligent transportation demand ultra-low latency and high reliability communication in future vehicular networks. Proactive wireless communication can facilitate minimal latency by open-loop communication, which discards traditional feedback control mechanisms. However, appropriate radio resource allocation in such proactive mobile networks has not been fully studied due to lacking channel state information (CSI) and the alleviation of multiple access interference (MAI) in multiple virtual cells. This paper aims to ensure the reliability of downlink communication by a novel radio resource allocation scheme in proactive vehicular networks with ultra-low latency. We regard data transmission success rate as the reliability indicator and propose a joint radio resource allocation model based on the “generalized closed-loop”, where anchor node (AN) uses the radio resource utilization information (RRUI) from the vehicle in the immediate past uplink as a guide to assist resource allocation. Subsequently, we study the radio resource allocation model solution on the vehicle side and the network side respectively. On the vehicle side, vehicles use the local or global data transmission experience to select the radio resource with the best quality as the RRUI. On the network side, according to the latest RRUI of vehicle and resource occupancy information, deep reinforcement learning is proposed to make appropriate radio resource allocation decisions. Simulations demonstrate the effectiveness of the intelligent joint radio resource allocation scheme under the cooperation between vehicles and AN. When the resource load rate reaches 40%, the joint radio resource allocation scheme can achieve a data transmission success rate of more than 98%.

and downlink transmission, offloading, and online learning issues to reduce latency [17]- [19]. (3) Some consider advanced machine learning and artificial intelligence technologies to empower networks, concentrating on resource allocation [20], scheduling algorithms [21], link adaptation [22], and so on. The above studies are all based on traditional closed-loop networks. However, like vehicles, roadside units (RSUs), and large-scale smart devices access the wireless network, the control signaling overhead is huge. The high-speed movement of vehicles in the network will cause frequent switching with micro base stations, generate signaling storms, and greatly reduce the spectral utilization efficiency.
The proactive network based on open-loop communication novelly realizes ultra-reliable and low-latency vehicular networks [23]. It uses open-loop communication, and there is no complicated feedback mechanism and control signaling between the vehicle and the network, which greatly reduces the transmission latency [24]. As depicted in Fig. 1, the access network of the proactive vehicular network consists of multiple access points (APs) including RSUs and 5G eNBs, and anchor nodes (ANs) which are responsible for managing the APs. Once a vehicle is connected to the wireless network, it will actively select a couple of APs and radio resources for its services to form a virtual cell [25]. To guarantee the reliability of proactive multi-vehicle communication, at the physical layer, each vehicle communicates with multiple APs at the same time to form a multi-path wireless network and achieve macroscopic spatial diversity. Simultaneously, multi-user detection [26] and a specially designed open-loop error correction code [27] can further improve the reliability of the proactive vehicular network. At the network layer, ANs can effectively predict the next location of the vehicle through anticipatory mobility management (AMM) [28], thereby selecting high-quality APs for multipath transmission to improve the success rate of data transmission. However, since it is impossible to know the current network channel quality, the proactive vehicular network can only transmit data by selecting radio resources randomly. It will increase the probability of resource conflicts, leading to a sharp increase in MAI and affecting the correct reception of data. The proactive vehicular network urgently needs an intelligent radio resource allocation scheme to ensure the reliability of data transmission.
The authors of [29], [30] respectively use machine learning and random optimization to solve the problem of uplink resource allocation in proactive networks. Unfortunately, they are not suitable for downlink transmission resource allocation. The initiative of vehicles in the proactive vehicular network leads to different uplink and downlink resource management schemes. During downlink transmission, AN needs to use fog computing and AMM to centrally manage and allocate AP and radio resources. Contrarily, the uplink data transmission is much simpler. Each vehicle independently and actively selects radio resources without waiting for centralized resource management allocation and access control in AN. Currently, there is no effective resource management plan for the downlink. To innovate the downlink radio resource allocation, we must resolve the following challenges for the proactive vehicular network: • Due to the lack of CSI in the proactive vehicular network without feedback mechanism, how to make reasonable radio resource allocation decisions in the unknown channel state environment is a difficult problem.
• Under the limitation of the ultra-low latency vehicular network, AN needs to make radio resource allocation decisions locally based on fog computing and requires an efficient radio resource allocation algorithm to make decisions quickly. Existing research that studied resource management in traditional closed-loop low latency networks cannot solve the problem of proactive network downlink resource allocation. Slice resource reservation based on deep reinforcement learning is proposed to realize automated prediction and resource allocation in [31]. A V2V link selection algorithm based on greedy cells is designed in [32], which minimizes the total delivery delay by selecting specific V2V links and assigning appropriate channels. By considering communication factors and changes in vehicular platoon structure, a dynamic manager selection scheme based on joint resource allocation and coding rate optimization algorithms is proposed in [33]. An adaptive fuzzy logic strategy is developed in [34] to formulate rules for services to improve the system resource utilization. However, the method of resource reservation in [31] cannot solve the traffic explosion situation in the resource management of the proactive vehicular network. The researches of [32] and [33] have a single application scenario and cannot be extended to the proactive network. The study of [34] requires a feedback mechanism to provide information and cannot support the proactive network. Unfortunately, none of the existing resource management methods are suitable for downlink communication in the proactive vehicular network. We need to design a new and effective downlink VOLUME 10, 2022 radio resource allocation scheme to optimize reliability under extremely low latency. This paper focuses on the downlink transmission radio resource allocation problem in the proactive vehicular network, and aims to design an intelligent and effective downlink radio resource allocation scheme to break the dilemma of the blind selection of radio resources. The main contributions of this paper are summarized as follows: • We propose a vehicle and AN cooperative radio resource allocation model based on a ''generalized closed-loop'' in the proactive vehicular network to optimize the success rate of data transmission. Due to lacking complete information about CSI, network management, and centralized coordination of transmissions, traditional radio resource allocation is not feasible anymore for proactive communications. In the process of ''generalized closedloop'' downlink data transmission, the additional radio resource utilization information (RRUI) of the immediate past uplink transmission from the vehicle is to guide for AN to make radio resource allocation decisions.
• We make bidirectional optimization of the downlink radio resource allocation model from the vehicle side and the network side. On the vehicle side, we propose two RRUI generation strategies based on local experience (LE-RRUI) and global experience with AN assistance (AA-RRUI), so as to provide the best quality radio resource set to guide the radio resource allocation of downlink transmission. On the network side, a deep reinforcement learning based radio resource allocation algorithm (DRL-RRA) is proposed, which can quickly make reasonable radio resource allocation decisions with the assistance of the RRUI through offline training.
• Through the simulation of different joint schemes of the vehicle-side RRUI generation algorithm and the network-side radio resource allocation algorithm, we obtain the optimal joint radio resource allocation scheme. Simulations prove that under the resource load rate of 40%, AA-RRUI combined DRL-RRA can obtain a data transmission success rate of more than 98%. This paper is organized as follows: Section II establishes a downlink radio resource allocation model based on ''generalized closed-loop'' and regards the long-term data transmission success rate as the optimization goal. Section III proposes two RRUI generation strategies on the vehicle side. Section IV proposes DRL-RRA on the network side and offers two benchmark solutions for comparison. Section V provides numerical results to validate the analysis and demonstrate the performance of the proposed radio resource allocation algorithm under the cooperation between vehicles and AN. Finally, conclusions are drawn in Section VI.

A. THE ARCHITECTURE OF THE PROACTIVE VEHICULAR NETWORK
We consider a proactive vehicular network as shown in Fig. 1. The vehicular network includes V2I, and V2N communication modes [35]. The network includes the radio access network and the core network, radio access network consists of the APs and the ANs. Multiple ANs manage the access network with sufficient computing power and storage and are directly connected to the core network. Each AN manages APs within a certain range, and ANs' respective management areas do not overlap. ''Proactive'' in the proactive vehicular network describes the vehicle. When a vehicle accesses the wireless network, it can actively associate with the nearest APs and select radio resources to form a virtual cell and directly perform uplink communication without interacting with the AN in advance, so that the AN perceives the existence of the vehicle. Subsequently, when a downlink data task arrives from the core network or the infrastructure, AN allocates APs and radio resources for downlink transmission with the assistance of AMM. In the network, the distributions of vehicles and APs obey the homogeneous Poisson point process (PPP) with the density of λ U and λ B , respectively. Simultaneously, B = {1, 2, · · · , B} and U = {1, 2, · · · , U } collect APs and vehicles in the region. In order to enhance the readability of the paper, we summarize the abbreviations and notations respectively in Table 1 and  Table 2.

B. RESOURCES IN THE SYSTEM
The resources in the proactive vehicular network are divided into radio resources and network resources. Orthogonal frequency division multiple access (OFDMA) is adapted for the physical layer transmission, where the radio resources are divided into multiple radio units (RUs), each containing a fixed number of subcarriers and symbols in the slot. The specific number of adjacent RUs are mapped into the link layer as the radio block (RB), which is defined as the basic scheduling element of the radio resources with the data transmission capacity of l. We denote the set of RBs by J and describe the quality of an RB by the data transmission 44712 VOLUME 10, 2022 success rate of the RB in a period of time. The partition of radio resources is illustrated in Fig. 2. This paper maps the physical layer radio resources into the link layer, AN makes allocation decisions in the link layer. The network resources refer to the APs managed by the AN.
The communication quality of the vehicle u is ensured by forming a virtual cell centered on itself, and we use C t,u K to represent the virtual cell. The virtual cell C t,u K includes the sets of network resource and radio resource that provide multipath communication for vehicle u, C t,u K = (V t,u K , t−τ,u ). Network resource V t,u K corresponds to the set of K APs closest to the vehicle u (see Fig. 1). t−τ,u is RRUI, a set of RBs with  the highest communication quality obtained by the vehicle u through experience, which was reported to the AN at last uplink transmission time t − τ .

C. RADIO RESOURCE ALLOCATION BASED ON ''GENERALIZED CLOSED-LOOP''
Since the proactive vehicular network reduces the communication latency by discarding the feedback mechanism, the AN cannot directly or indirectly obtain the current wireless channel state. Through a more in-depth study in the timing operation process of the proactive vehicular network communication, we note a ''generalized closed-loop'' in a macroscopic and delayed manner (see Fig. 3). Real-time digital map updates and exchange of control information in autonomous driving will induce dense and frequent uplink and downlink communications between the vehicle and the AN [36], which suggests the possibility of using the ''generalized closedloop'' framework for radio resource allocation.
As shown in Fig. 3, when there is uplink data to be sent at slot t −τ , the vehicle u selects N v BRs as RRUI t−τ,u , which according to a certain strategy (discussed in section III), and sends it to the AN with data together. The vehicle u then VOLUME 10, 2022 monitors the RRUI to receive downlink data. At slot t, the AN intelligently allocates appropriate RBs in RRUI t−τ,u as t,u for downlink transmission based on the latest network state. Note that the reason why AN only selects radio resources in RRUI t−τ,u is: 1) The proactive vehicular network has no feedback so that the vehicle cannot be informed of the transmission channel in time.
2) The complexity and latency of the resource allocation algorithm can be reduced by narrowing the radio resource selection space.
Since the network state changes rapidly, it is necessary to ensure the timeliness of RRUI. If the vehicle informs AN RRUI at time t a , the information will remain unchanged during min{t a − t a , θ} period, where t a is the next uplink transmission time, and the vehicle will update RRUI at this time. When the vehicle has not updated the RRUI for a long time, it will be forced to update if the time limit θ is exceeded. The AN can realize cooperative intelligent downlink radio resource management with the interaction in ''generalized closed-loop'' communication composed of uplink and downlink.

D. DOWNLINK TRANSMIT MODEL
The number of data tasks is random in every slot. At slot t, the set of data tasks is t , where t = t,u u ∈ t . t is the collection of receiving vehicles for the data tasks at slot t. When a downlink data task t,u arrives at slot t, the AN allocates the K nearest APs to vehicle u to combine V t,u K and make radio resource decision t,u for transmission. The transmission model can be viewed as the MISO process. The APs in V t,u K send the same packet synchronously to the vehicle u with each RB inside t,u .
The complex channel coefficients of the transmission link from AP b to vehicle u at slot t can be expressed as: where g t,u b is the small-scale fading experienced between AP b and vehicle u, and obeys the Rayleigh distribution. D t,u b −α is the path loss between AP b and vehicle u, where D t,u b is the distance, and α is the path loss exponent ranging from 2 to 5 [37].
Assuming that there are N s RUs at every slot and can be mapped into N b RBs, as while each RB contains n m RUs. And RB j is mapped into RUs as a set j . We express the RUs occupation information as a matrix S t = {0, 1} B×N s . s t b,m = 1 means that RU m is occupied by AP b to transmit data at slot t and s t b,m = 0 indicates the corresponding RU is not occupied by AP b.
Assume that every AP can only occupy at most C s RUs in a slot for data transmission, and the transmit power of every RU is p. Therefore, the signal-to-interference ratio (SIR) γ t,u j,k occupied RB j over the kth link in V t,u K can be mapped into the physical layer, as given by: where h t k is the downlink channel coefficient between the kth selected AP and vehicle u, h t,u b is the downlink channel coefficient between AP b and vehicle u. The noise is comparatively negligible in the presence of strong inter-virtual cell interference.
In this MISO-equivalent process, we assume that vehicle u adopts the selection combining strategy [38] to obtain the maximum SIR Therefore, the downlink data task can be successfully received by vehicle u, under the following condition: where γ th is the SIR threshold that the data can be received correctly. Equation (4) means that every RB occupied by the data task can be received correctly.

E. PROBLEM STATEMENT
Since the data traffic in the vehicular network is not stable, the instant data transmission success rate cannot be used as an appropriate indicator to measure the performance of the entire network. We define the long-term data transmission success rate of the entire system as a downlink reliability indicator, denoted by ρ: where u∈ t δ(t, u) is the number of successful transmissions and O(t) is the number of data tasks from the core network to the AN at slot t. Therefore, we formulate the optimization problem of downlink data transmission success rate as: is the set of radio resource allocation decisions for all slots from 1 to T . Condition c1 indicates that the maximum number of RUs that each AP occupies at every slot cannot exceed C s . Condition c2 defines the requirements for successful data transmission in downlink communication. n t,u and R t,u are the number of RBs occupied for transmission and the rate requirement of t,u , respectively.
When we fix the time variable, for each slot, (6) can be transformed into the famous traveling salesman problem, Algorithm 1 Local Experience Based RRUI Generation Algorithm (LE-RRUI) Input: the set of RBs J, the communication quality of RBs based on the local experience statistics Vehicle randomly selects an RB from J/ t−τ,u as RB i. 8: end if 9: Add the chosen RB i in t−τ,u . 10: end for which is a typical NP-hard problem. Therefore (6) is an NP-hard problem. Since is closely related to RRUI, in order to solve (6), we need to study and optimize from the generation of RRUI on the vehicle side and the radio resource allocation of AN on the network side.

III. RRUI GENERATION STRATEGY FOR VEHICLES
In the vehicular-helped downlink radio resource allocation model, since the radio resource decision t,u of AN is generated in the RRUI, the generation strategy of the RRUI is significant to achieve the optimal resource allocation decision and improve the data transmission success rate of the overall system. Here we elaborate on the specific operations of RRUI t−τ,u .

A. LOCAL EXPERIENCE BASED RRUI GENERATION STRATEGY
When vehicle u initially accesses the proactive network, it performs active network association by sensing nearby APs, and randomly selects radio resources for uplink transmission. Then in the communication process, vehicle u counts the number of successful transmissions of the system RBs within a period of time in local to measure the communication quality of RBs, and selects N v RBs as RRUI t−τ,u to send to the AN. We use J t−τ,u l to represent the communicating quality of the system RBs based on the local experience statistics of the vehicle u at slot t − τ . In order to ensure the balance between exploration and exploitation, we assume that the vehicle u adopts a ε − greedy policy when selecting RBs as RRUI. Algorithm 1 is the local experience based RRUI generation algorithm.

B. AN-ASSISTED RRUI GENERATION STRATEGY
There are certain limitations when the vehicle only utilizes local data transmission information to generate RRUI. Due to the high-speed changes of the network, for a single vehicle, the system RBs quality statistics based on local experience update slowly, the timeliness is not high enough and cannot fully represent the current radio resource quality. When we move the focus to the vehicle side of ''generalized closedloop'' communication at slot t − τ , AN can guide the vehicle to generate the RRUI with the global information of RBs' quality through the downlink transmission at slot t −τ −τ . Fig. 4 is the AN-assisted RRUI generation process. We improve the RRUI generation strategy based on Algorithm 1. Assume that when selecting RBs to generate RRUI, the vehicle has a probability of to select the best quality RB among the global information of RB quality J t−τ −τ ,u a provided by AN, and a probability of 1 − − ε v to select RB based on local experience. It should be noted that 1 − − ε v > 0, it is unreasonable to only use J t−τ −τ ,u a sent by the AN to generate the RRUI. Because of a delay in J t−τ −τ ,u a , even the global information from the network cannot fully represent the current radio resource quality. On the other hand, the local experience of the vehicle is most suitable for the current location and environment, and has reference value. For detailed algorithm steps, please refer to Algorithm 2.
The vehicle selects the RBs with good quality through the local and global RBs' communication quality information and sends it to the AN. After that, the vehicle receives downlink data by monitoring the RBs in the RRUI, and updates the local RBs quality statistics by the RBs occupation information in the downlink transmission.

IV. DOWNLINK RADIO RESOURCE ALLOCATION SOLUTIONS FOR AN
After the downlink data task arrives, the AN needs to make an appropriate radio resource allocation decision according to the RRUI provided by the target vehicle, so as to maximize the optimization objective in (6). To the best of our knowledge, there is no proper and mature solution for proactive network downlink radio resource allocation as a reference. In order to verify the effectiveness and pros and cons of DRL-RRA, this paper designs two benchmark solutions worthy of reference. x = random(0,1). 4: if x < ε v then 5: Vehicle randomly selects an RB from J/ t−τ,u as RB i. 6: else if x < ε v + then 7: Vehicle selects the highest communication quality RB in J t−τ −τ ,u a / t−τ,u as RB i. Add the chosen RB i in t−τ,u . 12: end for A. TWO BENCHMARK SOLUTIONS Given the system scenario in (6), there exist two immediate realizations. One is the random selection scheme, that is, free selection with equal probability within the range of optional radio resources without considering any other factors. This scheme is intuitive and self-explanatory. If the gains of a radio resource allocation scheme to the system cannot even achieve the same performance as the random selection, the scheme is invalid or even damages the system.
Another realization is the heuristic algorithm which gets an approximate solution to the NP-hard problem. It is difficult to solve the highly dynamic random optimization problem in a variable range, so we divide the time into intervals and transform the dynamic problem into per-slot deterministic optimizations. The optimization goal is changed from ρ to the data transmission success rate in each slot, which is represented by: where t 0 is the set of radio resource allocation decisions for the data tasks at slot t 0 and t 0 ∈ {1, 2, · · · , T }. A data task can select multiple RBs, and each RB can only be selected by one data task at a slot. The problem of manyto-one radio resource matching in a fixed slot fits the network flow model [39]. The network flow topology diagram G = (V , E, M , H ) based on radio resource allocation is shown in Fig. 5. Where V is the set of vertices of the network graph, including virtual source and target points SO, TA, O (t) downlink data task points, and N b RB points. E is the set of directed edges in the network graph. The edge from i to n is the cost per unit flow. In the network flow topology, the path selection from the data tasks to the system RBs is the process of radio resource allocation. The attributes setting of the edges between data tasks and the system RBs need to be specially explained. M [i, n] = 1 u ∈ t means that each task can only be matched with an RB once, and H [i, n] = µ t,u j u ∈ t , j = 1, 2, · · · , N b is whether the transmission data occupying the RB j can be successfully received, where In order to maximize the number of successful data tasks per time slot, we need to set H [i, n] = 0 for both the edges starting with SO and ending with TA. Finally, the optimization is treated as the maximum cost maximum flow problem (MCMF) to be solved by the shortest path faster algorithm (SPFA).

B. DOWNLINK RADIO RESOURCE ALLOCATION WITH DEEP REINFORCEMENT LEARNING
The radio resource allocation algorithm based on SPFA proposed in this paper has limitations in solving the NP-hard problem, it can only allocate radio resources in a single slot and obtain an approximation of the optimal solution of radio resource allocation. Reinforcement learning, as one of the paradigms and methodologies of machine learning, can utilize existing data in the proactive vehicular network to maximize specific goals in the process of interacting with the environment. This fits well with our dynamic optimization system and solves the radio resource allocation decision-making problem of highly dynamic systems.

1) THE BASIC MODEL OF REINFORCEMENT LEARNING
Consider the AN in the proactive vehicular network as an agent. For the downlink data task t,u , regarding the RUs occupation matrix S t at slot t and RRUI t−τ,u at slot t − τ , the AN will make an RB allocation decision and get a corresponding reward through the next uplink transmission. The whole process is a semi-Markov process, with the definitions of the state, action, and reward as follows.

a: STATE
We take the RUs occupation matrix S t in the network as the environment state. One dimension of the matrix represents the set of all APs controlled by the AN, and the other dimension represents the set of all RBs in the network. Each element in the matrix is a 0-1 variable.

b: ACTION
The action is defined as A t = a t 1 , a t 2 , · · · , a t N b , where a t j = 1 means RB j is utilized for data transmission, otherwise a t j = 0. There are 2 N b actions that can be chosen at every step.
As N b increases, the action space will increase exponentially. For t,u , t,u can only be selected in RRUI t−τ,u , resulting in a large number of unreasonable actions and affecting the learning performance. Therefore, we stipulate that at most N v RBs can be selected for transmitting during each downlink transmission. In this way, the size of the action space is n . It should be particularly noted that when the utilized RB is selected and mapped into the physical layer RUs for transmission, some of the RUs may have been occupied. At this time, the RUs should be discarded, and the operation is called dropout in this algorithm.

c: REWARD
We use whether a downlink task is successfully transmitted as a reward. The reward can be obtained from the subsequent uplink communication of the vehicle. When there is downlink data to be sent, according to the system state and related actions, the reward can be designated as: where s = S t ,a = A t . At slot t, if the downlink transmission SIR requirement (min γ t,u j j ∈ t,u ≥ γ th ) and the AP's occupying number of RUs requirement ( N s m=1 s t b,m ≤ C s ) are met, the data transmission is successful and reward = 1. If the SIR requirement cannot be met, the data transmission fails and reward = 0. If the AP's occupying number of RUs requirement cannot be met, the action selection is unreasonable and a certain penalty reward = −1 is required.
The problem that the reinforcement learning model needs to solve is to find the optimal policy π * : π * = arg max π ρ. (10)

2) DEEP Q-NETWORK STRATEGY
The Q-learning method has been increasingly exploited for solving the reinforcement learning problem but shows infeasibility for numerous state-action scenarios. When the state-action pair is sufficiently large, traversing all the samples stored in a Q-table at each step is challenging. To overcome the drawbacks of Q-learning, the Deep Q-network algorithm (DQN) is used in this paper. At step t, AN makes action A t = π S t through policy π under the current state S t . State-action value function Q π S t , A t is the expected return and can be expressed as (11) where ϕ ∈ [0, 1] is the discount factor to balance current reward and long-term reward. In the Q-learning algorithm, the evaluation of Q (s, a) is denoted as: where β ∈ (0, 1] is the learning rate. Q (s, a) is recorded in Q-table and AN selects the largest Q (s, a) value as action.
In DQN strategy, in order to deal with large state and action spaces, it uses neural networks to estimate Q (s, a). We define evaluated Q-network Q (s, a; ω) and target Q-network Q s, a; ω , and the weights ω of the evaluated network Q (s, a; ω) are updated according to the target value y t : The ω is updated periodically by copying ω, which can remove correlations in the observation sequence.
In order to further improve the stability of agent learning, DQN introduces an experience replay mechanism. The agent stores the experience e t = (S t , A t , R t , done, S t+1 ) of each step in the experience replay memory buffer, and randomly selects a set of experience samples from it, then trains the network weights through gradient descent to minimize the relevant objective loss function of y in (13).
In the following, we will introduce the training and testing operations of the proposed DRL-based radio resource allocation algorithm on the network side.

3) TRAINING AND TEST
The DRL-based radio resource allocation scheme on the network side has two stages: training and testing. The training stage trains the Q-network by simulating the generation of VOLUME 10, 2022 downlink data flow and the interaction between the vehicle and the AN. In the testing phase, the AN first loads the trained Q network parameters ω and ω and initializes the experience replay memory buffer, and then interacts with the environment. Actions made by the AN will be chosen based on the output of the Q-network with loaded parameters, and states will be generated depending on its local observations [40]. Fig. 6 and Algorithm 3 provide the framework and algorithm procedure of DRL-based radio resource allocation algorithm, respectively.
It should be noted that the main parameters of our proposed DRL-based radio resource allocation model, such as state dimension, action dimension, etc., are only related to the resources (number of APs and RBs) managed by the AN in the system. It is independent of other environmental variables such as the number of vehicles and the arrival of downlink data tasks. In the real environment, the number of APs managed by each AN and the number of radio resources are fixed, which makes this scheme have good adaptability and can be quickly deployed in other ANs in the edge network.

4) LATENCY ANALYSIS OF PROACTIVE NETWORK RADIO RESOURCE ALLOCATION SCHEME BASED ON DRL
How to implement an effective radio resource allocation scheme in the low-latency proactive vehicle network is our concern. In this paper, we optimize the latency from the following aspects: • Fog computing is used for distributed radio resource allocation on the AN, which reduces the scale and latency of the solution compared to the centralized radio resource management in the core network.
• Aiming at the convergence problem of DQN, we further de-redundancy by deleting illegal actions when setting actions to speed up the convergence of the algorithm.
• Considering that the computational complexity is related to the structure of the Q network in the real deployment, Algorithm 3 DRL-Based Radio Resource Allocation Algorithm (DRL-RRA) 1: Initialization: Initialize evaluated Q-network and target Q-network with parameters ω and ω . 2: for episode= 1 : M do 3: Initialize the proactive vehicular network environment. 4: for t = 1 : T do 5: AN receives the set of downlink data tasks t . 6: AN gets the downlink data task t,u . 8: AN senses the current environment state S i . 9: AN makes action A i according to t,u and S i based on ε − greedy policy. 10: AN obtains reward R i and next state S i+1 .

11:
AN stores (S i , A i , R i , done, S i+1 ) in the experience replay memory.

12:
Sample random minibatch of experiences (S k , A k , R k , done, S k+1 ) from the experience replay memory. 13: if episode terminates at step k + 1 then 14: y k = R k , 15: else 16: 17: end if 18: Train evaluated Q-network to minimize L(ω). 19: Every P steps, update target Q-network. 20: end for 22: end for 23: end for the redundant number of hidden layers of the neural network will increase the computational latency. Therefore, we use two fully connected (FC) layers as the hidden layers of the neural network, which can reduce the computational latency while ensuring fitting accuracy. The specific settings of the neural network are shown in Fig. 7.
• The parameters used by the radio resource allocation scheme based on DRL in the model application are already trained offline and can be directly put into use online in the real environment.

V. SIMULATIONS A. SYSTEM SETTINGS
To obtain the numerical results, we use a CPU-based server with 3.70 GHz Intel Core i9-10900k processor and 64 GB RAM, and the software environment is Python 3.7.6 with Tensorflow 1.13.0 and Hmmlearn 0.2.7. The arrival rate process of data flow in the proactive vehicular network conforms to certain spatio-temporal characteristics. It can be described by a new structured time series model based on the hidden Markov model [41]. Assume that the network's arrival number of data tasks is divided into three states: trough, mid-term, and peak, corresponding to Y 1 , Y 2 , and Y 3 data tasks, respectively. Then the number of data tasks in each slot obeys λ t = (π d , F, Z ), where π d is the initial state probability matrix, F is the hidden state transition probability matrix, and Z is the observation state transition probability matrix. We use Hmmlearn 0.2.7 to simulate the arrival of data flow.
In the network, the basic parameters used in our system simulation are shown in Table 3. In the urban environment, there are 14 APs and 64 vehicles. The channel band is 10 MHz [35]. Meanwhile, a vehicle randomly selects one direction out of eight directions to move a certain distance at every slot in the environment.

B. RESULTS AND ANALYSIS
On the vehicle side, we propose two RRUI generation strategies based on local experience and AN assistance, respectively. On the network side, after receiving the RRUI, the AN can make resource allocation decisions through three resource allocation algorithms: DRL-RRA, SPFAbased radio resource allocation algorithm (SPFA-RRA), and random-based radio resource allocation algorithm (R-RRA). We first analyze the effectiveness of the two RRUI generation algorithms. In order to verify the effectiveness of LE-RRUI and AA-RRUI, we compare them with the random RRUI generation strategy. Fig. 8 shows the performance of the six radio resource allocation algorithms under the condition of N v = 4, = 0.3, and resource load rate = 40%, where the resource load rate of each slot is the ratio of the occupied resource number in S t to the total number of resources and used to measure the density of downlink data flow in the network. When AN chooses R-RRA, the data transmission success rate under LE-RRUI and AA-RRUI is 1.6 and 2 times that of the RRUI random generation algorithm respectively, and the data transmission success rate of AA-RRUI is 10.7% higher than that of LE-RRUI. After fixing the RRUI generation algorithm, the superiority of DRL-RRA algorithm  on the network side is reflected. When RRUI is randomly generated, the data transmission success rate of DRL-RRA is 1.4 and 2.4 times that of SPFA-RRA and R-RRA, respectively. AA-RRUI combined with DRL-RRA can achieve a data transmission success rate of 99.3%. Fig. 9 shows that DRL-RRA has good convergence. VOLUME 10, 2022  Next, we discuss the parameters of the vehicle-side RRUI generation algorithm. Fig. 10 describes the effect of N v (the number of RBs in the RRUI) on the performance of the downlink resource allocation algorithms when = 0.3 and resource load rate = 40%. The optimal value of N v is in the range of 2 to 6. If N v continues to increase, the data transmission success rates of the six radio resource allocation algorithms all drop significantly. When N v is close to the total number of RBs in the system, it is meaningless for the vehicle to send RRUI to guide the AN for resource allocation. The DRL-RRA and SPFA-RRA algorithms on the network side are sensitive to the size of N v . In particular, the size of N v in RRUI directly affects the size of the action space in DRL-RRA. If N v is too large, the action space becomes larger and affects the learning performance and convergence of DQN. Fig. 11 shows the effect of (the probability of selecting RB from global information about RB quality sent by AN) in AA-RRUI on the performance of the algorithms when N v = 4 and resource load rate = 40%.
= 0 means that the vehicle adopts LE-RRUI. The performance of AA-RRUI is higher than that of LE-RRUI, which is because the number of data transmissions of the vehicle itself in a fixed period is much lower than the number of data transmissions of the entire network, estimating the RBs' quality has a certain deviation. However, since the local experience is more suitable for the location and radio environment of the vehicle at the moment, the local data transmission experience cannot be directly discarded. Due to the learning ability of DRL-RRA, with the increase of , the data transmission success rate of DRL-RRA only increases slightly, and the data success rate of DRL-RRA can reach 99.3% when = 0.4. The performance fluctuations of SPFA-RRA and R-RRA affected by are 10% and 14%, respectively. DRL-RRA, SPFA-RRA, and R-RRA have higher data transmission success rates when is in the range of 0.2 -0.6. Since the downlink data task must be sent in the next slot once it reaches the AN, the increase of the downlink data traffic leads to the increase of the resource load rate. Fig. 12 shows the data transmission success rates of six radio resource allocation algorithms under different resource load rates. R-RRA is most sensitive to the resource load situation. Since its radio resource selection is completely random, once the occupied resource in the network increase, the interference during downlink transmission will increase. When the resource load rate in the vehicular network does not exceed 40%, both DRL-RRA and SPFA-RRA have high data transmission success rates, and the AA-RRUI and DRL-RRA joint radio resource allocation algorithm can achieve a data transmission success rate of more than 90%. Unfortunately, as the average resource load rate further increases, the performance of both algorithms degrades significantly.  When the resource load rate reaches 80%, the performance of DRL-RRA and SPFA-RRA drops to around 80% and 62%, respectively.
When the downlink data flow in the proactive vehicular network is dense, appropriately increasing the network resources (number of APs) in each vehicle's virtual cell is an intuitive and effective method to improve the reliability of data transmission. As long as the SIR of one link is higher than the SIR threshold when a vehicle receives data, the data transmission is successful. Fig. 13 shows the relationship between the number of APs K in a virtual cell and the data transmission success rate when the resource load rate reaches 60%. Assuming that the AA-RRUI algorithm is used on the vehicle side, by increasing the number of APs in the virtual cell, the data transmission success rate of the DRL-RRA algorithm can be improved. In the process of increasing the value of K from 3 to 5, the success rate of data transmission increases significantly. When K is increased to 6, the data transmission success rate of DRL-RRA can reach 98%. It indicates K to be a critical design factor for high traffic load.

VI. CONCLUSION
This paper proposed an intelligent radio resources allocation scheme in an ultra-low latency vehicular network. Aim to break the status quo of blind radio resource selection in AN due to the open-loop communication without CSI feedback, we constructed a downlink radio resource allocation framework based on the ''generalized closed-loop'' with the guide of RRUI from the vehicle's immediate past uplink transmission. Subsequently, we took the long-term success rate of data transmission as a reliability indicator, and established a downlink radio resource allocation model based on cooperation between vehicles and AN to optimize the success rate of data transmission. On the vehicle side, we proposed two RRUI generation strategies, LE-RRUI and AA-RRUI, to select high communication quality RBs from local experience and global experience as RRUI. On the network side, we proposed an intelligent radio resource allocation algorithm based on deep reinforcement learning (that is DRL-RRA). The simulation verified the effectiveness of the joint radio resource allocation algorithm under the cooperation between the vehicle side and the network side.
We noted that the performance of the radio resource allocation scheme based on the joint vehicle and AN was limited when the resource occupancy rate exceeded 60%. Increasing the number of APs in a virtual cell could improve the success rate of data transmission, but at the same time, it would further increase the resource load rate. This paper proposes further research direction. We study the radio resource allocation after fixing the number of network resources (APs) in the virtual cell. If each downlink data task can be flexibly allocated from two dimensions of network resources and radio resources, the success rate of data transmission can be further improved. The difficulty in the direction is that the resource allocation model will be more complex, and the realization of more fine-grained resource allocation under extremely low latency communication has high requirements on algorithm performance.