Learn to Compress CSI and Allocate Resources in Vehicular Networks

Resource allocation has a direct and profound impact on the performance of vehicle-to-everything (V2X) networks. In this paper, we develop a hybrid architecture consisting of centralized decision making and distributed resource sharing (the C-Decision scheme) to maximize the long-term sum rate of all vehicles. To reduce the network signaling overhead, each vehicle uses a deep neural network to compress its observed information that is thereafter fed back to the centralized decision making unit. The centralized decision unit employs a deep Q-network to allocate resources and then sends the decision results to all vehicles. We further adopt a quantization layer for each vehicle that learns to quantize the continuous feedback. In addition, we devise a mechanism to balance the transmission of vehicle-to-vehicle (V2V) links and vehicle-to-infrastructure (V2I) links. To further facilitate distributed spectrum sharing, we also propose a distributed decision making and spectrum sharing architecture (the D-Decision scheme) for each V2V link. Through extensive simulation results, we demonstrate that the proposed C-Decision and D-Decision schemes can both achieve near-optimal performance and are robust to feedback interval variations, input noise, and feedback noise.

experience on wheels safer and more convenient [1]. V2X enabled coordination among vehicles, pedestrians, and other entities on the road can alleviate traffic congestion, improve road safety, in addition to providing ubiquitous infotainment services [2]- [4]. Recently, the 3rd generation partnership project (3GPP) begins to support V2X services in the long-term evolution (LTE) [5] and further the fifth generation (5G) mobile communication networks [6]. Cross-industry alliance has also been founded, such as the 5G automotive association (5GAA), to push development, testing, and deployment of V2X technologies.
Due to high mobility of vehicles and complicated time-varying communication environments, it is very challenging to guarantee the diverse quality-of-service (QoS) requirements in vehicular networks, such as extremely large capacity, ultra reliability, and low latency [7]. To address such issues, efficient resource allocation for spectrum sharing becomes necessary in the V2X scenario.
Existing works on spectrum sharing in vehicular networks can be mainly categorized into two classes: centralized schemes [8]- [11] and distributed approaches [12], [13]. For the centralized schemes, decisions are usually made centrally at a given node, such as the head in a cluster or the base station (BS) in a given coverage area. Novel graph-based resource allocation schemes have been proposed in [8] and [9] to maximize the vehicle-to-infrastructure (V2I) capacity, exploiting the slow fading statistics of channel state information (CSI). In [10], an interference hyper-graph based resource allocation scheme has been developed in the non-orthogonal multiple access (NOMA)-integrated V2X scenario with the distance, channel gain, and interference known in each vehicle-to-vehicle (V2V) and V2I group. In [11], a segmentation medium access control (MAC) protocol has been proposed in large-scale V2X networks, where the location information of vehicles is updated. In these schemes, the decision making node needs to acquire accurate CSI, interference information of all the V2V links, and each V2V link's transmit power to make spectrum sharing decisions. However, reporting all such information from each V2V link to the decision making node poses a heavy burden on the feedback links, and even becomes infeasible in practice.
As for distributed schemes [12], [13], each V2V link makes its own decision with partial or little knowledge of other V2V links. In [12], a distributed shuffling based Hopcroft-Karp algorithm has been devised to handle the subchannel allocation in V2V communications with one-bit CSI broadcasting. In [13], the spatio-temporal traffic pattern has been exploited for distributed load-aware resource allocation for V2V communications with slowly varying channel information. In these methods, V2V links may exchange partial or none channel information with their neighbors before making a decision. However, each V2V link can only observe partial information of its surrounding environment since it is geographically apart from other V2V links in the V2X scenario. This may leave some channels overly congested while others underutilized, leading to substantial performance degradation.
Notably, the above works usually rely on some levels of channel information, such as channel gain, interference, locations and so on. This kind of channel information is usually hard to obtain perfectly in practical wireless communication systems, which is even challenging in the V2X scenario. Fortunately, machine learning enables wireless communications systems to learn their surroundings and feed critical information back to the BS for resource allocation. In particular, reinforcement learning (RL) can make decisions to maximize long-term return in the sequential decision problems, which has gained great success in various applications, such as AlphaGo [14]. Inspired by its remarkable performance, the wireless community is increasingly interested in leveraging machine learning for the physical layer and resource allocation design [15]- [23].
In particular, machine learning for future vehicular networks has been discussed in [24] and [25].
In [26], each V2V link is treated as an agent to ensure the latency constraint is satisfied while minimizing interference to V2I link transmission. In [27], a multi-agent RL-based spectrum sharing scheme has been proposed to promote the payload delivery rate of V2V links while improving the sum capacity of V2I links. A dynamic RL scheduling algorithm has been developed to solve the network traffic and computation offloading problems in vehicular networks [28].
In order to fully exploit the advantages of both centralized and distributed schemes while alleviating the requirement on CSI for spectrum sharing in vehicular networks, we propose an RLbased resource allocation scheme with learned feedback. In particular, we devise a distributed CSI compression and centralized decision making architecture to maximize the sum rate of all V2V links in the long run. In this architecture, each V2V link first observes the state of its surrounding channels and adopts a deep neural network (DNN) to learn what to feed back to the decision making unit, such as the BS, instead of sending all observed information directly. To maximize the long-term sum rate of all links, the BS then adopts deep reinforcement learning to allocate spectrum for all V2V links. To further reduce feedback overhead, we adopt a quantization layer in each vehicle's DNN and learn how to quantize the continuous feedback. Besides, to further facilitate distributed spectrum sharing, we devise a distributed spectrum sharing architecture to let each V2V link make its own decision locally. The contributions of this paper are summarized as follows.
• We leverage the power of DNN and RL to devise a centralized decision making and distributed implementation architecture for vehicular spectrum sharing that maximizes the long-term sum rate of all vehicles. We use a weighted sum rate reward to balance V2I and V2V performance dynamically.
• We exploit the DNN at each vehicle to compress local observations, which is further augmented by a quantized layer, to reduce network signaling overhead while achieving desirable performance.
• We also develop a distributed decision making architecture that allows spectrum sharing decisions to be made at each vehicle locally and binary feedback is designed for signaling overhead reduction.
• Based on extensive computer simulations, we demonstrate both of the proposed architectures can achieve near-optimal performance and are robust to feedback interval variations, input noise, and feedback noise. In addition, the optimal number of continuous feedback and feedback bits for each V2V link are presented that strike a balance between signaling overhead and performance loss.
The rest of this paper is organized as follows. The system model is presented in Section II.
Then, the BS aided spectrum sharing architecture, including distributed CSI compression and feedback, centralized resource allocation and quantized feedback, is introduced in Section III.
The distributed decision making and spectrum sharing architecture is discussed in Section IV.
Simulation results are presented in Section V. Finally, conclusions are drawn in Section VI.

II. SYSTEM MODEL
We consider a vehicular communication network with N cellular users (CUs) and K pairs of coexisting device-to-device (D2D) users, where all devices are equipped with a single antenna.
Let K = {1, 2, ..., K} and N = {1, 2, ..., N} denote the sets of all D2D pairs and CUs, respectively. Each pair of D2D users exchange important and short messages, such as safety-related information via establishing a V2V link while each CU uses a V2I link to support bandwidthintensive applications, such as social networking and video streaming. In order to ensure the QoS of the CUs, we assume all V2I links are assigned orthogonal radio resources. Without loss of generality, we assume that each CU occupies one channel for its uplink transmission. To improve the spectrum utilization efficiency, all V2V links share the spectrum resource with V2I links. Therefore, N is also referred to as the channel set.
Denote the channel power gain from the n-th CU to the BS on the n-th channel, i.e., the n-th V2I link, by g n [n]. Let h k,B [n] represent the cross channel power gain from the transmitter of the k-th V2V link to the BS on the n-th channel. The received signal-to-interference-plus-noise-ratio (SINR) of the n-th V2I link can be expressed as where P c n and P d k refer to the transmit powers of the n-th V2I link and the k-th D2D pair, respectively, σ 2 represents the noise power, and ρ k [n] ∈ {0, 1} is the channel allocation indicator with ρ k [n] = 1 if the k-th D2D user pair chooses the n-th channel and ρ k [n] = 0 otherwise.
We assume each D2D pair only occupies one channel, i.e., N n=1 ρ k [n] ≤ 1. Then, the capacity of the n-th V2I link on the n-th channel can be written as where B denotes the channel bandwidth.
Similarly, h k [n] denotes the channel power gain of the k-th V2V link on the n-th channel.
Meanwhile, h l,k [n] denotes the cross channel power gain from the transmitter of the l-th D2D pair to the receiver of the k-th D2D pair on the n-th channel. Denote the cross channel power gain from the n-th CU to the receiver of the k-th D2D pair on the n-th channel by g n,k [n].
Then, the SINR of the k-th V2V link over the n-th channel can be written as where the interference power for the k-th V2V link I k [n] is In (4), the terms K l =k ρ l [n]P d l h l,k [n] and P c n g n,k [n] refer to the interference of the other V2V links and the V2I link on the n-th channel, respectively. Hence the capacity of the k-th V2V link on the n-th channel can be written as In the V2X networks, a naive distributed approach will allow each V2V link to select a channel independently such that its own data rate is maximized. However, local rate maximization often leads to suboptimal global performance due to the interference among different V2V links. On the other hand, the BS in the V2X scenario has enough computational and storage resources to achieve efficient resource allocation. With the help of machine learning, we propose a centralized decision making scheme based on compressed information learned by each individual V2V link distributively.
In order to achieve this goal, each V2V link first learns to compress local observations, including the channel gain, the observed interference from other V2V links and V2I link, transmit power, etc., and then feeds the compressed information back to the BS. According to feedback information from all V2V links, the BS will make optimal decisions for all V2V links using RL. Then, the BS broadcasts the decision result to all V2V links.

III. BS DECISION BASED SPECTRUM SHARING ARCHITECTURE
As shown in Fig. 1, we adopt the deep RL approach for resource allocation. In this section, we first design the DNN architecture of each V2V link and the deep Q-network (DQN) for centralized control at the BS, respectively. Then, we propose the centralized decision making and distributed spectrum sharing architecture, termed C-Decision scheme. Finally, we introduce the binary feedback design for information compression.

A. V2V DNN Design
Here, we discuss the DNN at each V2V link to compress local observation for feedback. As shown in Fig. 1  . Here, the channel information h k can be accurately estimated by the receiver of the k-th V2V link and we assume it is also available at the transmitter through delay-free feedback [29]. Similarly, the received interference power over all channels I k can be measured at the k-th V2V receiver. Each V2V transmitter knows its transmit power P d k . Besides, the vector h k,B can be estimated at the BS and then broadcast to all V2V links in its coverage, which incurs a small signaling overhead [27].
Then, the local observation, o k , is compressed using the DNN at each V2V link. The compressed information, b k , which is the output of the DNN, is fed back to the DQN at the BS. To limit overhead on information feedback, each V2V link only reports the compressed information vector, b k , instead of o k to the BS. Here, b k = {b k,j } is also known as the feedback vector of the k-th V2V link and the term b k,j , ∀j ∈ {1, 2, ..., N k } refers to the j-th feedback element of the k-th V2V, where N k denotes the number of feedback learned by the k-th V2V link. All V2V links aim at maximizing their global sum rate in the long run while minimizing the feedback information b k . Therefore, the parameters of the DNNs at all V2V links and those of the DQN will be jointly determined to maximize the sum rate of the whole V2X network.

B. Deep Q-Network at the BS
To make a proper resource sharing decision, we introduce the deep RL architecture at the BS as shown in Fig. 1. In order to maximize the long-term sum rate of all links, we resort to the RL technique by treating the BS as the agent. In the RL, an agent interacts with its surroundings, named as the environment, via taking actions, and then observes a corresponding numerical reward from the environment. The agent's goal is to find optimal actions so that the expected sum of rewards is maximized. Mathematically, the RL can be modelled as a Markov decision process (MDP). At each discrete time slot t, the agent observes the current state S t of the environment from the state space S and then chooses an action A t from the action space A and one time step later obtains a reward R t+1 . Then, the environment evolves to the next state The BS treats all the learned feedback as the current state s of the agent's environment, which can be expressed as: Then, the action of the BS is to determine the values of the channel indicators, ρ k [n], for each V2V link. Thus, we define the action a of the BS as where ρ k = {ρ k [n]} , ∀n ∈ N refers to the channel allocation vector for the k-th V2V link.
Finally, we design the reward for the BS, which is very crucial to the performance of RL.
To maximize the long-term sum rate of V2V links while ensuring the QoS of V2I links in the V2X scenario, we need to devise a mechanism to consider the transmissions of V2V links and V2I links simultaneously. As we know, the V2V links usually carry the safety-critical messages, such as vehicle's speed and emergency vehicle warning on the road, while the V2I links often support the entertainment services [27]. Thus, we should guarantee the transmission of V2V links as the primal target while making sure that the impact of V2V transmission on the V2I links can be tolerable and adjustable to some specific applications. To this end, we model the reward of the BS as refers to the capacity of the k-th V2V on all the channels. Besides, λ c and λ d are nonnegative weights to balance the performance of V2I links and V2V links.
The solution of the RL problem is related to the concept of policy π (a, s), which defines the probabilities of choosing each action in A when observing a state in S. The goal of learning is to find an optimal policy π * to maximize the expected return G t from any initial state s 0 . The expected return is defined as G t = ∞ k=0 γ k R t+k+1 , which is the cumulative discounted return with a discount factor γ.
To solve this problem, we resort to the Q-learning [30], which is a well-known effective approach to tackle the RL problem, due to its model-free property where p (s ′ , r|s, a) is not required a priori. Q-learning is based on the idea of action-value function q π (s, for a given policy π, which means the expected return when the agent starts from the state s, takes action a, and thereafter follows the policy π. The optimal action-value function q * (s, a) under the optimal policy π * satisfies the well-known Bellman optimality equations [31], which can be approached through an iterative update method: where α is the step-size parameter. Besides, the choice of action A t in state S t follows some exploratory policies, such as the ǫ-greedy policy. For better understanding, the ǫ-greedy policy can be expressed as a random action, with probability ǫ.
Here, ǫ is also known as the exploration rate in the RL literature. Furthermore, it has been shown in [31] that with a variant of the stochastic approximation conditions on α and the assumption that all the state-action pairs continue to be updated, Q converges with probability 1 to the optimal action-value function q * .
However, in many practical problems, the state and action space can be extremely large, which prevents storing all action-value functions in a tabular form. As a result, it is common to adopt function approximation to estimate these action-value functions. Moreover, by doing so, we can generalize action-value functions from limited seen state-action pairs to to a much larger space.
In [32], a DNN parameterized by θ is employed to represent the action-value function, thus called as DQN. DQN adopts the ǫ-greedy policy to explore the state space and store the transition tuple (S t , A t , R t+1 , S t+1 ) in a replay memory (also known as the replay buffer) at each time step.
The replay memory accumulates agent's experiences over many episodes of the MDP. At each time step, a mini-batch of experiences D are uniformly sampled from the replay memory, called experience replay, to update the network parameters θ with variants of stochastic gradient descent method to minimize the squared errors shown as follows: where θ − is the parameter set of a target Q-network, which is duplicated from the training Qnetwork parameter set θ, and fixed for a couple of updates with the aim of further improving the stability of DQN. Besides, experience replay improves sample efficiency via repeatedly sampling experiences from the replay memory and also breaks correlation in successive updates, which also stabilizes the learning process.

C. Centralized Control and Distributed Transmission Architecture
In this part, the architecture for the C-Decision scheme is shown in Fig. 1 Details of the training framework for the C-Decision scheme are provided in Algorithm 1.
We define O t = {o t k } , ∀k ∈ K as the observations of all V2Vs at the time step t ∈ {1, 2, ..., T }, where o t k refers to the observation of the k-th V2V at the time step t. Then, we can express the estimation of the return also known as the approximate target value [32] as where R t+1 and Q O t+1 , a; θ − represent the reward of all links and the Q function of the target DQN with parameters θ − under the next observation O t+1 and the action a, respectively.
Then, the updating process for the BS DQN can be written as [32], [33]: where β is the step size in one gradient iteration.
As for the testing phase, at each time step t, each V2V adopts its observation o t k as the input of the trained DNN to obtain its learned feedback b t k , and then sends it to the BS. After that, the BS takes {b t k } as the input of its trained DQN to generate the decision result a t , and broadcasts a t to all V2Vs. Finally, each V2V chooses the specific channel indicated by a t to transmit.

D. Spectrum Sharing with Binary Feedback
In order to further reduce feedback overhead, we propose a framework to quantize the V2V links' real-valued feedback into several binary digits. In other words, we try to constrain b k,j ∈ Algorithm 1 Training algorithm for the C-Decision scheme Input: the DNN model for each V2V, the DQN model for the BS, the V2X environment BS chooses a t according to s t using some policy π derived from Q, e.g., ǫ-greedy strategy as in (11) 10: BS broadcasts the action a t to every V2V 11: Each V2V takes action based on a t , and gets the reward R t+1 and the next observation Save the data {O t , a t , R t+1 , O t+1 } into the replay buffer B

13:
Sample a mini-batch of data D from B uniformly 14: Use the data in D to train all V2Vs' DNNs and BS's DQN together as in (14) 15: The binary quantization process consists of two steps [34]. The first step is to generate the required number of continuous feedback values in the continuous interval [−1, 1], which is also equal to the desired number of the binary feedback. Then, the second step takes the outputs of the first step as its input to produce the desired number of discrete feedback in the set {−1, 1} for each output real-valued feedback of the first step.
For the first step, we adopt a fully-connected layer with tanh activations, defined as tanh (x) = 2 1+e −2x − 1, where we term this layer as the pre-binary layer. Here, the input of this pre-binary layer connects the outputs of each V2V's DNN. Then, in order to binarize the continuous output of the first step, we adopt the traditional sign function method in the second step. To be specific, we take the sign of the input value as the output of this layer, which is shown as below: However, the gradient of this function is not continuous, challenging the back propagation procedure for DNN training. As a remedy to this, we adopt the identity function in the backward pass, which is known as the straight-through estimator [35]. Combining these two steps together, we can express the full binary feedback function as where W 0 and b 0 denote the linear weights and bias of the pre-binary layer that transform the activations from the previous layer in the neural network respectively. Here, we term this layer as the binary layer.

IV. DISTRIBUTED DECISION MAKING AND SPECTRUM SHARING ARCHITECTURE
In order to further facilitate distributed spectrum sharing and reduce the computational complexity, we propose the distributed decision making and spectrum sharing architecture (named as the D-Decision scheme) shown in Fig. 2 to let each V2V link make its own spectrum sharing decision. In this section, we first devise the neural network architecture for each V2V link to compress CSI and make decision, respectively, and then design the neural network for the BS to aggregate feedback from all V2V links. Then, we propose the hybrid information aggregation and distributed control architecture. Finally, we propose the D-Decision scheme with the binary aggregated information.

B. Hybrid Information Aggregation and Distributed Control Architecture
Each V2V link first observes its local environment to obtain o k , and then adopts its Com- Here, we define a t = {a t k } , ∀k ∈ K as the actions of all V2V links at the time step t ∈ {1, 2, ..., T }, where a t k = ρ k refers to the action for k-th V2V. Besides, in the training process, we take the observations of all V2V links O t as the input and train all DNNs and DQNs in an end-to-end manner. The training process can be implemented in a fully distributed manner.
As for the testing phase, at each time step t, each V2V link adopts its observation o t k as the input of its Compression DNN to learn the feedback b t k , and sends it to the BS. Then, the BS utilizes {b t k } as the input of its Aggregation DNN to generate the AGI φ t , and broadcasts φ t to all V2V links. Finally, each V2V link takes o t k and φ t as the input of its Decision DQN to make decision, and then transmits on the chosen channel.

C. Distributed Spectrum Sharing with binary information
Similar to Section III-D, we can also quantize the continuous feedback and the AGI in the D-Decision scheme into the binary data to further reduce signaling overhead. Then, both the Compression DNN of each V2V link and the Aggregation DNN at the BS need to include the binary function in (16). Initialize the policy π k for each V2V randomly 6: for time-step t = 1, ..., T do 7: Each V2V adopts the current observation o t k as the input of its Compression DNN to learn the feedback b t k , and sends it to BS 8: BS takes {b t k } as the input of its Aggregation DNN, and generates the AGI φ t 9: BS broadcasts the AGI φ t to every V2V 10: Each V2V takes s t k = {o t k , φ t } as the input of its Decision DQN Q k

11:
Each V2V chooses a t k according to s t k using some policy π k derived from Q k , e.g., ǫ-greedy strategy as in (11) 12: Each V2V takes action a t k , and gets the reward R t+1 and the next observation o t+1 k 13: Save the data {O t , a t , R t+1 , O t+1 } into the buffer B

14:
Sample a mini-batch of data D from B uniformly 15: Use the data in D to train all V2Vs' Compression DNNs and Decision DQNs and BS's Aggregation DNN together as in (14) 16: Each V2V updates its observation o t k ← o t+1 k 17: Each V2V updates its target network: θ − k ← θ k every N u steps 18: end for 19: end for

V. SIMULATION RESULTS
In this section, we conduct extensive simulation to verify the performance of the proposed schemes. In particular, we provide the simulation settings in Part A, and evaluate the training performance of the C-Decision scheme in Part B. Then, we assess the testing performance under the real-valued feedback and binary feedback in Parts C and D respectively. Besides, we demonstrate the impacts of V2I and V2V links weights on the performance in Part E and the robustness of the proposed scheme in Part F, respectively. Finally, we show the training and testing performance of the D-Decision scheme in Part G.

A. Simulation Settings
The simulation scenario follows the urban case in Annex A of [5]. The simulation area size is 1, 299 m × 750 m, where the BS is located in the center of this area. For better understanding, we provide related parameters and their corresponding settings in Table I. In addition, we list the corresponding channel models for both V2V and V2I links respectively in Table II.     Table IV.
We use the rectified linear unit (ReLU) activation function for both DNN and DQNs, defined as f (x) = max (0, x). Here, the activation function of output layers in DNNs and DQNs is set as a linear function. Besides, the RMSProp optimizer [37] is adopted to update the network parameters with a learning rate of 0.001. The loss function is set as the Huber loss [38].
We choose the weights λ c = 0.1 and λ d = 1 for V2I and V2V links, respectively. We train the    Fig. 3 (b). Here, we evaluate the training process every 5

B. Training Performance Evaluation
training episodes under 10 different random seeds with the exploration rate ǫ = 0, and plot the average return per episode in Fig. 3 (b). The average return per episode first increases quickly with increasing L train , and gradually converges despite some small fluctuations due to the timevarying V2X scenario, which shows the stability of the training process. Thus, Fig. 3 (a) and (b) demonstrate the desired convergence of the proposed training algorithm. Therefore, we set L train = 2, 000 for the C-Decision scheme afterwards. we also display the performance of two benchmark schemes: the optimal and the random action schemes, respectively. In the optimal scheme, we perform time-consuming brute-force search to find the optimal spectrum allocation in each testing step. In the random action scheme, each V2V link chooses the channel randomly. For better comparison, we depict the normalized return of these three schemes in Fig. 4 (a), where we use the return of the optimal scheme to normalize the return of the other two schemes in each testing episode. Besides, the average return of our proposed scheme and the random action scheme are also depicted. In Fig. 4 (a), the performance of the C-Decision approaches 100% in most episodes and its average performance is about 97% of the optimal scheme while the average performance of random selection is about 55% of the optimal performance. Thus, we conclude the proposed C-Decision scheme can achieve nearoptimal spectrum sharing. Fig. 4 (b) shows the impacts of different mini-batch sizes D and different numbers of realvalued feedback N k on the performance of the C-Decision scheme, which adopts the average return percentage (ARP) as the metric. Here, the ARP metric is defined as: the return under the C-Decision scheme is first averaged over 2, 000 testing episodes and then normalized by the average return of the optimal scheme. In Fig. 4 (b), the number of real-valued feedback equals 0 refers to the situation where each V2V link does not feed anything back to the BS and therefore, each V2V link just randomly selects channel to transmit, which is known as the random action scheme. From Fig. 4 (b), the ARP under the C-Decision scheme increases rapidly with the increase of N k , and reaches the maximal percentage nearly 98% at N k = 3. Thereafter, the ARP virtually keeps constant with increasing N k . In other words, each V2V link only needs to send 3 real-valued feedback to the BS to achieve near-optimal performance. Besides, different mini-batch sizes can achieve very similar performance. Particularly, the mini-batch size D = 512 achieves the best performance, which is good enough considering the computational overhead in the training process and the gained performance.

E. Impacts of V2I and V2V Weights
In this part, we evaluate the impacts of V2I links weight λ c and V2V links weights λ d on the system performance. For better understanding, we fix λ d = 1 and vary the values of λ c . Fig. 6 demonstrates the empirical cumulative distribution function (CDF) of V2I and V2V sum rate. In Fig. 6, "Real FB" and "Binary FB" refer to the proposed C-Decision scheme with realvalued feedback and that with binary feedback respectively, and "Optimal" represents the optimal scheme. In particular, two empirical CDFs of V2I sum rate under both real-valued feedback and binary feedback in Fig. 6       with respect to the each observation (such as channel gain value) for V2V links. In Fig. 8 (a), the ARP under both feedback schemes decreases very slowly at the beginning and then drops very quickly, and finally keeps nearly unchanged with the very large input noise, which shows the robustness of the proposed scheme. In addition, the proposed scheme can also gain nearly 60% of the optimal performance under both real-valued feedback and binary feedback even at the very large input noise, which is still better than the random action scheme shown in Fig. 4 -20 -15 (a). Based on this observation, we remark the proposed scheme can learn the intrinsic structure of the resource allocation in the V2X scenario.

F. Robustness Evaluation
Besides, Fig. 8   average return per episode under the D-Decision scheme in Fig. 9 (b) first increases quickly with the increase of L train , and then increases slowly, and finally gradually converges despite some fluctuations, which shows the stability of the training process. Besides, we observe that L train = 10, 000 under the D-Decision scheme is much bigger than L train = 2, 000 under the C-Decision scheme, which indicates that the D-Decision scheme converges more slowly than the C-Decision scheme. To train the whole neural network well, we set L train = 10, 000 under the D-Decision scheme. Besides, the exploration rate ǫ is linearly annealed from 1 to 0.01 over the beginning 8, 000 episodes and then keeps constant.

G. Performance Evaluation for the D-Decision Scheme
Then, the testing performance of the D-Decision scheme with the increasing number of AGI values is shown in Fig. 10. In particular, Fig. 10 (a) illustrates the ARP performance with the increasing number of real-valued AGI N r g . Here, we set the number of real-valued feedback which each V2V transmits to the BS as 3 as indicated by Fig. 4 (b). The APR first increases with increasing N r g , and then keeps nearly unchanged with the further increase of N r g . Especially, the ARP nearly achieves its maximal value 96% when N r g = 16. In other words, the BS only needs 16 real-valued AGI to represent the real-valued feedback of all V2V links to achieve 96% of the optimal performance. Furthermore, even when N r g = 2, the ARP can still reach 90%, which is suitable for the bandwidth-constrained broadcast channel of the BS. Compared with the C-Decision scheme, the D-Decision scheme only incurs 2% ARP degradation. However, it can achieve the fully distributed decision making and spectrum sharing, which is very appealing in the V2X scenario. In addition, the computational complexity for decision making under the D-Decision scheme is greatly reduced compared with that under the C-Decision scheme, which can further facilitate the fully distributed spectrum sharing in the V2X scenario.
Besides, the testing performance of the D-Decision scheme with the binary AGI is evaluated in Fig. 10 (b). Here, we choose the number of feedback bits as 36 for each V2V link and the number of real-valued AGI N r g = 16. In Fig. 10 (b), the ARP first increases with the increasing number of AGI bits N b g , and then becomes nearly unchanged with the further increase of N b g . In particular, the APR reaches 90% when N b g = 80. Meanwhile, the APR is very close to 90% even when N b g = 36. Similarly, compared with the C-Decision scheme with binary feedback, the D-Decision scheme with the binary feedback only incurs 4% ARP degradation, which, however, can be implemented in a fully distributed manner.

VI. CONCLUSION
In this paper, we proposed a novel C-Decision architecture to allow distributed V2V links to share spectrum efficiently with the aid of the BS in V2X scenario and also devised an approach to binarize the continuous feedback. To further facilitate distributed decision making, we have developed a D-Decision scheme for each V2V link to make its own decision locally and also designed the binary procedure for this scheme. Simulation results demonstrated that the number of real-valued feedback can be quite small to achieve near-optimal performance. Meanwhile, the D-Decision scheme can also gain near-optimal performance and enable a fully distributed decision making, which is more appealing to the V2X networks. Besides, the quantization of the feedback or AGI incurs small performance loss with an acceptable number of bits under both schemes. Our proposed scheme is quite immune to the variation of feedback interval, input noise, and feedback noise respectively, which validates the robustness of the proposed scheme.
In the future, we will investigate joint power control and spectrum sharing issue in this scenario.