Distributed Deep Deterministic Policy Gradient for Power Allocation Control in D2D-Based V2V Communications

Device-to-device (D2D) communication is an emerging technology in the evolution of the 5G network enabled vehicle-to-vehicle (V2V) communications. It is a core technique for the next generation of many platforms and applications, e.g. real-time high-quality video streaming, virtual reality game, and smart city operation. However, the rapid proliferation of user devices and sensors leads to the need for more efficient resource allocation algorithms to enhance network performance while still capable of guaranteeing the quality-of-service. Currently, deep reinforcement learning is rising as a powerful tool to enable each node in the network to have a real-time self-organising ability. In this paper, we present two novel approaches based on deep deterministic policy gradient algorithm, namely “distributed deep deterministic policy gradient” and “sharing deep deterministic policy gradient”, for the multi-agent power allocation problem in D2D-based V2V communications. Numerical results show that our proposed models outperform other deep reinforcement learning approaches in terms of the network’s energy efficiency and flexibility.


I. INTRODUCTION
Vehicle-to-vehicle (V2V) communication, which utilises intelligent vehicles in order to improve traffic safety and reduce energy consumption, has recently emerged as a promising technology.There have been researches on V2V communications that aim to make each vehicle more intelligent while ensuring safety [1], [2].The V2V technology facilitates efficient supervision of possible pitfalls in the roadways by allowing vehicles to cooperate with the already existing transport management systems.Moreover, intelligent transport systems can exploit data from V2V communications to enhance traffic management and enable vehicles to communicate with road infrastructures in order to build more reliable self-driving cars.
In device-to-device (D2D) communications, end-users can interact with each other without having to connect directly The associate editor coordinating the review of this manuscript and approving it for publication was Lei Shu .
to base stations (BS) or core networks.It enables the development of various platforms and applications.For example, D2D communication is a core technique in smart cities [3], high-quality video streaming [4], and disaster relief networks [5].D2D communication can also support V2V communications as it has tremendous advantages such as spectral efficiency, energy efficiency, and fairness [6]- [9].Firstly, the V2V communications under the D2D-enabled architecture are supported through localized D2D communication to inherit the benefits of D2D-based networks.Techniques that are used in D2D communication substantially reduce latency and power consumption; hence, they are suitable for tight delay V2V communications.Secondly, the requirement of time constraint in V2V links is strict as in D2D pairs due to the low latency is essential for critical safety services.In addition, the demand for high reliability in V2V communication is approximately similar in D2D communication.The V2V link reliability is guaranteed by ensuring the SINR is not lower than a small threshold.We identify and incorporate the reliability QoS requirements for V2V links into the objective formulation.Therefore, the D2D communication represents an emerging solution to enable safe, efficient, and reliable V2V communications.However, the resource allocation problem is one of the challenges to enable D2D-based V2V communications due to rapid channel variations caused by V2V user mobility.
Resource allocation problems in D2D communication have received enormous attention from the research community [10]- [15].In [10], the authors considered three scenarios, namely the perfect channel state information, partial channel state information, and imperfect channel between the users and the transmitters, to present a resource allocation algorithm to achieve optimal performance in terms of secrecy throughput and energy efficiency.In [11], the authors introduced an optimisation scheme based on the combination of coral reefs optimisation and quantum evolution to gain the optimal results for joint resource management and power allocation problem in cooperative D2D heterogeneous networks.The authors in [12] proposed an optimisation algorithm based on logarithm inequality to solve the joint energy-harvesting time and power allocation in D2D communications assisted by unmanned aerial vehicles.Meanwhile, in [13], in order to maximise the total average achievable rate from D2D transmitters to D2D receivers, the authors proposed an optimal solution to allocate the spectrum and power in cooperative D2D communications with multiple D2D pairs.In [14], a resource allocation approach was presented to improve energy-efficient D2D communication.In particular, the power allocation problem was solved by using the Lambert W function, and channel allocation was solved appropriately by Gale-Shapley matching algorithm.However, all the above approaches have a common drawback that requires the data of all D2D pairs to be collected and processed in a centralised manner at the BS.It causes delays in real-time scenarios.Furthermore, many previous algorithms typically only work on a small, static environment and all the data was analysed at one point.It is not realistic because environments are dynamic and centralised processing will inflict a bottleneck, congestion, and blockage at the BS or central processing unit.
Some recent works have studied to apply techniques in D2D communication to support V2V communications [6]- [9].In [7], a cluster-based resource block sharing algorithm and in [9] a separate resource block algorithm were proposed to deal with the radio resource allocation problem in D2D-based vehicle-to-everything communications.Meanwhile, the authors in [8] proposed a grouping algorithm, channel selection, and power control strategies to maximise the performance of a network consisting of multiple D2D-based V2V links sharing the same channel.However, the major issue of D2D communication is that each D2D pair in the network typically has limited resources and power for transmitting information whilst the demand for efficient resource allocation such as spectrum and power allocation is rising rapidly.Furthermore, each pair in D2D networks cannot frequently transfer or store in their memory the information of its resource allocation scheme due to limitations in transmission power and memory storage.Besides, if we use BS as a central processing unit to find a resource allocation scheme for each pair, the delay incurred will make the system model unsuitable for real-time applications.Recently, efficient optimisation algorithms have been deployed to enhance both energy efficiency and processing time [12], [16], [17].
In [18], reinforcement learning algorithm (RL) was used to obtain the optimal policy for the power control problem in energy harvesting two-hop communication.The authors considered that each energy harvesting node only knows the harvested energy and channel coefficients.Thus, the problem can be transferred to two point-to-point problems, and to maximise the amount of data at the receiver, RL algorithm called SARSA is employed at each energy harvesting node to reach the optimal policy at a transmitter.Nevertheless, the RL based algorithm has some disadvantages such as instability and inefficiency when the number of nodes in the network is sufficiently large.
Recently, deep learning (DL), a subfield of machine learning, is a powerful optimisation tool to solve the resource management problems in modern wireless networks [19], [20].An approach based on deep recurrent neural networks was presented in [19] to obtain the optimal policy for resource allocation in a non-orthogonal multiple access-based heterogeneous internet-of-things network.In [20], the authors proposed a deep learning-based resource management scheme to balance the energy and spectrum efficiency in cognitive radio networks.By utilising the neural networks, the convergence speed was significantly improved in terms of the lower computational complexity and learning cost while satisfying the network performance.DL has also been applied to solve the physical layer issues in wireless networks [21]- [25].The authors in [21] proposed a convolutional neural network-based method to automatically recognise eight popular modulation models, which are used in advanced cognitive radio networks.The proposed network was trained by using the two datasets of in-phase and quadrature to extract features and efficiently classify modulated signals.Meanwhile, the authors in [22] introduced a fully-connected neural network-based framework for maximising the network throughput under the limited constraint of total transmit power.The data was generated without labels and put into the neural network for offline unsupervised training.The DL-based algorithms were also proposed to enable mmWave massive multiple-input multiple-output framework for hybrid precoding schemes [23] and to detect the channel characteristics automatically [24].
Deep reinforcement learning, a combination of RL and deep neural network, has been used widely in wireless communication thanks to its powerful features, impressive performance, and adequate processing time.The authors in [26] formulated a non-cooperative power allocation game in D2D communications and proposed three approaches based on deep Q-learning, double deep Q-learning, and dueling deep Q-learning algorithm for multi-agent learning to find the optimal power level for each D2D pair in order to maximise the network performance.The authors in [27] used deep Q-learning algorithm to look for the optimal sub-band and transmission power level for each V2V user in V2V communications while satisfying the requirement of low latency.However, these algorithms can only work on the discrete action space; hence, human intervention is required to design the power level of each pair.With the finite set of action space, the performance of these algorithms cannot reach the optimal result, and the reward can become worse if we cannot divide the power level accurately.
Against this background, in this paper, we propose two novel models termed as distributed deep deterministic policy gradient (DDDPG) and sharing deep deterministic policy gradient (SDDPG) based on deep deterministic policy gradient (DDPG) algorithm [28].Our proposed approaches can work on a continuous action space for the multi-agent power allocation problem in D2D-based V2V communications.Therefore, we can improve the algorithm convergence quality and sample efficiency significantly, especially when the number of V2V pairs in the network increases.From the numerical results, we show that our model outperform the approach based on the original DDPG algorithm in terms of energy efficiency (EE) performance, computational complexity, and network flexibility.Our main contributions are as follows: • We provide two novel approaches based on DDPG algorithm to solve the multi-agent learning and non-cooperative power allocation problem in D2Dbased V2V communications.Experiment results show promising results over other existing deep reinforcement learning approaches.
• By modifying the input of the neural network, all the agents in the multi-agent deep reinforcement learning algorithm can share one actor network and one critic network to reach higher performance and faster convergence while reducing the computational complexity and memory storage significantly.
• Finally, after training the policy neural network, the noncooperative power allocation problem in D2D-based V2V communications can be solved in milliseconds.It becomes a promising technique for real-time scenarios.The remainder of the paper is organised as follows.
In Section II, we describe the system model and formulation of the multi-agent power allocation problem in D2D-based V2V communications.Section III describes the value functions, policy gradient concepts, and proposes distributed deep deterministic policy gradient algorithm-based method.In Section IV, we improve the model by using the embedding layer to solve the non-cooperative resource allocation problem in D2D-based V2V communications efficiently.In Section V, the simulation results are presented to demonstrate the efficiency of our proposed schemes.Finally, we conclude this paper and propose some future works in Section VI.

II. SYSTEM MODEL AND PROBLEM FORMULATION
In this section, we define the system model and formulation of the power allocation problem in D2D-based V2V communications.As depicted in Fig. 1, there are N V2V pairs are distributed randomly within the coverage of one BS.Each V2V pair consists of a single antenna V2V transmitter (V2V-Tx) and a single antenna V2V receiver (V2V-Rx).We define that β 0 , f i and α h are the channel power gain at the reference distance, an exponentially distributed random variable with unit mean, and the path loss exponent for V2V links, respectively.The location of the ith V2V-Tx and jth V2V-Rx with i, j ∈ {1, . . ., N } are (x i Tx , y i Tx ) and (x j Rx , y j Rx ).Hence, the channel power gain h ij between the ith V2V-Tx and jth V2V-Rx is written as where 2 is the Euclidean distance between the ith V2V-Tx and jth V2V-Rx.
The received signal-to-interference-plus-noise ratio (SINR) at the ith V2V user is defined as where p i ∈ (p min i , p max i ) and σ are the transmission power at ith V2V pairs and the AWGN power, respectively.
In the power allocation problem in D2D-based V2V communications with N V2V pairs, our objective is to find an optimal policy to maximise the EE performance of our network.The information throughput at the ith V2V pair is defined as follows: where W is a bandwidth.The total performance of the network is a joint function of all V2V pairs.We define the quality of service (QoS) constraints as In this work, we focus on maximising the total EE performance of the network while satisfying energy constraints and the QoS constraints for each V2V pair.Therefore, the EE optimisation problem can be defined as In the D2D-based V2V communications, we have N V2V pairs in which each V2V pair can only have its environment information about power allocation strategy and current environment state.This makes the power allocation problem in D2D-based V2V communications become a multi-agent and non-cooperative game.Thus, we formulate the multi-agent power allocation game in D2D-based V2V communications and propose two deep reinforcement learning approaches based on the DDPG algorithm to enable each V2V user to have an optimal power allocation scheme.
In RL, an agent interacts with the environment to find the optimal policy through trial-and-error learning.We can formulate this task as a Markov decision process (MDP) [29].In particularly, we define a 4-tuple S, A, R, P , where S and A is the agent state space and action space, respectively.The reward function r = R(s, a, s ) can be obtained at state s ∈ S, action a ∈ A, and next state s ∈ S.An agent has transition function P a ss which is the probability of next states s when taking action a ∈ A at state s ∈ S.
Regarding the multi-agent power allocation problem in D2D-based V2V communications, we define that each V2V transmitter is an agent, and the system consists of N agents.The ith V2V-Tx is defined as ith agent, which is represented as S i , A i , R i , P i , where S i is the environment state space, A i is the action space, R i is the reward function, and P i is the state transition probability function.Generally, an agent corresponding to a V2V user at each time t observes a state, s t from the state space, S, then accordingly takes action of selecting power level, a t , from the action space, A based on the policy, π .By taking the action a t , the agent receives a reward, r t and the environment transits to a new state s t+1 .
In the next step, we define the action spaces, state spaces and reward function of the multi-agent power allocation problem in D2D-based V2V communications as follows: State spaces: At each time t, the state space of the ith V2V transmitter observed by the V2V link for characterising the environment is defined as where I i ∈ (0, 1) is the level of interference as Action spaces: The agent i at time t takes an action a t i , which represents the agent selected power level, according to the current state, s t i ∈ S i under the policy π i .The action space of ith V2V-Tx is denoted as where

Reward function:
Our objective is to maximise the total performance of the network by interacting with the environments while satisfying the QoS constraints.Thus, we design a reward function R i of the ith V2V user in state s i by receiving the immediate return by executing action a i as

III. MULTI-AGENT POWER ALLOCATION PROBLEM IN D2D-BASED V2V COMMUNICATIONS: DISTRIBUTED DEEP DETERMINISTIC POLICY GRADIENT APPROACH
In RL, we have two main approaches and a hybrid model to solve the games.There are value function-based methods, policy search-based methods, and an actor-critic approach that employs both value functions and policy search [30].
In this section, we explain value function and policy search concepts which can learn on continuous domains.We further propose a solution based on the DDPG algorithm to solve the energy-efficient power allocation problem in D2D-based V2V communications.

A. VALUE FUNCTION
Value function, which is often denoted as V π (s), estimates the expected reward for an agent staring in state s and following the policy π subsequently.Value function represents how good for an agent to be in a given state where E(•) stands for the expectation operation and R denotes the rewards gain from the initial state s while following the policy π .In all the possibility of the value function V π (s) there is an optimal value V * (s) corresponding to an optimal policy π * ; the optimal value function V * (s) can be defined as The optimal policy π * is the policy that can be retrieved from optimal value function V * (s) by choosing the action a from the given state s to maximise the expected reward.We can rewrite (13) by using Bellman equation [31 where r(s, a) is the expected reward obtain when taking action a from the state s, p a ss defines the probability of the next state s if the agent at the state s takes action a, and ζ ∈ [0, 1] is the discounting factor.
The action-value function Q π (s, a) is the total reward which represents how good for an agent to pick an action a in state s when following the policy π The optimal action-value function Q * (s, a) can be written as Thus, we have Q-learning [32], an off-policy algorithm, regularly uses the greedy policy π = arg max a∈A Q(s, a) to choose the action.The agent can achieve the optimal results by adjusting Q value according to the updated rule where α ∈ [0, 1] is the learning rate.

B. POLICY SEARCH
The policy gradient, which is one of policy search techniques, is a gradient-based optimisation algorithm.It aims to model and optimise the policy to directly search for an optimal behaviour strategy π * for the agent.The policy gradient method is in popularity because of the efficient sampling ability when the number of policy parameters is large.Let π and θ π denote the policy and vector of policy parameters, respectively; and J is performance of the corresponding policy.The value of the reward function depends on this policy, and then the various algorithms can be applied to optimise parameter θ π to achieve the optimal performance.
The average reward function on MDPs can be written as where d(s) is the stationary distribution of Markov chain for policy π θ .Using gradient ascent, we can adjust the parameter θ π suggested by ∇ θ J (θ π ) to find the optimal θ * π that produces the highest reward.The policy gradient can be computed like in [33] as follows The REINFORCE algorithm is devised as a Monte-Carlo policy gradient learning algorithm that relies on an estimated return by Monte-Carlo simulations where episode samples are used to update the policy parameter θ π .The objective of REINFORCE algorithm is to maximise expected rewards under policy π θ * π = arg max θ π J (θ ).(22) Thus, the gradient is presented as Then, parameters are updated along positive gradient direction A drawback of the REINFORCE algorithm is the slow speed of convergence due to the high variance of the policy gradients.

C. DISTRIBUTED DEEP DETERMINISTIC POLICY GRADIENT
By utilising the advantages of both policy search-based methods and value function-based methods, a hybrid model called the actor-critic algorithm has grown as an effective approach [30].In policy gradient-based methods, the policy function π (a|s) is always modelled as a probability distribution over actions space A in the current state, and thus it is stochastic.Very recently, deterministic policy gradient (DPG) is deployed as an actor-critic algorithm in which the policy gradient theorem is extended from stochastic policy to deterministic policy.Inspired by the success of deep Q-learning [26], which uses neural network function approximation to learn value functions for a very large state and action space online, the combination of DPG and deep learning called deep deterministic policy gradient enables learning in continuous spaces.
An existing drawback of most optimisation algorithms is that the samples are assumed to be independently and identically distributed.It leads to the destabilisation and divergence of RL algorithms if we use a non-linear approximate function.To overcome that challenge, we use two major techniques as follows: • Experience replay buffer: agent i has a replay buffer D i to store the samples and take mini-batches for training.
Transitions are sampled from the environment following the exploration policy and the tuple (s t i , a t i , r t i , s t+1 i ) will be stored in D i .When the replay buffer D i is big enough, a mini-batch K i of transitions is sampled randomly from the buffer D i to train the actor and critic network.By setting the finite size of replay buffer D, the oldest samples are removed to retrieve space for the new samples, and the buffers are always up to date.
• Target network: At each step of training, the Q value is shifted.Thus, if we use a constantly shifting set of values to estimate the target value, the value estimations are easy out of control, and it makes the network unstable.To address this issue, we use a copy of the actor VOLUME 7, 2019 and critic networks, Q i (s i , a i ; θ q i ) and µ i (s i ; θ µ i ), respectively, to calculate the target values.The parameter θ q i and θ µ i in actor and critic network are then updated using soft target updates with τ 1 By using the target networks, the target values are constrained to change slowly, significantly learning the action-value function closer to supervised learning.However, both target µ i and Q i are required to process a stable target in order to train the critic consistently without divergence.Herein, this may slow training since the target network delays the propagation of value estimations.A notable challenge of learning in continuous action spaces is exploration [28].In order to do better exploration, we add a small white noise N i (0, 1) to our actor policy to construct a Gaussian exploration policy µ i [28] where is a small positive constant.The details of our proposed algorithm, distributed deep deterministic policy gradient based on DDPG algorithm, to deal with the multi-agent power allocation problem in D2D-based V2V communications are described in Algorithm 1.

IV. SHARING DEEP DETERMINISTIC POLICY GRADIENT FOR MULTI-AGENT POWER ALLOCATION PROBLEM IN D2D-BASED V2V COMMUNICATIONS
In this section, we present a simple improvement of the DDPG algorithm with the parameter sharing technique in multi-agent learning problems.In this algorithm, we can reach more effective policies for all the V2V pairs in the network by sharing the parameters of a single policy due to the homogeneous quality of all agents.Therein, each agent can be trained with the experiences of all agents simultaneously [34].
With the DDDPG algorithm in Algorithm 1, each agent has an actor network and a critic network for their own.It makes the systems shift significantly when the number of V2V pairs increases.In addition, the computational complexity, memory storage, and processing time are also unmanageable.Inspired by the impressive results of the paper [34], to overcome that problem, we propose a novel model based on DDPG called SDDPG algorithm in which a large number of agents can use sharing networks.By adding the embedding layer to build a new input layer of neural networks, we can use one actor and one critic network for the multiple agents in deep reinforcement learning.Consequently, it reduces the overall computational processing significantly in our model while ensuring the performance.The speed of convergence is also better than standard approaches.
The simplest way to represent an input layer with a node for every pair is ''one-hot'' encoding that is a vector of zeros with one at a single position.However when the number of V2V Initialisation: for all V2V i, i ∈ N do Randomly initialise critic Q i (s i , a i ; θ q i ) and actor µ i (s i ; θ µ i ) Randomly initialise targets Q i and µ i with parameter θ q i ← θ q i , θ µ i ← θ µ i Initialise replay buffer D i end for for all V2V i, i ∈ N do for episode = 1, . . ., M do Initialise the action exploration to a Gaussian N i Receive initial observation state s 1 i for iteration = 1, . . ., T do Obtain the action a t i at state s t i according to the current policy and action exploration noise Measure the achieved SINR at the receiver according to (2) Update the reward r t i according to (11) Observe the new state ) from buffer D i Update critic by minimising the loss: where Update the actor policy using the sampled policy gradient: Update the target networks: Update the state s t i = s t+1 i end for end for end for pairs in the network increases, the ''one-hot'' encoding vector becomes more sparse with relatively few non-zero values.Thus, ''one-hot'' encoding has some issues such as the more data is needed to train the model effectively and the more parameters, the more computation is required to train and use the model; herein, it turns out that making a model more difficult to learn effectively and it is easy to exceed the capabilities of the hardware.
Embedding is rising as a potential technique in which a lower-dimensional space can be achieved by translating from a large sparse vector while preserving semantic relationships [35] to deal with these above problems.To apply the embedding layer efficiently in our problem, we divide the input of the ith V2V pair into two parts, ID i and QoS constraint I i .Depending on the number of V2V pairs in the network, the output dimension of embedding layers can be chosen flexibly to reduce the memory storage and processing time while ensuring the performance of the network.The ID i of the V2V pair is put into the embedding layer and a fully-connected layer before being concatenated with the level interference of the ith V2V pair, I i .We assume that the concatenated layer is the input of neural networks in the DDPG algorithm.The actor and critic network of our proposed model are described in Fig. 2.
The details of the SDDPG algorithm-based approach for multi-agent power allocation problem in D2D-based V2V communications are described in Algorithm 2.

V. SIMULATION RESULTS
In this section, we perform the simulation results on PC Intel(R) Core(TM) i7-8700 CPU @ 3.20Ghz to demonstrate the effectiveness of our proposed methods in solving the Initialise the critic network Q(s, a; θ q ) and actor network µ(s; θ µ ) with random parameter θ q and θ µ Initialise the target networks Q and µ with parameter θ q ← θ q , θ µ ← θ µ Initialise replay buffer D for all V2V i, i ∈ N do for episode = 1, . . ) from buffer D Update critic by minimising the loss: where Update the actor policy using the sampled policy gradient: Update the target networks:  algorithms.We design the actor and critic networks with one input layer, one output layer, one hidden layer of 100 units and Adam optimisation algorithm [37] for training.The parameters of neural networks are initialised with small random values with a zero-mean Gaussian distribution.The other simulation parameters are given in Table 1.
Fig. 3 illustrates the EE performance of the network using the DDDPG algorithm while considering different values of mini-batch size K and the learning rate of actor and critic network, α A and α C , respectively.From Fig. 3 (a), we can see that with a small batch size, our proposed algorithms can be needed to take a long time to reach the optimal policy.On the other hand, there is a possibility that the learning process can be trapped local optimum and cannot escape to reach the best performance if a batch size is too large, although the calculated gradient is more accurate than the ones with a small batch size; hence, it may lead to a slower convergence.Meanwhile, the parameters of neural networks are updated according to the value of the learning rate.The learning rate decides the speed of convergence and stability of our proposed algorithms.In Fig. 3 (b) with the small values of the learning rate, results are at a slower speed of convergence.On the contrary, if we choose a high learning rate, the algorithms can diverge from the optimal solution.Clearly, our proposed algorithms can achieve the best performance with the learning rate, α A = 0.0001 and α C = 0.0001.Based on the result shown in Fig. 3, we choose the batch size to be K = 32 and the initial learning rate α = 0.0001 for actor and critic networks.
Fig. 4 compares the performance of our two proposed approaches based on the DDDPG and SDDPG algorithm with the output dimension of the embedding layer set to 5, |Dims| = 5.The comparison is against the standard DDPG algorithm for a multi-agent power allocation problem in D2D-based V2V communications.The EE performance of the network when using the DDDPG and SDDPG algorithm are almost identical and better than the standard DDPG in multi-agent learning.In convergence, the speed of convergence with the SDDPG algorithm is faster than ones with the  DDDPG algorithm and ones with the standard algorithm.reason is that when we use sharing networks for N agents, these networks are trained many times and the next agents can use the previous pre-trained networks to achieve an optimal policy faster than the DDDPG algorithm-based approach.These results advise that using the combination of multi-agent learning and the DDPG algorithm significantly helps to find an optimal policy for non-cooperative energyefficient power allocation problem in D2D-based V2V communications.The performance of the network by using the DDDPG and SDDPG algorithm-based approaches outperform with ones based on the classical DDPG algorithm in different number of V2V pairs.The simulation result difference between models based on the DDDPG and SDDPG algorithm is small even when the number of V2V pairs increases.With N = 30, the average performances of the DDDPG and SDDPG algorithm-based approaches are almost identical.The performance of the scheme based on the SDDPG algorithm is better than the DDDPG algorithm in some cases.However, the DDDPG algorithm uses N neural networks for actor function and N neural networks for critic function.Meanwhile, in the SDDPG algorithm, we share one actor network and one critic network for all the agents.Therefore, the computational processing and memory storage used for the DDDPG algorithm-based approach many times higher than the SDDPG algorithm when the number of V2V pairs increases.
Next, we compare EE performance results of the network using the SDDPG algorithm-based approach in different output dimensions of the embedding layer in Fig. 6.With the number of V2V pairs in the network N = 30, we can achieve the best performance while setting the output dimension of the embedding layer to Dims = 5.The higher output dimensions in the embedding layer do not guarantee  better performance.However, the variance of models with the higher output dimensions of the embedding layer is lower.
Moreover, in Fig. 7, we present the performance of the network with different values of SINR requirements γ * in models using the DDDPG, SDDPG, and classical DDPG algorithm.As we can see from Fig. 7, when the value of SINR requirement γ * is too high, the EE result degrades due to the decease in the number of V2V links that satisfy QoS requirements.The performance of the SDDPG algorithm-based approach is better than the ones using the DDDPG algorithm when we choose the γ * high.In addition, the effectiveness of our proposed algorithms, SDDPG and DDDPG, is superior to the classical DDPG algorithm for multi-agent power control problem in D2D-based V2V communications.
Finally, we evaluate the processing time of our proposed models during test time after neural networks being trained in comparison with other approaches.Table 2 presents the average processing time in different scenarios.By using our proposed models, each V2V user can choose the power level to maximise the EE performance of the network in milliseconds while satisfying QoS requirements.Particularly, we only need 21.1ms and 28.07ms to solve the power allocation problem in D2D-based V2V communications with the number of V2V pairs, N = 30, by using the DDDPG and SDDPG algorithm, respectively.On the other hand, the method based on the logarithmic inequality algorithm in [12] needs 145.6ms to solve a similar problem with the same environment parameters.Therefore, the results suggest that our proposed models are promising techniques for real-time scenarios.

VI. CONCLUSION
In this paper, we proposed two models based on the DDDPG and SDDPG algorithm to solve the multi-agent energyefficient power allocation problem in D2D-based V2V communications.By utilising the advantage of neural networks and the embedding layer, our proposed models can overcome the limitations of existing approaches.The simulation results outperformed other base-line algorithms in terms of the EE performance of the network, computational complexity, and memory storage.The computational complexity and memory storage of the solution can be significantly reduced by using the SDDPG algorithm when the number of V2V pairs increases.In the future, we will investigate more efficient multi-agent learning approaches and more advanced deep learning models in order to improve the learning convergence, reduce the training variance, and reduce the algorithm's computational complexity.

Algorithm 1
Distributed Deep Deterministic Policy Gradient Algorithm for Multi-Agent Power Allocation Problem in D2D-Based V2V Communications

FIGURE 2 .
FIGURE 2. Proposed model with sharing actor and critic network for multi-agent deep reinforcement learning problem.

Algorithm 2
Sharing Deep Deterministic Policy Gradient for Multi-Agent Power Allocation Problem in D2D-Based V2V Communications Initialisation:

FIGURE 3 .
FIGURE 3. The EE performance of the network by using the DDDPG algorithm in multi-agent power allocation problem in D2D-based V2V communications with different values of batch size K and learning rate α, the number of V2V pairs, N = 30.

FIGURE 4 .
FIGURE 4. The EE performance of the network by using the DDDPG, SDDPG and DDPG algorithm in multi-agent power allocation problem in D2D-based V2V communications with the number of V2V pairs, N = 30.

FIGURE 5 .
FIGURE 5. Performance results of the DDDPG, SDDPG and DDPG algorithm-based approaches with different number of V2V pairs in the network.

FIGURE 6 .
FIGURE 6.The EE performance of the network by using the SDDPG algorithm with different output dimensions of embedding layer in multi-agent power allocation problem in D2D-based V2V communications with the number of V2V pairs, N = 30.

FIGURE 7 .
FIGURE 7. The EE performance of the network by using the DDDPG, SDDPG and DDPG algorithm in multi-agent power allocation problem in D2D-based V2V communications while considering different values of SINR requirement, γ * .
. , M do Initialise the embedding layer Initialise a random process N i for action exploration Receive initial observation state s 1 i by concatenating the output of the embedding layers and I i for iteration = 1, ... , T doObtain the action a t i at state s t i according to the current policy and exploration noise Measure the achieved SINR at the receiver accord-

TABLE 2 .
The running time of our proposed models in comparison with other approaches.