Optimising Energy Efficiency in UAV-Assisted Networks using Deep Reinforcement Learning

In this letter, we study the energy efficiency (EE) optimisation of unmanned aerial vehicles (UAVs) providing wireless coverage to static and mobile ground users. Recent multi-agent reinforcement learning approaches optimise the system's EE using a 2D trajectory design, neglecting interference from nearby UAV cells. We aim to maximise the system's EE by jointly optimising each UAV's 3D trajectory, number of connected users, and the energy consumed, while accounting for interference. Thus, we propose a cooperative Multi-Agent Decentralised Double Deep Q-Network (MAD-DDQN) approach. Our approach outperforms existing baselines in terms of EE by as much as 55 -- 80%.


I. INTRODUCTION
T HE deployment of unmanned aerial vehicles (UAVs) to provide wireless coverage to ground users has received significant research attention [1] - [7].UAVs can play a vital role in supporting the Internet of Things (IoT) networks by providing connectivity to a large number of devices, static or mobile [1].More importantly, UAVs have numerous real-world applications, ranging from assisted-communication in disaster-affected areas to surveillance, search and rescue operations [8], [9].Specifically, UAVs can be deployed in circumstances of network congestion or downtime of existing terrestrial infrastructure.Nevertheless, to provide ubiquitous services to dynamic ground users, UAVs require robust strategies to optimise their flight trajectory while providing coverage.As energy-constrained UAVs operate in the sky, they may be faced with the challenge of interference from nearby UAV cells or other access points sharing the same frequency band, thereby impacting the system's energy efficiency (EE) [7].
There has been significant research effort on optimising EE in multi-UAV networks [1] - [5].The authors in [2] proposed an iterative algorithm to minimise the energy consumption of UAVs serving as aerial base stations to static ground users.In [4], a game-theoretic approach was proposed to maximise the system's EE while maximising the ground area covered by the UAVs irrespective of the presence of ground users.However, these works rely on a central ground controller for UAVs' decision making, thereby making it impractical to be deployed for emergencies due to the significant amount of All authors are with the CONNECT centre, School of Computer Science and Statistics, Trinity College Dublin, Ireland.e-mail: {omoniwab, galkinb, duspari}@tcd.ie.This work was supported, in part, by the Science Foundation Ireland (SFI) Grants No. 16  exchanged information between the UAVs and the controller.Moreover, it may be difficult to track user locations in such a scenario.Machine learning is increasingly being used to address complex multi-UAV deployment problems.In particular, multi-agent reinforcement learning (MARL) approaches have been deployed in several works to optimise the system's EE.A distributed Q-learning approach [1] focused on optimising the energy utilisation of UAVs without considering the system's EE.To address this challenge, a deep reinforcement learning (DRL) approach [7] could be adopted.In our prior work [10], a DRL-based approach was proposed to optimise the EE of fixed-wings UAVs that move in circular orbits and are typically incapable of hovering like the rotary-winged UAVs.Moreover, the focus was on UAVs providing coverage to static ground users.The distributed DRL work in [3] was an improvement on the centralised approach in [5], where all UAVs are controlled by a single autonomous agent.The authors in [3], [5] proposed a deep deterministic policy gradient (DDPG) approach to improve the system's EE as UAVs hover at fixed altitudes while providing coverage to static ground users in an interferencefree network environment.Although the approaches in [3] and [5] promise performance gains in terms of coverage score, they focus on the 2D trajectory optimisation of the UAVs serving static ground users.Motivated by the research gaps above, we focus on maximising the system's EE by optimising the 3D trajectory of each UAV over a series of time steps, while taking into account the impact of interference from nearby UAV cells and the coverage of both static and mobile ground users.We propose a cooperative Multi-Agent Decentralised Double Deep Q-Network (MAD-DDQN) approach, where each agent's reward reflects the coverage performance in its neighbourhood.The MAD-DDQN approach maximises the system's EE without hampering performance gains in the network.

II. SYSTEM MODEL
We consider a set of static and mobile ground users ξ located in a given area, as shown in Figure 1.Each user i ∈ ξ at time t is located in the coordinate (x t i , y t i ).We assume service unavailability from the existing terrestrial infrastructure due to disasters or increased network load.As such, a set N of quadrotor UAVs are deployed within the area to provide wireless coverage to the ground users.A serving UAV j ∈ N at time t is located in the coordinate (x t j , y t j , h t j ).Without loss of generality, we assume a guaranteed line-of-sight (LOS) channel condition [11], due to the aerial positions of the UAVs.Signal-to-interference-plus-noise-ratio (SINR) is a measure of the signal quality.It can be defined as the ratio of the power of a certain signal of interest and the interference power from all the other interfering signals plus the noise power.Each user i ∈ ξ in time t can be connected to a single UAV j ∈ N which provides the strongest downlink SINR.Thus, the SINR at time t is expressed as [1], where β and α are the attenuation factor and path loss exponent that characterises the wireless channel, respectively.σ 2 is the power of the additive white Gaussian noise at the receiver, d t i,j is the distance between the i and j at time t.χ int ∈ N is the set of interfering UAVs.z is the index of an interfering UAV in the set χ int .P is the transmit power of the UAVs.We model the mobility of mobile users using the Gauss Markov Mobility (GMM) model [12], which allows users to dynamically change their positions.UAVs must optimise their flight trajectory to provide ubiquitous connectivity to users.Given a channel bandwidth B w , the receiving data rate of a ground user can be expressed using Shannon's equation [7], In our interference-limited system, coverage is affected by the SINR.Hence, we compute the connectivity score of a UAV j ∈ N at time t as [3], where , where γ th is the SINR predefined threshold.Likewise R t i,j = 0 if user i is not connected to UAV j.During flight operations, a UAV j ∈ N at time t expends energy e t j .A UAVs' total energy e T is expressed as the sum in propulsion e P and communication e C energies, e T = e P +e C .Since e C is practically much smaller than e P , i.e., e C e P [1], we ignore e C .A closed-form analytical propulsion power consumption model for a rotary-wing UAV at time t is given as [13], where κ 0 and κ i are the UAVs' flight constants (e.g., rotor radius or weight), U tip is the rotor blade's tip speed, v 0 is the mean hovering velocity, ν is the drag ratio, s is the rotor solidity, A is the rotor disc area, V is the UAVs' speed at time t and ρ is the air density.In particular, we take into account the basic operations of the UAV, such as, hovering and acceleration.Therefore, we can derive the average propulsion power over all time steps as 1 T T t=1 P (t), and the total consumed energy of a UAV j is given as [1], where δ t is the duration of each time step.The EE at time t can be expressed as the ratio of the total data throughput and the total energy consumed by all UAVs, expressed as,

III. MULTI-AGENT REINFORCEMENT LEARNING APPROACH FOR ENERGY EFFICIENCY OPTIMISATION
In this section, we formulate the problem and propose a our MAD-DDQN algorithm to improve the trajectory of each UAV in a manner that maximises the total system's EE.

A. Problem Formulation
Our objective is to maximise the total system's EE by jointly optimising its 3D trajectory, number of connected users, and the energy consumed by the UAVs serving ground users under a strict energy budget.Maximising the number of connected users C t j will maximise the total amount of data i∈ξ R t i,j the UAV j will deliver in time step t which, for a given amount of consumed energy e t j , will also maximise the EE η t .Therefore, the optimisation problem can be formulated as, ) e t j ≤ e max , ∀j, t, (7c) where e max is the maximum UAV energy level, x min , y min , h min and x max , y max , h max are the minimum and maximum 3D coordinates of x, y and h, respectively.As multiple wireless transmitters sharing the same frequency band are in close proximity to one another the possibility of interference is significantly increased.The computational complexity of problem (7a) is known to be NP-complete [6].The problem (7a) is non-convex, thus having multiple local optimum.For this reason, solving (7a) with conventional optimization approaches is challenging [1], [6].Specifically, the problem (7a) will become more complex as more UAVs are deployed in a shared wireless environment, hence it is challenging to find the optimal cooperative strategies to improve the system's EE while completing the coverage tasks under dynamic settings.This is often because UAVs may become selfish and pursue the goal of improving their individual EE while minimising the communication outage and energy consumption, rather than the collective goal of maximising the system's EE.In such cases, cooperative MARL approaches may be suitable when individual and collective interests of UAVs conflict.
Deep RL has been shown to perform well in decision-making tasks in such a dynamic environment [14].Hence, we adopt a cooperative deep MARL approach to solve the system's EE optimisation problem.

B. Cooperative Multi-Agent Decentralised Double Deep Q-Network (MAD-DDQN)
We propose a cooperative MAD-DDQN approach, where each agent's reward reflects the coverage performance in its neighbourhood.Here, each UAV is controlled by a Double Deep Q-Network (DDQN) agent that aims to maximise the system's EE by jointly optimising its 3D trajectory, number of connected users, and the energy consumed.We assume the agents interact with each other in a shared and dynamic environment, which may lead to learning instabilities due to conflicting policies from other agents.From Algorithm 1, Agent j follows an -greedy policy by executing an action a, transiting from state s to a new state s and receiving a reward reflecting the coverage performance in its neighbourhood in (8), after which DDQN procedure described on line 17-25 optimises the agent's decisions.We explicitly define the states, actions, and reward as follows: • State space: We consider the three-dimensional (3D) position of each UAV [6], the connectivity score and the UAV's instantaneous energy level at time t, expressed as a tuple, 〈x t : {0, 1, ..., x max }, y t : {0, 1, ..., y max }, h t : {h min , ..., h max }, C t , e t 〉. • Action space: At each time-step t ∈ T , each UAV takes an action by changing its direction along the 3D coordinates.Unlike our closest related work and the evaluation baseline [3], we discretise the agent's actions following the design from [1] and [6], as follows: (+x s , 0, 0), (−x s , 0, 0), (0, +y s , 0), (0, −y s , 0), (0, 0, +z s ), (0, 0, −z s ) and (0, 0, 0).Our rationale to discretise the action space was to ensure quick adaptability and convergence of the agents.• Reward: The agent's goal is to learn a policy that implicitly maximises the system's EE by jointly minimising the ground users outage and total UAVs energy consumption.Hence, we introduce a shared cooperative factor to shape the reward formulation of each agent j in each time-step t ∈ T given as, where C t j and C t−1 j are the connectivity score in present and previous time-step, respectively.ω = , where e t j and e t−1 j are the instantaneous energy consumed by agent j in present and previous time-step, respectively.To enhance cooperation, we assign each agent a '+1' incentive from its neighbourhood via a function only when the overall connectivity score, which is the total number of connected users by UAVs in its locality in the present time-step C o t exceeds that in the previous time-step C o t−1 , otherwise the agent receives a '−1' incentive.We compute as,

C. DDQN Implementation
The neural network (NN) architecture of Agent j's DDQN shown in Figure 2 comprises of a 5-dimensional state space input vector, densely connected to 2 layers with 128 and 64 nodes, with each using a rectified linear unit (ReLU) activation function, leading to an output layer with 7 dimensions.Our decentralised approach assume agents to be independent learners.Following the analysis presented in [15], the computational complexity of the NN architecture used in the MAD-DDQN is approximately O(D s KW ) with an average response time of 5.6 ms, while that of our closest related work and the evaluation baseline [3] (MADDPG) is approximately O(D s KW ) + O((D a + D s )KW ) with an average response time of 7.4 ms, where D s is the dimension of the state space, D a is the dimension of the action space, K is the number of layers, W is the number nodes in each hidden layer.
In the training phase, given the state information as input, Agent j trains the main network to make better decision by yielding Q-values corresponding to each possible action as output.The maximum Q-value obtained determines the action the agent executes.At each time step Agent j observes its present state s and updates it's trajectory by selecting an action a in accordance with its policy.Following its action in time step t, Agent j observes a reward r which is defined in (8), and transits to a new state s .The information (s, a, r, s ) is inputted in the replay memory as shown in Figure 2. Agent j then samples the random mini-batch from the replay memory and uses the mini-batch to obtain y j .The optimisation is performed with L(θ) and θ updated accordingly.In every 100th time step, the target Q-network updates the parameters θ − with the same parameters θ of the main network.For the training, the memory size was set to 10,000, and the minibatch size was set to 1024.The optimisation is performed using a variant of the stochastic gradient descent called RMSprop to minimise the loss following the methodology described in [16,Chapter 4].The learning rate and discount factor were set to 0.0001 and 0.95, respectively.We train the Q-networks by running multiple episodes, and at each training step the -greedy policy is used to have a balance between exploration and exploitation [16].In the -greedy policy, the action is randomly selected with probability, whereas the action with the largest action value is selected with a probability of 1 − .The initial value of was set to 1 and linearly decreased to 0.01.

IV. EVALUATION AND RESULTS
In this section, we verify the effectiveness of the proposed MAD-DDQN approach against the following baselines: (i) the random policy; and (ii) the MADDPG [3] approach that considers a 2D trajectory optimisation while neglecting interference from nearby UAV cells.Simulation parameters  [6] 20 dBm Noise power/SINR threshold [2] -130 dBm/5 dB Bandwidth [6] 1 MHz Pathloss exponent [2], [6] 2 UAV step distance (∀ xs, ys, zs) [0-20] m are presented in Table I.We simulate a varying number of UAVs ranging from 2 to 12 to serve both static and mobile ground users in a 1000×1000 m 2 area as shown in Figure 2. We perform 2000 runs of Monte-Carlo (MC) trials over trained episodes.In Figure 3, we compare the MAD-DDQN approach with baselines to evaluate the impact of different number of deployed UAVs on the EE, ground users outage and total energy consumption.Due to baseline MADDPG approach taking significantly longer to converge (learn suitable behaviours), to achieve a fair comparison, Figure 3 compares the performance after training the MAD-DDQN approach for 250 episodes and the MADDPG approach for 2000 episodes.Since we focus on comparing the EE values rather than showing their absolute values, we normalise the EE values with respect to the mean values of the proposed MAD-DDQN approach.From Figure 3a, we observe that the MAD-DDQN approach consistently outperforms the random policy and MADDPG approaches across the entire range of UAVs deployment by approximately 80% and 55%, respectively.Interestingly, we see a marginally better performance by the MADDPG approach over the MAD-DDQN approach in minimising the outages experienced by ground users by about 2%, as shown in Figure 3b.However, the slight performance gain by the MADDPG comes at a huge computational training cost which is 8 times higher than the MAD-DDQN approach.Intuitively, the MAD-DDQN approach hides redundant information about the environment through discretisation of the agent's action space, which makes the MAD-DDQN approach require less experience to successfully learn a policy than the MADDPG approach.On the other hand, the random policy performed worst among the approaches in reducing connection outages, emphasizing the relevance of strategic decision making in MARL problems.Figure 3c clearly shows that the proposed approach significantly minimises the total energy consumed by all UAVs as compared to the baselines.Although the MADDPG approach performs slightly better at reducing outages than our approach, our MAD-DDQN approach is significantly more energy efficient, hereby implying the MADDPG approach trades energy consumption for improved coverage of ground users.In Figure 4, we show the plot of the EE versus the learning episodes while varying the     number of agents to demonstrate the convergence behaviour of the MAD-DDQN approach.We observe a steady decrease in the converged values of the EE while increasing the number of UAVs because the system becomes more unstable with more UAVs, thereby decreasing the system throughput as interference increases.Overall, the cooperative MAD-DDQN approach shows convergence in the system's EE irrespective of the number of UAVs deployed in the network.

V. CONCLUSION
In this letter, we propose a MAD-DDQN approach to optimise the EE of a fleet of UAVs serving static and mobile ground users in an interference-limited environment.The MAD-DDQN approach guarantees quick adaptability and convergence, thereby allowing agents to learn policies that maximise the total system's EE by jointly optimising its 3D trajectory, number of connected users, and the energy consumed by the UAVs serving ground users under a strict energy budget.Extensive simulation results have demonstrated that the MAD-DDQN approach significantly outperforms the random policy and a state-of-the-art decentralised MARL solution in terms of EE without degrading coverage performance in the network.
/SP/3804 (Enable) and 13/RC/2077 P2 (CONNECT Phase 2), as well as a research grant from SFI and the National Natural Science Foundation Of China (NSFC) under the SFI-NSFC Partnership Programme Grant Number 17/NSFC/5224.

Figure 1 .
Figure 1.System model for UAVs serving static and mobile ground users.

Figure 2 .
Figure 2. Multi-agent decentralised double deep Q-network framework where each UAV j equipped with a DDQN agent interacts with its environment.The environment shows the simulation snapshot of UAVs providing wireless coverage to 200 static (blue) and 200 mobile (red) ground users with flight trajectories.On the left shows the broadcast range of UAV j in a multi-UAV scenario, where UAVs broadcast their telemetry information to nearest neighbours

e t− 1 j −e t j e t j + e t− 1 j
Energy efficiency η vs. number of UAVs.
Ground users outage vs. number of UAVs.
Total energy consumed vs. number of UAVs.

Figure 3 .
Figure 3. Impact of number of deployed UAVs on the UAVs' EE, ground users outage and total energy consumption under dynamic network conditions with 400 ground users deployed in a 1 km 2 area, with results from 2000 runs of MC trials.

Figure 4 .
Figure 4. Energy efficiency η vs. learning episodes showing the convergence of MAD-DDQN while varying the number of agents.
3: s ← initial state, maxStep ← maximum number of steps in the episode 4: while goal not Reached and Agent alive and maxStep not reached do