Trajectory Design and Resource Allocation for Multi-UAV Networks: Deep Reinforcement Learning Approaches

The future mobile communication system is expected to provide ubiquitous connectivity and unprecedented services over billions of devices. The unmanned aerial vehicle (UAV), which is prominent in its flexibility and low cost, emerges as a significant network entity to realize such ambitious targets. In this work, novel machine learning-based trajectory design and resource allocation schemes are presented for a multi-UAV communications system. In the considered system, the UAVs act as aerial Base Stations (BSs) and provide ubiquitous coverage. In particular, with the objective to maximize the system utility over all served users, a joint user association, power allocation and trajectory design problem is presented. To solve the problem caused by high dimensionality in state space, we first propose a machine learning-based strategic resource allocation algorithm which comprises of reinforcement learning and deep learning to design the optimal policy of all the UAVs. Then, we also present a multi-agent deep reinforcement learning scheme for distributed implementation without knowing a priori knowledge of the dynamic nature of networks. Extensive simulation studies are conducted and illustrated to evaluate the advantages of the proposed scheme.


A. Background and Motivation
T HE increasing demand for high quality wireless services urges the future wireless communication system to provide ubiquitous connectivity and coverage over all kind of mobile devices.The diversity of network applications also poses strict requirements on network capacity, service latency and energy consumption for trillions of mobile devices.To realize the vision of essentially unlimited access to wireless data anywhere and anytime for anything, the recent emerging unmanned aerial vehicle (UAV)-based flying platforms are able to break the limitations of traditional network infrastructure [1], which urges to rethink the development of next generation communication systems.The UAV, also known as drone, has attracted many attentions due to its prominent in flexibility, easy and low cost deployment [2].Because of its high flying attitude, the UAV-based platform can establish the effective Line-of-Sight (LoS) links with the ground-users (GUs), thus to reduce the energy consumption for reliable connectivity [3].Therefore, an UAVs-based flying mobile communication system provides a cost-and energy-efficient solution with limited territorial cellular infrastructure for the GUs.
Developing an UAV-enabled wireless communications system has received attracted a large amount of research interests.To date, majority of the works have dedicated on the single UAV two-dimension (2-D) or 3-dimension (3-D) deployment/ placement optimization problems, with the assumption that UAV can serve as aerial quasi-static base stations (BS) or relay.Although adding a single UAV into the cellular network has shown its potential on performance enhancement, it has limited communications, caching and computing capability in general, which is not preferred for mission-critical services and a large number of GUs.Correspondingly, deployment of a swarm of UAVs is motivated.In the multi-UAV communication system, multiple UAVs may cooperatively serve the GUs in a large area.Moreover, different GUs could be served simultaneously with lower latency and higher throughput, which could address some throughput-and latency-related problems brought by a single-UAV system.
On the other hand, current works on the multi-UAV network usually focus on the proposals of trajectory design and resource allocations in a static manner considering the UAVs can act as BSs.In order to provide long-term effective connectivity and reliable coverage, UAV-based network with high mobility needs to be carefully designed and different UAVs should autonomously work as a team and their interactions should be explored.Therefore, establishing an efficient, smart and autonomous multi-UAV network emerges as a research topic with profound importance while is still under-investigated.Addressing such a topic is typically challenging.First, due to its high cost and limited communication capability, the mobility/route of different UAVs should be designed and coordinated with high accuracy to cover a large area over a long run.Moreover, fairness is also critical for the UAV network as the UAVs should move around to ensure the communication coverage.In addition, the energy consumption issues should be seriously considered as the UAVs are typically with limited energy supply and should be recharged from time to time.Last but not the least, the UAVs are usually deployed to where the network access is limited to execute mission-critical services.Certain degree of autonomy or self-organizing is highly preferred.
To address the aforementioned problems, and develop a smart and autonomous multi-UAV communication systems, we propose to leverage deep reinforcement learning (DRL) framework, which recently demonstrates a potential on improving the performance of wireless network.Due to the fact that RL can enable UAVs to choose their policies for optimizing the objectives without a priori knowledge of the environment, it is suitable to address the trajectory control and resource allocation in the multi-UAV wireless networks.Specifically, we consider that all the UAVs share the same spectrum to serve the GUs.By focusing on the downlink of the network, i.e. transmissions from the UAVs to GUs, the objective of this work is to maximize the system utility among all the GUs by jointly optimizing the power allocation, user association, and UAV trajectory in a given finite period.Addressing the formulated joint optimization is challenging, because the transmit power allocation, user association, and UAV trajectory design optimizations are actually coupled.Correspondingly, for the formulated problem, the DRL is able to provide a promising solution because it can solve the problem of high dimensionality in state-action space and also handle the time-varying environment [4].The DRL uses Deep Neural Networks (DNNs) to the decision making process, which can offer significant performance improvement to many learning problems with limited or even zero knowledge.Moreover, developing decentralized approaches is becoming more needed than ever due to the complexity of the multi-UAV wireless networks.Though it can be very challenging to design them, decentralized approaches scale well, as they typically incur little to no communication and computational overhead while still performing relatively well.Thus, we also consider the decentralized feature of multi-UAV system, and propose to utilize the multi-agent DRL to design a distributed algorithm [5], which enables the way towards an autonomous UAV communications system.

B. Related Works
The research on the UAV-based wireless communication systems have mainly concentrated on the UAV placement and resource optimization [3]- [19], with the assumptions that UAV can serve as aerial BSs or aerial relay to support GUs.For the trajectory design, the altitude of the UAV can be optimized with or without the horizontal location based on different considerations and QoS requirements.In [3], the authors aim to maximize the communication coverage by optimizing the altitude of the a single UAV wireless network.The authors of [6] utilize stochastic geometry-based approach to analyze two-tier wireless network consisting of BSs and aerial BS.General probabilistic LoS and NLoS propagation models are assumed and coverage probability and spectral efficiency are derived with the consideration of the height of the aerial BS.In [7], the authors jointly optimize the altitude of UAVs, the duration of transmission phases and the antenna configuration to maximize the coverage, under the assumptions of UAV and ground BS with distributed access points and multiple antennas.
In contrast, there are several papers working on the twodimensional (2-D) trajectory design (e.g. the horizontal positions) of the UAV by fixing its altitude.To address the problem of control over a group of UAVs in a long term, the authors of [8] utilize the deep reinforcement learning to minimize the energy consumption of the overall network while maintaining the reliable connectivity.In [10], the authors consider the UAV flies to a given location for certain mission and it needs reliable communication with BSs at each time slot.The aim is to minimize the completion time of the UAV by 2-D trajectory optimization, subject to the connectivity constraint of BS-UAV link.The authors of [11] investigate the cooperation of a group of UAVs, and propose mode selection between UAV-to-infrastructure and UAV-to-UAV modes for data delivery.Then the resource allocation and speed optimization are propose to maximize the uplink data rate.In [12], the authors investigate the UAV-based secure communication.A two-UAV system is considered where one is for data transmission and the other one is to jam the eavesdroppers on the ground.The minimum worstcase secrecy data rate of the GUs is optimized by designing UAVs' trajectories and user scheduling.
As for the 3-D trajectory design, in [13], both periodic and temporal operation modes are considered for the UAV system.In each case, the aim is to minimize the duration of UAV flight or mission completion time.In [14], the authors propose to maximize the minimum throughput of all the GUs in order to achieve fair performance.The route design, power allocation and user scheduling schemes are presented.The authors of [15] consider UAV provides services for a group of GUs in a dynamic channel scenario, and propose a transmit power allocation and 3-D trajectory design optimization scheme to maximize the minimum throughput of the group in a given time duration.In [16], a drone-based small cell placement problem is explored to maximize the overall system utility.In [17] and [18], by considering joint optimization of the mobility and location of the UAVs, transmit power allocation and user association schemes are presented to improve reliability of the uplink.The authors of [19] investigate the trajectory design and resource allocation problem for maximizing the throughput of a solar powered UAV system over a given time period.
In general, the (deep) multi-agent reinforcement learning has been explored to address control-related problems [20]- [23].There are increasing efforts to investigate the potential of multi-agent reinforcement learning (MARL) on the resource allocations in the wireless communication system.The authors of [24] utilize the MARL to address the power allocation problem in D2D communications, while the MARL-based approach is applied to address computation offloading and interference coordination in [27].The authors explore the MARL on improving the secure performance of wireless network in [28].In addition, the spectrum access problems in different types of wireless network are addressed via MARL in [29] and [30].Recently, MARL-based schemes have been gradually applied to the UAV networks [31] [32].The authors of [31] has utilized the MARL to present distributed trajectory design of multi-UAV network.In [32], MARL-based scheme is also applied for trajectory design when considering a UAV-assisted edge computing system.
As one can observe, there is a lack of works utilizing learning-based schemes on the proposal of joint optimization of trajectory design, power allocation and user association, to effectively and efficiently operate multi-UAV network.Moreover, there is spare study towards an autonomous multi-UAV communication system, which is of profoundly importance towards fully utilizing UAVs in the development of wireless communication system.

C. Contribution
In this work, our main target is to utilize collaborative machine learning, i.e., DRL-based scheme and multi-agent DRL-based scheme to tackle the problem of power allocation, user association and trajectory design for multi-UAV communications system.Bearing in mind the above mentioned works, main contributions of this paper are summarized in the following.
A multi-UAV communication system is considered to serve multiple GUs.A central base controller is assumed to carry out the learning process.With the objective to maximize the system utility, the problem of trajectory design, user association and power allocation is investigated.To address the problems related to the high dimensionality in state space, we first propose a machine learning-based strategic resource allocation algorithm which comprises of reinforcement learning and deep learning to explore the optimal policy of all the UAVs.The proposed centralized DRL process can be carried out at the central base and the UAVs are controlled via the signaling exchange with the base.Because the UAV-based network is expected to solve mission-critical problems in reality, an autonomous communication system is preferred.Thus, we further consider a complex scenario and propose to decentralize the considered multi-UAV system.In this setting, no UAVs can observe the underlying Markov state.Instead, each UAV only obtains a private observation correlated with that state.The UAVs are able to utilize dedicated limited-bandwidth channel to communicate with each other, and are fully cooperative and share the goal of maximizing the system utility.However, due to the partial observability and limitation of communication channel, the UAVs have to find a communication protocol which is able to coordinate their behavior and policy.
Consequently, we propose to utilize the centralized learning and decentralized execution.A deep multiagent reinforcement learning is proposed where the UAVs are considered as the agents.In the proposed scheme, learning is performed via the centralized algorithm, while during execution, the UAVs can communicate through the dedicated limited-bandwidth channel and learn the communication protocol.

D. Organization
The reminder of this paper is organized as follows.In Section II, the system model is depicted.Section III present the problem formulation and we propose the DRL-based resource allocation and trajectory design algorithms in Section IV.In Section V, we conduct the performance evaluation through simulation study.Section VI concludes this work.

A. System Model
The system model is depicted in Fig. 1.There are M > 1 UAVs sharing the same frequency spectrum and serving a group of U > 1 GUs.The UAV swam and GU set are denoted as M and U, respectively.Apparently, we have jMj ¼ M and jUj ¼ U.All the UAVs provide services to the users in consecutive time slots.We denote the time slot as t, and t 2 f1; 2; . ..; T g.The overall period is denoted as T .In this work, we consider a 3-D Cartesian coordinate system where the fixed location of each GU u denoted by horizontal and vertical coordinates, e.g., f f u ¼ ½x u ; y u T 2 R 2Â1 ; u 2 U.All UAVs are assumed to fly at a fixed altitude d h ¼ H above ground and the coordinate of UAV m at time t is denoted by c c m ðtÞ ¼ ½x m ðtÞ; y m ðtÞ T 2 R 2Â1 .We consider there is a base controller carrying out the proposed learning process, which can be satellite or BS.In addition, the UAVs are able to communicate within the swam.
We consider all the UAVs will fly back to the base so the trajectories should satisfy the following constraint In addition, the trajectories of the UAVs are also subjected to certain constraints of speed and distance, which are (2) kc c m ðtÞ À c c j ðtÞk !S min ; (3) where V max is the maximum speed of the UAV and S min is the minimum inter-UAV distance to avoid certain interference or collision.Accordingly, the distance between UAV m and user u in time slot t is given as

B. Path Loss Model
As a flexible flying platform, the UAV is able to establish a LoS link with the GUs.However, due to the fact that the changes of practical environment (rural, suburban, urban etc) are usually unpredictable, the randomness associated with the LoS and Non-LoS (NLoS) in a certain time should be taken into consideration when designing the UAV system.Accordingly, it is practical to consider the GU connects with the UAV via a LoS link with certain probability which we refer as LoS probability.The LoS probability will depend on the environment, the position of the UAV and GU.One commonly used expression can be given as where 1 and 2 are constant, the value of which the value depend on the carrier frequency and environment.u m;u ðtÞ is the elevation angle, and we have u m;u ðtÞ ¼ 180 p sin ðH=d m;u ðtÞÞ : The LoS and NLoS path loss models between the UAV m and user u is given as where h 1 and h 2 are the excessive coefficients in LoS and NLoS links, respectively.f c is the carrier frequency, a is the path loss exponent, and c is the speed of light.Given the locations of the UAVs and GUs, it is difficult to determine whether a LoS or NLoS path loss model should be used in the considered UAV system.Thus, we consider an average over both the LoS and NLoS links, i.e.,

C. Transmission Model
To express the user association between UAVs and GUs, a binary variable b m;u ðtÞ is defined as the user association indicator, which is In this work, we assume that one GU can only receive from one UAV in a given time slot, i.e.
P M m¼1 b m;u ðtÞ 1.In addition, The transmit power of the UAV m for u is denoted as p m;u ðtÞ and the channel gain between UAV m and user u is denoted as h m;u ðtÞ.Then, the data rate of GU u is expressed as In (10), due to the fact that multiple UAVs can cause interference to GU u, g m;u ðtÞ modelled as Signal to Interference and Noise Ratio (SINR) of the link between m and u, which is g m;u ðtÞ ¼ p m;u ðtÞh m;u ðtÞL À1 m;u ðtÞ P M j¼1;j6 ¼m p j;u ðtÞh j;u ðtÞL À1 j;u ðtÞ þ s 2 where s 2 is the noise variance.Note that essentially the trajectory of the UAVs, transmit power and channel state are continuous.Then after partitioning and and quantizing their values into different levels within their ranges, in each time slot t, the values of these variables can be understood as discrete counterparts.

A. Utility Function
As there are multiple UAVs sharing the same frequency resources, the transmit power of one UAV may bring additional interference to the users served by other UAVs, which is shown in (11).Moreover, the association factor and trajectory also affects the data rate of the users, which can be observed by ( 4) and (10).Therefore, in the following, we consider to jointly optimize these three parameters.We have B ¼ fb m;u ðtÞ; 8m; u; tg which is the association policy between UAVs and GUs, C C ¼ fc m ðtÞ; 8m; tg which is trajectory of the UAVs and essentially determines the path loss, and P ¼ fp m;u ðtÞ; 8m; u; tg which is the transmit power allocation.Based on the analysis, we can define the utility function Ç sys ðB; C C; BÞ of the overall multi-UAV system as follow.

B. Problem Formation
In order to maximize system utility, in this work, we jointly optimize transmit power allocation P P , trajectory design C C, and user association B B. With the above analysis, the formulated problem P1 can be expressed as follows, P1 : max C1 and C2 are the user association constraints, which ensure that one GU can only be served by one UAV in a time slot.The maximum transmit power constraint is given in C3, which means that the transmit power of the UAV should be smaller than its maximum power.C4-C5 is to ensure the minimum data requirement of each GU.
P1 is a non-convex combinatorial integer programming problem and it is NP-hard.In general, a brute-force-like scheme can be employed to find the optimal solution with high computational cost, which however, is infeasible for a large scale system.In addition, the optimization problem needs to obtain the complete information of the future in order to achieve the optimal solution for the next time slot, which means absence of prior information may degrade its achievable performance.Therefore, we intend to utilize the RL-based algorithm to achieve near-optimal solution without aforementioned prior knowledge.

IV. CENTRALIZED DEEP REINFORCEMENT LEARNING-BASED SOLUTION
In this section, we will utilize the DRL-based algorithm to address the formulated problem.We first introduce the basics of DRL, including the defined specific state, action and reward.Then, the single agent DRL is utilized where the base controller acts as the agent and control the behaviors of the UAVs, and we refer the scheme as centralized DRL (CDRL).

A. RL Framework Formulation
The RL problem comprises of a single or multiple agents and an environment.The agent can take actions based on a chosen policy to interact with the environment.Briefly, there are three elements in the RL framework: action a, state s and reward r.In our considered system, the agent can be the UAV central base controller or UAV itself, and the environment consists of all the GUs.The agent chooses an action a t from the action space at time slot t, which decides trajectory and resource allocation.After applying an action, the agent receives a reward or punishment from the environment.This scheme aims at maximizing the cumulative received rewards within interactions.

B. State, Action and Reward
We define state space, action space and reward of the DRLbased framework at time slot t of the considered system as follows.For the considered DRL framework, the decision will be carried out at the central base controller.
1) State: As for the centralized scheme, the central base should know all information about UAVs, e.g., association state, transmit power and trajectory state.The we define the state at the time slot t consists of the data rate R t and battery level E t .The battery level can decide the transmit power.The R t comprises of both the channel state (essentially the location of UAV) and the UAV-GU association.Then the state at the time slot t is 2) Action: In the considered system, the action consists of multiple parts, i.e., the user association strategy B, power allocation factors P and the trajectory design C C. The action space A should be the combination of all the possible values of these factors.
3) Reward: After executing the chosen action, the agent will obtain a reward in certain state in each time slot.As shown in (17), to enforce the agent to take proper action, the definition of the reward is compulsory.In general, the reward should be related to the objective function.According to the formulated problem in P1, the objective is to maximize the overall system utility while the QoS of each GU should be satisfied.In order to transform the objective function to a reward, we consider the following points.
The main objective of P1 is to maximize the overall system utility.As the target of the RL is to obtain reward maximization, the defined reward needs to be positively related to the objective function.
To meet the QoS requirements of GUs, the loss of the throughput of the GUs from their required QoS should decrease the reward.Accordingly, the immediate reward is defined as where ' a and ' b are the weights of two parts.

C. Q-Learning Method
Q-learning is one of the classical RL schemes that records the Q-value.In the considered system, the base controller first watches the state s t 2 S and selects an action a t 2 A at each time slot t according to a stochastic policy p.Then the base controller transmits control signals to the UAVs, obtains the reward rðs t ; a t Þ, and transitions to the next state s tþ1 .Q-learning advocates a value function Qðs t ; a t Þ that is the expected cumulative future discounted reward at state s t and chooses action a t .Then, each pair of state-action has a value Qðs t ; a t Þ for time slot t.For each time slot, the base controller calculates Qðs t ; a t Þ, the value of which is considered as a long-term reward and stores it in a Q-table.Qðs t ; a t Þ is expressed as: where rt ¼ P T t rðs t ; a t Þ.We define $ as the discount parameter and 0 $ 1.Note that if $ tends to 0, the base controller mainly takes the immediate reward into consideration and if $ tends to 1, the future is the focus of the base controller.In each step, the value of Qðs t ; a t Þ is iterated.When the optimal policy pðs t Þ ¼ max a t Qðs t ; a t Þ that maps the state and action is satisfied, the optimal action-value function Q Ã ðs t ; a t Þ is achieved.It obeys the Bellman optimality equation: Q Ã ððs tþ1 ; a tþ1 Þjs t ; a t Þ; (18) where a tþ1 and s tþ1 are the action and state of the next time slot, respectively.With a defined learning rate k, the procedure of Q-learning scheme in shown in Algorithm 1, where is the time-varying learning rate.

D. Proposed Centralized DRL Solution
Although simply advocating Q-learning may obtain the solution of P1, it is not ideal.When using Q-learning, we need to obtain and store its corresponding Q-value in a Q-table for each state-action group as presented.However, in the considered UAV system, due to its mobility feature, there is a very high possibility that a thousands of states exist.Then, if all the values are stored, the matrix of Q-table would be very large.Then it can be difficult to get enough samples to traverse each state, which results in the failures of the algorithm.Therefore, instead of calculating Q-value for each pair, a DNN is used here to estimate Qðs; aÞ which is the main idea of Deep Q-Network (DQN).
DQN uses a neural network (NN) Qðs; a; uÞ to represent Qfunction, where u is the weights of the NN.By updating u at each iteration, the Q-network is trained to approximate the real Q-values.When it is applied to Q-learning, NN improve the performance of flexibility at the cost of stability [4].In this context, DNN is proven to be with a more robust learning scheme and it has three major improvements comparing with the Q-learning.
The first one is that DNN is with multiple layer.The hierarchical layers of convolution filters in the DNN can be used to exploit the local spatial correlations.By such, the high-level features of input data are extracted.The second one is that experience replay can store its experience tuple eðtÞ ¼ ðs t ; a t ; r t ; s tþ1 Þ at time slot t into a replay memory O.The relay can randomly sample batches Ô from the memory to train the DNN.Such a process enables DQN to learn from different past experience rather than from the current one.In addition, while using one network for estimating the Q-values, the target Q-values that compute the loss of each action in the training process can be generated by a second network.Such a procedure is able to make the DQN stable.
As presented, DQN uses NN with parameter u to represent Qðs t ; a t Þ in each iteration.u and policy p are updated according to the mini-batch of Ô which is taken from experience memory O to train the DQN in a online manner.DQNs are optimized by minimizing LðuÞ ¼ E½y t À Qðs t ; a t ; uÞ 2 (19) where y t is the target Q-value, and it is given as In (20), u À is a target network parameter that is frozen for some iterations when the online network ÀQðs; a; uÞ is updated by gradient descent.Specially, the base controller chooses a t at time slot t according to (( 18)), obtains reward r t and goes to the next state s tþ1 .Accordingly, the base controller has a experience replay memory O to store the vector ðs t ; a t ; r t ; s tþ1 Þ.We can utilize the greedy policy in order to balance the exploration and exploitation.That is, we aim to balance the reward maximization based on the known information with choosing new actions to get unknown information.Algorithm 2 presents the process and the flow is shown in Fig. 2.

V. MULTI-AGENT DEEP REINFORCEMENT LEARNING-BASED SOLUTION
The proposed CDRL-based scheme assumes that the UAV base actually performs the learning process and coordinate the Algorithm 1: Q-learning method.
1: Initialize Qðs; aÞ 2: for each episode do 3: Initialize s of each UAV randomly.4: for each time do 5: Choose an action a t from all actions of state s t 6: Execute chosen a t , observe reward and next state s tþ1 7: Let s t s tþ1 .9: end for 10: end for actions of the entire UAV swam.However, on the way towards a smart UAV system, it is expected that the UAVs can be autonomous at a certain level.Thus, in the following, we focus on a setting with centralized learning but distributed execution towards establishing an autonomous UAV wireless communication system.Before we introduce the proposed scheme, some preliminaries are presented.

A. Preliminary
1) Independent DQN: The single agent DQN can be extended to multi-agent cooperative settings.In this setting, the global state s t can be observed by the agents.Then, the each agent chooses an individual action a m t and obtains a group reward r t which is shared among all the agents.A platform combining independent Q-learning with DQN has been proposed.In this framework, each agent m learns its own Q-function Q m ðs; a m ; u m i Þ independently and simultaneously.In [33], it is shown that there may be some convergence problems in independent Q-learning (since individual learning may result in non-stationary environment for the others).Nevertheless, it has been successfully applied to practical problems [33].
2) Deep Recurrent Q-Networks (DRQN): For both DQN and independent DQN, it is assumed full observability, i.e., the global state s t is the input.However, in practice, the dynamic environments are usually partially observable, i.e., the global state s t cannot be observed.Instead, each of the agents can only obtain an observation o t which is correlated with global state.In [34], the DRQN is proposed to address single-agent and partially observable case.In this work, instead of obtaining Qðs; aÞ with a feed-forward network, Qðo; aÞ is approximated with a recurrent NN that maintains an internal state and aggregates all the personal observations over some time slots.This is done by adding a hidden state h tÀ1 as the input, and it results in Qðo t ; h tÀ1 ; a t ; uÞ.

B. Assumption
In this case, we turn to investigate the formulated problem with different UAVs as multiple agents and partial observability is considered.The objective of maximizing the same discounted group rewards rðtÞ are shared among all the UAVs.Although the global state s t is not observable to the UAVs, each UAV m has its own observation o m t .In each time slot, each UAV selects an action a m 2 A that has impact on the environment and a communication action & m 2 that is observed by other UAVs but does not directly affect the environment/reward.Such settings are of interests because usually in the multi-UAV system, partial observability is a practical case.We concentrate on the case with centralized learning and decentralized execution.This is to say, communications between UAVs and base controller is not limited during centralized learning while during execution the UAVs can communicate only via a dedicated signaling channel with limitedbandwidth.Then, during decentralized execution, each UAV uses its own copy of the learned network, evolving its own hidden state, selecting its own actions, and communicating with others only through the communication channel.
Towards an self-organized and autonomous system in a dynamic environment, the UAV must develop and agree on a communication protocol as the environment can change fast and the configured communication protocol may not work effectively.
Intuitively, the space dimension of communication protocols is extremely high, since they are the mappings from the histories of observation-action to sequences of communication signals over number of UAVs.Therefore, discovering an effective protocol is challenging.In addition, due to the UAVs' requirement of coordinating the transmission and decoding of communication messages, exploring within this space becomes more difficult.For example, if a UAV transmits something useful to another UAV, it can obtain a positive reward only when receiving UAV successfully decodes and takes action accordingly.If the receiving UAV cannot decode the message correctly, the sending UAV will be hindered from transmitting again.Therefore, positive rewards can be achieved iff transmitting and decoding are successful, which is difficult to be achieved via a random search.

C. Proposed Decentralized Solution
In this following, we propose the reinforced inter-UAV learning which combines independent Q-learning with DRQN to select environment and communications actions.Each UAV's Q-network is denoted as Q m ðo m t ; & m tÀ1 ; h m t ; a m Þ, which conditions on that UAV's individual hidden state and observation.To avoid jjjAj outputs, we divide the Q-network into Q m a for the environment action and Q m & for the communication action, respectively.By utilizing -greedy policy, the action Initialize the considered wireless UAV network 5: Receive the initial observation on the state s 1 .

6:
for each time slot t do 7: Randomly select an action a t with probability , otherwise, select a t ¼ arg max a Qðx; a; uÞ.

8:
Execute chosen a t , observe reward and s tþ1 9: Store ðs t ; a t ; r t ; s tþ1 Þ in replay memory O 10: Sample a random batch of Z vectors ðs i ; a i ; r i ; s iþ1 Þ from O 11: Obtain the target Q-value y i from the target DQN, as follows,  The following two essential modifications are made to the DQN to guarantee the performance.First, as multiple UAVs' simultaneous learning can mislead the experience and render it obsolete, the experience replay is disabled to avoid non-stationarity.Second, to take into consideration of the partial observability, the actions a and & & of each UAV are feed in as the inputs of the next time slot.In Fig. 3, the information flows between UAVs and the network are presented together with how the action selector can process the Q-values to find proper actions.As shown, in order to choose environment action a m and communication action & m , all Q-values are passed to the action selector.For the selected actions, the gradients (red arrows in the figure) are calculated using DQN, and flow only through one single UAV's Q-network.Although the considered setting allows a centralized learning, as the each UAV is treated independently, the overall process is not a centralized learning procedure.In addition, all the UAVs are equally treated during the proposed decentralized execution process.
The proposed scheme can be extended to improve the centralized learning by parameter sharing among the UAVs.Such an extension only needs to learn one network and then used by all UAVs.However, because each UAV still has different observation, the UAVs can still behave differently and thus go to different hidden states.Moreover, each UAV obtains own index as input which allows them to specialize.The DQN is able to ease the learning process of a common policy while permitting the specialization.Sharing the parameters among all the UAVs also significantly decreases the amount of parameters that needs to be learned, which can also hasten the speed of learning.By sharing the parameters, the UAVs learn two Q-functions Q a ðo m t ; & m 0 tÀ1 ; h m tÀ1 ; a m tÀ1 ; & m 0 t ; m; a m t Þ and

VI. SIMULATION RESULTS AND DISCUSSIONS
In this section, simulations are conducted to verify the advantages of the proposed single agent (CDRL) and multiagent DRL-based (MADRL) resource allocation schemes for multi-UAV networks.The setup of whole networks are mainly based on the parameters in [16], [25].Some of the key notations for communications can be found Table I.The initial locations of the UAVs are randomized.The maximum transmit power of each UAV is the same.Based on this setting, the system utility, 3-D trajectory design and UAV-GU association are analyzed.
The 3-D and 2-D snapshots of the UAVs' locations and their associated GUs resulting from the proposed scheme are presented in Figs. 4 and 5.In both figures, 50 GUs are uniformly located and 9 UAVs are deployed to provide services.In Fig. 5, the 2D locations of UAV are marked in number.In this case, all GUs are able to connected with the UAVs and receive data from the associated UAVs by using the proposed scheme.The 3-D locations/trajectory of the UAVs and the UAV-GU association results are obtained based on the locations of the GUs and its minimum data rate requirement.
In Fig. 6, the optimized trajectories of the UAVs are illustrated.In Fig. 6(a), we plot the trajectory of four UAVs by using the proposed MADRL scheme, while in Fig. 6(b), the trajectory of one UAV is obtained by using the proposed CDRL scheme.It is observed that for the case of four UAVs, most of the users can be served by the UAVs.However, due to the limited battery capability, there are still some of users cannot be served by the UAVs.It can also be found that four UAVs can cooperate with each others through the proposed multi-agent learning scheme, and the users can be associated with individual UAV accordingly.As for the case of single UAV, due to the limited battery capability, the UAV has to come back after serving some of the users.Thus, only some of the users can be associated with the UAV.
In Fig. 7, we present the total utility versus the number of episodes with different number of UAVs when considering MADRL.As shown in the figure, our presented scheme shows a fast convergence speed for all of the cases.Besides, increasing the number of UAVs can lead to the increase of system utility.In Fig. 8, we present the total utility versus the number of episodes with different number of UAVs when considering CDRL.We can obtain similar performance as presented in Fig. 7. Nevertheless, for CDRL, when the number of UAVs becomes larger, it takes a bit longer time to converge.This may due to the fact that the CDRL needs to collect relative information in a centralized manner, which cost more time.
In Fig. 9 and in Fig. 10, we compare the throughput and utility performance of traditional Q-learning scheme, the proposed CDRL and the proposed MADRL.As we can observe from Fig. 9, as the number of UAVs increases, the total throughput of all these three schemes become larger.This is mainly due to the fact that the increase of the number of UAVs results in a better service coverage, and can provide better data services to the GUs.Similar situation can be observed from Fig. 10 when we investigate the utility performance.In addition, we can also find that both of the proposed schemes outperform the traditional Q-learning scheme, the centralized scheme obtain the best performance.This is mainly due to the fact that when the central controller can obtain all the relevant information, such as CSI and position  of UAV, it can carry out more accurate decision via deep learning schemes.Nevertheless, the MADRL has a close performance to the CDRL, which demonstrates its effectiveness.
We have compared the proposed CDRL with two commonly-used baseline methods, "Benchmark" and "TRRA".The "Benchmark" is the random UAV deployment scheme where the whole area is equally separated to a number of parts according to the number of UAVs.Then each UAV has its responded area, and then randomly flies within each area and serve the GUs.The "TRRA" refers to the traditional RRA scheme, where the power allocation is according to the waterfilling scheme and the association ignores the minimum data requirement.From Fig. 11, it is found that the system utilities of all three schemes increase with the number of UAVs.This is due to the fact that a larger number of UAVs can ensure more GUs being served with required data rate.Moreover, when the number of UAVs is sufficiently large, it turns out that there are less GUs who cannot be served and the increase of system utility becomes slow.It can also be observed the proposed scheme can obtain the best performance among all three, which shows the importance of adopting DRL and the development of power allocation and UAV association schemes.

VII. CONCLUSION
In this work, to establish a smart and autonomous multi-UAV wireless communication system, novel DRL-based trajectory design and resource allocation schemes are presented.In the considered system, the UAVs act as aerial Base Stations and provide ubiquitous coverages.Specifically, aiming at maximizing the defined system utility over all served GUs, a joint design of trajectory, user association and power allocation problem is presented.To address the formulated problem, we first propose a machine learning-based algorithm which comprises of reinforcement learning and deep learning to learn the optimal policy of all the UAVs.Then, we also present a multiagent deep reinforcement learning scheme for decentralized implementation without knowing a priori knowledge of the dynamics of networks.Extensive simulation studies are conducted to demonstrate advantages of the proposed schemes are demonstrated.Future work is to improve the multi-UAV system performance via energy efficiency and delay optimization in the proposed framework.
L m;u ðtÞ ¼ r los m;u ðtÞh 1 4pf c d m;u ðtÞ c a þ ð1 À r los m;u ðtÞÞh 2 4pf c d m;u ðtÞ c a :

Algorithm 2 : 1 :
DQN-based online method.Initialize replay memory O 2: Initialize parameter of the DNN u with random weights 3: for each episode do 4: selector separately picks a m ðtÞ and & m ðtÞ from Q a and Q & , respectively.Correspondingly, only jj þ jAj outputs are required for the network and the action selection requires maximizing over A and , but not Â A.We use modified DQN to train Q m a and Q m & .
for a and & &, respectively, where a m tÀ1 and & m tÀ1 are the last action inputs and & m 0 t are messages from other UAVs.During the execution process, each UAV uses own copy of the learned network, chooses own actions, evolves into own hidden state, and communicates with the others via the signalling channel.

Fig. 4 .
Fig. 4. Locations of UAVs and GUs in a 3-D snapshot.TABLE I KEY SIMULATION PARAMETERS

Fig. 9 .
Fig. 9.The impact of the number of UAVs on system throughput.

Fig. 10 .
Fig. 10.The impact of the number of UAVs on system utility.

Fig. 11 .
Fig. 11.The impact of the number of UAVs on system utility.
max Output: the optimal resource allocation policy, i.e., the user association strategy B, trajectory design C C, and power allocation P