Multi-Agent Reinforcement Learning Based Resource Allocation for UAV Networks

Unmanned aerial vehicles (UAVs) are capable of serving as aerial base stations (BSs) for providing both cost-effective and on-demand wireless communications. This article investigates dynamic resource allocation of multiple UAVs enabled communication networks with the goal of maximizing long-term rewards. More particularly, each UAV communicates with a ground user by automatically selecting its communicating users, power levels and subchannels without any information exchange among UAVs. To model the uncertainty of environments, we formulate the long-term resource allocation problem as a stochastic game for maximizing the expected rewards, where each UAV becomes a learning agent and each resource allocation solution corresponds to an action taken by the UAVs. Afterwards, we develop a multi-agent reinforcement learning (MARL) framework that each agent discovers its best strategy according to its local observations using learning. More specifically, we propose an agent-independent method, for which all agents conduct a decision algorithm independently but share a common structure based on Q-learning. Finally, simulation results reveal that: 1) appropriate parameters for exploitation and exploration are capable of enhancing the performance of the proposed MARL based resource allocation algorithm; 2) the proposed MARL algorithm provides acceptable performance compared to the case with complete information exchanges among UAVs. By doing so, it strikes a good tradeoff between performance gains and information exchange overheads.


I. INTRODUCTION
Aerial communication networks, encouraging new innovative functions to deploy wireless infrastructure, have recently attracted increasing interests for providing high network capacity and enhancing coverage [1].Unmanned aerial vehicles (UAVs), also known as remotely piloted aircraft systems (RPAS) or drones, are small pilotless aircraft that are rapidly deployable for complementing terrestrial communications based on the 3rd Generation Partnership Project (3GPP) LTE-A (Long term evolution-advanced) [2].In contrast to channel characteristics of terrestrial communications, the channels of UAV-to-ground communications are more probably line-of-sight (LoS) links [3], which is beneficial for wireless communications.
In particular, UAVs based different aerial platforms that for providing wireless services have attracted extensive research and industry efforts in terms of the issues of deployment, navigation and control [4].Nevertheless, resource allocation such as transmit power, serving users and subchannels, as a key communication problem, is also essential to further enhance the energyefficiency and coverage for UAV-enabled communication networks.

A. Prior Works
Compared to terrestrial BSs, UAVs are generally faster to deploy and more flexible to configure.The deployment of UAVs in terms of altitude and distance between UAVs was investigated for UAV-enabled small cells in [5].In [6], a three-dimensional (3D) deployment algorithm based on circle packing is developed for maximizing the downlink coverage performance.Additionaly, a 3D deployment algorithm of a single UAV is developed for maximizing the number of covered users in [7].Moreover, by fixing the altitudes, a successive UAV placement approach was proposed to minimize the number of UAVs required while guaranteeing each ground user to be covered by at least one UAV in [8].
Despite the deployment optimization of UAVs, trajectory designs of UAVs for optimizing the communication performance have attracted tremendous attentions, such as in [9]- [11].In [9], the authors considered one UAV as a mobile relay and investigated the throughput maximization problem by optimizing power allocation and the UAV's trajectory.Then, a designing approach of the UAV's trajectory based on successive convex approximation (SCA) techniques was proposed in [9].By transforming the continuous trajectory into a set of discrete waypoints, the authors in [10] investigated the UAV's trajectory design with minimizing the mission completion time in a UAV-enabled multicasting system.Additionally, multiple-UAV enabled wireless communication As discussed above, machine learning is a promising and power tool to provide autonomous and effective solutions in an intelligent manner to enhance the UAV-enabled communication networks.However, most research contributions focus on the deployment and trajectory designs of UAVs in communication networks, such as [15]- [17].Though resource allocation schemes such as transmit power and subchannels were considered for UAV-enabled communication networks in [11] and [12], the prior studies focused on time-independent scenarios.That is the optimization design is independent for each time slot.Moreover, for time-dependent scenarios, [17] and [18] investigated the potentials of machine learning based resource allocation algorithms.
However, most of the proposed machine learning algorithms mainly focused on single UAV scenarios or multi-UAV scenarios by assuming the availability of complete network information for each UAV.In practice, it is non-trivial to obtain perfect knowledge of dynamic environments due to the high movement speed of UAVs [19], [20], which imposes formidable challenges on the design of reliable UAV-enabled wireless communications.Besides, most existing research contributions focus on centralized approaches, which makes modeling and computational tasks become challenging as the network size continues to increase.Multi-agent reinforcement learning (MARL) is capable of providing a distributed perspective on the intelligent resource management for UAV-enabled communication networks especially when these UAVs only have individual local information.
The main benefits of MARL are: 1) agents consider individual application-specific nature and environment; 2) local exchanges between agents can be modeled and investigated; 3) difficulties in modelling and computation can be handled in distributed manners.The applications of MARL for cognitive radio networks were studied in [21] and [22].Specifically, in [21], the authors focused on the feasibilities of MARL based channel selection algorithms for a specific scenario with two secondary users.A real-time aggregated interference scheme based on MARL was investigated in [22] for wireless regional area networks (WRANs).Moreover, in [23], the authors proposed a MARL based channel and power level selection algorithm for device-to-device (D2D) pairs in heterogeneous cellular networks.The potential of machine learning based user clustering for mmWave-NOMA networks was presented in [24].Therefore, invoking MARL to UAV-enabled communication networks provides a promising solution for intelligent resource management.Due to the high mobility and adaptive altitude, to the best of our knowledge, multi-UAV networks are not well-investigated, especially for the resource allocation from the perspective of MARL.However, it is challenging for MARL based multi-UAV networks to specify a suitable objective and strike a exploration-exploitation tradeoff.
Motivated by the features of MARL and UAVs, this article aims to develop a MARL framework for multi-UAV networks.More specifically, we consider a multi-UAV enabled downlink wireless network, in which multiple UAVs try to communicate with ground users simultaneously.Each UAV flies according to the predefined trajectory.It is assumed that all UAVs communicate with ground users without the assistance of a central controller.Hence, each UAV can only observe its local information.Based on the proposed framework, our major contributions are summarized as follows: 1) We investigate the optimization problem of maximizing long-term rewards of multi-UAV downlink networks by jointly designing user, power level and subchannel selection strategies.Specifically, we formulate a quality of service (QoS) constrained energy efficiency function as the reward function for providing a reliable communication.Because of the time-dependent nature and environment uncertainties, the formulated optimization problem is non-trivial.To solve the challenging problem, we propose a learning based dynamic resource allocation algorithm.
2) We propose a novel framework based on stochastic game theory [25] to model the dynamic resource allocation problem of multi-UAV networks, in which each UAV becomes a learning agent and each resource allocation solution corresponds to an action taken by the UAVs.Particularly, in the formulated stochastic game, the actions for each UAV satisfy the properties of Markov chain [26], that is the reward of a UAV is only dependant on the current state and action.Furthermore, this framework can be also applied to model the resource allocation problem for a wide range of dynamic multi-UAV systems.
3) We develop a MARL based resource allocation algorithm for solving the formulated stochastic game of multi-UAV networks.Specifically, each UAV as an independent learning agent runs a standard Q-learning algorithm by ignoring the other UAVs, and hence information exchanges between UAVs and computational burdens on each UAV are substantially reduced.Additionally, we also provide a convergence proof of the proposed MARL based resource allocation algorithm.4) Simulation results are provided to derive parameters for exploitation and exploration in the -greedy method over different network setups.Moreover, simulation results also demonstrate that the proposed MARL based resource allocation framework for multi-UAV networks strikes a good tradeoff between performance gains and information exchange overheads.

C. Organization
The rest of this article is organized as follows.In Section II, the system model for downlink multi-UAV networks is presented.The problem of resource allocation is formulated and a stochastic game framework for the considered multi-UAV network is presented in Section III.In Section IV, a Q-learning based MARL algorithm for resource allocation is designed.Simulation results are presented in Section V, which is followed by the conclusions in Section VI.

II. SYSTEM MODEL
Consider a multi-UAV downlink communication network as illustrated in Fig. 1 operating in a discrete-time axis, which consists of M single-antenna UAVs and L single-antenna users, overlap with each other.Moreover, it is assumed that UAVs fly autonomously without human intervention based on pre-programmed flight plans as in [27].That is the trajectories of UAVs are predefined based on the pre-programmed flight plans.As shown in Fig. 1, there are three UAVs flying on the considered region based on the pre-defined trajectories, respectively.This article focuses on the dynamic design of resource allocation for multi-UAV networks in term of user, power level and subchannel selections.Additionally, assuming that all UAVs communicate without the assistance of a central controller and have no global knowledge of wireless communication environments.In other words, the channel state information (CSI) between a UAV and users are known locally.This assumption is reasonable in practical due to the mobilities of UAVs, which is similar to the research contributions [19], [20].

A. UAV-to-Ground Channel Model
In contrast to the propagation of terrestrial communications, the air-to-ground (A2G) channel is highly dependent on the altitude, elevation angle and the type of the propagation environment [2]- [4].In this article, we investigate the dynamic resource allocation problem for multi-UAV networks under two types of UAV-to-ground channel models: 1) Probabilistic Model: As discussed in [2], [3], UAV-to-ground communication links can be modeled by a probabilistic path loss model, in which the LoS and non-LoS (NLoS) links can be considered separately with different probabilities of occurrences.According to [3], at time slot t, the probability of having a LoS connection between UAV m and a ground user l is given by where a and b are constants that depend on the environment.d m,l denotes the distance between UAV m and user l and H denotes the altitude of UAV m.Furthermore, the probability of have NLoS links is P NLoS (t) = 1 − P LoS (t).
Accordingly, in time slot t, the LoS and NLoS pathloss from UAV m to the ground user l can be expressed as where L FS m,l (t) denotes the free space pathloss with L FS m,l (t) = 20 log(d m,l (t)) + 20 log(f ) + 20 log( 4π c ), and f is the carrier frequency.Furthermore, η LoS and η NLoS are the mean additional losses for LoS and NLoS, respectively.Therefore, at time slot t, the average pathloss between UAV m and user l can be expressed as 2) LoS Model: As discussed in [9], the LoS model provides a good approximation for practical UAV-to-ground communications.In the LoS model, the path loss between a UAV and a ground user relies on the locations of the UAV and the ground user as well as the type of propagation.
Specifically, under the LoS model, the channel gains between the UAVs and the users follow the free space path loss model, which is determined by the distance between the UAV and the user.
Therefore, at time slot t, the LoS channel power gain from the m-th UAV to the l-th ground user can be expressed as where u m (t) = (x m (t), y m (t)), and (x m (t), y m (t)) denotes the location of UAV m in the horizontal dimension at time slot t.Correspondingly, v l = (x l , y l ) denotes the location of user l.Furthermore, β 0 denotes the channel power gain at the reference distance of d 0 = 1 m, and α ≥ 2 is the path loss exponent.

B. Signal Model
In the UAV-to-ground transmission, the interference to each UAV-to-ground user pair is created by other UAVs operating on the same subchannel.Let c k m (t) denote the indicator of subchannel, where c k m (t) = 1 if subchannel k occupied by UAV m at time slot t; c k m (t) = 0, otherwise.It satisfies ( That is each UAV can only occupy a single subchannel for each time slot.Let a l m (t) be the indicator of users.a l m (t) = 1 if user l served by UAV m in time slot t; a l m (t) = 0, otherwise.Therefore, the observed signal-to-interference-plus-noise ratio (SINR) for a UAV-to-ground user communication between UAV m and user l over subchannel k at time slot t is given by where G k m,l (t) denotes the channel gain between UAV m and user l over subchannel k at time slot t.P m (t) denotes the transmit power selected by UAV m at time slot t.I k m,l (t) is the interference to UAV m with I k m,l (t) = j∈M,j =m G k j,l (t)c k m (t)P j (t).Therefore, at any time slot t, the SINR for UAV m can be expressed as In this article, discrete transmit power control is adopted at UAVs [28].The transmit power values by each UAV to communicate with its respective connected user can be expressed as , if UAV m selects to transmit at a power level P j at time slot t; and p j m (t) = 0, otherwise.Note that only one power level can be selected at each time slot t by UAV m, we have As a result, we can define a finite set of possible power level selection decisions made by UAV m, as follows.
Similarly, we also define finite sets of all possible subchannel selection and user selection by UAV m, respectively, which are given as follows: To proceed further, we assume that the considered multi-UAV network operates on a discretetime basis where the time axis is partitioned into equal non-overlapping time intervals (slots).
Furthermore, the communication parameters are assumed to remain constant during each time slot.Let t denote an integer valued time slot index.Particularly, each UAV holds the CSI of all ground users and decisions for a fixed time interval T s ≥ 1 slots, which is called decision period.We consider the following scheduling strategy for the transmissions of UAVs: Any UAV is assigned a time slot t to start its transmission and must finish its transmission and select the new strategy or reselect the old strategy by the end of its decision period, i.e., at slot t + T s .
We also assume that the UAVs do not know the accurate duration of their stay in the network.
This feature motivates us to design an on-line learning algorithm for optimizing the long-term energy-efficiency performance of multi-UAV networks.

III. STOCHASTIC GAME FRAMEWORK FOR MULTI-UAV NETWORKS
In this section, we first describe the optimization problem investigated in this article.Then, to model the uncertainty of stochastic environments, we formulate the problem of joint user, power level and subchannel selections by UAVs to be a stochastic game.

A. Problem Formulation
Note that from (6) to achieve the maximal throughput, each UAV transmits at a maximal power level, which, in turn, results in increasing interference to other UAVs.Hence, to provide reliable communications of UAVs, the main goal of the dynamic design for joint user, power level and subchannel selection is to ensure that the SINRs provided by the UAVs no less than the predefined thresholds.Specifically, the mathematical form can be expressed as where γ denotes the targeted QoS threshold of users served by UAVs.At time slot t, if the constraint ( 12) is satisfied, then the UAV obtains a reward R m (t), defined as the difference between the throughput and the cost of power consumption achieved by the selected user, subchannel and power level.Otherwise, it receives a zero reward.Therefore, we can express the reward function R m (t) of UAV m at time slot t, as follows: for all m ∈ M and the corresponding immediate reward is denoted as R m (t).In (13), ω m is the cost per unit level of power.Note that at any time slot t, the instantaneous reward of UAV m in (13) relies on: 1) the observed information: the individual user, subchannel and power level decisions of UAV m, i.e., a m (t), c m (t) and p m (t).In addition, it also relates with the current channel gain G k m,l (t); 2) unobserved information: the subchannels and power levels selected by other UAVs and the channel gains.It should be pointed out that we omitted the fixed power consumption for UAVs, such as the power consumed by controller units and data processing [29].
Next, we consider to maximize the long-term reward v m (t) by selecting the served user, subchannel and transmit power level at each time slot.Particularly, we adopt a future discounted reward [30] as the measurement for each UAV.Specifically, at a certain time slot of the process, the discounted reward is the sum of its payoff in the present time slot, plus the sum of future rewards discounted by a constant factor.Therefore, the considered long-term reward of UAV m is given by where δ denotes the discount factor with 0 ≤ δ < 1.Specifically, values of δ reflect the effect of future rewards on the optimal decisions: if δ is close to 0, it means that the decision emphasizes the near-term gain; By contrast, if δ is close to 1, it gives more weights to future rewards and we say the decisions are farsighted.
Next we introduce the set of all possible user, subchannel and power level decisions made by UAV m, m ∈ M, which can be denoted as Θ m = A m ⊗ C m ⊗ P m with ⊗ denoting the Cartesian product.Consequently, the objective of each UAV m is to make a selection , which maximizes its long-term reward in (14).Hence the optimization problem for UAV m, m ∈ M, can be formulated as Note that the optimization design for the considered multi-UAV network consists of M subproblems, which corresponds to M different UAVs.Moreover, each UAV has no information about other UAVs such as their rewards, hence one cannot solve problem (15) accurately.To solve the optimization problem (15) in stochastic environments, we try to formulate the problem of joint user, subchannel and power level selections by UAVs to a stochastic non-cooperative game in the following subsection.

B. Stochastic Game Formulation
In this subsection, we consider to model the formulated problem ( 15) by adopting a stochastic game (also called Markov game) framework [25], since it is the generalization of the Markov decision processes to the multi-agent case.
In the considered network, M UAVs communicate to users with having no information about the operating environment.It is assumed that all UAVs are selfish and rational.Hence, at any time slot t, all UAVs select their actions non-cooperatively to maximize the long-term rewards in (15).Note that the action for each UAV m is selected from its action space Θ m .The action conducted by UAV m at time slot t, is a triple θ m (t) = (a m (t), c m (t), p m (t)) ∈ Θ m , where a m (t), c m (t) and p m (t) represent the selected user, subchannel and power level respectively, for UAV m at time slot t.For each UAV m, denote by θ −m (t) the actions conducted by the other As a result, the instantaneous SINR of UAV m at time slot t can be rewritten as where , and I k m,l (t)(•) is given in (6).Furthermore, G m,l (t) denotes the matrix of instantaneous channel responses between UAV m and user l at time slot t, which can be expressed as with G m,l (t) ∈ R M ×K , for all l ∈ L and m ∈ M.
At any time slot t, each UAV m can measure its current SINR level γ m (t).Hence, the sate s m (t) for each UAV m, m ∈ M, is fully observed, which can be defined as Let s = (s 1 , • • • , s M ) be a state vector for all UAVs.In this article, UAV m does not know the states for other UAVs as UAV cannot cooperate with each other.
We assume that the actions for each UAV satisfy the properties of Markov chain, that is the reward of a UAV is only dependant on the current state and action.As discussed in [26], Markov chain is used to describes the dynamics of the states of a stochastic game where each player has a single action in each state.Specifically, the formal definition of Markov chains is given as follows.
Definition 1.A finite state Markov chain is a discrete stochastic process, which can be described as follows: Let a finite set of states S = {s 1 , • • • , s q } and a q × q transition matrix F with each entry 0 ≤ F i,j ≤ 1 and q j=1 F i,j = 1 for any 1 ≤ i ≤ q.The process starts in one of the states and moves to another state successively.Assume that the chain is currently in state s i .
The probability of moving to the next state s j is which depends only on the present state and not on the previous states and is also called Markov property.
Therefore, the reward function of UAV m, m ∈ M, can be expressed as Here we put the time slot index t in the superscript for notation compactness and it is adopted in the following of this article for notational simplicity.In (20), the instantaneous transmit power is a function of the action θ t m and the instantaneous rate of UAV m is given by Notice that from (20), at any time slot t, the reward r t m received by UAV m depends on the current state s t m , which is fully observed, and partially-observed actions (θ t m , θ t −m ).At the next time slot t + 1, UAV m moves to a new random state s t+1 m whose possibilities are only based on the previous state s m (t) and the selected actions (θ t m , θ t −m ).This procedure repeats for the indefinite number of slots.Specifically, at any time slot t, UAV m can observe its state s t m and the corresponding action θ t m , but it does not know the actions of other players, θ t −m , and the precise values G t m .The state transition probabilities are also unknown to each player UAV m.Therefore, the considered UAV system can be formulated as a stochastic game [31].Definition 2. A stochastic game can be defined as a tuple Φ = (S, M, Θ, F, R) where: • S denotes the state set with S = S 1 × • • • × S M , S m ∈ {0, 1}, for all m ∈ M; • M is the set of players; • Θ denotes the joint action set and Θ m is the action set of player UAV m; • F is the state transition probability function which depends on the actions of all players.Specifically, F (s t m , θ, s t+1 m ) = Pr{s t+1 m |s t m , θ}, denotes the probability of transitioning to the next state s t+1 m from the state s t m by executing the joint action with assigning a strategy π i to each UAV i, the optimization objective in ( 14) can be reformulated as where r t+τ +1 m represents the immediate reward received by UAV m at time t + τ + 1 and E{•} denotes the expectation operations.In the formulated stochastic game, players (UAVs) have individual expected reward which depends on the joint strategy and not on the individual strategies of the players.Hence one cannot simply expect players to maximize their expected rewards as it may not be possible for all players to achieve this goal at the same time.Next, we describe a solution for the stochastic game by Nash equilibrium [32].
Definition 3. A Nash equilibrium is a collection of strategies, one for each player, so that each individual strategy is a best-response to the others.That is if a solution It means that in a Nash equilibrium, each UAV's action is the best response to other UAVs' choice.Thus, in a Nash equilibrium solution, no UAV can benefit by changing its strategy as long as all the other UAVs keep their strategies constant.Note that the presence of imperfect information in the formulated non-cooperative stochastic game provides opportunities for the players to learn their optimal strategies through repeated interactions with the stochastic environment.Hence, each player UAV m is regarded as a learning agent whose task is to find a Nash equilibrium strategy for any state s m .In next section, we propose a multi-agent reinforcementlearning framework for maximizing the sum expected reward in (22) with partial observations.

IV. PROPOSED MULTI-AGENT REINFORCEMENT-LEARNING ALGORITHM
In this section, we first describe the proposed MARL framework for multi-UAV networks.
Then a Q-Learning based resource allocation algorithm will be proposed for maximizing the expected long-term reward of the considered for multi-UAV network.
Joint action (t) Q Fig. 2: Illustration of MARL framework for multi-UAV networks.

A. MARL Framework for Multi-UAV Networks
Fig. 2 describes the key components of MARL studied in this article.Specifically, for each UAV m, the left-hand side of the box is the locally observed information at time slot t-state s t m and reward r t m ; the right-hand side of the box is the action for UAV m at time slot t.The decision problem faced by a player in a stochastic game when all other players choose a fixed strategy profile is equivalent to an Markov decision processes (MDP) [26].An agent-independent method is proposed, for which all agents conduct a decision algorithm independently but share a common structure based on Q-learning.In this article, Q-learning is used to solve MDPs, for which a learning agent operates in an unknown stochastic environment and does not know the reward and transition functions [33].
Next we describe the Q-learning algorithm for solving the MDP for one UAV.Without loss of generalities, UAV m is considered for simplicity.Two fundamental concepts of algorithms for solving the above MDP is the state value function and action value function (Q-function) [34].
Specifically, the former in fact is the expected reward for some state in (22) giving the agent in following some policy.Similarly, the Q-function for UAV m is the expected reward starting from the state s m , taking the action θ m and following policy π, which can be expressed as where the corresponding values of ( 24) are called action values (Q-values).
Proposition 1.A recursive relationship for the state value function can be derived from the established return.Specifically, for any strategy π and any state s m , the following condition holds between two consistency states s t m = s m and s t+1 m = s m , with s m , s m ∈ S m : where π j (s j , θ j ) is the probability of choosing action θ j in state s j for UAV m.
Proof.See Appendix A.
Note that the state value function V m (s m , π) is the expected return when starting in state s m and following a strategy π thereafter.Based on Proposition 1, we can rewrite the Q-function in (24) also into a recursive from, which is given by Note that from (26), Q-values depend on the actions of all the UAVs.It should be pointed out that ( 25) and ( 26) are the basic equations for the Q-learning based reinforcement learning algorithm for solving the MDP of each UAV.From ( 25) and ( 26), we also can derive the following relationship between state values and Q-values: As discussed above, the goal of solving a MDP is to find an optimal strategy to obtain a maximal reward.An optimal strategy for UAV m at state s m , can be defined, from the perspective of state value function, as For the optimal Q-values, we also have Substituting ( 27) to ( 28), the optimal state value equation in ( 28) can be reformulated as where the fact that θm π(s m , θ m )Q * m (s m , θ m ) ≤ max θm Q * m (s m , θ m ) was applied to obtain (30).Note that in (30), the optimal state value equation is a maximization over the action space instead of the strategy space.
Next by combining ( 30) with ( 25) and ( 26), one can obtain the Bellman optimality equations, for state values and for Q-values, respectively: and Note that (32) indicates that the optimal strategy will always choose an action that maximizes the Q-function for the current state.In the multi-agent case, the Q-function of each agent depends on the joint action and is conditioned on the joint policy, which makes it complex to find an optimal joint strategy [34].To overcome these challenges, we consider UAV are independent learners (ILs), that is UAVs do not observe the rewards and actions of the other UAVs, they interact with the environment as if no other UAUs exist.

B. Q-Learning based Resource Allocation for Multi-UAV Networks
In this subsection, an ILs [35] based MARL algorithm is proposed to solve the resource allocation among UAVs.Specifically, each UAV runs a standard Q-learning algorithm to learn its optimal Q-values and simultaneously determines an optimal strategy for the MDP.Specifically, the selection of an action in each iteration depends on Q-values in terms of two states-s m and its successors.Hence Q-values provide insights on the future quality of the actions in the successor state.The update rule for Q-learning [33] is given by with s t m = s m , θ t m = θ m , where s m and θ m correspond to s t+1 m and θ t+1 m , respectively.Note that an optimal action-value function can be obtained recursively from the corresponding actionvalues.Specifically, each agent learns the optimal action-values based on the updating rule in (33), where α t denotes the learning rate and Q t m is the action-value of UAV m at time slot t.Another important component of Q-learning is action selection mechanisms, which are used to select the actions that the agent will perform during the learning process.Its purpose is to strike a balance between exploration and exploitation that the agent can reinforce the evaluation it already knows to be good but also explore new actions [33].In this article, we consider -greedy exploration.In -greedy selection, the agent selects a random action with probability and selects the best action, which corresponds to the highest Q-value at the moment, with probability 1 − .As such, the probability of selecting action θ m at state s m is given by where ∈ (0, 1).To ensure the convergence of Q-learning, the learning rate α t are set as in [36], which is given by where . Note that each UAV runs the Q-learning procedure independently in the proposed ILs based MARL algorithm.Hence, for each UAV m, m ∈ M, the Q-learning procedure is concluded in Algorithm 1.In Algorithm 1, the initial Q-values are set to zero, therefore, it is also called zero-initialized Q-learning [37].Since UAVs have no prior information on the initial state, a UAV takes a strategy with equal probabilities, i.e., π m (s m , θ m ) = 1  |Θm| .
Algorithm 1 Q-learning based MARL algorithm for UAVs Initialize the action-value Initialize the state s m = s t m = 0; 6: end for 7: Main Loop: for all UAV m, m ∈ M do 10: Update the learning rate α t according to (35).

11:
Select an action θ m according to the strategy π m (s m ). 12: Measure the achieved SINR at the receiver according to (16);   Set s t m = 0. 17: Update the instantaneous reward r t m according to (20). 19: Update the action-value Q t+1 m (s m , θ m ) according to (33).

21:
Update t = t + 1 and the state s m = s t m .

C. Analysis of the proposed MARL algorithm
In this subsection, we investigate the convergence of the proposed MARL based resource allocation algorithm.Notice that the proposed MARL algorithm can be treated as an independent multi-agent Q-learning algorithm, in which each UAV as a learning agent makes a decision based on the Q-learning algorithm.Therefore, the convergence is concluded in the following proposition.
Proposition 2. In the proposed MARL algorithm of Algorithm 1, the Q-learning procedure for each UAV is always converged to the Q-value for individual optimal strategy.
The proof of Proposition 2 depends on the following observations.Due to the non-cooperative property of UAVs, the convergence of the proposed MARL algorithm is dependent on the convergence of Q-learning algorithm [35].Therefore, we focus on the proof of convergence for the Q-learning algorithm in Algorithm 1.
Theorem 1.The Q-learning algorithm in Algorithm 1 with the update rule in (33) converges with probability one (w.p.1) to the optimal Q * m (s m , θ m ) value if 1) The state and action spaces are finite; 2)

V. SIMULATION RESULTS
In this section, we verify the effectiveness of the proposed MARL based resource allocation algorithm for multi-UAV networks by simulations.We consider multi-UAV networks deployed in a disc area with a radius r d = 500 m.The ground users are randomly and uniformly distributed inside the disk.All UAVs are assumed to fly at a fixed altitude H = 100 m.In the simulations, the noise power is assumed to be σ 2 = −80 dBm, the subchannel bandwidth is W K = 75 KHz and T s = 0.1 s.For the probabilistic model, the channel parameters in the simulations follow [7], where a = 9.61 and b = 0.16.Moreover, the carrier frequency is f = 2 GHz, η LoS = 1 and η NLoS = 20.For the LoS channel model, the channel power gain at reference distance d 0 = 1 m is set as β 0 = −60 dB and the path loss coefficient is set as α = 2 [11].In the simulations, the maximal power level number is J = 3, the maximal power for each UAV is P m = P = 23 dBm, where the maximal power is equally divided into J discrete power values.The cost per unit level of power is ω m = ω = 100 and the minimum SINR for the users is set as γ 0 = 3 dB.Moreover, c α = 0.5, ρ α = 0.8 and δ = 1.
In Fig. 3, we consider a random realization of a multi-UAV network in horizontal plane, where L = 100 users are uniformly distributed in a disk with radius r = 500 m and two UAVs are initially located at the edge of the disk with the angle φ = π 4 .For illustrative purposes, Fig. 4 shows the average reward and the average reward per time slot of the UAVs under the setup of Fig. 3, where the speed of the UAVs are set as 40 m/s.Fig. 4(a) shows average rewards with different , which is calculated as As can be observed from Fig. 4(a), the average reward increases with the algorithm iterations.This is because the long-term reward can be improved by the proposed MARL algorithm.However, the curves of the average reward become flat when t is higher that 250 time slots.In fact, the UAVs will fly outside the disk when t > 250.As a result, the average reward will not increase.Correspondingly, Fig. 4(b) illustrates the average instantaneous reward per time slot r t = m∈M r t m .As can be observed from Fig. 4(b), the average reward per time slot decreases with algorithm iterations.This is because that the learning rate α t in the adopted Q-learning procedure is a function of t in (35), where α t decreases with time slots increasing.Notice that from (35), α t will decrease with algorithm iterations, which means that the update rate of the Q-values becomes slow with increasing t.Moreover, Fig. 4 also investigates the average reward with different = {0, 0.2, 0.5, 0.9}.If = 0, each UAV will choose a greedy action which is also called exploit strategy.If goes to 1, each UAV will choose a random action with higher probabilities.Notice that from Fig. 4, = 0.5 is a good choice in the considered setup.
In Fig. 5 and Fig. 6, we investigate the average reward under different system configurations.and L = 200.Specifically, the UAVs randomly distributed in the cell edge.In the iteration procedure, each UAV flies over the cell followed by a straight line over the cell center, that is the center of the disk.As can be observed from Fig. 5 and Fig. 6, the curves of the average reward have the similar trends with that of Fig. 4 under different .Besides, the considered multi-UAV network attains the optimal average reward when = 0.5 under different network configurations.In Fig. 7, we investigate the average reward of the proposed MARL algorithm by comparing it to the matching theory based resource allocation algorithm (Match).In Fig. 7, we consider the same setup as in Fig. 4 but with J = 1 for the simiplicity of algorithm implementation, which indicates that the UAV's action only contains the user selection for each time slot.Furthermore, we consider complete information exchanges among UAVs are performed in the matching theory based user selection algorithm, that is each UAV knows other UAVs' action before making its own decision.comparisons, in the matching theory based user selection procedure, we adopt the Gale-Shapley (GS) algorithm [38] at each time slot.Moreover, we also consider the performance of the randomly user selection algorithm (Rand) as a baseline scheme in Fig. 7.As can be observed that from 7, the achieved average reward of the matching based user selection algorithm outperforms that of the proposed MARL algorithm.This is because there is not information exchanges in the proposed MARL algorithm.In this case, each UAV cannot observe the other UAVs' information such as rewards and decisions, and thus it makes its decision independently.Moreover, it can be observed from Fig. 7, the average reward for the randomly user selection algorithm is lower than that of the proposed MARL algorithm.This is because of the randomness of user selections, it cannot exploit the observed information effectively.As a result, the proposed MARL algorithm can achieve a tradeoff between reducing the information exchange overhead and improving the system performance.
In Fig. 8, we investigate the average reward as a function of algorithm iterations and the UAV's speed, where a UAU from a random initial location in the disc edge, flies over the disc along a direct line across the disc center with different speeds.The setup in Fig. 8 is the same as that in Fig. 6 but with M = 1 and K = 1 for illustrative purposes.As can be observed that for a fixed speed, the average reward increases monotonically with increasing the algorithm iterations.
Besides, for a fixed time slot, the average reward of larger speeds increases faster than that with smaller speeds when t is smaller than 150.This is due to the randomness of the locations for users and the UAV, at the start point the UAV may not find an appropriate user satisfying its QoS requirement.Fig. 8 also shows that the achieved average reward decreases when the speed increases at the end of algorithm iterations.This is because that if the UAV flies with a high speed, it will take less time to fly out the disc.As a result, the UAV with higher speeds has less serving time than that of slower speeds.

VI. CONCLUSIONS
In this article, we investigated the real-time designs of resource allocation for multi-UAV downlink networks to maximize the long-term rewards.Motivated by the uncertainty of environments, we proposed a stochastic game formulation for the dynamic resource allocation problem of the considered multi-UAV networks, in which the goal of each UAV was to find a strategy of the resource allocation for maximizing its expected reward.To overcome the overhead Here, we show that the state values for one UAV m over time in (25).For one UAV m with state s m ∈ S m at time step t, its state value function can be expressed as where the first part and the second part represent the expected value and the state value function, respectively, at time t+1 over the state space and the action space.Next we show the relationship between the first part and the reward function R(s m , θ, s m ) with s t m = s m , θ t m = θ and s where the definition of R m (s m , θ, s m ) has been used to obtain the final step.Similarly, the second part can be transformed into The proof of Theorem 1 follows from the idea in [36], [39].Here we give a more general procedure for Algorithm 1.Note that the Q-learning algorithm is a stochastic form of value iteration [36], which can be observed from ( 26) and (32).That is to perform a step of value iteration requires knowing the expected reward and the transition probabilities.Therefore, to prove the convergence of the Q-learning algorithm, stochastic approximation theory is applied.
We first introduce a result of stochastic approcximation given in [36].
Based on the results given in Lemma 1, we now prove Theorem 1 as follows.
Note that the Q-learning update equation in (33) can be rearranged as Therefore, the Q-learning algorithm can be seen as the random process of Lemma 1 with Next we prove that the Ψ t (s m , θ m ) has the properties of 3) and 4) in Lemma 1.We start by showing that Ψ t (s m , θ m ) is a contraction mapping with respect to some maximum norm.
Definition 4. For a set X , a mapping H : X → X is a contraction mapping, or contraction, if there exists a constant δ, with delta ∈ (0, 1), such that for any x 1 , x 2 ∈ X .
Proposition 3.There exists a contraction mapping H for the function q with the form of the optimal Q-function in (B.8).That is Hq 1 (s m , θ m ) − Hq 2 (s m , θ m ) ∞ ≤ δ q 1 (s m , θ m ) − q 2 (s m , θ m ) ∞ , (B.7) Proof.From (32), the optimal Q-function for Algorithm 1 can be expressed as  Note that (B.12) corresponds to condition 3) of Lemma 1 in the form of infinity norm.
Finally, we verify the condition in 4) of Lemma 1 is satisfied.
R is a real valued reward function for player m.In a stochastic game, a mixed strategy π m : S m → Θ m , denoting the mapping from the state set to the action set, is a collection of probability distribution over the available actions.Specifically, for UAV m in the state s m , its mixed strategy is π m (s m ) = {π m (s m , θ m )|θ m ∈ Θ m }, where each element π m (s m , θ m ) of π m (s m ) is the probability with UAV m selecting an action θ m in state s m .A joint strategy π = {π 1 (s 1 ), • • • , π M (s M )} is a vector of strategies for M players with one strategy for each player.Let π −m = {π 1 , • • • , π m−1 , π m+1 , • • • , π M (s M )} denote the same strategy profile but without the strategy π m of player UAV m.Based on the above discussions, the optimization goal of each player UAV m in the formulated stochastic game is to maximize its expected reward over time.Therefore, for player UAV m under a joint strategy π = (π 1 , • • • , π m )

Since
Markov property is used to model the dynamics of the environment, the rewards of UAVs are based only on the current state and action.MDP for agent (UAV) m consists of: 1) a discrete set of environment state S m , 2) a discrete set of possible actions Θ m , 3) a one-slot dynamics of the environment given by the state transition probabilities F s t m →s t+1 m = F (s t m , θ, s t+1 m ) for all θ m ∈ Θ m and s t m , s t+1 m ∈ S m ; 4) a reward function R m denoting the expected value of the next reward for UAV m.For instance, given the current state s m , action θ m and the next state s m :R m (s m , θ m , s m ) = E{r t+1 m |s t m = s m , θ t m = θ m , s t+1 m = s m }, where r t+1 m denotes the immediate reward of the environment to UAV m at time t + 1.Notice that UAVs cannot interact with each other, hence each UAV knows imperfect information of its operating stochastic environment.

Fig. 5
Fig. 5 illustrates the average reward with LoS channel model given in (4) over different .
Average rewards per time slot.

Fig. 8 :
Fig. 8: Average rewards with different time slots and speeds.

FF
(s m , θ m , s m ) × R(s m , θ m , s m ) + δ max θ m Q * m (s m , θ m ) .(B.8)Hence, we haveHq(s m , θ m ) = s m (s m , θ m , s m ) × R(s m , θ m , s m ) + δ max θ m q(s m , θ m ) .(B.9)To obtain (B.7), we make the following calculations in (B.10).Note that the definition of q is used in (a), (b) and (c) follows properties of absolute value inequalities.Moreover, (d) comes from the definition of infinity norm and (e) is based on the maximum calculation.