Energy Efficient 3-D UAV Control for Persistent Communication Service and Fairness: A Deep Reinforcement Learning Approach

Recently, unmanned aerial vehicles (UAVs) as flying wireless communication platform have attracted much attention. Benefiting from the mobility, UAV aerial base stations can be deployed quickly and flexibly, and can effectively establish Line-of-Sight communication links. However, there are many challenges in UAV communication system. The first challenge is energy constraint, where the UAV battery lifetime is in the order of fraction of an hour. The second challenge is that the coverage area of UAV aerial base station is limited and the commercial UAV is usually expensive. Thus, covering a large target region all the time with sufficient UAVs is quite challenging. To solve above challenges, in this paper, we propose energy efficient and fair 3-D UAV scheduling with energy replenishment, where UAVs move around to serve users and recharge timely to replenish energy. Inspired by the success of deep reinforcement learning, we propose a UAV Control policy based on Deep Deterministic Policy Gradient (UC-DDPG) to address the combination problem of 3-D mobility of multiple UAVs and energy replenishment scheduling, which ensures energy efficient and fair coverage of each user in a large region and maintains the persistent service. Simulation results reveal that UC-DDPG shows a good convergence and outperforms other scheduling algorithms in terms of data volume, energy efficiency and fairness.

The authors in [6]- [8] optimized the UAV placement and trajectory of single UAV to maximize the coverage and throughput by using the minimum transmission power. The works in [9]- [12] further studied the optimal 3-D locations of multiple UAVs to maximize the downlink coverage performance with minimum transmission power. In [13], the authors determined the optimal 3-D mobile trajectory of multiple UAVs with a minimum energy consumption to adapt to the time-varying nature of user density. Energy-aware control protocols [14] minimize the unnecessary maneuvers of mobile devices and energy-aware network layer protocols are extensively surveyed in [5] to reduce battery consumption or conserve power. However, above works do not consider the energy replenishment of UAVs, which is inevitable in practice. Up to now, the works on energy replenishment through scheduling of UAVs are quite limited. In [15], the authors presented deployment strategies for multiple UAVs to maximize the stationary coverage of a target region to guarantee the continuity of the service by energy replenishment operations at ground charging stations. However, this work just addressed stationary coverage where the UAVs are at fixed positions.
As we know, UAVs can work as BSs to provide wireless communication for ground users by carrying current wireless technologies, such as LTE or Wi-Fi. On the other hand, the cost of each commercial UAV is usually several thousand dollars. Thus, covering a large target region all the time with sufficient UAVs is quite challenging due to the limited communication range and relatively high costs, which is the second challenge. Therefore, UAVs moving around to ensure each user to be covered is of paramount importance. In [16], the authors proposed a deep reinforcement learning (DRL) framework to control the mobile trajectory of a group of UAVs to achieve fairness coverage while maintaining their connectivity with minimum energy consumption. However, this work assumes all UAVs have the same altitude. Besides, it does not consider on-board circuit power, communication power, 3-D deployment of aerial BSs and energy replenishment.
Different from the aforementioned existing works under the assumption of either 2-D or stationary UAV coverage, inspired by the success of DRL, we propose a UAV Control policy based on Deep Deterministic Policy Gradient (DDPG) algorithm [17] (UC-DDPG) to address the combination problem of 3-D mobility of multiple UAVs and energy replenishment scheduling, which ensures energy efficient and fair coverage of each ground user in a large target region, while maintaining the persistent service.
The main contributions of this paper are listed as follows: • Detailed UAVs communication system models are built, including channel model, data rate model and energy model.
• In contrast with other papers, this paper takes UAV battery lifetime and energy replenishment into account, so as to maintain persistent service.
• In order to improve energy efficiency and guarantee service fairness, we develop a 3-D UAV deployment scheduling algorithm based on DDPG algorithm, which takes the residual energy of UAV, circuit power, communication power, mobility power and hover power into account. The rest of paper is organized as follows. Section II presents the related works. Section III presents system models and problem definition. The preliminaries of DDPG is introduced in Section IV. The proposed UC-DDPG is detailedly introduced in Section V. In Section VI, the convergence and performance of the proposed algorithm are verified by numerical results. Finally, Section VII concludes the paper.

II. RELATED WORK A. UAV 3-D DEPLOYMENT AND COVERAGE
Recently, most studies have been carried out to optimize the deployment of UAV BSs, aiming at coverage range, number of active UAVs, transmit power. In [6], the authors optimized the UAV position to satisfy the rate requirement of users in the entire high-rise building with minimum total transmit power. In [18] and [19], the hovering altitude of the UAV can be determined to maximize the radio coverage on the ground. An optimum placement of multiple UAVs is further investigated in [20] to maximize the number of covered users in the target region. Similarly, the studies in [9] and [21] investigate the optimal 3-D placement of UAVs to maximize the coverage while minimizing the transmitting power of the UAVs. In [22] and [23], the authors minimize the number of UAVs that must be deployed for covering all the ground terminals in the target region. The works in [10] proposed a framework to achieve energy-efficient uplink data collection from ground IoT devices by jointly optimizing the 3-D placement, device-UAV association and uplink power control in single time slot. Then, the works optimize the UAV's mobile trajectory by allowing UAVs to dynamically update their locations depending on the time-varying device's activation process, where the total energy consumption of the UAVs while updating their location is minimized. In [13], the optimal placement of multiple UAVs, such as altitude and coverage radius, is derived in a single time slot when the transmit power of UAV equals to their on-board circuit power. Then, the optimal placement updating problem in multiple time slot duration is also addressed to achieve near minimal energy consumption in polynomial time.

B. UAV ENERGY MANAGEMENT
The problem of prolonging the UAV working time has been extensively studied in the existing literature, such as energy-based protocol [7], [8], [14], [24], [25] to reduce the UAV's energy consumption, and replenishment strategies [15], [26]- [29] by leveraging the presence of charging infrastructures on the ground. The authors in [7] studied the optimization of the throughput of a relay-based UAV system by jointly controlling the UAVs trajectory as well as the source/relay transmit power. Later, the authors extend the work in [7] to optimize the energy efficiency of the relaybased UAV system by optimizing the UAV's trajectory [8].
In [14], the authors illustrate that energy efficiency based protocol minimizes the unnecessary maneuvers, which can be implemented via carefully controlling the movement of UAVs and optimizing the communication strategies with the minimum energy expenditure. In [24], the authors propose the use of passive scanning for the mobiles and periodic beaconing for UAVs as access points, where a cooperative game theory is used to provide effective coverage for mobile users. The authors in [25] design an energy efficient traveling path algorithm considering the peculiar feature of UAVs, such as the available energy, weight, maximum speed, etc.
The authors in [26] and [27] study the continuous coverage problem for mobile targets, where [26] properly allocates the charging slots for replenishing energy while [27] replaces the UAV that runs out of energy by a new one during the coverage process. Shakhatreh et al. [28] improve the model in [26] to the scenario with multiple UAVs. Considering the onboard circuit power and mobility power, the UAV control for scheduling fly or recharge has been investigated to guarantee persistent coverage of a target area by exploiting characteristics of fixed terrestrial charging infrastructures on the ground [15], [29]. The works in [30], [31] describe the design of reliable charging station for UAV. In [30], the ground charging station is designed to achieve the reliable recharging process, while a guidance system enabling the UAV to land on a charging station is described in [31].
The study for extracting energy from the environmental forces has also been applied in UAVs system. In [32], [33] and [34], the authors plan a path for UAV to extend the flight duration by exploiting the wind energy, while [34] considers the uncertainty of the wind field and the variation with respect to time. The authors in [35], [36] study the wireless power transfer techniques enabled by radio frequency signals to charge the UAVs. Similarly, the laser energy harvesting system is expected to efficiently prolong the UAV's flight duration, where a laser transmitter sends laser beams to charge a fixed-wing UAV in flight [37]- [39]. In [40], the authors present a rotational energy harvester using a brushless Direct-Current (DC) generator to harvest ambient energy from the propellers of the UAVs in order to prolong the UAV's flight duration.

C. DEEP REINFORCEMENT LEARNING IN WIRELESS NETWORK
DRL has recently attracted much attention from wireless communication field and is used to solve various problems [49]. In [50], an artificial intelligence framework (AIF) for smart wireless network management was proposed. DCRQN [51] which is a novel Wi-Fi handoff management scheme based on Deep Q-Network (DQN) [52] effectively improves the data rate during the handoff process. In [53], the authors presented DeepNap, which uses a DQN to learn effective BS sleeping policies and reduces the energy consumption of Wi-Fi networks. In [54], a novel channel allocation algorithm based on DRL was presented, which improves spectrum efficiency and decreases the co-channel interference for multi-beam satellite systems. The authors in [55] proposed to use DRL to obtain the optimal interference alignment (IA) user selection policy in the cacheenabled opportunistic IA networks. In [56], a DRL approach was proposed to maximize the channel utility for multi-user wireless networks with less computation and limited observations. In UAV communication networks, a DRL framework for multi-user access control is proposed in [57], which effectively improves system throughput. Recently, a Qlearning [59] based framework [58] is proposed for quality of experience (QoE) driven deployment and movement of UAV-BSs, and shows good performance and low complexity.

III. SYSTEM MODELS AND PROBLEM DEFINITION
In this section, we introduce the system models of the paper in the first place, including the scenario, channel model, data rate model as well as energy model. Then, the problem definition of energy efficient 3-D UAV control for persistent communication service and fairness is presented.

A. SCENARIO
We consider a rectangular geographical area of size a × b m 2 , as shown in Figure 1, within which a set K = {1, 2, . . . , K } of K ground users are distributed. In this system, a set A = {1, 2, . . . , N } of N rotary wing UAVs are deployed to provide communication coverage to the ground users in the target area. Because the number of UAVs is limited, the users cannot be completely covered by hovering UAVs. As a result, the UAVs should move around to provide service for all users. The total service time is T . The locations of user k ∈ K and UAV i ∈ A are, respectively, given by (x k , y k ) and ( where h min and h max are the minimum and maximum allowed height of UAV, respectively. A charging station S E locates at the center of the plane where the altitude is h min , and it can be used to recharge the UAV's 53174 VOLUME 8, 2020 battery at a speed of C S E Watt. We assume all the UAVs start with the fully charged batteries, and the battery capacity is E max . Assume is the angle of the sensing cone. With h i , the radius of the cover area can be given by [15], [18], For simplicity, T is divided into consecutive time slots {t 0 , t 1 , . . . , t end } of length equal to t slot , and there is a control center which can collect information from UAVs and command UAVs. In t j , UAV i can fly, hover, serve users and replenish energy, which is determined by the commands of control center. The users which are in the coverage of UAV i can be served by UAV i simultaneously. The users are assigned to different channel where the channel bandwidth is B, and there is no interference between them. If a user is covered by multiple UAVs, it will connect to the first UAV which provides communication service. The residual energy of UAV i at the beginning of time slot t j is denoted by E i,t j . If a UAV replenishes energy in charging station S E , it will not serve users.

B. CHANNEL MODEL 1) AIR-TO-GROUND PATH LOSS MODEL
According to [18], the air-to-ground (A2G) path loss model can be characterized into LoS links and non Line-of-Sight (NLoS) links, which can be given respectively by where f c is the carrier frequency; d ik is the distance between the UAV i and the user k, given by d ik = c is the speed of light; η LoS and η NLoS are the mean value of the excessive path loss on the top of the free space for LoS and NLoS links, determined by environment (suburban, urban, dense urban, highrise urban or others).
For A2G communications, each transmitter-receiver pair will typically have a LoS link with a given probability, which depends on the environment, location of the users and the UAV as well as the elevation angle [18]. Therefore, we have the LoS probability [18], [20] where ψ and ζ are constant values which depend on the environment and ϑ ik = 180 π arcsin( h i d ik ) is the elevation angle. Note that, the NLoS probability is P ik NLoS (ϑ ik ) = 1 − P ik LoS (ϑ ik ). Therefore, the average path loss of A2G channel can be expressed as

2) SMALL-SCALE FADING CHANNEL MODEL
We consider the small-scale channel fading following Rician distribution [42]. A Rician distribution is an adequate choice due to the possible combination of LoS and multiple scatters that can be experienced at the receiver. The complex channel gain between the pair of the UAV and its user is denoted by g ik . Then the probability distribution function (PDF) of g ik can be expressed as [41] where g ik is the complex channel gain between the pair of the UAV and its user; I 0 (·) is the zero-order modified Bessel function of the first kind, which can be defined by is the Rician factor defined as the ratio of the power in the LoS component to the power in the NLoS multiple scatters. is the average fading power where = 1.
Based on [43], the Rician factor can be modeled as a nonincreasing function of ϑ ik , which can be expressed as where κ 0 = K (0) and κ π 2 = K ( π 2 ). Then, the expectation of K can be estimated according to [44], ].
C. DATE RATE MODEL Assume the allocated transmit power to the interested user k is P t . The instantaneous signal-to-noise ratio (SNR) between the UAV i and the user k can be modeled as where σ 2 is the noise power. Then, the PDF of γ SNR ik is obtained by introducing a change of variables in the expression for the PDF f g ik (x), yielding is the average SNR. Therefore, the PDF of γ SNR ik can be given by [41] Based on the Shannon-Hartley theorem, the average data rate between the UAV-BS i and the user k can be given by which can be approximated to [45] To evaluate (10), the second moment of γ is required, which can be formulated as [45] Then, combining (9) and (10), a second-order approximation for R ik can be attained as The total energy consumption of the UAV network includes communication energy and propulsion energy. The communication energy is needed due to the radiation, signal processing and other circuitry while the propulsion energy is required to ensure that the UAV remains aloft as well as for supporting its mobility. However, the propulsion energy is different according to the UAV's flying state. In this subsection, the energy models including communication energy, hover energy as well as mobility energy are illustrated.

1) COMMUNICATION ENERGY
Assume the on-board circuit power is set to be P cu . Since the UAVs fly on the target areas to serve users, the corresponding communication time for the UAV i to the users is depended on the control policy. Let t com denote the duration that UAV i communicates with the users and n i,t j denote the number of the served users by UAV i in t j . Then, at t j , the communication energy of UAV i, E C i , can be given by 2) HOVER ENERGY According to the [47], [48], the hover energy consumption of UAV can be derived using power consumption of a multirotor helicopter, which is approximately linearly proportional to the weight of its battery and payload. Then, the hover power in Watt by the UAV can be given by [48] P hover = M G 3 2 2ρπβ 2 (14) where M is the number of rotors of the helicopter; G = (W +m)g is the thrust in Newton, given the frame weight W in kg, the battery and payload weight m in kg, and gravity g in N /kg. ρ is the fluid density of the air in kg/m 3 , and β is the rotor disk radius in m. Therefore, the energy consumed of UAV i in hover at t j can be computed as where t hover denotes the duration that UAV i hovers in t j .

3) MOBILITY ENERGY
Let P h , P a and P d denote the mobility power in the horizontal direction, ascending power and descending power, respectively. Similarly, v h , v a and v d represent the velocity in the horizontal direction, ascending velocity and descending velocity, respectively. Assume the UAV i updates its location in the considered area at t j . Then, the mobility energy of UAV i at t j can be given by [13] where d(i, t j ) and h(i, t j ) are the horizontal moving distance and the variation of the height of the UAV i at t j , respectively. Then, the effective horizontal and vertical (ascending or descending) velocities will be v h = υ sin ϕ and v a = v d = υ cos ϕ with ϕ = arctan( where v denotes the velocity of the UAV. I ( h(i, t j )) is the indicator function, which can be expressed as The power consumption of the horizontal direction can be given by [8], [10] P h = P p + P I (18) where P p is the parasitic power for overcoming the parasitic drag due to the aircraft's skin friction, form drag, etc, which can be given by [10], [46] where C D 0 is the drag coefficient, c b is the rotor chord, S is the reference area (frontal area of the UAV), w is the angular velocity.
And P I is the induced power for overcoming the liftinduced drag due to the wings redirecting air to generate the lift for compensating the aircraft's weight. According to [8], [10], P I can be given by where 53176 VOLUME 8, 2020 Similarly, P a and P d can be given by [46] respectively.

E. PROBLEM DEFINITION
We purpose to design a control algorithm which commands how each UAV acts in each time slot. The targets of the control algorithm include: 1) providing persistent service in T ; 2) maximizing communication data volume; 3) minimizing the UAVs energy consumption; 4) guaranteeing the fairness of the users. For the first target, because the battery lifetime of UAVs is much less than T , a replenishment policy should be designed to guarantee persistent service. In order to achieve the second objective, intuitively, appropriate communication locations where there is a good channel environment and the users are covered as many as possible should be found. For the third target, the flight path of UAVs should be designed carefully to reduce needless energy consumption, e.g., the UAV moves to some places without any user. Lastly, for the sake of guaranteeing the fairness of the users, the UAVs should serve all users as evenly as possible, rather than only serve part of users. In summary, it is a quite sophisticated task and traditional optimization algorithms are unsuitable. Recently, DRL has received extensive attention in the field of wireless communication [49]. DRL can learn the best policy by real-time interacting with the environment and only very minimal prior knowledge is needed, which applies to designing the UAV control algorithm. DDPG, which is the state-of-the-art DRL algorithm, shows good performance in solving complex tasks [17]. In the following sections, we will detailedly present the control algorithm based on DDPG.

IV. PRELIMINARIES ON DDPG
This section gives a brief description of reinforcement learning (RL) and DDPG. For a comprehensive presentation, please refer to [59] and [17]. Figure 2 shows the basic form of RL. RL is learning how to map state to action, so as to maximize a numerical reward. The learner, i.e. the agent, is not told which actions to take, but instead must discover which actions yield the most reward by trying them. The agent observes the state s of the environment, and executes an action a according to the policy. Then, the agent receives a reward r and observes a new state s . The above process is repeated until the end of the agent-environment interaction, and a complete interaction process is referred to as an episode. This information, (s, a, r, s ), is used to improve the agent's policy, and the episode will be repeated until the policy converges to the optimal policy. Q-Learning and SARSA [59] are the most common algorithms in RL.  However, RL is unsuitable and inapplicable to the complex tasks which have continuous and high dimensional state spaces or action spaces. DRL embraces the advantage of deep neural network (DNN) to train learning process, thereby improving the learning speed and the performance of RL algorithms. DDPG [17], which is a model-free off-policy actor-critic DRL algorithm, can learn policies in continuous, high dimensional state spaces and action spaces. As shown in Figure 3, the DDPG algorithm maintains a parameterized actor neural network µ(s | θ µ ) which specifies the current policy by deterministically mapping states to a specific action. The parameterized critic neural network Q(s, a | θ Q ) is learned using the Bellman equation as in Q-learning. The actor is updated by applying the chain rule to the expected return from the start distribution J with respect to the actor parameters as follows: Specially, experience replay and target network are introduced in DDPG to guarantee the convergence, which is VOLUME 8, 2020 inspired by DQN [52]. In experience replay, a replay buffer with a finite size is used to store the sample (s, a, r, s ). When the replay buffer was full, the oldest samples were discarded. The actor and critic are updated by sampling a minibatch randomly from the replay buffer. Experience replay breaks the correlations between samples and therefore reduces the variance of learning. In target network, a copy of the actor and critic networks, Q (s, a | θ Q ) and µ (s | θ µ ), is created. Q (s, a | θ Q ) and µ (s | θ µ ) are used to calculate the target values, and their weights are then updated by having them slowly track the original networks: θ ← τ θ + (1 − τ )θ with τ 1, which greatly improving the stability of learning.

V. UAV CONTROL BASED ON DDPG
In the section, we design a UAV control policy based on DDPG. In this problem, the agent is the control center, and UC-DDPG is implemented in the control center. The basic three elements (state, action and reward) of RL are designed as follow.

A. STATE
In time slot t j , state s j is defined as where (x i , y i , h i ) denotes the location of UAV i at the beginning of the time slot t j and E i,t j denotes the residual energy of UAV i at the beginning of t j . data k,t j denotes accumulative received data volume of user k before t j . As shown above, s j is a vector with the size of 4N + K and consisted of three parts. The first part is the locations of all UAVs and the second part is the residual energy of each UAV. The last part is the accumulative received data volume of each user. Specially, all the elements in state s j are normalized to accelerate the process of learning. In detail, x i , y i , h i and E i,t j are divided by their corresponding maximum, i.e. a, b, h max and E max . data k,t j is divided by k data k,t j which is the total received data volume by all users.

B. ACTION
Obviously, action a j is In the start of t j , the UAV flies according to action a j with a fixed velocity v. ϕ i and φ i are the polar angle and the azimuthal angle of the UAV i flight direction, respectively. d i is the flight distance and d max is the largest allowed flight distance. If a UAV flies off the border, it will stay at the border. After the flight, in the remaining time of t j , if the UAV is not in charging station, it will hover and provide communication service for covered users. Otherwise, the UAV will charge until the end of t j .

C. REWARD
The reward r j is calculated at the end of the time slot t j and is designed as If the residual energy of any UAV at the end of t j is larger than 0, the first line formula will be used to calculate the reward. Thereinto, E t j denotes the energy consumed by all . data t j represents the data volume received by all users in t j . I i is the indicator function, which can be expressed as E i,t j denotes the normalized residual energy of UAV i at the beginning of t j , and is equal to JFI is Jain's Fairness Index, which is used to estimate the fairness. JFI is defined by and JFI ∈ [ 1 K , 1]. The fairer the service is, the larger JFI is. The first part of the first line formula, i.e. 7 JFI · data t j E t j , can be interpreted as fairness times energy efficiency. The larger the energy efficiency and the fairness are, the larger the first part is. The second part of the first line formula, i.e. i I i · 10(− 5 , is used to stimulate the agent to learn replenishment policy. It is expected that the UAV replenishes energy when it has low energy rather than high energy. As a result, this part is developed. IfĒ i,t j is less than 60%, the value will be positive. If not, the value will be negative.
Otherwise, the second line formula is used, where there is a UAV without energy in t j . The agent will receive the ''penalization'' which is −20 in our implementation.

D. UC-DDPG
The pseudo-code of UC-DDPG is presented in Algorithm 1. In the first place, we randomly initialize actor neural network µ and critic neural network Q, and build target network Q and µ . A replay buffer RB with a fixed size is also built.

3:
Randomly initialize the positions of all UAVs. 4: Initialize the UAVs with fully charged battery.

17:
Update the actor policy using the sampled policy gradient: 18: Update the target networks:

19:
if there is a UAV without power then 20: break 21: end if 22: end for 23: end for At the start of each episode, the positions of all UAVs are randomly initialized and all UAVs have fully charged batteries. A random noise N which is used to balance exploration and exploitation is initialized. In the early stage, the policy is far from optimal, and various actions need to be explored. As the algorithm is iterated, the policy gradually converges. Therefore, it is needed to decrease exploration and increase exploitation. In our implementation, we use Gaussian noise with mean 0 and variance var. The value of var is 5 in the beginning and times 0.995 after each time slot. The episode terminates when there is a UAV without energy or the length of this episode is longer than T .
At the beginning of the time slot t j , the agent determines actions a j according to the actor network µ, current state s j and the Gaussian noise N , where a j = µ(s j ) + N . Then, the actions are distributed to each UAV and the UAVs execute received actions. Corresponding flight energy consumption is calculated according to equation (16). During the flight, the UAV does not serve users. After the flight, if the UAV is not in charging station, the UAV will hover and provide communication service for users in the remaining time of t j . Otherwise, the UAV will charge until the end of t j . At the end of t j , the communication energy consumption, hover energy consumption, charging energy and the data volume are calculated according to the system model in Section III. Then, the reward r j is calculated according to equation (25) and a new state s j+1 is observed. The tuple, (s j , a j , r j , s j+1 ), is stored in the replay buffer RB. Next, a minibatch with a size of L is randomly sampled from the RB. loss is calculated according to the target network µ , Q , the critic network Q and the samples in the minibatch, as shown in lines 15-16. The critic network parameters are updated by minimizing loss (line 16) and the actor network parameters are updated by policy gradient (line 17). Finally, the target networks µ , Q are updated by slowly tracking the original networks, as shown in line 18.

VI. SIMULATION AND PERFORMANCE EVALUATION
In this section, we present simulation to evaluate the performance of UC-DDPG. The simulation runs are performed with TensorFlow 1.12 [60]. We consider the users are randomly located within a square area with a size of 400m × 400m. The minimum and maximum allowed height of UAV are 20m and 50m respectively. The number of UAVs is 2 and we test the performance of UC-DDPG under three different user numbers (10, 15 and 20). The charging station is a cuboid with the size of 20m × 20m × 15m. The UAVs communicate in an urban environment with ψ = 12.08, ζ = 0.11, η LoS = 1.6 dB and η NLoS = 23 dB at 2GHz carrier frequency [20]. The size of replay buffer RB is 30000 and the size of minibatch L is 64. The total service time T is 2 hours and the time slot t slot is 60 seconds. Therefore, there are 120 time slots in T . The main simulation parameters are listed in Table 1. Both actor network and critic network are feed-forward fully connected neural network and their parameters are listed in Table 2 and Table 3.
In particular, Random flight and Hilbert curve flight are used as benchmark.
Random flight. Random flight is very straightforward. In time slot t j , for each UAV i, the flight direction polar angle ϕ i , the azimuthal angle φ i and flight distance d i are randomly selected in [0, π], [0, 2π) and [0, d max ] respectively. Similarly, if a UAV flies off the border, it will stay at the border.
Hilbert curve flight. Hilbert curve flight is a traversal algorithm. The altitude of all UAVs is same and fixed, which is 35m in our simulation. UAVs fly along the 3rd-order Hilbert   curve, as shown in Figure 4. The two UAVs start from points A and B respectively. In each time slot, the UAVs fly a unit distance which is 50m in our implementation. If the UAV  reaches the endpoint, it will double back. Specially, if the residual energy of the UAV is less than 20%E max at the start of time slot, the UAV will fly to charging station to replenish energy until the battery is full. Then the UAV returns to the original position and serves users sequentially.

A. TRAINING
For each scenario with different numbers of users, we execute training with 40000 episodes. We show the training process by the example of 10 users, and the training processes in all scenarios are similar. The total reward of each episode and the length of each episode are shown in Figure 5 and Figure 6.
It can be observed that the Total Reward Per Episode (TRPE) is small and almost unchanged in the first 10000 episodes. Then, the TRPE shows fluctuation between 10000th episode and 17000th episode. For the rest of episodes, it gradually increases and stabilizes. Figure 5 and Figure 6 correspond to each other. In the first 10000 episodes, the service time of each episode is short, which results in the small TRPE. After that, we can see that the agent learns how to charge between 10000th episode and 53180 VOLUME 8, 2020 17000th episode. Finally, the agent knows how to charge and can provide persistent service in 7200 seconds, which is the length of T . It is interesting that the agent increases the TRPE by learning the way of charging firstly. As shown in Figure 6, the agent has learned to charge in 17000th episode, but the TRPE of 17000th episode is not high. Afterwards, the agent increases the TRPE by learning how to serve user fairly and energy efficiently.
After training, the flight trajectories of all UAVs which are generated by the actor network µ are shown in Figure 7. It seems that the agent learned a sort of cyclic trajectory. All UAVs circle and repeat the pattern of flying, serving or charging. We can also see that the UAVs do not fly to the places without users to avoid wasting energy. It is a pity that the agent did not learn to let two UAVs collaborate effectively, such as serving different users separately to further reduce flight energy consumption. The research on effective cooperation will be put into our future work.

B. ENERGY EFFICIENCY
Data volume and energy efficiency of different scenarios are depicted in Figure 8 and Figure 9, where energy efficiency equals to data volume divided by energy consumption. We use the actor network which trained 40000 episodes to determine actions. We can see that UC-DDPG has the best performance. Compared with Hilbert curve flight, UC-DDPG gets about twice the amount of data and the energy efficiency. As expected, Random flight has the worst performance. In Random flight, the UAV flies randomly and does not know replenishing energy, therefore it runs out of power soon. As a result, Random flight has the minimal data  volume and the lowest energy efficiency. It can be observed that the data volume and the energy efficiency of Random flight increase with the growth of number of users. It is because that UAVs have higher probability of covering users when the number of users increases. Hilbert curve flight has better performance than Random flight, but is not as good as UC-DDPG. In Hilbert curve flight, the UAVs traverse all places. However, there may be some places without any user. Consequently, the UAVs waste energy in vain. On the other hand, the positions used to serve users in Hilbert curve flight may be energy inefficient due to the poor channel environment, which leads to further deterioration of energy efficiency. In contrast, as shown in Figure 7, UC-DDPG avoids places where there are no users by training, and learns the better service positions, which increases the data volume and improves the energy efficiency.
C. FAIRNESS Figure 10 shows the fairness in different scenarios. We can see that both Hilbert curve flight and UC-DDPG have pretty VOLUME 8, 2020 good fairness and Random flight has inferior fairness. It is not strange that the fairness has no relation with the number of users. Because in Random flight, the UAVs randomly fly, it usually has very low fairness. Hilbert curve flight traverses all places, thus it always produces good fairness. In UC-DDPG, JFI in reward r j and normalized data volume in state s j guide the agent to learn how to serve users fairly. After training, the agent learned fair service policy.

VII. CONCLUSION
The energy efficient, fair 3-D deployment and energy replenishment policy of multiple UAVs are jointly studied in this paper. Firstly, we build detailed channel model, data rate model and energy model. Then, inspired by the success of DRL, we propose a UAV control policy based on DDPG which is a deep actor-critic algorithm. The state, action and reward of RL are carefully designed under the consideration of energy efficiency, fairness and persistence. A lot of training ensures the performance of UC-DDPG. Simulation results show that UC-DDPG has good convergence and outperforms other scheduling algorithms (Random flight and Hilbert curve flight) in terms of data volume, energy efficiency and fairness. In future work, we plan to use multi-agent DRL to improve the cooperation between UAVs.