Energy Minimization in UAV-Aided Networks: Actor-Critic Learning for Constrained Scheduling Optimization

In unmanned aerial vehicle (UAV) applications, the UAV's limited energy supply and storage have triggered the development of intelligent energy-conserving scheduling solutions. In this paper, we investigate energy minimization for UAV-aided communication networks by jointly optimizing data-transmission scheduling and UAV hovering time. The formulated problem is combinatorial and non-convex with bilinear constraints. To tackle the problem, firstly, we provide an optimal relax-and-approximate solution and develop a near-optimal algorithm. Both the proposed solutions are served as offline performance benchmarks but might not be suitable for online operation. To this end, we develop a solution from a deep reinforcement learning (DRL) aspect. The conventional RL/DRL, e.g., deep Q-learning, however, is limited in dealing with two main issues in constrained combinatorial optimization, i.e., exponentially increasing action space and infeasible actions. The novelty of solution development lies in handling these two issues. To address the former, we propose an actor-critic-based deep stochastic online scheduling (AC-DSOS) algorithm and develop a set of approaches to confine the action space. For the latter, we design a tailored reward function to guarantee the solution feasibility. Numerical results show that, by consuming equal magnitude of time, AC-DSOS is able to provide feasible solutions and saves 29.94% energy compared with a conventional deep actor-critic method. Compared to the developed near-optimal algorithm, AC-DSOS consumes around 10% higher energy but reduces the computational time from minute-level to millisecond-level.


I. INTRODUCTION
Unmanned aerial vehicles (UAVs) have attracted much attention to high-speed data transmission in dynamic, distributed, or plug-and-play scenarios, e.g., disaster rescue, live concert, or sports events [1]. However, UAVs' limited endurance, energy supply, and storage become critical issues for its applications, which motivates the study of energy efficiency in UAV-aided communication networks. The UAV's The work has been supported by the ERC project AGNOSTIC (742648), by the FNR CORE projects ROSETTA (C17/IS/11632107), ProCAST (C17/IS/11691338) and 5G-Sky (C19/IS/13713801), and by the FNR bilateral project LARGOS (12173206).
S. Sun is with the Institute for Infocomm Research, Agency for Science, Technology, and Research, Singapore 138632 (e-mail: sunsm@i2r.astar.edu.sg).
Part of this paper has been presented at IEEE EuCNC, June 2020 [18]. energy consumption comes from two aspects, propulsion energy for flying and hovering, and communication energy for data transmission. The flying energy mainly depends on the UAV's velocity and trajectory [1]. The hovering energy is in general proportional to the hovering time. Compared to the propulsion energy, the communication energy consumption is not a negligible part, e.g., considerable communication energy can be consumed in the scenarios with high traffic requests from a large number of users. Thus joint energy optimization for both parts is necessary and has attracted considerable attention in the literature [2]- [7]. The authors in [2], [3] maximized the energy efficiency, referring to the ratio between transmitted data and propulsion energy. In [4], the authors introduced a complete UAV energy model and proposed a user-timeslot scheduling method to minimize the sum of the propulsion energy and communication energy. Based on the energy model in [4], the authors formulated an energy minimization problem with latency constraints by trajectory design in [5]. The above works in [2]- [5] adopted a time division multiple access (TDMA) mode, where the UAV serves one user per timeslot. Besides TDMA, space division multiple access (SDMA) enables simultaneous data transmission to multiple users, such that the hovering time and hovering energy can be reduced. In [6], the authors designed an SDMA-based beamforming scheme to minimize the total transmit power for multi-antenna UAVs. In [7], an energy efficiency maximization problem was investigated in an SDMA-based multi-antenna UAV network via optimizing the flying velocity and power allocation. However, serving multiple users simultaneously may lead to strong inter-user interference and may require more communication energy to fulfill users' demands.
Deterministic optimization algorithms, e.g., [2]- [7] might not be suitable for fast decision making in a dynamic wireless environment. To address this issue, deep learning-based solutions have been investigated in the literature. The authors in [8] applied a deep neural network (DNN) for UAV-enabled hybrid networks to efficiently predict the resource allocation scheme. In [9], a deep learning-based auction algorithm was proposed to determine a dynamic battery charging scheduling for UAV-aided systems. Supervised learning, such as DNN, requires large amounts of training data, which is a non-trivial task in an offline manner [10]. Another category of studies is deep reinforcement learning (DRL), with the following advantages. Firstly, DRL provides timely solutions, adapted to environment variations. Secondly, DRL integrates DNN to make decisions and improve solution quality. Thirdly, DNN requires an offline data generating and training phase, whereas DRL is less needed for prior knowledge and is able to train by exploring unknown environments and exploiting received feedbacks in an online manner. In [11], the authors applied a deep Q network (DQN) to design an energy-efficient flying trajectory scheme for UAV-aided networks. In general, DQN is used to deal with a relatively small and discrete action space, where the action space refers to the set of all possible decisions [12]. The authors in [13] designed a different deep Q-learning architecture with a high dimensional action space, but it needs to evaluate all of the actions before making a decision, which is time-consuming.
Deep actor-critic is an emerging DRL method with fast convergent properties and the capability to deal with a large action space [14]. In [15], an actor-critic-based DRL (AC-DRL) algorithm was proposed to reduce the UAV's energy consumption and enhance the UAV's coverage of ground users via optimizing UAV's flying direction and distance. In [16], the authors employed deep actor-critic to design a learning algorithm for UAV-aided systems, considering energy efficiency and users' fairness. Note that the AC-DRL in [15], [16] was developed for unconstrained problems. However, most of the problems in UAV systems are constrained and with discrete variables. The conventional AC-DRL algorithms have limitations on tackling constrained combinatorial optimization problems, which may result in slow convergent, infeasible, and degraded solutions. The authors in [17] developed an AC-DRL algorithm for a combinatorial optimization problem in a UAVaided system, but when the size of the action space grows exponentially, the convergence of the algorithm deteriorates. In [18], the authors applied a conventional AC-DRL approach to address an energy minimization problem in UAV networks, where the performance is limited by feasibility guarantee and rapidly-increasing action space.
In this study, we minimize the UAV's communication and propulsion energy in a downlink UAV-aided communication system. The novelty of solution development lies in two aspects. Firstly, compared to offline optimization approaches, we provide online learning and timely energy-saving solutions based on DRL. Secondly, unlike the conventional DRL or AC-DRL methods, the proposed solution is designed to address the challenging issues in constrained combinatorial optimization. The major contributions are summarized as follows: • We formulate an energy minimization problem for an SDMA-enabled UAV communication system, where usertimeslot allocation and UAV's hovering time assignment are the coupled optimization tasks. The formulated problem is combinatorial and non-convex with bilinear constraints. • We provide a relax-and-approximate method to approach the optimum. That is, the bilinear terms are addressed by McCormick envelop relaxation, then the remaining integer linear programming problem is solved by the branch-and-bound (B&B) algorithm. • We characterize the interplay among communication energy, hovering time, and hovering energy. Based on the derived analytical results, we develop a golden section search-based heuristic (GSS-HEU) algorithm for benchmarking general instances with lower complexity than the optimal solution. • Being aware of the issues in optimal/sub-optimal and conventional DRL approaches, we propose an actor-criticbased deep stochastic online scheduling (AC-DSOS) algorithm, where the original problem is transformed to a Markov decision process (MDP). Unlike conventional AC-DRL solutions, in AC-DSOS, we design a set of approaches, e.g., stochastic policy quantification, action space reduction, and feasibility-guaranteed reward function design, to specifically address the constrained combinatorial problem. • Simulations demonstrate that the proposed AC-DSOS enables a feasible, fast-converging, and dynamicallyadaptive solution. The designed approaches are effective in reducing action space and guaranteeing feasibility. AC-DSOS achieves 29.94% and 52.51% energy reduction compared with a conventional AC-DRL method and a heuristic user scheduling method with almost the same computation time. The rest of the paper is organized as follows. Section II provides the system model and Section III formulates the considered optimization problem. In Section IV, we analyze the relationship between the energy consumption and hovering time, and propose a heuristic algorithm. In Section V, we reformulate the problem as an MDP and develop an AC-DSOS algorithm. Numerical results are presented and analyzed in Section VI. Finally, we draw the conclusions in Section VII.
The codes for generating the results are online available at the link: https://github.com/ArthuretYuan.

A. System Model
We consider a downlink UAV-aided communication system. A UAV serves as an aerial base station (BS) to deliver data to ground users, e.g., for the scenarios if terrestrial BSs are unavailable or overloaded by high traffic demand from numerous users. We assume that the UAV is equipped with L antennas and each ground user has a single antenna [7]. The UAV is fully loaded with data and energy at a dock station before the task starts. The service area is divided into N clusters considering the UAV's limited coverage area. This setup can be used in many practical scenarios such as emergency rescue and temporary communication [19], [20]. We denote N = {1, ..., n, ..., N } as the set of clusters and N + = N ∪ {N + 1} as the extended set, where the (N + 1)-th cluster denotes the dock station. The UAV flies through all the clusters successively according to a pre-optimized trajectory, and transmits data to the users by hovering at a given point, e.g., above the cluster's center. Let K n and K n denote the number and set of the users in the n-th cluster. The demands of user k ∈ K n are denoted by q k,n (in bits). When all the demands in a cluster are satisfied, the UAV leaves the current cluster and visits the next one. After serving all the clusters, the UAV flies back to the dock station. The process of the UAV from leaving to returning the dock station is defined as a round or a task. Fig. 1 illustrates an example of the considered system.  The data stored in the UAV typically has a certain life span [21]. Thus, we consider the transmitted data is delaysensitive, and all data delivery must be completed within T max (in frames), where the time domain is divided by frames in set T = {1, ..., t, ..., T max }. One frame consists of I timeslots, and the duration of a timeslot is Φ. With SDMA, the UAV can simultaneously transmit data to more than one user in each timeslot. The frame-timeslot structure is shown in Fig. 2, where the shaded blocks indicate that the users are scheduled. We define the scheduled users at a timeslot as a user group. The union of the possible groups in cluster n is denoted by G n = {1, ..., g, ..., G n }. The maximum number of candidate groups in cluster n is G n = 2 Kn − 1 [22], which increases exponentially with K n . The number and set of the users of group g in cluster n are denoted by K g,n and K g,n , respectively.  We consider a quasi-static Rician fading channel which comprises both a deterministic line-of-sight (LoS) component and a random multipath component [23]. The channel states are static within a transmission frame, and varying from one frame to another. The channel vector from the UAV antennas to ground user k ∈ K n is denoted as h k,n ∈ C 1×L , which can be expressed by α k,n 10 −ξ k,n /10 , where α k,n ∈ C 1×L is the multipath Rician fading vector and ξ k,n is the free-space propagation loss between the UAV and ground user k ∈ K n . We collect all the channel vectors of the users in K g,n to form a matrix H g,n ∈ C Kg,n×L . Within a user group, we apply a linear minimum mean square error (MMSE) precoding scheme due to its high efficiency and low computational complexity in mitigating intra-group interference. The precoding vector for user k ∈ K g,n is calculated by: where p k,g,n is the transmit power for user k in group g, h k,g,n is the k-th column in H H g,n (σ 2 I + H g,n H H g,n ) −1 , and σ 2 is the noise power. Note that transmit power p k,g,n is fixed as parameters in this work by following practical UAV applications, e.g., constant transmit power can be selected from 0.1 W to 10 W [24]. The signal-to-interference-plus-noise ratio (SINR) for the user k ∈ K g,n is given by: where β (kk) g,n = |h k,nhk,g,n | 2 and β (kj) g,n = |h k,nhj,g,n | 2 are the effective channel gains. Since the channel states vary over frames, we use Γ k,g,n,t , β (kk) g,n,t and β (kj) g,n,t to track SINR and channel coefficients on the t-th frame. In this work, the time-varying channel is further modeled as a first state Markov channel (FSMC). Under the FSMC, we quantify each coefficient β (kk) g,n,t and β (kj) g,n,t to multiple Markov states and obtain a transition probability such that the variations of β (kk) g,n,t and β (kj) g,n,t follow a Markov process between frames [25]. If group g ∈ G n is scheduled at timeslot i on frame t, the amount of data transmitted to user k ∈ K g,n and the consumed communication energy of group g ∈ G n can be expressed by: and e g,n,t = Φ k∈Kg,n β (kk) g,n,t p k,g,n , g ∈ G n , t ∈ T , where B is the system bandwidth. Note that within a frame, we assume a user's channel condition is identical across all the timeslots, thus index i is omitted in d k,g,n,t and e g,n,t .

B. UAV's Energy Model
We employ a UAV energy model proposed in [4]. The flying power is formulated as a function f (U ) of flying velocity U : where • P 0 : the blade profile power in hovering status; • P 1 : the induced power in hovering status; • U tip : the tip speed of the rotor blade; • U ind : the mean rotor induced velocity; • ρ 1 : the parameter related to the fuselage drag ratio, rotor solidity, and the rotor disc area; • ρ 2 : the air density.
When UAV approaches the hovering point of each cluster, it will fly around the point with a certain velocity U = U hov , which is more energy-efficient than U = 0 [5]. Thus, the hovering power P H is f (U = U hov ). The flying energy with constant velocity U and traveling distance S is expressed as: Hovering energy and communication energy need to be jointly optimized since they are coupled by hovering time, whereas the optimization of flying energy is independent. By applying graph-based numerical methods [26], the minimum flying energy E * F along with the optimal flying speed U * F can be obtained by: where U . The main notations are summarized in Table I. number and set of users in cluster n Gn, Gn number and set of groups in cluster n Kg,n, Kg,n number and set of users in group g of cluster n q k,n demands of user k in cluster n Tmax, T maximum number and set of frames in each round I, I number and set of timeslots in each frame Φ duration of each timeslot (in seconds) Γ k,g,n,t SINR of user k ∈ Kg,n on frame t β (kj) g,n,t channel coefficient from user j's precoding vector to user k (k, j ∈ Kg,n) on frame t d k,g,n,t transmitted data of user k ∈ Kg,n per timeslot on frame t eg,n,t communication energy of group g ∈ Gn per timeslot on frame t U *

III. PROBLEM FORMULATION
We denote binary variables λ i,g,n,t ∈ {0, 1} as the scheduling indicator, where λ i,g,n,t = 1 indicates that user group g ∈ G n is assigned to timeslot i on frame t and λ i,g,n,t = 0 otherwise. Another binary variables ν n,t ∈ {0, 1} indicate that the UAV is hovering above cluster n on frame t (ν n,t = 1), and ν n,t = 0 otherwise. The UAV energy consumption consists of flying energy E F , hovering energy E H , and communication energy E C . Since the minimal flying energy E * F can be independently obtained by Eq. (7) without loss of optimality, the objective focuses on joint optimization of E C and E H , which are expressed by: Note that the UAV is battery limited in practice. We focus on the instances that the minimum consumed energy in (10a) is within the UAV's battery storage, otherwise the task is infeasible. The optimization problem is formulated as: Constraints (10b) guarantee that all the users' requests have to be satisfied within T max . Constraints (10c) define that the UAV follows a successive and forward manner in visiting clusters. For example, if the UAV is hovering above cluster n on frame t, in the next frame t + 1, the UAV either chooses to stay at the current cluster n or move to the next cluster n + 1.
The option of flying back to previously visited clusters, e.g., n − 1, is thus excluded. Note that the UAV takes off from the first cluster, i.e., ν 1,1 = 1. Constraints (10d) represent that all the timeslots on frame t are assigned to a user group when ν n,t = 1, otherwise, no users are scheduled in any timeslot. Constraints (10e) and (10f) indicate that no more than one group can be scheduled at a timeslot and only one cluster can be served within a frame. Constraints (10g) and (10h) confine variables λ i,g,n,t and ν n,t to binary. Note that P 1 is a combinatorial optimization problem with a non-convex bilinear objective and constraints. The optimum can be approached by a well-established relax-andapproximate method. That is, the non-convex bilinear terms are relaxed and bounded by McCormick envelop [27], where each variable (λ i,g,n,t and ν n,t ) is bounded by an upper and a lower bound. The relaxation problem becomes an integer linear programming (ILP) problem which can be optimally solved by B&B. Overall, the optimum of P 1 can be ap-proached by ultimately tightening the bounds, e.g., increase the number of breakpoints in the envelopes, but this results in exponentially increasing complexity which is unaffordable in practice [28]. Thus, we adopt the above relax-and-approximate method to provide an optimal solution for benchmarking small-medium cases. For general cases, we propose a suboptimal algorithm in the next section.

IV. HEURISTIC APPROACH
We decompose the joint optimization to two sub-problems, i.e., user-timeslot and hovering time allocation, corresponding to optimization of λ i,g,t,n and ν n,t , respectively. We then solve one sub-problem when the other is fixed.

A. User-Timeslot Scheduling
The bilinear items are resolved with the fixed ν n,t . The number of frames at each cluster are determined by: and ΦIt n is the hovering duration. The user-timeslot scheduling can be carried out independently in each cluster, and the resulting problem for the n-th cluster is formulated in P 2 (n) with a given t n . We denote E H,n and E C,n as the hovering and communication energy for the n-th cluster: where τ n refers to the number of elapsed frames before the UAV arriving cluster n, which can be calculated by: The sub-problem P 2 (n) is formulated as: Gn g=1 I i=1 λ i,g,n,t = I, ∀t ∈ {τ n + 1, ..., τ n + t n }, Gn g=1 λ i,g,n,t ≤ 1, ∀i ∈ I, t ∈ T , λ i,g,n,t ∈ {0, 1}, ∀i ∈ I, g ∈ G n , t ∈ T . (15e) P 2 (n) is a multi-choice multi-dimensional knapsack problem (MMKP), which can be solved by a guided local search (GLS)-based heuristic algorithm with high-quality sub-optimal solutions and pseudo-polynomial-time complexity [29].

B. Hovering Time Allocation
To optimize hovering time efficiently, we first investigate the connection between the objective energy and t n . From Eq. (12) and Eq. (13), E H,n increases linearly with t n while E C,n is determined by both t n and λ i,g,n,t . Next, we show the relationship between the optimum E C,n and t n . For cluster n, we denote E * C,n (t n ) as the communication energy with the optimal scheduling decision λ * i,g,n,t at a given hovering time t n .
Proof. We denote the optimal user scheduling for P 2 (n)| tn=t as λ * i,g,n,t . If t n increases fromt tot + ∆t, λ * i,g,n,t is still feasible for P 2 (n)| tn=t+∆t such that λ * i,g,n,t might not be necessarily optimal for t n =t+∆t. There exists an optimal scheduling resulting in lower communication energy, i.e., Thus the conclusion.
From Lemma 1, we can observe that E * C,n (t n ) is an nonincreasing function of t n , i.e., = ΦIP H based on Eq. (12). Thus, the extreme point of E * C,n (t n ) + E H,n (t n ) can be obtained at t n = t † when Since the existence and the number of extreme points are undetermined. There are three possible cases, i.e., unimodal, multimodal, and monotonic, for E * C,n (t n ) + E H,n (t n ), as illustrated in Fig. 3. In case 1, the curve is a unimodal function with only one extreme point. In case 2, the fluctuation of Observing the possible cases, we employ an efficient golden section search (GSS) to find the extreme points [30]. In GSS, we limit the hovering time t n ≤t n to ensure that the total service duration does not exceed T max , wheret n is a maximal time limitation for cluster n. Intuitively, the clusters with more demands need more transmission frames. We assumet n is proportional to the users' demands:

C. Algorithm Summary
We summarize the proposed GSS-based heuristic (GSS-HEU) algorithm in Alg. 1. We denote B n,t as the set of channel states of cluster n on frame t, which is expressed as: 1,n,t , ..., β (kj) Gn ,n,t | ∀k, j ∈ K g,n }.
In GSS-HEU, the initial search range of GSS [x 1 , y 1 ] is set as [0,t n ], which is partitioned into 3 sections by two points u 1 and v 1 with the golden ratio 0.618 in lines 2-4, where is an operation to round a value up to an integer. When a hovering time is searched in GSS, e.g., t n = u m or t n = v m , the corresponding user-timeslot allocation is obtained by solving P 2 (n) in line 6. In lines 9-13, we compare the objective energy and update the search range. The search process terminates at |y m − x m | ≤ 1. The selected hovering time t * n is v m and the corresponding scheduling scheme λ * i,g,n,t is λ i,g,n,t | tn=vm . The complexity of GSS-HEU is O( N n=1 G 2 n × max{K n , It n } + log(2t n )), which is much lower than that of the optimal method. However, both the optimal and GSS-HEU approaches may have limitations in fast decisionmaking. The computational time for both algorithms grows exponentially with the number of users since G n = 2 Kn − 1 [10]. In addition, both algorithms need the estimated and complete channel states for the whole task frames, i.e., from t = 1 to T max . This may result in difficulties in channel estimation. Therefore, we reconsider P 1 from the perspective of DRL to enable the UAV to make decisions intelligently, while the developed optimal and sub-optimal algorithms are used to benchmark the performance of learning-based solutions.

A. Overview of Actor-Cirtic-Based DRL (AC-DRL)
In DRL, an agent learns to make decisions by exploring the unknown environments and exploiting the received feedbacks. At each learning step 1 t, the agent observes the current state s t and takes an action a t based on a policy. Then, a reward r t 1 In this paper, a learning step is equivalent to a transmission frame.

Outputs:
Heuristic solution: λ * 1,1,1,1 , ..., λ * I,Gn,N,Tmax , t * 1 , ..., t * N 1: for n = 1; n ≤ N ; n + + do 2: x 1 = 0; y 1 =t n ; 3: for m = 1; |y m − x m | > 1; m + + do 6: Solve P 2 (n)| tn=um and P 2 (n)| tn=vm ; 7: Obtain the corresponding user scheduling schemes λ i,g,n,t | tn=um and λ i,g,n,t | tn=vm ; 8: Obtain the objective energy (E C,n +E H,n )| tn=um and (E C,n + E H,n )| tn=vm ; 9: if (E C,n +E H,n )| tn=um < (E C,n +E H,n )| tn=vm then 10: end for 15: t * n = v m ; λ * i,g,n,t = λ i,g,n,t | tn=vm . 16: end for will be fed back to the agent. The policy will be updated step by step according to the feedback. Actor-critic is an emerging reinforcement learning method that separates the agent into two parts, an actor and a critic. The actor is responsible for taking actions following a stochastic policy π(a t |s t ), where π( | ) refers to a conditional probability density function. The critic is used to evaluate the decisions via a Q-value, which is given by: where E at∼π(at|st) [ | ] is a conditional expectation under the policy π(a t |s t ), and R t is the cumulative discounted reward with a discount factor γ, which can be expressed as: However, obtaining the explicit expressions of π(a t |s t ) and Q π (s t , a t ) is difficult. DRL uses DNNs as the parameterized approximators to provide estimations for π(a t |s t ) and Q π (s t , a t ). We denote θ t and ω t as the parameter vectors for the actor and critic, and π(a t |s t ; θ t ) and Q θ (s t , a t ; ω t ) as the corresponding parameterized functions 2 . The goal of the agent is to minimize the loss function of the actor −J(θ t ): Based on the fundamental results of the policy gradient theorem [12], the gradient of J(θ t ) can be calculated by: The update rule of θ t can be derived based on gradient descent: where α a is the learning rate of the actor. For the critic, the parameter vector ω t is updated based on temporal-difference (TD) learning [12]. In TD learning, the loss function of the critic C Q (ω t ) is defined as the expectation of the square of TD error δ Q (ω t ), i.e., E[(δ Q (ω t )) 2 ]. The TD error δ Q (ω t ) refers to the difference between the TD target and estimated Q-value, which is given by: where r t +γQ θ (s t+1 , a t+1 ; ω t ) is the TD target. The objective of the critic is to minimize the loss function C Q (ω t ) and the updated rule of ω t can be derived by gradient descent: where α c is the learning rate for the critic. However, approximating Q π (s t , a t ) brings about a large variance for the gradient ∇ θ J(θ t ), resulting in poor convergence [31]. To solve the problem, a V-value is introduced: Approximating V π (s t ) can reduce the variance. With the parametered V-value V θ (s t ; ω t ), the TD error and the loss function of the critic are expressed as: and In addition, δ V (ω t ) provides an unbiased estimation of Qvalue [31]. Thus, we can rewrite ∇ θ J(θ t ) in Eq. (25) as:

B. Problem Reformulation
To apply AC-DRL, we reformulate P 1 to an MDP problem, in which the UAV acts as an agent. We define the states, actions, and rewards as follows.
1) States: The system states s t consist of the channel states for all the clusters on the current frame, i.e., B 1,t , ..., B N,t , the undelivered demands, and the currently served cluster on frame t. The undelivered demands b n,t is the residual data to be delivered for cluster n on frame t: b n,t+1 = b n,t − d π n,t , ∀n ∈ N , t ∈ T , where d π n,t is the delivered data for cluster n in frame t under the policy π(s t |a t ). We denote o t ∈ N + as an indicator to represent which cluster the UAV is serving in frame t. When the users requests in the current cluster are completed, the UAV will move to the next cluster in the next frame, otherwise, staying at the current cluster. For example, we assume that the UAV is hovering above cluster n on frame t, i.e., o t = n. For the next frame, o t+1 is obtained by: When the UAV's duration exceeds T max , the UAV will fly back to the dock station. By assembling the above three parts, the state s t is defined as: Note that the elements of B n,t are modeled as FSMC. In addition, based on Eq. (33) and Eq. (35), the next state of b n,t and o t only depend on the current state and current policy. Therefore, the transition of the state s t conforms to MDP [12].
2) Actions: The action of the UAV is the user-timeslot assignment on frame t, which is given by: where a i,t = g means the g-th group is selected at the i-th timeslot on the t-th frame. Note that the action space G n can be huge since it increases exponentially with the number of users.
3) Rewards: The reward functions are commonly related to the objective of the problem. Conventionally, the reward function of P 1 can be designed by Eq. (38) and Eq. (39), referring to [33] and [34]: where e π t is the energy consumed on frame t under the policy π(s t |a t ). Since both the above reward functions monotonically decrease with e π t , the UAV updates the policy towards reducing energy consumption.

C. The AC-DSOS algorithm
Conventional AC-DRL algorithms may not be able to deal with constrained discrete problems. Firstly, the combinatorial component of P 1 limits the conventional AC-DRL in addressing huge discrete action spaces [32]. Secondly, the increased action space reduces the exploration efficiency in the learning process and degrades overall energy-saving performance. Thirdly, the conventional AC-DRL algorithms cannot guarantee the solution's feasibility in general. This means that a high-reward action can fail to satisfy the constraints in P 1 . To overcome the above difficulties and limitations, we propose an AC-DSOS algorithm that is tailored for constrained problems with discrete action representation. The basic actorcritic framework is employed in order to take the advantages of the stochastic policy and TD learning, where the stochastic policy can be quantified to tackle the issue of huge discrete spaces and TD learning can improve the learning efficiency. We illustrate the actor-critic framework of AC-DSOS in Fig. 4, where two DNNs work as the actor and critic, respectively. The stochastic policy π(a t |s t ) is usually modeled as Gaussian distribution with a mean µ(s t ) and a variance χ(s t ) [35]. Given the current state s t , the actor does not predict π(a t |s t ; θ t ) directly but obtains approximations of the mean µ(s t ; θ t ) and the variance χ(s t ; θ t ). An action a t can be selected based on π(a t |s t ; θ t ). Then, the agent receives a reward r t after taking the action and collects the next state s t+1 . For the critic, two V-values, V θ (s t ; ω t ) and V θ (s s+1 ; ω t ), are estimated by DNN with the inputs s t and s t+1 , respectively. The TD error δ V (ω t ) can be calculated by Eq. (30). A tuple {s t , s t+1 , δ V (ω t ), r t } is stored in a memory at each step t. By applying a memory replay mechanism, the data in the memory can be used for training the DNNs. In each training step, the actor and critic are updated by the gradient descent over a batch of training data. The whole training process consists of multiple episodes, each episode including T max steps. Based on the above framework, the AC-DSOS algorithm is summarized in Alg. 2. The novelties of the proposed AC-DSOS compared to the conventional AC-DRL are summarized as follows.
1) Action Mapping to Tackle the Issue of Huge Discrete Action Space: The conventional actor-critic is used for continuous action space. We denoteâ t = [â 1,t , ...,â I,t ] as the original action selected by the stochastic policy, where the elementâ i,t is fractional. However, as the decision variables are integers in P 1 , the action space is discrete. To deal with this issue, we adopt an action mapping method in AC-DSOS (line 9 in Alg. 2). Firstly, we confineâ i,t to a fixed range [−κ, κ] to avoid its value being too large/small since the domain of Gaussian distribution is [−∞, ∞]. Then, a uniform quantization method is used to mapâ i,t to the discrete action space {1, ..., G n } by: where 2κ/G n is the quantization interval. With the mapping operation, we can support a larger G n by reducing the interval.

Algorithm 2 AC-DSOS Algorithm
Inputs: The current state s t . Outputs: The current action a t . 1: Initialize θ 1 and ω 1 . 2: for each learning episode do 3: Observe the initial state s 1 .

4:
for t = 1 : T max do 5: Remove the groups containing the demand-satisfied users. 6: Predicted mean µ(s t ; θ t ) and variance χ(s t ; θ t ) by the DNN of the actor.

9:
Map the elementsâ i,t to a i,t by Eq. (40). 10: Take the after-mapped action a t .

16:
Obtain θ t+1 and ω t+1 by gradient descent. 17: s t = s t+1 ; θ t = θ t+1 ; ω t = ω t+1 . 18: end for 19: end for 2) Action Space Restriction to Improve Solution Quality: Although AC-DSOS can tackle the issue of discrete action space by the above mapping operation, exploring in a huge space remains difficult. To improve the exploration efficiency and the quality of the solution, we design a method to restrict the action space in the learning process (line 5 in Alg. 2).
At the beginning of each frame, we first observe which users' demands have been satisfied. Then, we remove the corresponding candidate groups, i.e., the groups containing the successfully served users. Therefore, the size of the action space is not fixed over T max but gradually decreases. The action space restriction can help the agent to avoid redundant searches for demand-satisfied users. Besides, searching in a smaller action space speeds up the algorithm to converge, thereby improving search efficiency and quality.
3) Re-designed Reward Function to Deal with Feasibility Issues: Without a carefully designed mechanism, the actions made in conventional AC-DRL may easily violate constraints, thus fail to guarantee the solution feasibility. In P 1 , the major difficulty comes from constraints (10b), whereas (10c)-(10h) can be satisfied by properly defined actions. Under the commonly-used reward designs, e.g., Eq. (38) or Eq. (39), constraint (10b) may not be satisfied since the criterion of the decision making is to minimize the objective energy without considering constraints. To solve the problem, we re-design the reward function by incorporating constraint (10b), which is given by: .
The rationale is that the proposed reward function is the ratio between the delivered data and the consumed energy on frame t, where is a control parameter. When is small, the reward enforces the UAV to deliver more data to meet users' demands. However, transmitting more data results in more energy consumption. To control energy growth, we can increase such that the agent will reduce the energy consumption to avoid the reward losses. Thus, by tuning an appropriate , the decisions made by AC-DSOS can achieve good energy-saving performance while satisfying users' demands.

VI. NUMERICAL RESULTS
In this section, we present numerical results to evaluate the performance of the proposed AC-DSOS algorithm and compare it with other schemes: • Previous AC-DRL scheme: Deep deterministic policy gradient (DDPG) [36]; • High-complexity near-optimal scheme: the proposed GSS-HEU in Alg. 1; • Low-complexity sub-optimal scheme: semi-orthogonal user scheduling-based heuristic algorithm (SUS-HEU) [37]; • Optimal scheme: relax-and-approximate approach. DDPG provides performance benchmarks from a typical actorcritic perspective, where a deterministic policy is applied without action space restriction. The structure of the DNNs, parameter settings, and reward function Eq. (41) for AC-DSOS and DDPG are the same in order to enable a feasible solution from DDPG. The proposed sub-optimal GSS-HEU and optimal algorithms, and sub-optimal SUS-HEU in [37] benchmark AC-DSOS from an optimization aspect, where SUS-HEU adopts a simple user-grouping strategy with lower complexity than GSS-HEU.
In the simulation, we first evaluate the performance of energy consumption and computational time. After that, we justify the developed new reward function in guaranteeing solution feasibility by comparing several well-known reward functions. Furthermore, we evaluate the convergence performance of AC-DSOS with different learning rates.

A. Parameter Settings
The UAV is equipped with L = 10 antennas serving N = 3 clusters. The ground users are randomly scattered in the service area. Each cluster contains up to K = 9 users. The users' demands q k,n are randomly selected from {1, 2, 3, 4, 5} (Mbit). We assume the bandwidth B = 10 MHz, noise power σ 2 = 0.1 mW, hovering power P H = 10 W, and transmit power p k,g,n = 3 W, referring to [4]. Based on FSMC, we quantize β (kk) g,n,t and β (kj) g,n,t into 9 levels, {0, 0.3, 0.6, 0.9, 1.2, 1.5, 1.8, 2.1, 2.4}. The setting of the transfer probability matrix is similar in [38]. Two fullyconnected DNNs are employed as the actor and the critic. The adopted parameters for implementing AC-DSOS are summarized in Table II.

B. Results and Analysis
Firstly, by comparing with four benchmarking algorithms in Fig. 5, the proposed AC-DSOS achieves a good trade-off between energy minimization and computational time. Note that for K > 7, the optimal energy results are absent due to the high complexity and the corresponding long computational time. From Fig. 5, AC-DSOS saves around 29.94% energy compared to DDPG in average. Overall, AC-DSOS provides a sub-optimal solution, with 19.17% gap to the optimum. GSS-HEU achieves near optimality, and consumes less 9.8% energy than AC-DSOS in average but with paying much higher complexity and time, e.g., see Fig. 6. SUS-HEU consumes the highest energy since it schedules users based on channel conditions without considering energy consumption. It is also shown that the total objective energy follows a roughly linear increase in all the algorithms. The gaps between the optimal algorithm and other algorithms become larger as K increases. When K grows from 5 to 7, the gap to the optimum increases from 47.7% to 65.1% for SUS-HEU, and from 31% to 44.5% for DDPG. In AC-DSOS, since the delivery-completed users are deleted during the learning process, the size of the action space will continuously decrease. This improves the searching efficiency and quality, and reduces the growth rate of the gap as K increases, from 11.1% (K = 5) to 16.7% (K = 7).    6 compares the computational time with respect to K. The computational time records from giving inputs to algorithms until returning the optimized results. In GSS-HEU and the optimal approach, the computational time grows exponentially with K, whereas the proposed AC-DSOS along with DDPG and SUS-HEU maintain at a low magnitude and insensitive to the increase of K. AC-DSOS saves 99.23% and 92.86% computational time compared to the optimal algorithm and the GSS-HEU when K = 7. This is due to the fact that DRL can provide online decisions based on the current environment state instead of solving the optimization problem directly. The computational time of AC-DSOS slightly lower than DDPG and SUS-HEU. However, by recalling Fig. 5, AC-DSOS saves 29.94% and 52.51% energy compared with DDPG and SUS-HEU, respectively. Fig. 7 demonstrates the total energy consumption with respect to T max , and Fig. 8 illustrates the communication energy and hovering energy separately. From Fig. 7, AC-DSOS outperforms DDPG by saving 21.37% total energy in average. The average gap between GSS-HEU and the optimal solution is 8.91% smaller than that of AC-DSOS, but, from   6, GSS-HEU consumes nearly 126 times higher calculation time than AC-DSOS at T max = 160. The energy-saving performance of SUS-HEU is worse than other algorithms and its gap to the optimum reaches 59.44%. Fig. 7 also shows that, as T max increases, the objective energy rapidly decreases first then grows steadily. This can be explained via Fig. 8. The objective energy consists of the communication energy and hovering energy. From Fig. 8, the communication energy drops rapidly when T max < 140, and becomes stable after T max > 180. Whereas, the hovering energy increases linearly with T max for all the algorithms. Fig. 9 verifies the capability of the proposed reward function in dealing with feasibility issues, where a feasible solution is obtained only if the ratio of delivered demand over total demand in y-axis achieves 100%. From Fig. 9, the reward functions used in Eq. (38) and Eq. (39) fail to guarantee the feasibility of the solution. For the re-designed reward, we evaluate the performance by setting to 1, 1.2, and 1.5. A small means that transmitting more data can bring more rewards gain than saving energy. When drops below 1.2, the feasibility issue can be solved. Fig. 10 shows the objective energy with different . It can be found that a smaller leads to more energy consumption. Thus, an appropriate parameter lies at 1.2, enabling the after-learned solution to guarantee the demands while consuming less energy. Fig. 11 demonstrates the convergence of AC-DSOS with different actor's learning rate α a . The x-axis is the learning episode and the y-axis is the reward value. When α a increases from 0.001 to 0.003, the reward value grows by 2.8% at the convergence. If α a increases to 0.005, the curve fluctuates and the converged value is 7.2% lower than the case of α a = 0.001. Taking the actor as an example, the learning rate for the critic α c has the same tendency. In conclusion, the learning rates of the actor and critic are sensitive to the convergence, and need to be properly selected, e.g., 0.003 for the actor.

VII. CONCLUSION
In this paper, we have investigated an energy minimization problem for UAV-aided communication systems from the perspective of AC-DRL. The formulated problem is combinatorial and non-convex. We provided an optimal relaxand-approximate method and proposed a GSS-based heuristic algorithm to solve the problem and serve as benchmarks. To make the solutions adaptive to online operation, we propose an AC-DSOS algorithm. Different from previous AC-DRL methods, the proposed AC-DSOS is able to deal with the huge discrete action space and guarantee the feasibility. Numerical results have shown that AC-DSOS provides a good tradeoff between energy efficiency and computational efficiency. Furthermore, the re-designed reward function is effective to deal with the feasibility issue. An extension of the current work is to jointly optimize energy consumption in uplink, downlink communications, and UAV propulsion, e.g., UAV downlink data-delivery and uplink data-collection tasks co-exist among clusters.