Optimal Frequency Reuse and Power Control in Multi-UAV Wireless Networks: Hierarchical Multi-Agent Reinforcement Learning Perspective

To overcome the problems caused by the limited battery lifetime in multiple-unmanned aerial vehicle (UAV) wireless networks, we propose a hierarchical multi-agent reinforcement learning (RL) framework to maximize the energy efficiency (EE) of UAVs by finding the optimal frequency reuse factor and transmit power. The proposed algorithm consists of distributed inner-loop RL for transmit power control of the UAV terminal (UT) and centralized outer-loop RL for finding the optimal frequency reuse factor. Specifically, the proposed algorithm adjusts these two factors jointly to effectively mitigate intercell interference and reduce undesired transmit power consumption in multi-UAV wireless networks. We show that, for this reason, the proposed algorithm outperforms conventional algorithms, such as a random action algorithm with a fixed frequency reuse factor and a hierarchical multi-agent Q-learning algorithm with binary transmit power controls. Furthermore, even in the environment where UTs are continuously moving based on the mixed mobility model, we show that the proposed algorithm can find the best reward when compared to conventional algorithms.


I. INTRODUCTION
T HE utilization of unmanned aerial vehicles (UAVs) is one of the promising characteristics for future sixthgeneration (6G) wireless networks because the key objective of 6G is to provide three-dimensional (3D) wireless connectivity [1]. There are many challenges to achieve this goal, e.g., 3D channel modeling, multilayered network architecture design, seamless 3D handover, and network lifetime maximization [1] [2]. In particular, the limited battery lifetime of UAVs shortens the time UAVs can operate [3] [4]. Considering this battery problem, many studies have been conducted to improve UAV's energy efficiency (EE) [5][6] [7]. Specifically, in [5], the authors proposed a multi-agent rein-forcement learning (RL)-based UAV deployment and power control algorithm for maximizing EE in multi-UAV wireless networks. Additionally, the authors of [6] proposed an online random access protocol by adjusting the packet transmission opportunities based on the residual energy of drones in S-ALOHA-based swarming drone networks.
In addition, finding optimal frequency reuse is a crucial enabling technologies to simultaneously maximize networkwide resource utilization efficiency and EE in practical wireless communication networks. First, several frequency reuse techniques have been introduced to mitigate intracell and intercell interferences when sharing frequency resources in wireless networks [8][9] [10]. The authors of [8] compared    [9], strict FFR was used to partition a cell area into spatial regions with different frequency reuse factors, and soft frequency reuse (SFR) was used to divide a cell area into two regions: an inner region where the entire frequency resources were available and an outer region where a small fraction of the resources were available. SFR can be more bandwidth-efficient than strict FFR but results in more intercell interference for both cell-interior users and edge users. That is, strict FFR has the advantage of reducing interference between cell-interior users and cell-edge users, as it does not share any frequencies.
To accommodate flying BSs, the authors of [10] proposed a flexible SFR (F-SFR) technique to assign a frequency resource plan that considered the dynamic network topology and maximizes inter-BS distance among cells with the same resource plan by assigning different SFR levels in each cell. Since this technique aimed at supporting aerial BSs and ground users, it is not easy to apply directly to a wireless network consisting of ground BSs and aerial terminals.
Several hierarchical reinforcement approaches have been proposed to improve the performance of multi-UAV wireless networks [11][12] [13]. In [11], to resolve the problem of limited data collection coverage of the backscatter sensor nodes, the hierarchical deep reinforcement learning (DRL) framework was proposed to extend the data collection coverage and minimize the total flight time of the rechargeable UAVs when performing data collecting missions. The authors of [12] proposed a hierarchical deep Q-network (h-DQN) model for dynamic spectrum access. The proposed h-DQN shows faster convergence, higher performance, and higher channel utilization than Q-learning for dynamic sensing (QADS) [14] or deep reinforcement learning for dynamic access (DRLDA) [15]. Additional, in [13], a hierarchical scheduling architecture with top-layer scheduling for satellite selection and foundation-layer precise scheduling for urgent tasks was introduced to solve the real-time earth observation satellite (EOS) scheduling problem. Here, Q-learning with an adaptive action selection strategy was proposed to solve the Markov decision process model more efficiently. Furthermore, it is expected to realize real-time task scheduling of agile satellites. However, in complicated three-dimensional network environments where UAVs are continuously moving, it is still difficult to utilize conventional algorithms in practice due to their huge computational complexity.
In this paper, we assume multicell network environments in which multiple UAV terminals (UTs) perform their own missions, and each UT transmits its information to ground control systems (GCSs). The goal of this paper is to find the optimal frequency reuse factor and transmit power for maximizing EE in multi-UAV wireless networks by using FIGURE 1: System model of proposed hierarchical multi-agent Q-learning framework for optimal frequency reuse and transmit power control in multi-UAV wireless networks a hierarchical multi-agent reinforcement learning algorithm. The main contributions of this paper are as follows: • To maximize network-wide EE while reducing the computational complexity of multi-agent reinforcement learning, we adopt a hierarchical approach. Specifically, the proposed algorithm consists of distributed multiagent inner-loop RL for finding the optimal transmit power of UTs and centralized outer-loop RL for determining the optimal frequency reuse factor. The simultaneous adjustment of these factors is very challenging and complicated. • To reduce the complexity of inner-loop RL, we propose a distributed multi-agent approach. In the inner-loop RL, each agent only considers its own state but shares its reward with others. That is, after gathering the separate reward from each agent, the central head redistributes this shared reward at each time step so that the overall reward of the hierarchical reinforcement learning can be maximized. • Even in the hexagonal prism-shaped three-dimensional environment where UTs are continuously moving, the proposed algorithm can well-converge to the optimal solution obtained by the exhaustive search algorithm. This demonstrates the practicality and scalability of our proposed algorithm.
The rest of this paper is organized as follows. In Section II, we describe the air-to-ground (A2G) channel model and UAV mobility model. Additionally, we propose a hierarchical multi-agent Q-learning-based optimal frequency reuse and power control algorithm in Section III. In Section IV, we demonstrate that the proposed algorithm outperforms the conventional algorithms with respect to network-wide EE. Finally, the conclusions are drawn in Section V. The notations and symbols used in this paper are summarized in Table 1. Fig. 1 represents the system model of the proposed hierarchical multi-agent Q-learning framework for optimal frequency reuse and transmit power control in uplink multi-UAV wireless networks. We consider a hexagonal prism-shaped threedimensional cell architecture with N g GCSs and N u UTs, and the number of cells per cluster is determined by the frequency reuse factor (µ).

A. AIR-TO-GROUND (A2G) CHANNEL MODEL
In the A2G channel model, the line-of-sight (LoS) signal between UTs and GCSs may be occasionally blocked by ground buildings and obstacles so that LoS and non-LoS (NLoS) propagations should be separately considered. Thus, we herein utilize the elevation angle-dependent probabilistic LoS model as the A2G channel model [16]. The LoS path VOLUME 4, 2016 where v l is the speed of light, f c is the carrier frequency, and d ij denotes the distance between GCS i and UT j. In Equations (1) and (2), the free space path loss is common and can be calculated as '20 log 4πfcdij v l '. Additionally, ζ LoS and ζ N LoS denote the excessive path loss caused by artificial obstacles (e.g., skyscrapers) in LoS and NLoS paths, respectively. The excessive path loss varies depending on the urban environmental deployment models proposed by the International Telecommunication Union -Radio communication sector (ITU-R) [17]. Even though the A2G channel model used in this paper does not consider small-scale fading directly, it reflects the A2G channel characteristics of various environmental deployments (suburban, urban, dense urban, and highrise urban) by using these statistical parameters.
In this A2G channel model, the LoS probability between GCS i and UT j can be calculated as where a and b are other statistical parameters representing the A2G channel characteristics of four urban environmental deployments. θ ij is the elevation angle between GCS i and UT j, and can be calculated as θ ij = arcsin hj dij , where h j denotes the altitude of UT j. From Equation (6), the NLoS probability between GCS i and UT j can be obtained as P NLoS ij = 1 − P LoS ij . Using Equations (1)-(6), the average path loss of the A2G link between GCS i and UT j considering the LoS and NLoS probabilities can be described as From Equation (7), the received power of GCS i from UT j (P RX ij ) can be represented as where P TX ij indicates the transmit power of UT j associated with GCS i.
As mentioned above, instead of instantaneous small-scale fading, excessive path loss depending on the LoS and NLoS paths, ζ LoS and ζ NLoS , is included in the path loss model. If the small-scale fading effect is included, the empirical mean of the signal-to-interference-plus-noise-ratio (SINR) between GCS i and UT j is given by Equations (3) path losses caused by small-scale fading of LoS and NLOS paths, respectively. Here, Jensen's inequality is used in both inequalities (4) and (5). For a simple performance evaluation, the upper bound in (5) is adopted as a performance measure instead of the empirical mean of the SINR. Furthermore, we verify that the upper bound is sufficiently tight to be used in the proposed algorithm. Accordingly, SINR between GCS i and UT j can be represented as In Equation (10), I ij t (k) is an indicator function that determines whether the k-th frequency resource of GCS i is assigned to UT j.

B. UAV MOBILITY MODEL
UTs' movement is modeled as a mixed mobility (MM) model, which is a combination of the random waypoint and random walk models [18]. The detailed operation of the mixed mobility model is described as follows.

1) When the current way point is
2) UT j ascends or descends to the next way point 4) According to the flight maintenance probability p s j , UT j holds or alters its position during T d j as follows.
5) The process is repeated until the learning is complete.

III. HIERARCHICAL MULTI-AGENT Q-LEARNING FOR OPTIMAL FREQUENCY REUSE AND POWER CONTROL
To maximize network-wide EE in multi-UAV wireless networks while reducing the computational complexity of RL, we propose a hierarchical multi-agent Q-learning framework. As shown in Fig. 1, the proposed RL framework consists of distributed multi-agent inner-loop RL for finding UTs' optimal transmit power and centralized outer-loop RL for determining the optimal frequency reuse factor. The change in the frequency reuse factor requires a large control overhead in terms of the entire network so that it needs to be adjusted intermittently through the outer-loop RL. In contrast, UTs' transmit power can be controlled every time step through the inner-loop RL to maximize network-wide EE. The detailed operations of the inner-loop RL and the outer-loop RL are as follows.

A. INNER-LOOP RL
The goal of the inner-loop RL is to find the optimal transmit power of UTs for maximizing EE. Fig. 2 shows the detailed operation of the distributed multi-agent inner-loop RL. In the inner-loop RL, each GCS performs the role of an agent. According to the frequency reuse factor (µ) the number of frequency resources available in each agent is determined. Additionally, each agent should manage µ n different Qtables, where µ n is the number of divisors of µ n . Here, µ n Algorithm 1: Detailed procedure of proposed hierarchical multi-agent Q-learning based optimal frequency reuse and power control in multi-UAV wireless networks 1 Place GCSs according to the inter-GCS spacing distance, D g . 2 Determine the altitude of UTs randomly between h min and h max and place UTs within R u on the basis of the horizontal position of each GCS. 3 Initialize Q-tables of inner-loop and outer-loop agents, and the frequency reuse factor µ. 4 Partition GCSs into ceil( Ng µ ) clusters. 5 for Every episode do 6 for Every iteration do 7 Calculate SINR between GCS i and UT j (Γ ij ), for all i and j. 8 Allocate frequency band k of GCS i, which provides the greatest SINR, to UT j, for all j. 9 if mod (t, T outer ) == 0 then 10 Choose the outer-loop action A OUT t by decayed ε-greedy policy.
The central header adjusts the frequency reuse factor µ according to A OUT t and re-partition GCSs. 12 Re-calculate Γ ij (t) and η µ t (i). 13 Move on to the next state S OUT t+1 , calculate R t+1 , and update the Q-value of outer-loop RL.  Re-calculate Γ ij (t) and η µ t (i). 18 Move on to the next state S

IN,(i,µ) t+1
, calculate R t+1 , and update the Q-value of inner-loop RL. denotes the maximum value of µ, and µ is determined by the outer-loop RL. Namely, µ is one of the divisors of µ n . Additionally, the agents share their rewards with the central head, and the integrated reward is redistributed to the agents in each time step. In the proposed inner-loop RL algorithm, because each agent considers only its action set and state set, the computational complexity of the proposed algorithm is significantly reduced compared to the centralized Q-learning algorithm. Detailed descriptions for the state, action, and reward of the inner-loop RL are expressed as follows.
• Inner-loop RL state: When the frequency reuse factor is µ at time step t, the inner-loop state of GCS i (S Here, s IN,(i,µ) t (k) denotes the amount of transmit power of the k-th frequency resource when the frequency reuse factor is µ at time step t and |S IN,(i,µ) t | = N c . In Equation (14), "None" means that the k-th frequency resource is not assigned to any UTs. Additionally, P TX min and P TX max indicate the minimum and maximum transmission power of UTs, respectively. • Inner-loop RL action: When the frequency reuse factor is µ at time step t, GCS i adjusts UTs' transmit power associated with it as follows.
In Equation (15) and (16) . Additionally, ∆ TX p , −∆ TX p , and "0" represent "transmit power up", "transmit power down" and "maintain the current transmit power", respectively. • Inner-loop RL reward: The objective of the proposed hierarchical multi-agent Q-learning is maximizing the EE of the multi-UAV wireless networks. Accordingly, we define EE (η µ t (i)) as Here, η µ t (i) is the EE of GCS i when the frequency reuse factor is µ at time step t. B µ is the bandwidth size of each frequency resource when the frequency reuse factor is µ, and B µ = B tot /N f where B tot is the total bandwidth of each cluster and N f is the total number of frequency resources. Additionally, P CRT j is the fixed circuit power consumption of UT j. Using Equation (17), the reward of the proposed hierarchical multi-agent Q-learning algorithm can be expressed as

B. OUTER-LOOP RL
To find the optimal frequency reuse factor µ, we consider a central header that manages all GCSs as the agent of outerloop RL. The total number of frequency resources (N f ) that can be used in the entire network is fixed. Therefore, according to the variation in µ, the number of resources available in each GCS (N c ) is different, N c = N f /µ. As µ increases, the number of resources available in each GCS reduces, but intercell interference also decrease. Therefore, finding the optimal µ is crucial for maximizing network-wide EE. Detailed descriptions for the state, action, and reward of the outer-loop RL are described as follows.
• Outer-loop RL state: At time step t, the outer-loop state (S OUT t ) is defined as In practical wireless networks, the number of divisors of µ n is not large so that the Q- where ∆ µ , −∆ µ , and '0' denote "increase in µ", "decrease in µ", and "maintain the current µ", respectively. • Outer-loop RL reward: The reward of outer-loop RL is the same as the reward of inner-loop RL.

C. POLICY
We adopt the decayed ε-greedy model to adaptively control the ratio between exploitation and exploration [19]. The policy utilized in this paper is as follows.
arg max at∈A Q(s t , a t ), with probability 1 − ε. (21) ψ×|A| . ε init and E represent the initial value of ε and the current episode index, respectively. Additionally, |A| represents the cardinality of A and ψ is an exploration parameter to adjust the attenuation rate of ε. It is noteworthy that the importance of exploration depends on the number of actions. In this policy, the agent acts randomly with ε and chooses the optimal action maximizing the reward obtained from the Q-value with 1 − ε. According to the variation in ε, we can adaptively adjust the ratio between exploration and exploitation to find the optimal solution quickly and accurately.

D. Q-TABLE UPDATE
The Q-table of the proposed hierarchical multi-agent Qlearning is updated as follows.
where α and β are the learning rate and discount factor, respectively. With α, we can control the speed of the Qvalue update, and the ratio between the current reward and the future expected reward is adjusted by using β.
In the proposed hierarchical multi-agent Q-learning algorithm, first, GCSs are placed with a constant spacing D g and UTs are randomly distributed in a three-dimensional network area based on each GCS. Because each inner-loop agent needs to learn µ n Q-tables, relatively more learning opportunities should be given to the inner-loop agent than the outer-loop agent. Therefore, outer-loop RL performed intermittently, every T outer . Additionally, in each episode, the 3D positions of UTs are updated according to the MM model. The detailed procedure of the proposed hierarchical multiagent Q-learning based optimal frequency reuse and power control algorithm is summarized in Algorithm 1.
Finally, the proposed hierarchical multi-agent Q-learningbased optimal frequency reuse and power control problem for VOLUME 4, 2016 Max. and min. transmit power (P TX min , P TX max ) 0.5, 2.0 (W) Fixed circuit power consumption (P CRT ) 20 (W) Step size of power increment (∆ TX p ) 0.75 (W) Outer-loop RL period (Touter) 20 (episodes) Learning rate (α) 0.1 Discount factor (β) 0.9 Exploration parameter (ψ) 130 maximizing network-wide EE can be formulated as s.t. C 1 : P TX min ≤ P TX ij ≤ P TX max , ∀i ∈ N g , j ∈ N u , (24) Here, C 1 is the constraint of the transmit power for the UTs, and C 2 describes the constraint that only one channel of the serving GCS can be allocated to a UT. Additionally, C 3 defines the constraint that the serving GCS is not changed during the end of the learning and C 4 means that GCS i has N c frequency resources.

IV. RESULTS AND DISCUSSION
We show the performance results according to the variations in the inter-GCS distance (D g ) and cell radius (R u ). We conducted simulations considering the following (D g ,R u ) combinations: (100, 50)(m), (100, 80)(m), (200, 100)(m), (500, 100)(m), (200, 150)(m), and (500, 300)(m). Initially, users of each cell were randomly distributed within the cell radius of R u . However, not all GCSs served the same number of UTs because each UT associated with the GCS that provided the highest received signal power. In addition, when D g < 2R u , the number of UTs in a cell boundary or overlapping area increases, and the intercell interference becomes severe. Conversely, when D g ≥ 2R u , GCSs might receive relatively low intercell interference. Other simulation parameters are summarized in Table 2. Furthermore, a random action (RA) algorithm and a hierarchical RL-based binary action algorithm (HRL-BA) were considered benchmarks to compare the performance of the proposed algorithm in terms of network-wide EE. A detailed description of these benchmark algorithms is as follows.
• Random Action (RA) Algorithm with Fixed µ: Each UT randomly chooses its transmit power assuming that µ is fixed. As the optimal solution cannot be obtained in complicated network environments, the convergence of the proposed hierarchical multi-agent Q-learning algorithm to the optimal solution can be roughly demonstrated through the random action algorithm. That is, because of the extremely high computational complexity required for obtaining the optimal solution based on the exhaustive search, we compared the performance of the proposed algorithm with the random action algorithm for each µ. • Hierarchical RL-based Binary Action (HRL-BA) Algorithm: Similar to the proposed algorithm, HRL-BA exploits a hierarchical multi-agent Q-learning framework. However, HRL-BA has binary actions when agents adjust the transmit power of UTs. HRL-BA has a relatively small computational complexity compared to the proposed algorithm. Fig. 3 shows the accumulated average reward versus episode for exhaustive search, random action, and proposed algorithms when N g = 4, N u = 8, N f = 2, T outer = 20, D g = 100, R u = 50, and µ n = 2. To obtain these results, we performed 1, 000 episodes, where each episode had 5, 000 iterations. As shown in this figure, we find that the proposed hierarchical multi-agent RL algorithm well converges to the optimal solution obtained by the exhaustive search algorithm.
Figs. 4a-4f show the accumulated average reward versus episode for the proposed, HRL-BA, and random action algorithms under N g = 12, N u = 72, and N f = 6 according to combinations of (D g ,R u ). Figs. 4a, 4b, 4c, 4d, 4e, and 4f show the results under (D g , R u ) = (100, 50)(m), (100, 80)(m), (200, 100)(m), (500, 100)(m), (200, 150)(m), (500, 300)(m), respectively. To obtain these results, we performed 1, 000 episodes, where each episode had 50, 000 iterations. Each random action algorithm had a fixed frequency reuse factor. Thus, this algorithm cannot obtain the performance improvement according to the change in the fre-quency reuse factor. That is, the random action algorithm can obtain only performance improvement according to power control under the fixed frequency reuse factor. If the transmit power of a UT becomes larger, the received signal strength increases. At the same time, the intercell interference signal could increase, making it very important and difficult to obtain the optimal transmission power.
By comparing the combinations of (Fig. 4a, Fig. 4b) and (Fig. 4c, Fig. 4e), when R u increases for the same D g , the EE relatively decreases due to the increase in the amount of intercell interference. In contrast, when D g increases for the same R u , EE can increase because the intercell interference decreases. Additionally, since each GCS can use more frequency resources when µ is low, the average EE value considering the relatively large number of users may be VOLUME 4, 2016 Because obtaining the optimal solution by using the exhaustive search is impossible owing to the computational complexity, we find that the proposed algorithm can roughly converge to the optimal solution for all simulation scenarios through the error bars in Figs. 4a-4f. That is, by adjusting UT's transmit power in inner-loop RL and finding the optimal frequency reuse factor in outer-loop RL, the proposed hierarchical multi-agent Q-learning can increase the chances of finding the optimal solution.
Moreover, the performance behavior of the proposed algorithm according to the variation in T outer is shown in Fig.  5. When T outer = 2, very frequent changes of µ are needed, and thus it will be a significant burden to network operators. In contrast, when T outer becomes significantly larger, it is difficult to find the optimal number of frequency resources in each cell, and thus large reward oscillations occur as the episode progresses. That is, GCSs should find the optimal network-wide EE by performing transmit power control only. Therefore, it is very important to set an appropriate T outer value in consideration of the characteristics of the network environment. Table 3 summarizes the maximum EE and throughput for the proposed and random action algorithm under N g = 12, N u = 72, and N f = 6 according to the variation in (D g , R u ). As shown in this table, EE and throughput are significantly affected by the variations in D g and R u . We find that the largest D g and the smallest R u give the greatest EE result, e.g., (500, 100). As mentioned before, when D g ≥ 2R u , GCSs might receive relatively lower intercell interference leading to an increase in EE and throughput. Conversely, when R u increases for the same D g , the intercell interference becomes severe because the number of UTs in a cell boundary or overlapping area increases. Consequently, EE and throughput can become worse.

V. CONCLUSION
In this paper, we propose a hierarchical multi-agent Qlearning-based optimal frequency reuse and power control algorithm to maximize network-wide EE in uplink multi-UAV wireless networks. First, to mitigate an intercell interference problem, we focused on obtaining the optimal frequency reuse factor with centralized outer-loop RL. Additionally, UTs' transmit power was optimally adjusted by using distributed inner-loop RL. Because the simultaneous adjustment of these factors is very challenging and complicated in practice, it is almost impossible to propose an optimal algorithm working in real-time. Nevertheless, in this paper, we obtained the best EE results with the hierarchical multi-agent Q-learning algorithm compared to the random action algorithms using the fixed frequency reuse factor. To evaluate performance in various network environments, we considered many (D g , R u ) combinations. Even in the case when the number of UTs in a cell boundary or overlapping area increases, we showed that the proposed algorithm outperformed conventional algorithms and converged. For further work, we will investigate the joint optimization of power consumption at a transceiver and a propulsion system for maximizing network-wide EE in multi-UAV wireless networks.