Optimal Resource Allocation Considering Non-Uniform Spatial Traffic Distribution in Ultra-Dense Networks: A Multi-Agent Reinforcement Learning Approach

Recently, the demand for small cell base stations (SBSs) has been exploding to accommodate the explosive increase in mobile data traffic. In ultra-dense small cell networks (UDSCNs), because the spatial and temporal traffic distributions are significantly disproportionate, the efficient management of the energy consumption of SBSs is crucial. Therefore, we herein propose a multi-agent distributed Q-learning algorithm that maximizes energy efficiency (EE) while minimizing the number of outage users. Through intensive simulations, we demonstrate that the proposed algorithm outperforms conventional algorithms in terms of EE and the number of outage users. Even though the proposed reinforcement learning algorithm has significantly lower computational complexity than the centralized approach, it is shown that it can converge to the optimal solution.


I. INTRODUCTION
The massive amount of data traffic generated by the many different types of mobile services has led to a rapid increase in the number of base stations (BSs) deployed within the same network region [1]- [3]. This gradually accelerates network densification [4]- [6]. In addition, cellular networks have tended to use a higher frequency (e.g., a frequency in the terahertz range), which decreases the cell radius because of the larger attenuation of transmit power and requires more BSs to be deployed in the same network area [4], [5]. This network densification has resulted in the proliferation of ultradense small-cell networks (UDSCNs). In UDSCNs, because the average inter-site distance between small cell BSs (SBSs) and users has been decreasing considerably, the link quality can be improved. However, this may cause severe interference The associate editor coordinating the review of this manuscript and approving it for publication was Xujie Li. between neighboring SBSs and vastly increase the energy consumption of the entire network [7]. In this regard, it is worth noting that 80% of the energy in mobile networks is consumed by radio access networks (RANs), and most of the energy in current cellular networks is consumed by BSs, which is approximately 58% of the total power consumption [8], [9]. Therefore, in UDSCNs, maximizing the energy efficiency (EE) of SBSs is one of the most critical research challenges facing next-generation communication networks.
Recently, many researchers have been actively conducting research on minimizing the network energy consumption of UDSCNs. In [10], the impact of the idle-mode operation of BSs, transmit power control, user density, and user distribution on network energy efficiency was considered to find potential gains and limitations of ultra-dense networks (UDNs). The authors of [11] proposed a joint optimization framework for energy-efficient switching on/off strategy and user association policy for UDNs with partial VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ conventional BSs. In addition, an energy-aware user association and power allocation algorithm was proposed for ultra-dense networks with energy-harvesting BSs based on millimeter waves (mmWaves), in [12]. Moreover, various techniques have been proposed to effectively utilize radio resources based on Q-learning in UDN environments. In [13], a Q-learning based solution for a small-scale cooperative coded caching system was proposed to maximize the long-term expected cumulative traffic load served by SBSs without accessing macro cell BSs (MBSs). The authors of [14] proposed a Q-learning based dynamic load adjustment algorithm to reduce energy consumption and adjust the traffic load. It has been proven that the algorithm can save energy consumption compared to the existing on/off algorithm and other conventional algorithms. Furthermore, a Q-learning based downlink transmit power control algorithm was proposed in [15]. A transfer learning method called hotbooting was applied to accelerate the learning speed and reduce the energy consumption based on the estimated user density without any information about the network and channel model of the other small cells.
The traffic generated by actual UDSCNs is geographically disproportionate. According to [16], half of the network sites carry only 15% of the total traffic, whereas 5% of the sites carry 20% of the traffic. Therefore, the network operator should efficiently manage and control network energy consumption by considering the dynamics of the spatial network traffic. Many studies have been conducted to improve these spatial and temporal traffic dynamics. In [17], the authors presented a load balancing scheme based on deepreinforcement learning (DRL) to solve global and local traffic variations in irregular dense small cell networks. The authors of [18] proposed unmanned aerial vehicle (UAV)assisted cell-edge mobile user offloading in non-uniform heterogeneous cellular networks. Here, cell-edge mobile users are periodically scheduled between coordinated ground base stations and flying UAVs. In addition, [19] solved the user association problem using resource and handover management based on the deep deterministic policy gradient (DDPG) method for mmWave networks. They showed that intelligent load-balancing handover could effectively associate users in the case of a high-load situation. In [20], the authors proposed cluster-based resource allocation and user association via efficient co-channel interference management in mmWave dense femtocell networks. This study altered the binary optimization problem into a continuous problem using deductive penalty functions and solved it by computing the difference of two convex functions. Furthermore, the authors of [21] proposed a load-aware cell selection scheme for multi-connectivity in intra-frequency 5G ultra-dense networks to efficiently utilize available idle resources and reduce the probability of radio link failure.
Previous methods using optimization and deep reinforcement learning frameworks are computationally extremely complex and require intensive iteration. In particular, because tabular Q-learning does not exploit deep neural networks for functional approximation, it can significantly reduce the computational overhead caused when performing neural network training in the conventional approaches. This motivated us to propose a SBS control algorithm based on multi-agent distributed Q-learning to maximize the EE while simultaneously minimizing the number of outage users in UDSCNs. The proposed algorithm, which considers the spatial traffic dynamics, can efficiently control the transmit power of SBSs based on multi-agent Q-learning. The main contributions of this study are as follows.
• Two types of network dynamics are considered for proposing a reinforcement learning algorithm that maximizes EE in UDSCNs: uniform/non-uniform spatial traffic distributions and random user mobility.
• Regardless of uneven spatial traffic distribution and unpredictable user movements, we demonstrate that the proposed multi-agent Q-learning algorithm can converge to the optimal solution obtained by exhaustive search.
• Even in ultra-dense network environments, the proposed algorithm outperforms the conventional algorithm in terms of EE and the number of outage users. However, achieving these two objectives may not be feasible in a conventional optimization framework.
• The proposed algorithm can significantly reduce the computational complexity by allowing the agent to consider only its own state.
The remainder of this paper is organized as follows: In Section II, the system model of the proposed algorithm is presented. The proposed multi-agent Q-learning algorithm for maximizing EE while minimizing the number of outage users is proposed in Section III. Section IV presents the effectiveness of the proposed algorithm verified through intensive simulations with respect to the EE and the number of outage users. Finally, conclusions are presented in Section V.

II. SYSTEM MODEL
Herein, we describe the system model for the proposed algorithm and the assumptions used in this study. Consider a downlink communication for UDSCNs configured with several MBSs (M), SBSs (N), and users (U), as shown in Fig. 1. The MBSs are considered as interferers in this network and the SBSs adjust their transmit power to maximize the system performance.

A. SINR CALCULATION
The channel quality of users received from SBSs is measured by the reference signal received power (RSRP), which is commonly used as a channel quality metric between users and BSs in cellular networks. RSRP between user i and SBS j (P r (i, j)) is expressed as P r (i, j) = P t (j) d(i,j) ρ , where P t (j) is the transmit power of SBS j, d(i, j) is the distance between user i and SBS j, and ρ is the path loss exponent in UDSCNs. Using P r (i, j), the signal-to-interference-plus-noise ratio (SINR) of user i for SBS j can be calculated as where σ 2 i is the thermal noise power. As mentioned before, all other SBSs, except the serving SBS and MBSs, are considered as interferers. When γ (i, j) < γ th , ∀j ∈ N, user i is considered as an outage user. Here, γ th denotes the SINR outage threshold.

B. EE CALCULATION CONSIDERING SBS POWER CONSUMPTION
From equation (1), the achievable data rate of user i for SBS j (ζ (i, j)) can be obtained as Here, W j is the system bandwidth of SBS j and |U(j)| is the number of users associated with SBS j, and using equation (2), the EE of SBS j (ξ (j)) can be calculated as where P tot (j) is the total power consumption of SBS j, which can be represented as Here, δ is the power amplifier efficiency, and P c (j), P p (j), and P t (j) are the fixed circuit power consumption, radio frequency (RF) power amplifier power consumption, and transmit power consumption, respectively. In particular, P p (j) describes the power consumption according to the variation in the cell load. Accordingly, P p (j) can be obtained as where κ is the maximum number of users supportable per RF power amplifier and P pa is the power consumed by each RF power amplifier. VOLUME 10, 2022 Move the position of user i corresponding to the random walk model with v i (τ ) and θ i (τ ). 6: end for 7: for t = 1 : t max do 8: for j = 1 : |N| do 9: Chooses action of agent j (a j (t)) according to the decayed epsilon greedy policy with (τ ). 10: Calculate SINR of user i for its serving SBS (γ (i, j)) and achievable data rate of user i for SBS j (ζ (i, j)). 14: if γ (i, n) < γ th , ∀n ∈ N then 15: Calculate ξ (j) by using ζ (i, j). 22: end for 23: Calculate R(s(t + 1), a(t)) and update Q-values for all agents using equation (10). 24: end for 25: end for

C. USER MOBILITY MODEL
We apply a random walk model to emulate users' unpredictable movements [22]. At each episode, the position of each user is altered according to the random walk model. The speed of user i at episode τ (v i (τ )) is randomly determined within [0, v max ]. In addition, the moving direction of user i (θ i (τ )) is randomly chosen within [0, 2π]. Consequently, user i moves with the velocity vector (v i (τ )) at episode τ as follows:

D. SPATIAL TRAFFIC DISTRIBUTION IN UDSCN
An important characteristic of actual UDSCNs is the geographically disproportionate network traffic. Accordingly, we assume that a non-uniform spatial traffic distribution is generated according to the constraints described in [16]. This spatial traffic distribution is based on real-world measurements. From [16], half of the network cells carry only 15% of the total network traffic, whereas 5% of the cells carry 20% of the traffic. Unfortunately, spatial traffic growth would be expected to increase most in cells that already have high loads. For instance, Figs. 2a and 2b show the network deployment results considering uniform and non-uniform spatial traffic distributions for the 3 MBSs, 6 SBSs, and 36 users. The vertical axis denotes relative traffic density in each cell.

Specifically, when |U(i)| = |U| |N|
, we assumed the traffic density of SBS i as 0.5. Also, this value was used as a criterion for determining the traffic density of other SBSs.

III. PROPOSED MULTI-AGENT Q-LEARNING ALGORITHM FOR MAXIMIZING EE IN UDSCN
We herein propose an EE maximization algorithm based on multi-agent Q-learning for UDSCNs with small cell clusters, as shown in Fig. 1. In our multi-agent distributed reinforcement framework, agents, states, actions, a reward, a Q-function, and a policy are defined as follows.

A. AGENT
Consider that each SBS is an agent of the proposed multi-agent reinforcement learning framework in UDSCNs. In a centralized reinforcement framework, a single agent can manage all state information of the SBSs, but it generates a large amount of overhead as a result of the computational complexity. Thus, we consider a multi-agent distributed Q-learning framework.

B. STATE
In this study, the agent does not share its state information with other agents. Thus, each agent considers only its transmit power. The state of the agent can be defined as where P min t,N and P max t,N are the minimum and maximum transmit power of the SBS, respectively. In addition, the P t is the step size of the power increases. Here, {a:b:c} represents a set of values from a to c with a step size of b.

C. ACTION
To maximize the EE of UDSCNs, the agent can choose one of three actions (A): ''transmit power up ( P t )'', ''transmit power down ( P t )'', and ''keep current transmit power ( 0 )'' as follows:

D. REWARD
Assume that each agent shares its reward information with each other agent to maximize the EE of the entire network. In addition, because minimizing the number of outage users is essential when maximizing the EE, we design an outage-aware reward in the proposed reinforcement framework. Accordingly, the reward of agent j (R c j ) is represented as where |U| and U out are the total number of users and the number of outage users in the entire network, respectively.

E. Q-FUNCTION UPDATE
A Q-function is a state-action value function that externally implies a value to the action in a specific state of the agent, and internally implies the expected reward when the action is performed. In other words, it describes the benefit of an agent performing a particular action in a state with a specific policy. In this study, the Q-function (Q j (s j (t), a j (t))) is expressed as: Here, α is the learning rate and η is a discount factor.

F. POLICY
We adopt the decayed -greedy policy for extensive exploration in early episodes [23], [24]. According to (τ ), each agent chooses a random action with a probability of (τ ), and the optimal action with a probability of 1 − (τ ). As the number of episodes increases, the value of (τ ) decreases; therefore, in the latter part of the learning, the agent exploits more than it explores. (τ ) can be described as where init is the initial epsilon value, χ is the decay parameter, and |A| is the size of the action set of each agent. The detailed operational procedure of the proposed multi-agent distributed Q-learning algorithm is described in Algorithm 1. • Reward-Optimal: The reward-optimal solution is obtained using the exhaustive search algorithm. This algorithm enumerates and checks all possible states of the agents. In the case of complicated network environments, it is difficult to obtain a reward-optimal solution owing to its high computational complexity.

• No Transmit Power Control (No TPC):
None of the agents control their transmit power. That is, each agent always sends its signal using maximum transmit power.

• Adaptive Transmit Power Control (A-TPC):
The transmit power of each agent is calculated by the number of associated users. In this study, user association was determined by measuring the SINR in the initial network deployment.
• Random Action: This algorithm randomly chooses an action in each episode. We can use the random action algorithm to roughly prove that the proposed algorithm to the optimal solution because the exhaustive search-based optimal solution cannot be obtained in simulation scenarios with extremely high computational complexity.

• EE Maximization based on Centralized Q-Learning (C-QL):
This algorithm is based on Q-learning and considers the overall state information of the agents. However, because all possible cases explorable by the agent should be considered, the size of the Q-table increases VOLUME 10, 2022   exponentially according to the number of agents and the sizes of the state and action sets. Because of its complexity, we cannot apply this algorithm to complicated network environments.

• EE Maximization based on Distributed Q-learning (D-QL):
The basic operation of this algorithm is almost similar to that of the proposed algorithm. However, the only difference is that reward sharing is not considered in this distributed Q-learning algorithm. That is, each agent only considers its own reward before choosing the action to perform. Hence, the reward (R d j ) can be described as where |U(j)| and U out (j) are the number of users associated with SBS j and the number of outage users of SBS j, respectively. The computational complexity of the reward-optimal algorithm, centralized Q-learning algorithm, distributed Q-learning algorithm, and the proposed algorithm are summarized in Table 2. Because the reward-optimal and centralized Q-learning algorithms are designed to consider all cases that could occur in networks, their computational complexity is significantly larger than that of other algorithms. The proposed multi-agent distributed Q-learning algorithm can greatly reduce the computational complexity by allowing the agent to consider only its own state.   each SBS according to the initial association results, A-TPC delivers performance results superior to those of the No TPC algorithm. In the case of the centralized Q-learning algorithm, because this algorithm needs to consider the states and reward information of all the SBSs, the convergence speed of this algorithm is relatively slow compared to that of the proposed algorithm. In addition, each agent in the distributed Q-learning algorithm tries to maximize its reward without considering the status and rewards of other agents. As a result, the transmit power of each agent gradually increases to reach the maximum power, and the result finally converges to that of the No TPC algorithm.
Figs. 4a-4c demonstrate that the proposed algorithm can converge to the optimal solution even for a network deployment with a non-uniform spatial traffic distribution, as shown in Fig. 2b. Similar to the results for the uniform traffic distribution, our proposed algorithm outperforms the conventional algorithms with respect to the accumulated average reward regardless of the increase in user mobility. The overall performance behavior is clearly similar to the case of uniform spatial traffic distribution, but the accumulated reward value is relatively smaller than that of the uniform distribution owing to the regionally biased traffic. Moreover, as learning progressed, the length of the error bars gradually became shorter, which shows that the learning progressed well.
To prove the operational flexibility of the proposed multi-agent distributed Q-learning algorithm, we considered ultra-dense network environments. Figs. 5a and 6a show the network deployment results considering uniform and non-uniform spatial traffic distributions when |M| = 5, |N| = 100, and |U| = 600. Here, users were randomly distributed within 150m of the SBS and were moving in correspondence to the random walk mobility model with v max = 0.01 m/s. Figs. 5b and 5c show the accumulated average EE and the average number of outage users based on the progress of the episode for each algorithm under the network deployment with the uniform spatial traffic distribution in Fig. 5a. In addition, Figs. 6b and 6c show the accumulated average EE and the average number of outage users based on the progress of the episode for each algorithm under the network  deployment with the non-uniform spatial traffic distribution in Fig. 6a. Even in ultra-dense network environments, the proposed algorithm outperforms the conventional algorithms in terms of the accumulated EE and number of outage users. Because of the extremely high computational complexity of the reward-optimal and centralized Q-learning algorithms, these algorithms did not produce results in this simulation scenario. However, the error bars of the proposed algorithm and the random action algorithm provide a rough indication that the proposed algorithm converges to the optimal solution. Furthermore, in the case of non-uniform spatial traffic distribution, distributed Q-learning yields several outage users because this algorithm only considers its state and reward. However, we can show that the average number of outage users of the proposed algorithm converges to zero. Table 3 compares the fairness of EE between SBSs for each algorithm once learning is complete. We used Jain's fairness index to obtain the fairness results [25]. Jain's fairness index can be represented as With the No TPC, A-TPC, and distributed QL methods, because all SBSs transmit at almost maximum power, the difference in the energy efficiency between SBSs is smaller than that in other algorithms. As a result, these algorithms produce superior SBS fairness results compared to the reward-optimal, random action, centralized QL, and proposed algorithms. Moreover, as expected, the SBS fairness results for the uniform spatial traffic distribution are larger than those for the non-uniform spatial traffic distribution. As the learning progresses, the SBSs with higher traffic density use relatively larger transmit power than those with lower traffic density. Consequently, the EE results for each SBS could be gradually different, and accordingly, the SBS fairness in the non-uniform traffic distribution becomes smaller than that in the uniform traffic distribution.
To show the performance behavior against delayed learning information, we obtained Figs. 7a and 7b under network deployment with non-uniform spatial traffic distribution when |M| = 3, |N| = 6, and |U| = 36. In Fig. 7a, while training the Q-values of each agent, we assumed v max as 0.1 m/s. However, in the test environments, the users moved to larger v max values to reflect the effect of the delayed learning information. Similarly, in Fig. 7b, while training the Q-values of each agent, we assumed v max as 0.2 m/s. Also, the tests were performed against larger v max values. It can be seen that the greater the difference in v max between the learning environment and the test environment, the larger the performance degradation. In addition, in the case of Fig. 7a, since v max is smaller than that in Fig. 7b, it has a chance to perform learning for more diverse positions. As a result, the performance degradation might be smaller. Fig. 7c shows how P t affects the system performance. To obtain this figure, we set P max t as 5W, and the reward-optimal solution is obtained using the exhaustive search algorithm under P t = 0.5W. When P t = 0.01W, because the action set size of each agent becomes too large, it needs to take a long time to achieve the optimal solution. In contrast, when P t = 2.5W, the agent rarely finds the optimal solution because its action set size is too small. Therefore, network operators should determine P t very carefully considering the network environment and spatial traffic distribution.

V. CONCLUSION
In this paper, we proposed a SBS power control algorithm based on multi-agent distributed Q-learning to maximize the network EE while reducing the number of outage users in UDSCNs. To consider practical network environments, we utilized uniform and non-uniform spatial traffic distributions based on real-world measurements. Even in the non-uniform distribution, we showed that the proposed algorithm converges well to the optimal solution obtained by the exhaustive search algorithm. In addition, to demonstrate the performance in ultra-dense network environments, we considered 100 SBSs and 600 users in a network with five small cell clusters. In this network environment, we demonstrated that the proposed algorithm outperforms conventional algorithms such as random action, No TPC, A-TPC, and distributed Q-learning algorithms regardless of the increase in user mobility. Furthermore, by allowing the agent to consider only its own state, the computational complexity of the proposed algorithm can be significantly reduced compared to that of the centralized Q-learning algorithm.