Optimal Learning Paradigm and Clustering for Effective Radio Resource Management in 5G HetNets

Ultra-dense heterogeneous networks (UDHN) based on small cells are a requisite part of the future cellular networks as they are proposed as one of the enabling technologies to handle coverage and capacity problems. But co-tier and cross-tier interferences in UDHN severely degrade the quality of service due to K-tiered architecture. Machine learning based radio resource management either through independent learning or cooperative learning is a proven efficient scheme for interference mitigation and quality of service provision in UDHN in a both distributive and cooperative manner. However, an optimal learning paradigm selection, i.e., either independent or cooperative learning and optimal cooperative cluster size in cooperative learning for efficient radio resource management in UDHN is still an open research problem. In this article, a Q-learning based radio resource management scheme is proposed and evaluated for both distributive and cooperative schemes using independent and cooperative learning. The proposed Q-learning solution follows the $\epsilon -$ greedy policy for optimal convergence. The simulation results for the UDHN in an urban setup show that in comparison to the independent learning paradigm, cooperative learning has no significant impact on macro cell user capacity. However, there is a significant improvement in small cell user capacity and the sum capacity of the cooperating small cells in the cluster. A significant increase of 48.57% and 37.9% is observed in the small cell user capacity, and sum capacity of the cooperating small cells, respectively, using cooperative learning as compared to independent learning which sets cooperative learning as an optimal learning strategy in UDHN. The improvement in small cell user capacity is at cost of increased computational time which is directly proportional to the number of cooperating small cells. To solve the issue of computational time in cooperative learning, an optimal clustering algorithm is proposed. The proposed optimal clustering reduced the computational time by four times in cooperative Q-learning.


I. INTRODUCTION
Evolution of wireless communication technologies in the last two decades results in an explosive increase in cellular networks users and quality of service (QoS) requirements like higher data rate, throughput, coverage, and capacity while reducing the latency to negligible value (nearly zero). The evolution of cellular networks from 1G to 5G results in The associate editor coordinating the review of this manuscript and approving it for publication was Di Zhang .
improved QoS and Quality of Experience (QoE) key performance indicators (KPIs). Due to the massive increase in cellular network users, the concept of small cells based k-tiered UDHN was proposed for improved coverage and capacity [1], [2], [3], [4], [5]. Although the k-tiered UDHN successfully met the requirements of improved coverage and capacity, some related issues are effective radio resource management (RRM) for efficient interference mitigation as the UDHN deployment results in co-tier and cross-tier interferences which severely degrades the QoS for both macrocell users and small cell users. For effective utilization of k-tiered UDHN in 5G, co-tier and cross-tier interferences have to be mitigated through efficient RRM [5], [6], [7].
RRM is an essential aspect of wireless communication systems, especially in k-tiered UDHN which consists of multiple types of cells with different frequencies, technologies, and coverage areas. RRM includes but is not limited to load balancing, carrier aggregation, interference mitigation, and self-organizing networks (SON) implementation. Due to the large number of applications of RRM in UDHN, RRM is a widely researched topic in the context of 5G UDHN.
The RRM for effective interference mitigation in k-tiered UDHN is not an easy task due to the dynamic nature of UDHN. Many solutions, most of which were non-adaptive, were proposed in the literature but these non-adaptive solutions cannot handle the dynamic nature of UDHN where the density of small cells continuously changes and therefore interference conditions [6], [8]. In comparison to the non-adaptive RRM algorithms, recently some machine learning based RRM algorithms are proposed in literature which performed significantly better than the non-adaptive algorithms. Reinforcement learning which is a subdomain of machine learning is utilized in devising adaptive RRM through Q-learning in UDHN where the algorithm is utilized to optimize the allocation of network resources such as bandwidth, power, and spectrum to different nodes in the network by continuously learning and interacting with the environment [9].
Reinforcement learning based RRM in UDHN through Q-learning has shown remarkable performance in recently proposed solutions in literature [5], [6], [7]. Q-learning can be applied distributively through independent learning or cooperatively through cooperative learning. However, the literature is silent about the optimal learning scheme in real-time UDHN for 5G cellular networks. In this article, we explored the optimal learning strategy for efficient RRM in terms of various KPIs and the provision of QoS to both the macro cell and small cell users simultaneously.
A. RELATED WORK Small cells are low-powered wireless access points that are used to provide coverage and capacity in densely populated areas. UDHN are networks that consist of a combination of small cells, macro cells, and other network elements to provide seamless and efficient coverage and capacity [5], [6], [7], [10]. Small cells in UDHN work by complementing macro cells and offloading traffic from them and creating a user balance among the tiers of the network. They are deployed in areas where there is high demand for data, such as shopping centers, airports, and sports stadiums. This helps to reduce congestion and improve the overall QoS and capacity for users through efficient user association [2], [4], [10], [11]. The deployment of small cells in HetNets can also help service providers to meet the increasing demand for high speed data services and support the growth of the Internet of Things (IoT). The combination of small cells and macro cells in a UDHN allows service providers to create a flexible and scalable network that can adapt to changing network conditions and user demand [2]. Recently, many solutions have also been proposed based on software-defined networking (SDN) architecture for efficient deployment of UDHN in the millimeter wave (mmW) spectrum [12], [13], [14].
Although the implementation of small cells has various advantages, the initial cost, overall system reliability, and interferences due to k-tiered architecture are unresolved problems [3]. Interference is one of the main challenges in small cell UDHN. In co-channel deployment mode, small cells operate in the same frequency bands as macro cells, and their proximity to each other can result in interference between small cells and between small cells and macro cells in the same network which is co-tier interference (I co ) and cross-tier interference (I cr ) respectively [15], [16]. In addition, a relatively small coverage area leads to multiple small cells being deployed close to each other resulting in a UDHN, further exacerbating the interference problem [15], [16], [17]. Therefore, this article focuses on interference mitigation in UDHN through optimal resource allocation.
Recently, researchers presented a number of strategies to improve the reliability, throughput, QoS, QoE, coverage, and capacity of small cells UDHN by mitigating I co and I cr through intelligent and adaptive schemes as the non-adaptive solution for RRM are not considered useful due to dynamic nature of k-tiered UDHN [15], [16], [17], [18]. Therefore, the concept of self-organizing networks (SON), outlined in LTE 3GPP TS 36.300 [19], is utilized in RRM techniques for adaptive solutions [20]. SON integration in UDHN has been also proven profitable and cost-effective for the network operators [21], [22]. However, the integration of SON in UDHN requires some source of cognition or intelligence which can be provided through reinforcement learning (RL).
RL is a type of machine learning algorithm that is used to optimize decision-making in dynamic environments. RL is applied in communication systems using Q-learning (QL) for optimization of network resource allocation, such as spectrum and power allocation, and traffic routing. QL algorithms learn from network conditions, such as traffic patterns and interference levels, and determine the best actions to take in real-time to improve network performance. This results in more efficient use of network resources, improved network coverage and capacity, and reduced latency [17], [18]. Overall, QL has the potential to be a key enabler for the successful deployment of 5G networks, providing the ability to optimize network operations and improve network performance in dynamic and complex environments [23].
QL can be implemented in many different ways. While the basic QL algorithm follows a similar structure, the differences between different QL schemes can be substantial. One QL scheme is different from another in terms of value function representation, exploration-exploitation tradeoff, reward function, learning rate, and discount factor [24], [25], [26], [27], [28], [29], [30], [31], [32], [33], [34]. The VOLUME 11, 2023 novelty of different QL schemes lies in the development of the optimization problem and constraining it with network key performance indicators and the design of an efficient reward function to solve the optimization problem. The solutions proposed for adaptive power allocation to small cells define optimization problems in different ways and with different constraint and therefore a novel reward function to solve it. Each QL scheme may have a different reward function design to address the underlying optimization problem. Similarly, the discount factor and learning rate determine the convergence of the proposed QL scheme [23].
QL has been widely applied to the UDHN either through independent learning ( i L) mode in a distributed manner or cooperative learning ( c L) mode in a cooperative manner. Both of the learning paradigms have pros and cons. Independent Q-Learning is generally simpler to implement and more scalable, as it does not require communication between the cells. On the other hand, Cooperative Q-Learning can lead to more efficient decision-making and better network performance, as the cells can learn from each other's experiences. The reward function, an integral and critical part of QL is impacted by the learning paradigm. In i L, each learning agent act according to individual reward optimization whereas in c L, cooperating agents learn to form a joint RF [35]. QL solutions for RRM have been proposed in both i L and c L. Although some solutions have been proposed in both i L and c L but optimal learning scheme for real-time implementation has not been proposed [24], [25], [26], [27], [28], [29], [30], [31], [32], [33], [34].
Literature review reveals multiple types of limitations of state of the art QL based adaptive power control schemes for UDHN like the value function representation and proposed reward functions in these QL schemes are either biased to M u or S u as discussed in [15] and [16]. Some of the proposed schemes could not set their superiority against the state of the art solutions. Many solutions proposed in c L performed better than i L but the performance was bottlenecked by the communication overhead and computational time [36], [37]. Therefore, there is still a need to investigate an ideal or optimal learning paradigm that can be deployed in real-time implementation to improve the performance of UDHN.
In this article, we proposed a QL scheme to address the above-mentioned limitations of the state of the art QL based adaptive power allocation schemes and also devised an optimal learning paradigm based on the performance of the proposed QL scheme in independent and cooperative learning paradigms.

B. CONTRIBUTIONS
In this article, we have investigated the performance of the QL-based RRM for small cell UDHN to provide QoS to both macrocell and small cell users by simultaneous mitigation of co-tier and cross-tier interferences. The proposed solution based is evaluated in both independent and cooperative learning paradigms to find an optimal learning paradigm for real-time deployment scenarios. The following are the major contributions of the paper: • A QL-based adaptive power allocation scheme is proposed to handle the co-tier and cross-tier interferences simultaneously in the small cell UDHN which ensures QoS for both macrocell and small cell users.
• The proposed QL scheme models the small cell UDHN as a single or multi-agent Markov Decision Process (MDP) where small cells play the role of QL agents in the network and implement QL for RRM.
• The defined optimization problem which maximizes the capacity of macrocell users, small cell users, and the sum capacity of small cell users is constrained over the minimum required QoS thresholds to guarantee QoS for all users and is solved through the proposed reward function for the QL algorithm.
• The optimal learning paradigm is proposed by evaluating the proposed QL based adaptive power allocation scheme in both learning paradigms, i.e., independent and cooperative learning.
• Simulation results in various combinations of co-tier and cross tier interferences based on standard 3GPP simulation setup prove the optimality of the cooperative learning paradigm in terms of standard KPIs at the cost of increased computational time.
• For efficient deployment of small cell UDHN, an optimal clustering algorithm is also proposed and evaluated. The simulation results show that optimal learning, i.e. cooperative learning, is the efficient QL implementation scheme when deployed with optimal clustering technique which significantly reduces its computational time.
The paper is organized as follows: a system model for small cell UDHN is presented in section II for a comparison of QL implementation in i L and c L for RRM to QoS. In section III optimization problem in the underlying study is presented. QL algorithm in i L and c L paradigm to solve the optimization problem is presented in section IV whereas the simulation parameters and setup for evaluation of the proposed solution are discussed in section V. The results of Monte-Carlo simulations to compare the performance of i L and c L in 3GPP interference setups are presented in section VI whereas the conclusion of the paper is presented in section X.

II. SYSTEM MODEL
A system model composed of the k-tiered s C UDHN is presented in Fig.1 where small cells ( s C) are deployed in the co-channel mode under the over laid macrocell ( m C). The each k th user of n th s C from N sc , S u n,k , where k ∈ K and K = {1, 2, 3, . . . K} are also deployed indoor randomly in the s C. The s C are operating in the co-channel deployment mode. BS m and BS s equally divide the transmission power to its their users [39]. It is assumed that QoS parameters s C are provided by the network operator in the SON procedures like self-configuration.
The presence of S u n,k in m C due to the k-tiered s C UDHN results in I cr to M u i which affect its SINR. In the downlink, the SINR at any M u i , ς m i , can be calculated as follows in presence of I cr where p m i and h m m,i transmitted power and channel gain by BS m to all M u i operating in m C. p s n and h n m,i is the transmitted power and the channel gain by n th BS s to all M u i . which results in I cr . In addition to I cr , AWGN also impacts the ς m i which is represented by the variance, σ 2 in (1).
Unlike (1), the SINR at k th S u of n th s C, S u n,k , ς s n,k , in the downlink operating on the subband f ∈ F sub , is impacted by I cr from BS m , I co from the neighboring BS s and thermal noise. The ς s n,k is obtained as where p m , p s j and p s n,k are the transmitted power by BS m , BS s of j th and n th s C to S u n,k respectively. Similarly, h m n,k , h n n,k and h j n,k and are the channel gains from the BS m and BS s of n th and j th s C to the S u n,k respectively. From the ς m i and ς s n,k in (1) and (2), respectively, normalized capacities at the M u i and S u n,k are given below: where m C i and s C n,k are the capacities of M u i and S u n,k respectively.
The accumulated value of capacities of all S u n,k in the system, s C sum is represented as follows:

III. PROBLEM FORMULATION
The problem of RRM in 5G UDHN addressed in this research is one of the major research problems for many years in the domain of ultra-dense HetNets as 5G enabling technology.
Recently, many solutions are proposed for QoS provision for all users in k-tiered s C UDHN architecture through optimal RRM and interference mitigation [15], [16], [27], [28], [29], [31], [34]. However, the fundamental difference among the recently proposed RRM techniques lies in an optimization problem in terms of optimization function and conditions. The optimization problem (OP) defined in the underlying research strives to maximize the m C i , s C n,k , and s C sum while keeping m C i and s C n,k above the QoS capacity thresholds ξ m and ξ c through interference mitigation. The adaptive s C transmission power-based intelligent interference mitigation scheme handles the I co and I cr simultaneously and thus guarantees QoS to M u i and S u n,k by improving SINR. OP for adaptive power allocation is defined as follows by assuming that BS s of n th s C, operating over a subband, f ∈ F sub , can select a transmit power, p s from the available set of powers, where p 1 and p max define the range of discrete values of transmit powers which any BS s may select. The objective function, (6a), maximize m C i , s C n,k , and s C sum whereas the constraints of the OP, (6b)-(6d), describe threshold/ values p s n , m C i and s C n,k . The OP constraints in (6c) and (6d), ensure QoS provision to S u and M u simultaneously in the s C UDHN. OP in (6a) -(6d) can be solved through learning based adaptive solution by relating the p s n to the m C i and s C n,k while constraining over QoS capacity thresholds. The learning framework to solve (6a)-(6d) is discussed in the following sections.

IV. OPTIMAL RESOURCE ALLOCATION IN s C UDHN USING QL
The QL is an iterative algorithm to apply RL in a system where the environment is dynamic or unknown. QL agents interact with the environment and strive for a maximum reward through a learned optimal policy, π * . However, learning π * is a computationally extensive process that requires improving the π in each iteration. π * can be found whether prior information on the environment is available or not. s C UDHN can be modeled as MDP to implement QL based RRM for interference mitigation and QoS provision. The detailed modeling of s C UDHN as MDP is provided in [15], [32], [33], [34], and [16].

A. PROPOSED QL ALGORITHM
Based on the rationale of QL and s C UDHN as MDP, we have proposed QL algorithm, 1 for optimal RRM in s C UDHN as MDP. The proposed QL algorithm is based on the definitions of s C UDHN as MDP where each BS s acts as the QL agent and adaptively selects an action, a t n , which is transmission power based on learning of the QL agent. The actions of the agents, a n ∈ A, are a discrete set of transmission powers, P, of BS s , as defined in section III. The step size between elements of P is calculated through the following equation [32], [33], [34].
In the iterative process of learning and improvement, the QL agent updates Q- Table (QT) which is based on the actions, a t n , and states, x t n , of QL agent [15]. At the time, t = 0, QT is initialized with no entry in QT and a random state, x t n . During the iterative process, an action, a t n , can be selected based on the exploration-exploitation policy (EEP) [15].
After the selection of a t n at time t, reinforcement, R t+1 is applied according to the (9) resulting in the new state, x t+1 c of the agent and QT is updated [16].
where the R t f is the reward function (RF) of the QL algorithm. For the proposed QL algorithm, R t f is defined as a function of m C i , s C n,k , m and s at any time t and is given below.
41268 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. where The part a of the R t f in (10) encourages the system to maximize the capacities of the M u i and S u n,k whereas the part b guarantee to meet the minimum QoS requirements. A value of z > 1 in (10) provides small preference to m C i over the s C n,k as M u i is the primary user in the network. z and d th are user defined parameters which selected in line with the literature [15], [16], [32]. The systematic design and development of effective R t f is presented in [15]. The EEP in (9) involves the c L and i L learning paradigms. In the i L, each QL agent in the system learn and act independently regardless of learning of neighboring agents and impact of its actions on the other agents or environment. On the other hand, c L consider the learning of the other agents and consider the impact of their related actions while learning and therefore share the learning information with neighboring agents. As compared to the learning process in i L, learning agents cooperate with neighboring agents by sharing rows of their updated QT. The details of i L and c L are discussed in the following subsections.

B. INDEPENDENT LEARNING VS COOPERATIVE LEARNING
In the s C UDHN as MDP, BS s are the learning agents of the QL which repeatedly interact with the surrounding environment to learn π * from the π by improving in each iteration. According to the EEP, learning of the agents can be either in the i L and c L mode. Both learning modes are explained in the following subsections.

1) INDEPENDENT LEARNING
In i L, the learning of each agent in the UDHN is independent of the other agents in the environment. While learning in i L, agents assume anything around it as the environment even if there are other agent present and act selfishly somehow. No agent cooperates with other agents by not accepting or sharing any information. Therefore, no prior information is available to the agents in this learning paradigm. In the i QL, any agent does not cooperate to share any QL information with other neighboring agents.

2) COOPERATIVE LEARNING
In i L paradigm, each agent of s C UDHN learns π * for RRM individually and without any prior information, therefore, learning the π * more time and resources. Furthermore, in i L all agents learn an optimal policy, π * , individually in a somehow greedy manner regardless of the negative impacts on the neighboring agents. Contrary to the i L, in c L s C cooperates with the neighboring agents to exchange the QLrelated information. Unlike i L, prior information about agents Algorithm 1 QL Algorithm for Optimal RRM in s C UDHN for 5G CN i L and c L Paradigm Define size and coverage area of QL Agents Share the updated QT with all agents in the system else Do not share the updated QT end end * All QL Agents in radius of R C from QL Agent nearest to m C and the environment is available to the new agents entering the system in c L. Therefore, learning π * is more robust and effective in c QL as compared to i QL. The cooperation of the agents also helps the neighboring agents in learning and considering the environment in learning π * in such a way FIGURE 2. Apartment strips to create simulation setup based on single macrocell and small cell user per macrocell and small cell respectively [38].
that their impact on the performance of other agents is least. Therefore, c QL can further reduce the I co in s C UDHN and convergence time for new agents entering the system. Based on the above rationale, the c L is a step ahead of the i L where useful information is shared with the cooperating s C to optimally allocate RRM. In c L, s C cooperates by sharing one of the following three different types of information; i) episodic information, ii) instantaneous information, and iii) individually learned π * [40].
Based on the above cooperating techniques, in the proposed algorithm QL agent shares QT with agents to cooperatively learn an π * to adaptively allocate p s n to handle I co and I cr and improve the capacity of the s C while considering minimum required QoS parameters as proposed in [27], [28], and [32].

C. COOPERATING CLUSTERING
The limiting factor in the performance of c QL is the number of cooperative s C, n, in the cooperating cluster. As n increases, the size of QT also increases which results in increased computational time and overhead. To handle the issue of large cluster size, an optimal clustering algorithm is proposed presented in Fig. 10. The optimal clustering algorithm combines the neighboring s C into small overlapping clusters. Each s C decides its cooperating agents based on the distance threshold, d nt , of the neighboring s C discovered earlier using automatic neighbor discovery (AND) in the selfconfiguration phase. Although there is no limit on the number of n in the cluster, the size of the cluster remains limited due to d n t. According to the proposed algorithm, an s C can be part of multiple clusters due to the random deployment of s C in s C UDHN. All the clusters execute c QL in parallel which results in fast convergence of c QL in less computational time and with negligible computational overhead.

D. CONVERGENCE OF Q-LEARNING ALGORITHM
In Q-learning algorithm, an optimal policy (π * ) in MDP is learned through an iterative process. The algorithm works by updating estimates of the Q-values of the state-action pairs based on the received rewards and the Q-values of the next state-action pairs in each iteration. One of the key properties of Q-learning is its ability to converge to the optimal Q *values, given certain conditions. Specifically, Q-learning is guaranteed to converge to the optimal Q * -values based on the following factors: • An appropriate selection of learning rate parameter (α), determines the degree to which new information overrides the old information. A high value of α may lead to no convergence whereas a low value may result in slow convergence.
• An appropriate selection of exploration rate in EEP (8).
The exploration rate (ϵ) in EEP determines the degree to which the algorithm explores versus exploits the current best estimate of the Q-values. If the exploration rate is set too high, the algorithm may not converge, while if it is set too low, the algorithm may converge to a suboptimal policy.
• The state and action spaces must be finite.
• The MDP must satisfy the ''Markov property,'' which means that the future state and reward depend only on the current state and action, and not on any previous states or actions. Based on the above factors, the convergence of the QL can be achieved through the ϵ−greedy policy in EEP. π * is an ϵ−optimal policy for ϵ > 0 and δ ∈ (0, 1], if the following condition is met [32], [41], [42].
where Q * represents the learned optimal Q * -value after QL iterations and Q π is the Q-value based on the current stateaction pair. The proposed QL algorithm follows ϵ−greedy policy for optimal convergence as in [15], [32], and [16].

V. SIMULATION SETUP AND PARAMETERS
The proposed QL algorithm is evaluated through Monte-Carlo simulations in a standard 3GPP setup [38] in MATLAB 2020a on a Corei7,16 GB memory machine. The UDHN simulation setups, Fig. 3 are created by variation in Setup 2b (sparse), Fig. 2a based on the urban dual strip model [38], [43]. The UDHN simulation setups, Fig. 3, are developed to cater to different combinations of I co and I cr based on the density of s C, and the number of M u and S u and the location of apartment strips in the m C. The simulation parameters of m C and s C are according to the 3GPP TR 36.872 [38].

VI. RESULTS
The performance of the QL algorithm is evaluated in multiple interference scenarios presented in Fig. 3 by increasing the density of the s C in the system. The s C are added one by one where each s C performs QL either through i L or c L. According to the algorithm, all s C learn independently in parallel, however, share learned information in form of QT if operating in c L mode.
The simulation results of QL in i L or c L are measured through various KPIs to find an optimal learning policy for QL based RRM in s C UDHN for 5G CN. The performance of the QL algorithm is compared in terms of the capacities of macrocell and small cell users, the sum capacity of small cells, computational time, and the sum power transmitted by small cells in the network.

A. CAPACITY OF MACROCELL USERS
The capacity of macrocell users is computed using the QL in both learning paradigms, i L and c L, in all simulation setups of Fig. 3 and is presented in Fig. 4a. From Fig. 4a, it can be observed that the QL algorithm performed closely in both learning paradigms and provided the required capacity to the macrocell users for QoS. Hence, QoS requirements can be met through either of the learning schemes. The performance comparison is also summarized in Table 3. i L performed slightly better than the c L by providing 2.97% higher capacity VOLUME 11, 2023   to macrocell users in high I co and I cr setup, Fig. 3a. Whereas in the other two setups Fig. 3a and 3b, c L performed better than i L with 5.6% and 4.60% increase in macrocell user capacity respectively. Although the performance difference is negligible, c L provided higher macrocell user capacity. The similar performance of both learning paradigms, i L and c L, for capacity of macrocell users is due to almost negligible impact of cooperation of s C in c L on I cr mitigation.

B. MINIMUM CAPACITY OF SMALL CELL USERS
Both i QL and c QL provided small cell user capacity above the minimum required capacity threshold to ensure QoS for small cell users in simulation setups 1-3, Fig. 3a-3c. However, cooperation among the neighboring small cells in the clusters has shown a significant impact on the small cell capacity. Fig. 4b presents the performance comparison for the minimum capacity of small cells using c L and i L based QL in simulation setups 1-3, Fig. 3a-3c. In all three simulation setups, the c QL algorithm performed significantly better than the i QL algorithm in the same setup as shown in Fig.4b. Using c QL, a minimum improvement of 23.61% is observed in setup 41272 VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.   3 whereas the maximum improvement is in setup 2 which is 48.57%. The performance comparison of i QL and c QL is summarized in Table 4 which establishes that c QL provides a higher minimum capacity of small cells in UDHN as the density of the small cells increases. The improvement in the minimum capacity of small cell users using the c L paradigm is due to significant reduction in I co by efficient cooperation among the s C.

C. SUM CAPACITY OF SMALL CELL USERS
Although the sum capacity of small cell users is not a QoS parameter in UDHN but a higher value of sum capacity represents that the utilized RRM technique can provide a higher minimum capacity to all small cells in the UDHN. As the sum capacity is the sum of capacities of all small cell users in the system, therefore an improvement in the minimum capacities of an individual small cell user is reflected in sum capacity. As the c L has improved the minimum capacity of small cell users in all interference setups in Fig. 3a-3c, therefore sum capacity provided by c L is also higher than the i L. The comparison of sum capacity using i L and c L is presented in     5a. The minimum improvement in sum capacity is for setup 3 which is 6.287% and the maximum improvement is in the highest interference setup 1 which is 37.90%. The improvement in sum capacity using c L for all interference setups is summarized in Table 5. From Table 5, it can be inferred that c L provided higher sum capacity as compared to i L and therefore can be opted as the optimal learning policy.

D. COMPUTATIONAL TIME
The significant improvements in minimum and sum capacities of small cells, and competitive performance for macrocell user capacity using the c L based QL algorithm as compared to the i L are at the cost of the increase in computational time, and overhead. In the c L, all the cooperating agents transmit and receive the entries of QT which result in communication overhead. The communication overhead is directly proportional to the density of small cells in the system. The increased communication overhead in the c L paradigm also increases the computational time as compared to the i L paradigm. However, the increase in computational time is not significant as it is compensated by the decreased learning time due to the availability of prior information to small cells in form of QT rows shared by the neighboring small cells. The computational time of the proposed c L based QL algorithm is significantly less as compared to other CL algorithms in literature but i L has slightly less computational time as compared to c L which is evident in Fig. 5b. A similar trend is observed for the computational time in all three simulation setups 1-3, Fig. 3a-3c, using c L and i L. The analysis of the increase in computational time using c QL as compared to i QL is summarized in Table 6.

E. SUM POWER OF SMALL CELLS
Small cells transmitting at higher power levels result in strong I co and I cr and also reduce the EE of the system. i L reduced the sum power of the small cells operating in the cluster and also provided capacities above the minimum required thresholds to maintain minimum QoS for macrocell and small cell users. However, the cooperation among the small cells through c L further reduced the sum transmit power in all three simulation setups 1-3, Fig. 5a. The decrease in sum transmit power is indirectly related to improvements in the capacities of the macrocell, and small cells. A lower value of the sum transmit power in conjunction with improvements in the capacities of the macrocell, and small cell indicates that the proposed solution has the capability of mitigating I co and I cr simultaneously through adaptive power allocation to BS s in UDHN. The decrease in sum transmit power using the c L is summarized in Table 7 which shows a minimum decrease in sum transmit power of 17.20% and a maximum decrease of 29.71% in setups 2 and 1 respectively.

F. CONVERGENCE ANALYSIS
For the convergence of the proposed algorithm in i L and c L paradigm, the ϵ-greedy policy is utilized based on the (8) and (11). To learn optimal policy (π * ) through convergence of the Q-value to optimal Q * -value, the QL parameters affecting the convergence, like learning rate (alpha), exploration probability (ϵ), finite state-action space, and MDP satisfying ''Markov Property'', are selected in line with the literature [32], [33], [34]. The simulation parameters are summarized in Table 2. The QL algorithm is considered to be converged if the regret magnitude in (11) is less than 0.001 for consecutive 1000 iterations where the maximum number of QL iterations is 75 × 10 3 . The convergence of the regret to the defined threshold is presented in Appendix for both i L and c L in different simulation setups. It can be observed in Fig. 12a-12c that regret magnitude converges to the required threshold and then remains constant after a certain number of QL iterations. The regret minimization to the threshold levels is also shown in the magnified views in Fig. 12a-12c. The QL iterations for convergence of i L and c L are presented in Fig. 7 for all three simulation setups. The statistical comparison of the number of iterations for convergence is also presented in Table 8. It can be observed that both i L and c L converge in less number of iterations without reaching the maximum limit of QL iterations. However, the overall pattern of the number of iterations for convergence remains similar. A slight difference between number of iterations using i L and c L is due to the learning paradigm where c L converges slightly earlier than i L.

VII. QoS PROVISION USING IL AND CL
In UDHN, QoS provision to the macrocell and small cell users simultaneously through an effective interference mitigation scheme is a difficult task and one of the fundamental VOLUME 11, 2023 objectives of this research. In this section, we analyzed the performance of the QL using i L and c L for the provision of QoS to macrocell and small cell users simultaneously. The performance of i L and c L is also compared with other recently proposed solutions in literature Q-DPA [32], PA-DRL [33], FAQ [34] and Greedy in i L and c L paradigm.

A. QoS FOR MACROCELL USERS
QoS provision for macrocell users by the i QL, c QL and other recently proposed solutions in literature [32], [33], [34] is presented in Fig. 8 for all three simulation setups, Fig. 3a-3c. It can be observed from Fig. 8a-8c that proposed i QL and c QL algorithms provided QoS to macrocell users in all simulation setups in presence of a cluster of sixteen small cells whereas the other recently proposed solutions in the literature [32], [33], [34] failed to provide QoS in high interference setup, i.e. simulation setup 1. Therefore, the proposed QL based solution can effectively mitigate the I cr and can provide QoS to macrocell users in highly dense urban UDHN. However, there is no significant difference in the provision of QoS to macrocell users in i L or c L paradigm which is also evident from Fig. 4a.

B. QoS FOR SMALL CELL USERS
In UDHN, small cell users are victims of high I cr and I co . The QL based proposed in i L paradigm not only provided QoS to all small cell users in the cluster of sixteen s C in all simulation setups but also outperformed the recently proposed solution [32], [33], [34] by providing QoS to the higher number of small cells in different interference setups. Similarly, c QL also not only provided QoS to all small cell users but also improved the minimum and sum capacities of small cell users where the other recently proposed solution in literature [32], [33], [34] could not provide QoS to all small cell users in c L paradigm as well.

VIII. OPTIMAL LEARNING PARADIGM FOR QL BASED RRM IN UDHN
From the performance comparison of various KPIs in section VI VII, it can be observed that c L has a negligible effect on the capacity of macrocell users as compared to i L in the case of UDHN. However, a significantly higher minimum capacity of small cell users and the sum capacity of the cooperating small cells in the cluster can be observed using the c L as compared to i L. This significant improvement is at the cost of increased T c which is directly related to the number of small cells in the cluster. In this research, we simulated a cluster size of 16 small cells, which are 37.5% more small cells according to 3GPP TR36.872 by adding small cells in the cluster one by one. Simulation results show that the proposed c L based QL algorithm not only outperformed other recently proposed solutions in literature but it proved its significance over the i L paradigm. Therefore, c L is an optimal learning strategy for QL based RRM in UDHN.

IX. OPTIMAL CLUSTERING IN c L BASED QL
Although simulation results and their comparison have set the superiority of c L over the i L, the issues of increased computational time and overhead can be handled through optimal cooperative clustering in s C UDHN to further improve the performance of c L based QL. An optimal clustering algorithm is proposed in Fig. 10 to handle computational time and overhead. The optimal clustering technique finds an optimal size of the cooperative cluster which always guarantees an optimal size of the cluster to successfully handle the issue of increased computational time in cooperative learning. The simulation results of optimal clustering in terms of computation time and overhead and the optimal size of the cluster are discussed in the subsequent sections.

A. IMPACT OF CLUSTERING ON COMPUTATIONAL TIME
The simulation results for the proposed c L based QL algorithm and optimal clustering algorithm to evaluate the impact of reduced cluster size on computational time are presented in Fig. 11b. To evaluate the proposed solution, the simulation setup 1, Fig. 3a is utilized as it is the combination of the highest I co and I cr . The proposed optimal clustering algorithm divided the cluster of total small cells, i.e. 16, into multiple overlapping smaller clusters of 2, 3, and 4 small cells. The optimal clustering in simulation setup 1, Fig. 3a is presented in Fig. 11a. Therefore, the largest cluster size in terms of small cells is 4. However, there is no limit on the number of smaller overlapping clusters. All the clusters executed the c QL in parallel and therefore the largest cluster computational time is the maximum computational time. It is pertinent to mention that all small cells are always independent and work in parallel whether the learning scheme is independent or cooperative. Therefore, there is no such case where the small cells are not working in parallel. In the Monte-Carlo simulation, the maximum size remained limited to four small cells, and the computational time for this cluster is 24 sec as shown in Fig. 4.10. The computational time for 16 s C in the system is now reduced to approximately equal to 4 small cells in the system. Therefore, the c QL algorithm with optimal clustering is four times faster than a simple c L application in UDHN. The comparison of c L based QL and optimal clustering based c L QL is presented in Fig. 11b. The results for the optimal clustering in simulation setup 2-3, Fig. 3a-3c can be obtained similarly.

B. IMPACT OF CLUSTERING ON COMPUTATIONAL OVERHEAD
In c QL, the computational overhead is due to the large size of QT which is the combination of x t n and a t c . The size of the QT increases with the number of small cells in the system as they share it with cooperating agents. When all the small cells operate in a single cluster or unlimited size of the cluster, the size of the QT is large and therefore computational overhead while executing the QL algorithm. The optimal clustering size reduces the number of small cells in the system and thus the size of the QT which as result not only reduces computational overhead but also computational time. The optimal clustering algorithm, Fig. 10, divided the total 16 small cells into small clusters. The largest cluster size is composed of 04 small cells, therefore optimal clustering reduced computational overhead by four times as compared to using c L in a single cluster of 16 small cells in simulation setup Fig. 11a. However, the reduction in communication overhead can be varied due to the deployment scenario, the total number of 16 small cells in the system, and the dynamic behavior of the UDHN.

X. CONCLUSION
In this research article, the Q-Learning algorithm in the independent and cooperative learning paradigm is explored for RRM to handle the co-tier and cross-tier interferences simultaneously in UDHN for quality of service provision to both macro and small cell users. In the cooperative learning paradigm, learning agents share independently learned information with the other neighboring agents in the cluster and utilize their mutual experience to learn optimal policy by meeting the convergence conditions to improve system KPIs jointly. The Q-Learning algorithm successfully mitigated the co-tier and cross-tier interferences simultaneously in both independent and cooperative learning paradigms and provided QoS to all users in K-tiers in the cluster of 16 small cells in various setups of standard 3GPP interference scenarios where other recently proposed solutions in literature and greedy power allocation fail to meet the QoS requirements for both macro and small cell users at the same time. Cooperative learning provides higher macrocell and small cell users capacity and sum capacity of small cell users as compared to independent learning at the cost of increased computational time and sum power of small cells. Although the impact of cooperative learning is not very large on the capacity of macro cell users as compared to independent learning, significant improvement can be observed in small cell capacity in the case of UDHN. A significant improvement of 48.57% and 37.9% in the small cell capacity set that c L is the optimal learning scheme of QL based RRM in UDHN for QoS provision. The performance improvement of c L results in increased computational time and overhead which are directly proportional to the size of the cluster. To handle the issue of computational time and overhead, an optimal clustering algorithm is proposed and evaluated. The optimal clustering in combination with a superior learning scheme, c L, improved the robustness and reduced computational complexity by a factor of 4 in a cluster size of 16 small cells, 37.5% more small cells according to 3GPP TR36.872. However, improvement in robustness and decrease in computational overhead is directly proportional to the number of small cells in UDHN.

APPENDIX
The convergence of regret to the required threshold using the proposed QL algorithm based on the ϵ-greedy policy in i L and c L paradigm for simulation scenarios, Fig. 3, is presented in Fig. 12 for the convenience of the reader.