BS-UE Association and Power Allocation in Heterogeneous Massive MIMO Systems

We consider a downlink Heterogeneous Massive Multiple-Input Multiple-Output (HM-MIMO) system with the macro-base station (BS) having hundreds of antennas and each of the micro-BSs having tens of antennas. Two lower bounds and one approximation of the achievable per-user equipment (UE) rate in closed forms under this system are derived and compared in terms of network utility. Using the derived approximation, three BS-UE association approaches are proposed. Firstly, a simple, sub-optimal, and heuristic multiple BS-UE association approach is designed, which achieves around 25% utility performance improvement compared with the conventional maximum received signal strength (Max-RSS) approach. Secondly, based on the results given by this heuristic approach, a learning approach for BS-UE association using convolutional neural network (CNN) is introduced. After being fully trained, the CNN can take any new BS-UE configuration as input and provide a sub-optimal BS-UE association for that configuration directly. It has only a small performance degradation compared with the proposed heuristic approach. Thirdly, realizing that the BS-UE connection probability in the proposed CNN architecture can be considered as a power allocation ratio, a combined power allocation and association approach is proposed. Its performance achieves as high as 60% utility improvement compared with the Max-RSS association and is also comparable to that achieved by the max-min power allocation approach which requires more than $10000\times $ running time. It is remarkable that by using the gradients of the derived achievable per-UE rate approximation with respect to the power control coefficients, accurate target data is in fact not required for training in this approach.


I. INTRODUCTION
Massive multiple-input multiple-output (MIMO) technology [1] has been regarded as one of the ''big three'' of fifth-generation (5G) wireless technologies due to its high spectral efficiency, good energy efficiency, and simple signal processing characteristics [2], [3]. In this paper, we will address issues related to base station (BS) and user equipment (UE) association as well as power allocation in Heterogeneous Massive MIMO (HM-MIMO) systems. Different from the conventional cellular systems, each BS in an HM-MIMO system has many antennas, and the number of antennas of a macro-cell BS and that of a micro-cell BS can be significantly different. In such a system, a BS usually transmits to many UEs in the same time-frequency The associate editor coordinating the review of this manuscript and approving it for publication was Jinming Wen . resource block (RB), and a UE is usually associated with multiple BSs with different numbers of antennas at various locations. In this case, BS-UE association in HM-MIMO systems becomes very challenging since each UE receives not only desired signals but also interferences from many antennas of several BSs at different locations [4], [5].
In conventional technologies, BS-UE association is usually implemented based on maximum received signal strength (Max-RSS) [6]. Specifically, a UE will associate with a BS which provides the Max-RSS. Although such a BS-UE association method is very easy to implement, some disadvantages are inherent with this method. Firstly, it will result in load imbalance, especially in Heterogeneous Networks (HetNets), where maco-BSs and micro-BSs have different transmission power. Secondly, it is a sub-optimal association method for maximizing both spectral and energy efficiencies. Lastly, it is not applicable to multiple BS-UE VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ association scenarios where multiple BSs jointly serve a UE (e.g., HM-MIMO, coordinated multipoint (CoMP)). In order to improve the simple Max-RSS association method, many new BS-UE association methods have been proposed for different purposes under various scenarios [4], [7]- [12], and a comprehensive survey on BS-UE association in 5G networks can be found in [13]. BS-UE association in Massive MIMO systems has also been explored. For example, a combined power allocation and user association problem was investigated in single-cell [14] and multi-cell [15] Massive MIMO downlink (DL) systems to improve energy efficiency. [16] considered Massive MIMO enabled HetNets. Using the Lagrangian dual analysis, an energy-efficient and fair user association was achieved by solving a network logarithmic utility maximization problem. There was also BS-UE association work for spectral efficiency maximization in Massive MIMO networks. Optimal single BS-UE association (i.e., each UE is associated with a single BS) for spectral efficiency maximization in Massive MIMO HetNets was analyzed in [17]. A simple, decentralized approach was also proposed, which was shown to be close to the globally optimal solution [17]. Different from [17], [5] considered a joint resource allocation and multiple BS-UE association problem. By limiting the BS cluster choices at different RB sets and specifying the number of UEs which can be served by every BS [5], a convex optimization problem for maximizing network utility is formulated to derive activity fractions of every served UE with respect to different associated BS clusters. It is noted that in [5], some parameters need to be fine-tuned for different network deployments such as the number of served UEs by every BS at different BS clusters. Moreover, since the BS-UE association scheme for different RB sets is different and the obtained activity fractions are non-integers, this approach cannot be employed to derive BS-UE association without the aid of scheduling. In this paper, we focus on low complexity physical layer BS-UE association schemes and adopt the setting that all served UEs share all allocated RBs [3] so that the complexity and overhead associated with scheduling are nearly zero.
There are also works using tools from stochastic geometry to analyze the BS-UE association problem. For example, in [18], the stochastic geometry based tools were applied to derive RSS thresholds for every tier of a multi-tier DL HetNet. The stochastic geometry based tools were also applied in [19] to derive spectral and energy efficiencies of a K-tier HetNet, where the macro cells employed Massive MIMO. [20] analyzed BS cooperation with noncoherent joint-transmission, and the practical irregular BS deployment was incorporated. By utilizing the stochastic geometry based tools, the signal-to-interference-plus-noise ratio (SINR) distribution with cooperation was precisely characterized. However, finding the optimal association is a combinatorial problem whose complexity scales exponentially with the number of BSs and UEs. Low complexity and versatile BS-UE association approaches are highly desirable due to the mobility of UEs and the variations of practical systems. Although much research work has been done, such approaches for multiple BS-UE association in Massive MIMO systems are not available.
In this paper, we derive two lower bounds and one approximation of achievable per-UE rate in closed forms under HM-MIMO systems. These closed forms incorporate linear minimum mean square error (LMMSE) channel estimation and conjugate beamforming (CB). More importantly, the derived two lower bounds and one approximation include only large-scale fading coefficients and are very easy to compute. Based on these closed forms, three low complexity and versatile BS-UE association approaches are proposed. At first, we introduce a simple and heuristic BS-UE association approach for network utility maximization. This approach is not only designed based on large-scale time interval but also has very low computational complexity and impressive network utility performance.
Secondly, based on the proposed heuristic approach, we develop a supervised learning approach for BS-UE association using convolutional neural network (CNN). This learning approach can directly output required BS-UE association and uses easily obtained information as input such as per-antenna power constraint, number of antennas per BS, and large-scale fading coefficients. The corresponding performance is very close to that obtained by the heuristic approach. Note that, as far as the authors know, NN approach for BS-UE association in HM-MIMO systems is currently not available even though the applications of NN have been broadly reported in various aspects of wireless communications such as channel estimation [21], detection [22], positioning [23], exploration of noise correlation [24], dynamic spectrum access [25], and so on [26], [27].
Thirdly, regarding the connection probability between a BS-UE pair as the transmit power portion allocated to the UE by the BS, we propose a new combined power allocation and BS-UE association approach based on the CNN architecture introduced earlier for BS-UE association. It is noted that in BS-UE association, equal power allocation is adopted and the status between a BS and a UE is either connected or disconnected. While in combined power allocation and BS-UE association, the incorporated power control coefficients are real numbers under certain power constraints. In other words, for a BS, the power portions allocated to its served UEs can be different and are determined by the power control coefficients. Using the gradients of the derived achievable per-UE rate approximation, the proposed approach does not require accurate target data for training and can further improve the network utility performance. By applying a designed threshold operation for BS-UE association of this approach, significant network overhead can be reduced while much better performance can still be achieved compared with the two previously proposed BS-UE association approaches. For comparison, a power allocation and association approach based on max-min power control using the approximation of the achievable per-UE rate is also introduced.
Note that the proposed approaches are very versatile and computationally efficient. They can be extended easily for designs with respect to other cost functions besides the network utility adopted in this paper. Moreover, numerical simulations show that the proposed CNN is very suitable for performing combined power-allocation and BS-UE association in highly mobile HM-MIMO systems when the parallel computing power of graphics processing unit (GPU) is available. If parallel processing is not available, the heuristic BS-UE association approach is then a good alternative.
The organization of this paper is as follows. Section II gives the formulations of HM-MIMO systems with BS-UE association, including descriptions of LMMSE channel estimation and CB. Two lower bounds and one approximation of the achievable per-UE rate in closed forms are described in Section III in detail. Section IV introduces the heuristic BS-UE association approaches as well as the corresponding numerical results. Two CNN based approaches (a BS-UE association approach, and a combined power allocation and BS-UE association approach) are proposed in Section V. As an alternative approach and performance reference, a method using max-min optimization and a threshold operation for combined power allocation and BS-UE association is introduced in Section VI. Section VII gives the numerical results of CNN based approaches and Section VIII concludes this paper.
Notations: We use boldface upper-and lower-case letters to denote matrices and column vectors, respectively. E h {·} means expectation with respect to variable h. Var(·) and Cov(·, ·) means variance and covariance operations, respectively. (·) T , (·) * , and (·) H denote transpose, conjugate, and transposed conjugate operations, respectively. z ∼ CN (0, σ 2 ) denotes a circularly symmetric complex Gaussian random variable with zero mean and variance σ 2 . 1 L×K represents an L × K matrix and each of its component is equal to 1, and |U| denotes the cardinality of set U.

II. SYSTEM FORMULATION
We consider a DL two-tier HM-MIMO system. This system has L BSs and K UEs. One macro-BS is located at the center of the serving area and L − 1 micro-BSs are distributed either randomly or in clusters. It is assumed that a UE can be served by multiple BSs simultaneously. In terms of BS-UE association, there is no difference to a UE regarding which types of BSs it connects to. The detailed association scheme for each UE (i.e., which BSs should be associated with this UE) is decided by a central processing unit (CPU) which is connected to all BSs using dedicated wired or wireless links. Based on a decided association scheme, the CPU transmits to each BS the information of which UEs it should serve. Upon receiving the association information from the CPU, all BSs will transmit downlink data to their associated UEs.
The antenna number of the l-th BS is denoted as M l and every UE has a single antenna. We consider a single sub-carrier and ignore sub-carrier index in our formulation. Then the channel frequency response between the l-th BS and the k-th UE at a certain sub-carrier is expressed as: where k = 1, . . . , K and l = 1, . . . , L. In (1), {β l,k } are the large-scale fading coefficients that account for path losses and shadow fading. {h l,k = [h 1,k , h 2,k , . . . , h M l ,k ] T } ∈ C M l ×1 are small-scale fading coefficient vectors whose components are i.i.d. ∼ CN (0, 1).

A. UPLINK CHANNEL ESTIMATION
Time division duplexing (TDD) scheme is adopted in our system, and the uplink (UL) and DL channels are therefore reciprocal. A standard LMMSE channel estimation method adopted in Massive MIMO systems [3], [28] is used for estimating UL channel state information (CSI) which is then used for DL precoding. Due to the channel estimation error, it is expected that the DL UE data rate under channel estimation will decrease to a certain extent compared with that under perfect CSI at the BS.
We denote the pilot sequence transmitted by the k-th UE as √ τ p a k ∈ C τ p ×1 where τ p is the length of pilot sequences and a k 2 = 1, ∀k. Here we assume τ p ≥ K , so the pilot sequences transmitted by all UEs are orthogonal. Then the received pilot matrix at the l-th BS is given as: where ρ p is the normalized signal-to-noise ratio (SNR) of each pilot symbol. Here, ρ p = P p /P N , where P p is the transmit power for each pilot symbol and the noise power P N = Bandwidth × k B × T 0 × noise factor (Watt) where k B = 1.381 × 10 −23 (Joule per Kelvin) is the Boltzmann constant and T 0 = 290 (Kelvin) is the noise temperature. W l in (2) is a normalized additive Gaussian noise matrix at l-th BS whose elements are i.i.d. ∼ CN (0, 1). Projecting Y l onto a k , we obtain: where w l,k is a C M l ×1 vector whose components are also i.i.d.
∼ CN (0, 1). Based on the i.i.d. assumption of the small-scale fading coefficients, it is sufficient to consider channel estimation for a specific pair between the m l -th BS antenna of the l-th BS and the k-th UE: where y m l ,k is the m l -th component of y l,k and m l = 1, . . . , M l . An LMMSE method is then applied to (4) for channel estimation, and it can be expressed as: g m l ,k = c l,k y m l ,k , c l,k = arg min Then c l,k is given as: VOLUME 8, 2020 It can be easily shown that E h {ĝ H l,k ε l,k } = 0, where ε l,k = g l,k −ĝ l,k is the channel estimation error vector. Here, we define γ l,k as: Then E h {|ε m l ,k | 2 } has the simple form below B. PRECODING SCHEME Since UL and DL channels are reciprocal in our system, the transmit symbols at the l-th BS using CB are given as: where U l is the set of UEs served by the l-th BS, q k is the data symbol transmitted for the k-th UE, which is zero mean and has unit variance (i.e., E{|q k | 2 } = 1), and ρ l is the normalized SNR of each DL data symbol at the l-th BS (i.e., ρ l = P l /P N and P l is the per antenna transmit power at the l-th BS). η l,k denotes the power control coefficient of the transmit symbol for the k-th UE at the l-th BS. If we apply equal power allocation with full power transmission, η l,k is equal to 1 |U l |γ l,k . Based on the channel hardening and favorite propagation properties of Massive MIMO, simple CB precoding can achieve comparable performance as high complexity precoding schemes such as zero-forcing (ZF) [3] in cellular Massive MIMO systems. Thus we adopt CB precoding in this work.

C. RECEIVED SIGNAL
After demodulation, the received signal at the k -th UE, r |k , has the form below: where B k is the set of BSs serving the k -th UE, ∀k , and w k ∼ CN (0, 1), ∀k .

III. ACHIEVABLE PER-UE RATE
According to the Shannon capacity theory [29], the DL achievable per-UE rate with perfect CSI at the UE side is given as: where D |k denotes the desired signal component at the k -th UE and I k|k denotes the interference component caused by the k-th UE at the k -th UE. Their expressions are given below:

A. CONVENTIONAL LOWER BOUND
Using the techniques in [3] and assuming every UE only has statistical channel information instead of instantaneous channel realizations, the conventional lower bound (CLB) of achievable per-UE rate for the k -th UE is given as: where BU is a beamforming uncertainty term and is defined as: It is noted that (14) is the achievable per-UE rate expression when statistical channel information is available at the UE side while (11) is the achievable per-UE rate expression when perfect channel information is available at the UE side. Thus, (14) is naturally a lower bound of (11). This lower bound has been shown quite tight in a cellular scenario [3]. However, large gap is also observed in cell-free Massive MIMO systems [30]. We define ξ l,k √ η l,k γ l,k and a simplified closed form of CLB |k with equal power allocation (i.e., η l,k = 1 |U l |γ l,k , ∀l, ∀k) is given below: where B a is the set of active BSs (i.e., every BS in B a has at least one UE associated with it). It is noted that under equal power allocation, ξ l,k = γ l,k |U l | , and the proof of (16) is provided in Appendix A.

B. FIRST ORDER APPROXIMATION
In this section, we show an approximation of (11). Applying Jensen's inequality and the first order Taylor expansion to (11), the approximated achievable per-UE rate can be expressed as: We call this approximation as Jensen-Taylor Approximation (JT-A). It is noted that (17) is an approximation of (11). A simple closed form of (17) with equal power allocation is given in (18), as shown at the bottom of the next page, which can be simply obtained by substituting (42) and (43) into (17). In (18), we have defined α l,k η l,k β l,k γ l,k , and under equal power allocation, α l,k becomes β l,k |U l | . It will be shown later that under our system setting, this approximation is almost indistinguishable with (11).

C. JENSEN-TAYLOR LOWER BOUND
Another lower bound of (11) is the Jensen-Taylor lower bound (JT-LB) which is introduced in [30] under a distributed Massive MIMO system with perfect CSI at both access point and UE sides. This lower bound is derived by applying Jensen's inequality and the second-order Taylor expansion to (11), and has the expression below: where The closed-form expressions of cov |k and var |k under current system scenarios are given in (20) and (21), as shown at the bottom of the page, respectively. Their proofs are analogous to those provided in [30] and thus are not provided here.

IV. HEURISTIC BS-UE ASSOCIATION ALGORITHM A. ALGORITHM INTRODUCTION
Heuristic BS-UE association algorithms are proposed in this section. The algorithms are based on the signalto-interference ratio (SIR) defined as the ratio between the power from a BS to a UE and the sum power from all other BSs to this UE. An expression for SIR l,k 1 is given below: 1 When considering channel estimation and CB precoding, SIR l,k can also be defined as (22) is converted from SINR A |k defined in (18) by assuming that the l-th BS only serves the k-th UE and the k-th UE is also only served by the l-th BS, and ignoring the noise effect. Our simulations show that similar system performances are obtained when SIR l,k is defined either by (23) or (22). Thus the more general definition of SIR given in (23) is chosen in our work. This SIR takes into account transmit power at each BS and the large-scale fading coefficients, and is very simple to compute. The following logarithmic utility is adopted for evaluating the performances of various BS-UE association schemes. In addition to characterizing the throughput performance of all served UEs, the network utility in (24) promotes ''proportional fairness'' [4], [5], [17]. This is because the ''log'' function ensures diminishing incremental returns for individual UE rates as they get higher. Thus, it encourages association schemes to give more network resources to UEs with lower rates. It is noted that R |k in (24) can also be replaced by CLB |k , A |k , or J |k . The expressions for R |k , CLB |k , A |k , and J |k are given in (11), (16), (18), and (19), respectively. Based on (24), the BS-UE association based network utility maximization problem is given as: Since it is difficult to obtain the optimal solution of an integer optimization problem, this section introduces heuristic SIR based BS-UE association algorithms to solve (25). The proposed algorithms can be implemented efficiently and will be shown to outperform the Max-RSS and the algorithm presented in [5]. Before introducing the heuristic multiple BS-UE association (i.e., each UE is associated with one or more BSs), we introduce the following heuristic single BS-UE association (denoted as Algorithm 1) where each UE is associated with only one BS. Here, the number of BSs to be tested per UE is denoted asL where 1 ≤L ≤ L. Based on Algorithm 1, the heuristic multiple BS-UE association algorithm is given in Algorithm 2. For a fully connected network, the number of BS-UE connections is LK , but the number of BS-UE connections corresponding to the best-designed merit is usually much smaller than LK .
var |k = l∈B k  4: BS associations of the UEs are determined according to the order in S SIR to reduce the computational cost. In this manner, the BS associations for the first (j − 1) UEs in S SIR will be determined before considering the BS association of the j-th UE in S SIR in Step 5. 5: Let the j-th UE be associated with each of theL possible BSs and compute the utility for the first j UEs in S SIR for each of theL options. The BS providing the largest utility will be chosen to connect with the j-th UE. 6: Repeat Step 5 until the BS association of the last UE in S SIR is done. Then, store the resulting utility for the K UEs. 7: Compute the utility with Max-RSS association and compare it with the stored utility of Algorithm 1. The association with a better merit will be adopted.
Thus, we pre-determine an upper limit of the number of total BS-UE connections, N c , to reduce the computational complexity significantly. A selection criterion of N c will be discussed later in numerical results. It is noted that for the single association scenario where each UE is served by one BS, the number of antennas of a BS limits the number of UEs it can serve in practice. This constraint is released when one UE can be served by multiple BSs simultaneously. In the multiple association scenario, the number of antennas at a BS does not need to be larger than the number of UEs associated with the BS in an HM-MIMO system. However, the total number of antennas for all BSs needs to be greater than the total number of UEs to ensure that the degree of freedom of the system is not less than the number of UEs. It is also noted that the obtained BS-UE association from Algorithm 2 can be further improved by a perturbation approach as described below. Given a BS-UE association, the BS-UE pairing of the UE with the smallest achievable per-UE rate will be examined. By connecting the nearest unpaired BS to this UE, the corresponding network utility will be recalculated. If the utility increases, the perturbation process is successful and will continue. Otherwise, it will stop. From our simulation results, the iteration number for this perturbation process is less than 10 with more than 90% probability and is less than 20 with almost 100% probability. Let the i-th option, 0 ≤ i ≤ N c − K , be consisted of two groups of BS-UE pairs: 1) the first i new BS-UE pairs derived from Step 2; and 2) the K BS-UE pairs derived from Algorithm 1 in Step 1. Note that the first option (i.e., i = 0) is the single BS-UE association derived from Algorithm 1. 4: Compute the designed merits of the N c − K + 1 BS-UE association options derived in Step 3. The BS-UE association option with the best-designed merit is the one to be adopted.
As mentioned above, R |k in (24) can be replaced by CLB |k , A |k , or J |k . Denote the average time needed to compute R |k , CLB |k , A |k , or J |k , whichever is chosen, as T u . Then For the approach in [5], if CVX [31] is used to solve the convex optimization problem formulated in [5], it has size O L max L=1 NLN A K whereL denotes the cluster size, NL denotes the number of size-L clusters, N A denotes the number of operations [5]. Before solving the convex optimization problem, the UE data rate for every UE associated with every BS cluster needs to be computed and the overall complexity is { L max L=1 NLK }T u . It is noted that the convex optimization problem in [5] needs to be iteratively solved while Algorithm 2 provides N c −K BS-UE association options which can be computed in parallel. When parallel computing is available, our multiple BS-UE association approach, especially whenL = 1 and perturbation is not applied, is definitely much faster than the approach in [5].

B. NUMERICAL RESULTS OF HEURISTIC ALGORITHM
We consider a network consisting of five BSs and K UEs in a circular area with a one-kilometer diameter. One macro-BS with 100 antennas is located at the center and four micro-BSs each with 25 antennas are located in the area as shown in Fig. 1. We consider two kinds of UE distributions. The first kind is K single-antenna UEs are randomly distributed (with a uniform distribution) in the serving area, and we call this distribution as ''Random.'' The second kind is K single antenna UEs are distributed in tiers. Divide the K UEs evenly into five groups. Thus, K /5 UEs are located randomly in the vicinity of each of the four micro-BSs (with radius 150 m), and the remaining K /5 UEs are randomly distributed in the whole serving area. We call this kind of distribution as ''Tiers.'' One typical example of ''Tiers'' is shown in Fig. 1 where the diamond denotes the macro-BS location, circles denote the micro-BS locations, and plus-signs denote UE locations. The large-scale fading coefficients defined in (1) are generated using the 3GPP channel models [32]. Specifically, macro-BS uses UMa NLoS model, and micro-BSs use UMi NLoS model. The detailed simulation parameter settings are provided in Table 1. Using Algorithm 2 withL = L under one network realization for both ''Random'' and ''Tiers'' scenarios, Fig. 2 shows utility values of achievable per-UE rate with perfect CSI, achievable per-UE rate with LMMSE channel estimation, CLB, JT-A, and JT-LB with respect to N c −K (i.e., the number of BS-UE pairs in Group II adopted in the BS-UE association). It is observed from Fig. 2 that the utility values with LMMSE channel estimation are lower than those with perfect CSI in both ''Random'' and ''Tiers'' settings. Note that substantially more UEs are located far away from all BSs in the ''Random'' setting than in the ''Tier'' setting. Due to their disadvantageous positions, the CSI estimation accuracy for these far-away UEs is largely compromised by the noise effect. Thus, the performance degradation with channel estimation in the ''Random'' setting is larger than that in the ''Tiers'' setting. It can also be observed that CLB, JT-A, and JT-LB have the same variation trend as the achievable per-UE rate under LMMSE channel estimation with respect to the number of BS-UE pairs selected. From this trend, N c can be chosen such that N c −K is greater than the number associated with the peak utility in Fig. 2 with some margins.
Remarkably, compared with JT-LB and CLB, the utility of JT-A in both ''Random'' and ''Tiers'' scenarios has the smallest gap with respect to the utility computed using (24) which includes both small-and large-scale fading in Monte-Carlo simulation. Based on these observations, it is reasonable to omit the small-scale fading and replace R |k in (24) by A |k which includes only large-scale fading coefficients. In the following formulation and simulation results, the utilities are all based on {A |k } to reduce the computational complexity. Fig. 3 shows the cumulative distribution function (CDF) curves of network utility obtained by ''Algorithm 2 + Perturbation,'' Algorithm 2, Algorithm in [5], Algorithm 1, and Max-RSS association. Here, we have considered two options for Algorithm 2:L = L in Algorithm 2I, andL = 1 in Algorithm 2II. It is observed that the performances of Algorithm 2I and Algorithm 2II are very close to each other for the ''Tiers'' scenario and are exactly on top of each other for the ''Random'' scenario. Similar observations are made for ''Algorithm 2I + Perturbation'' and ''Algorithm 2II + Perturbation'' under both ''Random'' and ''Tiers'' scenarios. The 50%-likely network utility achieved by Algorithm 1 withL = L is 3.5% and 6.0% higher than that achieved by Max-RSS in ''Random'' and ''Tiers'' scenarios, respectively. Remarkably, significant improvement in terms of utility can be obtained with multiple BS-UE association. The 50%-likely network utility achieved by ''Algorithm 2I + Perturbation'' is 26.4% and 23.4% higher than that achieved by Max-RSS in ''Random'' and ''Tiers'' scenarios, respectively.
For a fair comparison between our multiple BS-UE association approach and the approach in [5], three BS cluster sizes (1, 2, and 3) are considered in implementing ''Algm [5],'' and the number of served UEs by every BS at different BS cluster sizes is fine-tuned for ''Algm [5]'' to maximize the network utility. It is observed from Fig. 3 that ''Algorithm 2 + Perturbation'' and Algorithm 2 have significant performance improvement compared with ''Algm [5]'' in both ''random'' and ''tiers'' scenarios. This is primarily due to the fact that the predetermined number of served UEs for every BS, the predetermined number of clusters, and the predetermined BS cluster sizes of ''Algm [5]'' may not be optimal. However, there is no algorithm published yet for deriving the optimal number of served UEs as well as the optimal cluster number and sizes.
Based on the BS-UE associations obtained by ''Algorithm 2I + Perturbation,'' we calculate the probability of the number of BS associations per UE and show the results in Fig. 4. It can be observed that most of the UEs are associated with only one or two BSs for both ''Random'' and ''Tiers'' scenarios. According to Fig. 4, the average number of total BS-UE connections for ''Random'' and ''Tiers'' scenarios is 132.2 and 116.1, respectively, out of total possible 400 connections. This is consistent with the result in Fig. 2, i.e., better network utility performance occurs when the connection number is relatively small. This is a favorable property since a small connection number implies low network overhead and short delay.   It is observed that the 50%-likely network utility value decreases as the number of served UEs increases from 60 to 100 because the UE data rates generally decrease as the number of served UEs increases. It is also observed that the performance improvement by ''Algorithm 2I + Perturbation'' and Algorithm 1 over Max-RSS increases with the number of served UEs.
The total number of BS-UE association options for a given network realization is 2 L − 1 K , and the optimal utility performance can only be obtained by calculating the utility values for all possible BS-UE association options and finding the largest one. It is computational prohibitive when K is large. To show the performance gap of our proposed algorithms with the optimal network utility performance, we consider a simple network consisting of 3 BSs and 6 UEs in a circular area also with a one-kilometer diameter. One macro-BS with 100 antennas is located at the center of this area, two micro-BSs each with 25 antennas are located separately. Table 2 shows the average network utility values obtained by optimal association, Algorithm 2I, and Max-RSS under this network setting for both ''Random'' and ''Tiers'' scenarios. Note that for the ''Tiers'' scenario, two UEs are located in the vicinity of every micro-BS and the remaining two UEs are distributed randomly over the whole serving area. It can be seen that the average utility values achieved by Algorithm 2I are very close to those obtained by the optimal association. It should be noted that the computational complexity of Algorithm 2I is only 141T u if we set N c = 18 while the computational complexity of the optimal association is around 7 × 10 5 T u . This information shows that our proposed algorithm not only has satisfactory performance but also maintains very low computational complexity.

V. NEURAL NETWORK BASED ASSOCIATION AND POWER ALLOCATION A. NEURAL NETWORK BASED ASSOCIATION
For any network with given {L, K , {P l }, {M l }}, the BS-UE association can be viewed as a function f(·) that maps the inputX ∈ R 2L×K to the output Z ∈ R L×K such that the merit (e,g., network utility) is maximized. Here, whereβ l = P l M l β l , β l = [β l,1 , β l,2 , . . . , β l,K ] T ∈ R K ×1 andγ l = P l M l γ l , γ l = [γ l,1 , γ l,2 , . . . , γ l,K ] T ∈ R K ×1 . And Z = [z l,k ] where z l,k can be either 1 or 0. If z l,k = 1, the l-th BS and the k-th UE will be connected and if z l,k = 0, the l-th BS and the k-th UE will not be connected. Note that the number of UEs, K , can be set as any reasonable value and is set as 80 in our work for example.
We consider an N -layer CNN 2 to represent f(·), whose input and output areX and Z, respectively. Note that the input layer is not counted in the N layers. The structure of the CNN is as follows. Denote S n as the total number of channels of the n-th layer (n = 1, 2, . . . , N ), and X s n n as the output of the s n -th channel in the n-th layer with s n = 1, 2, . . . , S n . With the same notation format, denote the input layer as the n = 0 layer. Its output is then denoted as X S 0 0 where S 0 = 1. Note that X S 0 0 is equal to the given system inputX. The first N 1 (1 ≤ N 1 < N ) layers of the CNN are convolutional layers, and each layer consists of convolution and activation operations. Denote s n−1 ,s n n ∈ R a 1 ×a 2 as the convolutional Kernel for the n-th layer between the s n−1 -th and the s n -th channels, where a 1 × a 2 is the size for the Kernel matrix. Denote b s n n as the bias factor for the s n -th channel of the n-th layer. Based on these notations, the output of the s n -th channel at the n-th layer after activation is given as:  (27) is a 2D convolution whose stride is (1, 1). We also applied the commonly used activation function, rectified linear unit (ReLU), in our CNN [33].
After N 1 convolutional layers, an average pooling operation is applied to {X N 1 , respectively. Then the average pooling operation is given as: where R 1 and R 2 are the pooling coefficients. It should be noticed that the output Z is a 2D matrix. Instead of using a flatten operation and fully connected layers after convolutional layers in classic CNNs [34], we used N −N 1 transposed convolutional layers after average pooling in our CNN. Sincẽ X s N 1 N 1 has the same dimension as Z, stride (1, 1) is also applied in transposed convolutional layers so the operation is the same as in (27) except that 1 2L×K is replaced by 1 L×K . It is noted that the number of channels S N 1 is gradually reduced to 1 in the transposed convolutional layers. We call the CNN explained in this section as ''NN 1''.
It is noted that ''NN 1'' is a supervised learning approach. Ideally, the target should be the optimal association given a network realization. However, as aforementioned, the optimal association can only be obtained by an exhaustive search which is time and memory prohibitive when K is large. Due to this limitation, we use the associations given by ''Algorithm 2I + Perturbation'' as the target for training and validation. Since the target association is a 2D matrix consisting of only numbers 1 and 0, we do not apply ReLU after the last layer (n = N ) of ''NN 1'' and directly use BCEWithLogitsLoss function as the objective function, which has the form below: Here we denoteZ ∈ R L×K as the target association matrix andz l,k as the (l, k)-th component ofZ. ν(·) is the Sigmoid function which is defined as can be interpreted as a connection probability between the l-th BS and the k-th UE. It can be expected that after sufficient training, v(x S N l,k ) will approach 1 ifz l,k = 1 and approach 0 ifz l,k = 0. We applied the Adam algorithm [35] for optimization training and the final BS-UE association is obtained by applying a threshold operation on the inference result by ''NN 1.'' The threshold operation is given below: Given the properties of the Sigmoid function, we believe it is reasonable to set the threshold value as 0.5.

B. NN BASED POWER ALLOCATION AND ASSOCIATION
In this section, a CNN based power allocation and association approach is introduced. As aforementioned, ν(x S N l,k ) can be interpreted as a connection probability between the l-th BS and the k-th UE. It also occurs to us that this probability can be regarded as the power portion allocated to the k-th UE by the l-th BS. Based on such an inspiration, we introduce a combined power allocation and association approach using CNN and a threshold operation for HM-MIMO systems.
We first consider a network utility maximization problem for a fully connected HM-MIMO system via power allocation. Under this setting, the power control coefficients, {η l,k }, ∀l, k, need to be reintroduced into (18), which gives (31), as shown at the bottom of the page. We consider a per-antenna power constraint, and according to (9), the symbol transmitted by the m l -th antenna can be expressed as: Then the per-antenna power constraint is given by E{|s m l | 2 } ≤ ρ l . According to (7) and E{|q k | 2 } = 1, we can get a simplified form, K k=1 η l,k γ l,k ≤ 1, which does not depend on the antenna index m l . Based on (24), (31), and the power constraints, the network utility maximization problem can be formulated as Due to the non-convexity of the objective function in (33), it is challenging to find the optimal solution in an efficient way.
Here we propose a CNN based approach, denoted as ''NN 2,'' to tackle this problem, and the detailed architecture is shown in Fig. 6. It is noted that the CNN used in ''NN 2'' has a similar structure as that of ''NN 1'' except for some hyper-parameter settings. We denoteẐ ∈ R L×K as the output of the Sigmoid function andẑ l,k = ν(x S N l,k ) as the (l, k)-th component ofẐ where 0 ≤ẑ l,k ≤ 1. From (31), by up-scaling all the power control coefficients by the same factor,Ā |k , k = 1, . . . , K will increase, which will result in a larger network utility value. Based on this observation, the scaling operation in Fig. 6 is expressed as: Note that in Fig. 6, η NN = [η NN 1,1 , . . . , η NN 1,K , η NN 2,1 , . . . , η NN l,k , . . . , η NN L,K ] where the superscript NN denotes neural network and the output of the Sigmoid functionẐ = [ẑ l,k ] is the input of the scaling operator.
The power control coefficients calculated by (34) are used to compute network utility which is further used to compute the MSE loss for optimization as shown in Fig. 6. We denote U NN i as the i-th computed network utility andŪ i as the i-th target utility, then the MSE loss function is defined as whereM is the mini-batch size and C is a predefined constant. It is noted that we are considering a power controlled 184054 VOLUME 8, 2020 HM-MIMO system, whose corresponding network utility should be significantly higher than that obtained by ''Algorithm 2I + Perturbation'' where only BS-UE association is considered. We can use the network utility given by ''Algorithm 2I + Perturbation'' as a reference of the target utility and choose a sufficiently large value of C (e.g., the same magnitude asŪ i ) for training. Alternatively, we can directly setŪ i , ∀i as a constant which is significantly larger than the utility value obtained by ''Algorithm 2I + Perturbation'' and set C as 0 for training. As in ''NN 1,'' the Adam algorithm is also used for training the CNN in this section. In Fig. 6, the solid and dashed arrows show the forward and backward propagation directions, respectively. In the backward propagation direction, each operation (e.g., computation of utility, scaling operation, sigmoid function, etc.) defines a gradient which is needed for updating the coefficients of the CNN. As the designed merit (i.e., utility) can be directly computed using the closed-form approximation, JT-A, these gradients can also be directly computed using JT-A. Relying on the simple form of the derived JT-A, which averages out the complex-valued small-scale fading coefficients, ''NN 2'' in Fig. 6 can then be trained efficiently even without having accurate target utility values.
It is noted that ''NN 2'' is for a fully connected HM-MIMO system (i.e., KL connections are established). Under this condition, the overhead and delay might be large. In order to reduce the network overhead and delay, we perform a threshold operation to the inference output,Ẑ, of ''NN 2'' to reduce the number of BS-UE connections. Given a threshold value σ t , the threshold operation forẑ l,k , ∀l, ∀k is given asẑ which is different from the threshold operation in (30) for ''NN 1.'' It is noted that the value of σ t should be properly selected so that the connection number is smaller than or equal to a predefined upper limit (i.e., N c in Algorithm 2I).
Since the threshold operation in (36) has very low computational complexity, a bisection search can be applied to find the proper value of σ t . After the threshold operation, (34) will be applied toẑ l,k , ∀l, ∀k to get the power control coefficients under the combined power allocation and BS-UE association scenario. It will be shown later that the value of N c can be set as the same as for Algorithm 2I, while the utility value only has slight degradation compared with that by a fully connected network.

VI. POWER ALLOCATION AND ASSOCIATION VIA MAX-MIN OPTIMIZATION
It is remarkable that a uniform service to every UE would achieve a great network utility performance and such a service can be achieved using max-min power control as shown in [28]. In this section, to serve as an alternative approach and reference, we introduce the power allocation and BS-UE association approach via max-min power control and the threshold operation. The max-min power control problem for a fully connected HM-MIMO system is formulated as: However, due to the non-convexity of SINRĀ |k in (31), (37) can not be solved using convex optimization. By observing the difference of SINR expressions between (18) and (16), we find that for (16), if we remove the BU term in the denominator and add it to the numerator, (16) is converted to (18). Based on this observation, we define a new SINR expression which is given as: where Md is short for ''Medium.'' The reason for this name is that SINR Md |k is obtained by removing the BU term in the numerator of SINRĀ |k , thus we have this relationship: Here SINR CLB |k can be obtained by reintroducing the power control coefficients {η l,k } back into (16) since {η l,k } may be unequal now. We call the per-UE rate computed using (38) as JT-Md. Note that SINR Md |k is very close to SINRĀ |k since the power of BU is much smaller than the desired signal power under Massive MIMO scenarios [3]. Based on these observations, we convert (37) into the optimization problem below: Since the objective function of (39) is quasi-concave, (39) is a quasi-concave problem. It can be further formulated into VOLUME 8, 2020 a sequence of convex feasibility problems, and each one is a second-order cone programming (SOCP) problem which can be solved using CVX. The detailed algorithm used to solve (39) is provided in Algorithm 3.

Algorithm 3
Max-Min Algorithm to Solve (39) 1: Initialization: choose predefined values of t min and t max , where t min and t max define the range of the objective function in (39). Choose a tolerance > 0 and let ζ l,k = √ η l,k for l = 1, . . . , L, k = 1, . . . , K . 2: Set t = (t max + t min )/2. Solve the following convex feasibility program: where ζ = [1, ζ 1,1 , ζ 2,1 . . . , ζ L,1 , . . . , ζ 1,K , ζ 2,K . . . , ζ L,K ], It is noted that Algorithm 3 basically solves problem (37). It will be shown later that when we substitute the power control coefficients obtained from Algorithm 3 into (31), an almost uniform service can be achieved. The BS-UE association procedure after max-min power control is similar to the one given in Section V-B. The threshold operation (36) and the scaling operation (34) are applied, and the power control coefficients under max-min power control and BS-UE association are then obtained.

VII. NUMERICAL RESULTS OF NEURAL NETWORK BASED APPROACHES
The same network and parameter settings are used in this section as in Section IV-B and the detailed hyper-parameter settings of ''NN 1'' and ''NN 2'' are given in Table 3.    (31) to get ''Max-min A.'' Note that for one network realization, max-min power control can provide uniform service to all K UEs, while for different network realizations, the uniform data rates achieved may be different. Since 500 network realizations are included in Fig. 7, almost uniform services are shown in both ''Max-min A'' and ''Max-min Md.'' More importantly, a negligible performance gap is observed in Fig. 7 between ''NN 2 Md'' and ''NN 2 A'' as well as between ''Max-min Md'' and ''Max-min A.'' A similar phenomenon is also observed in the ''Random'' scenario. These results suggest that the power control coefficients obtained using SINR Md |k are applicable to SINRĀ |k . We conclude that the problem conversion from (37) to (39) is proper and the two problems are almost equivalent to each other in HM-MIMO systems based on our settings.
Note that a fully connected network may not be practical due to hardware limitations as well as some system performance issues (e.g., large network resource consumption, long handshake delay, difficulty to deal with synchronization and mobility, etc). Thus, BS-UE pairs with small power allocations are eliminated so that the total number of BS-UE connections does not exceed a predetermined number N c . Under such a condition, Fig. 8   As shown in Fig. 8, firstly, both ''Algm 2I + Pertur'' and ''NN 1'' have significant performance improvement compared with Max-RSS, which suggests the advantage of multiple BS-UE association compared with the single association. Secondly, the performance of ''NN 1'' is not as good as, but quite close to, that of ''Algm 2I + Pertur'' in both ''Random'' and ''Tiers'' scenarios. The performance gap is within 4% in terms of 50%-likely network utility. This shows that the ''NN 1'' architecture can successfully capture the relationship between the input and output information given by ''Algm 2I + Pertur.'' Thirdly, adding power allocation into BS-UE association can greatly improve system performance. Compared with Max-RSS, the 50%-likely network utility improvement of ''NN 2 A'' and ''Max-min A'' is 56.5% and 60.5% in ''Random'' scenarios, and 61.2% and 65.1% in ''Tiers'' scenarios, respectively. Note that both ''NN 2 A'' and ''Max-min A'' represent the results of a fully connected network. Fourthly, as mentioned before, we applied different thresholds {N c } and the scaling operation to ''NN 2 A'' and ''Max-min A'' to reduce the network overhead and delay. As predicted, decreases of N c degrade the system performance. Lastly, ''Max-min A'' and ''NN 2 A'' have similar performances, and ''Max-min A Thre'' and ''NN 2 A Thre'' also have similar performances. The CDFs of the connection (BS-UE pair) numbers for the algorithms in Fig. 8 are provided in Fig. 9. Firstly, since Max-RSS has the same connection number 80 for every network realization, and the fully connected cases ''NN 2 A'' and ''Max-min A'' also have the same connection number 400 for every network realization, their CDF curves are omitted here. Secondly, as mentioned before, the value of N c for ''Algm 2I + Pertur'' can be chosen such that N c − K is greater than the number associated with the peak utility in Fig. 2 with some margins. In more than 99% network realizations in our simulations, the connection number corresponding to the peak utility is smaller than 170 and 150 in ''Random'' and ''Tiers'' scenarios, respectively. Thus, 170 and 150 are set as the value of N c for ''Algm 2I + Pertur'' in ''Random'' and ''Tiers'' scenarios, respectively. Thirdly, different values of N c are applied for ''NN 2 A Thre'' and ''Max-min A Thre'' as shown in Fig. 9. It is noted that the values of N c used in both ''Random'' and ''Tiers'' scenarios are much smaller than KL (i.e., 400 in our case). Moreover, no severe performance degradation is observed in Fig. 8. In fact, when N c = 170 in the ''Random'' scenario and N c = 150 in the ''Tiers'' scenario, less than 5% performance degradation in terms of 50%likely network utility is observed. This observation suggests that significant network overhead and delay can be reduced without a severe compromise of network performance using our proposed algorithms.
Due to the log function in the definition of network utility in (24), a small UE rate will severely degrade the utility performance, while a large UE rate will not lead to much utility improvement. Thus, based on this property, we believe the performance given by max-min power control would be a good performance benchmark due to its property of providing a uniform service. This point is also validated from Fig. 8 where ''Max-min A'' has the best utility performance. However, solving a sequence of SOCP involves high computational complexity, and it supports minimal parallelism and introduces computational overhead [36]. On the contrary, ''NN 2 A'' and ''NN 2 A Thre'' mainly involve simple multiplication and addition operations. With the help of parallel computation by GPU, the running efficiency of ''NN 2 A'' and ''NN 2 A Thre'' can be further improved. Since max-min based algorithms need to be solved in an iterative way, it is not suitable to make a direct computational complexity comparison between max-min and NN based approaches. Instead in Table 4, we provide a comparison of their running time for one network realization under current system parameter settings. It is observed that the running time of max-min based algorithms is in the order of tens of seconds for even one network realization. This makes them almost impossible for practical applications and can only be regarded as an analytical tool. On the contrary, NN based algorithms have much shorter running time and can be used for real-time implementation. Furthermore, the performance achieved by the NN based approach has a small gap to the performance obtained by the max-min power control approach. These two factors make the NN based approach very promising for practical applications in power allocation and BS-UE association.
It is also observed from Fig. 9 that the number of connections of ''NN 2 A Thre'' can be in the same order as those of ''Algm 2I + Pertur'' and ''NN 1.'' However, the overhead required by ''NN 2 A Thre'' is still relatively high since the power control coefficient requires more bits to encode than the association state 1 or 0, which requires only one bit. In this sense, ''Algm 2I + Pertur'' and ''NN 1'' are more suitable for scenarios requiring low overhead.

VIII. CONCLUSION
We consider problems of BS-UE association as well as power allocation for DL HM-MIMO systems in this paper. Incorporating CB and LMMSE channel estimation, two lower bounds and one approximation (denoted as JT-A) of the achievable per-UE rate with perfect CSI at the UE side are derived in closed forms. Based on the tightness comparison, JT-A not only has excellent approximation accuracy to the achievable per-UE rate under HM-MIMO systems but also needs much less computational efforts. Thus, it is reasonable to omit the small-scale fading and replace the achievable per-UE rate by JT-A.
By taking advantage of good accuracy and easy computation of JT-A, three BS-UE association algorithms are proposed. Firstly, a simple, sub-optimal, and heuristic BS-UE association algorithm (denoted as Algorithm 2) is proposed. Compared with the exhaustive search method, huge computational cost can be reduced. In fact, by applying parallel computing, the running time of Algorithm 2 is only in the order of computing a closed-form JT-A. More importantly, the utility performance achieved by this approach is significantly higher than that obtained by the Max-RSS method and the approach in [5].
Secondly, based on the association results given by Algorithm 2, a supervised learning approach (denoted as ''NN 1'') for BS-UE association using CNN is introduced. Numerical simulations show that the BS-UE associations inferred by ''NN 1'' achieve satisfactory utility performance. This suggests that the proposed CNN architecture successfully captures the complicated relationship between the input 2D matrix containing CSI of all possible BS-UE pairs and the output 2D matrix containing the connection status of all BS-UE pairs. This feature shows the great potential of the ''NN 1'' architecture for solving the BS-UE association problem in various Massive MIMO systems, which leads to our third proposed algorithm.
Thirdly, using a similar CNN architecture as used in ''NN 1'' and regarding the connection probability between a BS-UE pair as the transmit power portion allocated to the UE by the BS, we introduce a combined power allocation and association approach (denoted as ''NN 2''). Relying on the simple form of the derived JT-A, which averages out the complex-valued small-scale fading coefficients, ''NN 2'' can be trained efficiently without knowing the accurate target utility. Furthermore, by incorporating power allocation, the utility performance achieved by this approach is comparable to the benchmark utility performance obtained by max-min power control, which is around 60% higher than that by the Max-RSS association. Remarkably, the running time of the max-min power control approach is more than 10000× of that by ''NN 2.' ' The proposed approaches are demonstrated to be very suitable for practical DL HM-MIMO mobile applications.
Owing to the parallel computing power of GPU, ''NN 2'' can provide power-allocation and BS-UE association in real-time. On the other hand, if parallel processing is not available, Algorithm 2 can also provide BS-UE association in real-time. Numerical results show obvious advantages of applying multiple BS-UE association and power allocation over single BS-UE association in HM-MIMO systems.
As artificial intelligence applications grow rapidly in wireless communications, we believe our work is timely and promising. It is remarkable that CNN based approaches can be flexibly designed to solve both the BS-UE association problem and the combined power allocation and BS-UE association problem. Moreover, since CNN is very versatile for solving complicated optimization problems, the procedure leading to the practical implementation of ''NN 2'' can be adopted for achieving various designs with different merit functions. In short, one can at first derive a simple heuristic approach to obtain target data for initial CNN training. After obtaining the prototype CNN architecture capturing the input-output relationship, one can conduct further training of the CNN structure using an arbitrarily chosen but much larger target merit to improve the CNN performance. Note that, with sophisticated designs on the NN architecture, a NN based approach can be implemented to not only take advantage of the main features of input information but also capture the key characteristics of the considered problems (e.g., the derived analytical forms of JT-A in this paper). Future work can be explored in this direction. Based on (12), and the property that estimated channel coefficients and the estimation error are uncorrelated by LMMSE channel estimation, E h {D |k } is calculated as: Thus, using (41), we obtain: where the third and fourth equal signs are because g m l ,k and g m l ,k are independent if k = k due to the application of orthogonal pilots, andĝ m l ,k andĝ m j ,k are also independent due to (5)  √ ρ l ρ j η l,k η j,k E h {g T l,k ĝ * l,k ĝ T j,k g * j,k } = l∈B k j∈B k ,j =l √ ρ l ρ j η l,k η j,k E h {||ĝ l,k || 2 ||ĝ j,k || 2 } + l∈B k ρ l η l,k E h {tr(g * l,k g T l,k ĝ * l,k ĝ T l,k )} = l∈B k j∈B k ,j =l √ ρ l ρ j η l,k η j,k M l M j γ l,k γ j,k + l∈B k ρ l η l,k tr(E h {(ĝ * l,k ĝ T l,k + ε * l,k ε T l,k )ĝ * l,k ĝ T l,k }).