3D Angular-Based Hybrid Precoding and User Grouping for Uniform Rectangular Arrays in Massive MU-MIMO Systems

This paper proposes a new user grouping algorithm and three-dimensional (3D) angular-based hybrid precoding (AB-HP) scheme for massive multi-user multiple-input multiple-output (MU-MIMO) systems using uniform rectangular arrays (URA). At ﬁrst, the users clustered in multiple spots are efﬁciently grouped according to the proposed user grouping algorithm, which only utilizes the user angle-of-departure (AoD) information and does not require prior knowledge of the number of user groups. By employing the AoD support of the user groups, the RF-beamformer of AB-HP is designed to reduce the inter-group interference, the channel state information (CSI) overhead, and the number of RF chains. Then, the digital baseband precoder of AB-HP is constructed via regularized zero-forcing (RZF) using the effective channel seen from baseband to simultaneously serve the users clustered in multiple groups, by considering three approaches: joint-group-processing (JGP), per-group-processing (PGP) and common-group-processing (CGP). For each approach, the signal-to-interference-plus-noise ratio (SINR) expressions as well as their tight deterministic approximations are derived. To further reduce the number of RF chains, we also propose a new transfer block design, which reduces the number of RF chains down to the number of independent data streams without penalizing the sum-rate performance. Illustrative results reveal that the proposed AB-HP schemes with the relaxed CSI estimation overhead and reduced hardware cost/complexity can closely approach to the sum-rate performance of the single-stage fully-digital precoding (FDP). Furthermore, AB-HP has considerably higher energy efﬁciency performance compared to FDP due to the reduced number of RF chains. We show through simulation that the proposed AB-HP can offer signiﬁcantly better performance than existing HP techniques. The computational complexity of AB-HP is also analyzed.


I. INTRODUCTION
Massive multiple-input multiple-output (MIMO) is one of the key technologies for the next generation wireless communication systems [3]- [6]. Massive MIMO systems adopting an excessively large number of antennas at the base station (BS) do not only provide a considerable improvement in the spectrum efficiency via the spatial multiplexing, but also dramatically enhance the energy efficiency via The associate editor coordinating the review of this manuscript and approving it for publication was Abhishek Kandwal . the capability of focusing signal energy through smaller regions (i.e., large beamforming gain) [5]. For the downlink transmission, the precoding at the BS is a crucial signal processing procedure to ensure reliable transmission quality for massive multi-user MIMO (MU-MIMO) systems. Early studies for precoding design focus on the single-stage precoders, where the single-stage fully-digital precoding (FDP) is directly designed according to the channel matrix. Hence, FDP requires full instantaneous channel state information (CSI). Furthermore, the number of radio frequency (RF) chains should be as many as the number of antennas. Among VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the linear FDP techniques, the regularized zero-forcing (RZF) precoder has been widely considered due to its near-optimal performance compared to the optimal non-linear dirty paper coding [7]- [10]. Despite the aforementioned benefits of large antenna arrays, it also brings two interesting challenges: (i) the number of RF chains increases the hardware cost/complexity as well as the power consumption and (ii) the requirement for the considerable amount of CSI overhead causes longer training sequence and increased signalling duration [11]. In order to overcome these problems, hybrid precoding (HP) has been proposed as a promising solution [11]- [13], where the precoder is partitioned into two stages: an analog RF beamformer followed by a low-dimensional digital baseband precoder. The RF-stage and baseband-stage are connected to each other via RF chains, where the selected number of RF chains is between the number of available antennas and the independent data streams or equivalently, the number of single-antenna users to be simultaneously served. In terms of input parameters for the RF beamformer design, there are two main strategies: (i) as a function of the fast time-varying instantaneous CSI [14]- [22], resulting in large CSI overhead, or (ii) using the slowly time-varying CSI (e.g., channel covariance matrix, angle-of-departure (AoD)) for the RF-stage [23]- [29], which can reduce both the hardware cost/complexity and the CSI overhead because only the effective instantaneous CSI is utilized for the baseband precoder.

A. RELATED WORKS
Among the HP schemes requiring the full CSI, a lowcomplexity HP technique reducing the number of RF chains while keeping the similar sum-rate performance of FDP is introduced in [14], where the RF beamformer following the constant-modulus constraint is designed via the phase of instantaneous CSI. In [15], the authors provide an algorithmic precoding solution based on the concept of orthogonal matching pursuit, which takes an optimal unconstrained FDP obtained by the full CSI and approximates it via a constrained HP. In order to find the minimum number of RF chains for HP keeping the same performance of FDP, [16] shows that when the number of RF chains is twice the total number of independent data streams, HP can realize any FDP regardless of the antenna array size. Then, [17] investigates HP design for the multi-carrier massive MIMO systems employing orthogonal frequency division multiple access transmission and proposes further reduction in hardware cost/complexity by decreasing the number of RF chains to the rank of FDP while achieving the similar sum-rate performance. In addition to decreasing the number of RF chains for massive MIMO systems, the phase-shifter reduction is also analyzed in [18], where the authors demonstrate that the sub-connected structure (each RF chain is connected to a subset of antennas) can reduce the number of phase-shifters by half and preserve the same spectral efficiency of the fully-connected structure (each RF chain is connected to all the antennas). User grouping, power allocation and HP design are investigated for non-orthogonal multiple access in millimeterwave (mmWave) massive MIMO systems in [19], where the full CSI is not only required for designing HP but also clustering the users into different groups. Additionally, the number of user groups is also assumed to be known. In [20], an angular-based approach to develop the RF beamformer is independently adopted to establish orthogonal beam selection based HP (OBS-HP) and non-orthogonal angle space based HP (NOAS-HP) techniques, where the BS equipped with uniform linear array (ULA) utilizes RZF precoding at the baseband-stage. The users are not considered to be clustered in multiple spots in [20], hence, the orthogonalization in the angular domain among the user groups is not taken into account in that study. Moreover, the proposed significant beam selection algorithm in [20] selects the strongest beam of each user for the RF beamformer design, but it requires the complete knowledge of CSI. However, this algorithm is not capable of using all available degrees of freedom provided by the channel, especially for a rich scattering environment having wider angle spread. As stated in [20], the accuracy of the RF beamformer in OBS-HP and NOAS-HP heavily depends on the angle spread and it is shown that OBS-HP and NOAS-HP suffer from the sum-rate performance degradation when the angle spread is just increased from 1 • to 3 • . A HP scheme is investigated for the spatial modulation technique in the massive MU-MIMO systems in [21], where the RF beamformer is constructed via the phase of instantaneous CSI as in [14]. In [22], joint user scheduling and HP design is studied for the mmWave massive MIMO systems, where the BS equipped with uniform rectangular array (URA) builds the RF beamformer via applying the singular value decomposition to the instantaneous channel matrix.
As mentioned earlier, the above studies cannot reduce the large CSI overhead because they all utilize the full CSI to construct HP. According to the aforementioned second strategy using slowly time-varying CSI for the RF-beamformer design, joint spatial division multiplexing (JSDM) technique is proposed in [23]. The authors consider the correlation-based channel model as explained in [30]. Assuming that users within the same groups have an identical covariance matrix and the groups have non-overlapping AoD distributions, the RF beamformer is constructed according to either eigen-beamforming based HP (EBF-HP) or block-diagonalization based HP (BD-HP) techniques by using the slowly time-varying channel covariance matrix. The primary goals of the RF-stage are to reduce the CSI overhead required for the HP structure and suppress the effect of the inter-group interference. Afterwards, by using zero-forcing (ZF) and RZF precoding, two approaches having different CSI requirements are given for the baseband precoder design as joint-group processing (JGP) and per-group processing (PGP). Later on, in [24], they also propose two algorithms for user grouping (using K -means clustering and chordal distance fixed quantization) based on the channel covariance matrix, where the number of user groups are assumed to be known in both approaches. Similarly, by using a priori knowledge of the number of user groups, [25] clusters the users by minimizing the generalized Fubini-Study distance between eigenspaces to improve the sum-rate performance of JSDM. In [26], it is shown that the sum-rate performance of JSDM can be enhanced, when the baseband precoder is designed according to the weighted minimum-mean-squared error optimization method. In [27], the JSDM technique is combined orthogonal frequency division multiplexing, which provides a flexible way to transmission in order to separately decode data with a small loss in spectral efficiency. Afterwards, [28] proposes a HP scheme for uniform circular arrays (UCA), where the RF beamformer using the phase mode transformation technique does not require any channel knowledge and it reduces the CSI overhead as well as the number of RF chains by transforming a large-size UCA into a reduced-size virtual ULA. Then, [29] analyzes a non-linear HP design for multi-cell massive MIMO systems, where an approximate block-diagonalization (BD) technique is utilized at the RF-stage and a minimum mean square error vector perturbation technique is employed at the BB-stage. It is shown that applying a non-linear HP technique can improve the error performance compared to the linear HP.
Regarding the phased array antenna architecture design, there are mainly three important considerations as (i) the array geometry, (ii) the number of antenna elements and (iii) the element spacing [31]. In terms of the array geometry, ULA, UCA and URA are the most common configurations. However, unlike ULA and URA, the steering vector of UCA does not have Vandermonde structure, which means that the angular-based beamforming techniques cannot be directly applied [28]. On the other hand, URA is more widely considered than ULA because URA packs antennas on a two-dimensional (2D) grid instead of the single-dimensional (1D) line [32]. Furthermore, ULA can be only utilized to estimate the azimuth angle, while URA is also capable of resolving the elevation angle by means of its 2D structure [33]. For three-dimensional (3D) beamforming, both azimuth and elevation angles should be taken into account. This aspect also makes URA a better solution for 3D beamforming [33]- [35]. However, due to its simple structure, ULAs are often investigated in the literature (e.g., in [14], [16]- [21], [23]- [26], [29]).

B. CONTRIBUTIONS AND ORGANIZATION
In this paper, a new user grouping algorithm and 3D angular-based HP (AB-HP) technique is proposed for massive MU-MIMO systems, where the BS is equipped with URA and users are clustered in non-overlapping geographical regions. Similar to JSDM, the proposed AB-HP aims to minimize both the CSI overhead and the number of RF chains. The main contributions are summarized below: • User grouping: Different from [19], [24], [25], [36], the proposed user grouping algorithm does not require the number of groups as an input parameter, which is a priori unknown at the BS in practical application. Furthermore, the previous studies utilize either the complete knowledge of instantaneous CSI or the channel covariance matrix, which may be unavailable, especially for massive MU-MIMO systems having large antenna arrays. In this work, we propose an entirely different user grouping algorithm, which employs the AoD information of the users with respect to the BS. According to the proposed algorithm, the target number of user groups is first estimated through the eigenvalue decomposition of the Laplacian matrix, whose entries are similarity measures among the users AoD information. Accordingly, the users are partitioned into groups sharing the similar characteristics based on their AoD information. We compare the proposed user grouping algorithm and the direct applications of the K -means algorithm employing various similarity measures as chordal distance [24], weighted likelihood [36] and subspace projection [36]. It is demonstrated that the proposed method can provide well-separation among the groups and satisfactory assignment of the users to their corresponding groups, which cannot be achieved via the direct application K -means algorithm with an incorrect priori knowledge.

• 3D Angular-Based Hybrid Precoding (AB-HP):
Based on the AoD support of the user groups, the RF beamformer is designed to eliminate the inter-group interference while reducing the CSI overhead as well as the number of RF chains. Considering the practical constraints, AB-HP only utilizes the phase-shifters at the RF-stage without variable-gain amplifiers. On the other hand, JSDM also requires variable-gain amplifiers to build the RF beamformer according to EBF-HP and BD-HP techniques [23]. Therefore, unlike JSDM, AB-HP can simply realize the RF-stage via phase-shifters and provide the constant-modulus constraints. According to the RZF technique, three approaches requiring different reduced CSI overheads are considered for the baseband precoder, which are JGP, PGP and common-group-processing (CGP). For each approach, the signal-to-interference-plus-noise ratio (SINR) expressions are derived to evaluate the system sum-rate. Numerical results demonstrate that the proposed AB-HP schemes with the relaxed CSI estimation overhead and reduced hardware cost/complexity can tightly approach the sum-rate performance of the single-stage FDP. Considering the imperfect channel knowledge, it is shown that AB-HP is more robust to channel estimation errors compared to FDP. Also, the computational complexity of FDP can be significantly decreased via AB-HP. Furthermore, AB-HP can provide a considerable performance improvement in comparison to existing two-stage HP techniques such as EBF-HP, BD-HP, OBS-HP and NOAS-HP.
• Deterministic SINR Approximations: By applying the deterministic equivalent principle for large antenna arrays [9], the deterministic SINR approximations having almost sure convergence are derived for JGP, PGP and CGP approaches. During the derivation, we utilize the tools from the large-dimensional random matrix theory [37]. As shown in the numerical results, without running heavy Monte-Carlo simulations, the derived approximations can be efficiently utilized to attain insights into the system performance.

• Transfer Block Design for RF Chain Reduction:
In addition to the two-stage AB-HP having the RF beamformer and baseband precoder, a new three-stage AB-HP including a transfer block matrix is also proposed for the further minimization in terms of the hardware cost/complexity. The transfer block placed at the RF-stage is realized via phase-shifters. Hence, it is located between the RF beamformer and baseband precoder. According to the transfer block design, three-stage AB-HP can decrease the number of RF chains exactly to the total number of independent data streams without sacrificing the performance. The rest of this paper is organized as follows. In Section II, we introduce the system model. Section III expresses the user grouping algorithm. In Section IV, we propose three approaches for AB-HP and derive the instantaneous SINR expressions as well as their deterministic approximations. Section V explains the transfer block design for the RF chain reduction. The illustrative results are provided in Section VI. Finally, the paper is concluded in Section VII.
Notation: Bold upper/lower case letters denote matrices/vectors. (·) * , (·) T , (·) H , (·) † and · 2 represent the complex conjugate, the transpose, the conjugate transpose, the Moore-Penrose inverse and the 2-norm of a vector or matrix, respectively. I K , E {·}, tr (·) and (·) stand for K × K identity matrix, the expectation operator, the trace operator and the argument of a complex number, respectively. X ⊗ Y denotes the Kronecker product of two matrices X and Y. We use x ∼ CN (0, σ ) when x is a complex Gaussian random variable with zero-mean and variance σ .

II. SYSTEM MODEL
We consider a BS equipped with a URA having M = M x ×M y antennas to serve K single-antenna users. The configuration for URA is shown in Figure 1, where M x and M y are respectively the number of antennas placed along x-axis and y-axis as in [35], [38], d is the distance between two antenna elements normalized by the wavelength, θ and ψ are the elevation and azimuth angles, respectively. The main purpose for selecting URA is to squeeze the antennas in 2D grid to satisfy the area requirements in practical applications. For example, when the carrier frequency is 3 GHz, a 20 × 20 URA at half-wavelength antenna spacing only requires 1 meter in horizontal and vertical directions, whereas a ULA with the same number of antennas needs 20 meters of space in one direction. Based on the geometry-based 3D channel model [30] and URA [31], the multipath channel coefficient between the (m, n) th BS antenna and k th user is given by: where P is the number of paths, g k,p is the complex coefficient reflecting the path loss and the random phase of the p th path with the distribution of CN 0, 1 P so that h m,n k ∼ CN (0, 1). γ x,k,p = sin θ k,p cos ψ k,p and γ y,k,p = sin θ k,p sin ψ k,p carry the information of both elevation and azimuth angles of the corresponding path, where θ k,p ∈ θ k − δ θ k , θ k + δ θ k with the mean elevation angle θ k and elevation angle spread δ θ k , and ψ k, with the mean azimuth angle ψ k and azimuth angle spread δ ψ k . By using (1), the received signal at the k th user can be written as: where s = s 1,1 , · · ·, s 1,M y , · · ·, s M x ,1 , · · ·, s M x ,M y T ∈ C M denotes the transmitted signal vector satisfying a maximum transmit power constraint of P T , i.e., E s 2 2 ≤ P T , w k is the complex Gaussian noise distributed as CN 0, σ 2 , g k = g k,1 , · · · , g k,P T ∈ C P is the path gain vector and k ∈ C P×M is the phase response matrix according to the URA geometry and P paths defined as: where φ x (γ ) = 1, e −j2πdγ , · · · , e −j2πd(M x −1)γ T ∈ C M x is the phase response vector along x − axis and φ y (γ ) = 1, e −j2πdγ , · · · , e −j2πd(M y −1)γ T ∈ C M y is the phase response vector along y − axis. Then, the instantaneous CSI for the k th user can be defined as h T k = g T k k ∈ C M . 84692 VOLUME 8, 2020 As in [23], we assume that the phase response matrix k based on AoD information changes slower than the channel coherence time. Hence, the CSI can be divided into two parts: (i) fast time-varying instantaneous CSI as h k and (ii) slowly time-varying phase response matrix as k . Furthermore, although the users usually have different instantaneous CSIs due to the local scattering and path loss, a group of users sharing the similar AoD support results in the similar phase response matrices.

III. USER GROUPING
The first step toward the AB-HP design is to partition the users into groups sharing similar characteristics. According to the channel model given in (1), all users experience different channel coefficients. However, it is reasonable to assume that the users collocated in the same area have almost the same AoDs with respect to the BS. Our approach is based on the AoD information, 1 where the users belonging to different groups can be separated in orthogonal subspaces.
Given the AoD information associated to every user and a measure of similarity ρ i,j ≥ 0 between any pair of i th and j th users for i, j = 1, · · · , K and i = j, our goal is to divide the users into several groups such that users in the same group have similar AoDs and users in different groups are dissimilar to each other. In practical situations, the number of groups is a priori unknown. Therefore, we propose to jointly find the number of groups and the distribution of users among these groups. We first represent the massive MU-MIMO network as an undirected graph G. Each vertex in this graph represents a user and the connection between two vertices i and j (called edge) is weighted by ρ i,j to model the similarity between two users. It is worthwhile to mention that ρ i,j is equal to ρ j,i for our undirected graph. The problem of user grouping is now reformulated using the similarity graph. Hence, our objective is to find a partition of the graph such that the edges between two users in the different groups have low weights (low similarity) by means of dissimilar AoD support and the edges within the same group have high weights (high similarity) by means of the similar AoD support.
Since the constructed graph should represent the relationships between users, the similarity measure has to model local neighborhoods. By using the Gaussian kernel function, the similarity between i th and j th users can be expressed as: where γ x,i , γ x,j , γ y,i , γ y,j ∈ R P are the angle vectors constructed via the AoD information of i th and j th users, and is the parameter controlling the width of the neighborhoods between γ x,i , γ x,j and γ y,i , γ y,j pairs. However, the similarity measure as defined in (4) is not suitable for the above 1 AoD changes over time at a much slower rate than the actual channel coefficients, Also, it can be tracked and estimated by using standard AoDs estimation algorithms [35], [38], [39]. described problem. To understand the reason, we consider the following simple example. Suppose that the angles vectors for users 1 and user 2 are γ x,1 = γ y,1 = [0.21, 0.41, 0.53] and γ x,2 = γ y,2 = [0.34, 0.38, 0.39], respectively. Despite of the quite similar AoD supports, the similarity measure returns a very small weight and the users will be placed in different groups. While examining the angle vectors of these users, we can see that although the difference between the second elements of the angle vectors is very small, its effect vanishes due to the high differences between the first and third element pairs. Thus, this similarity cannot be reflected via ρ i,j defined in (4). To overcome this problem, we first define the similarity function between i th and j th users as follows: where P i and P j denote the number of paths for the i th and j th users, respectively, for the case of different number of paths each user in general. The effect of close angle pairs can be now represented with above defined similarity measure in (5). The information about the similarity can be compactly represented by the weighted adjacency matrix J ∈ R K ×K , whose (i, j) th element is ρ i,j and the diagonal elements are 0, and the diagonal matrix The main tool to find the number of user groups as well as the user grouping is the Laplacian matrix L ∈ R K ×K defined as: The eigenvalues and eigenvectors of the Laplacian matrix describe many properties of the graph, in particular they allow us to partition the graph through eigenvalue decomposition. One can show that the Laplacian matrix L is a real symmetric and positive semi-definite matrix. Therefore, the eigenvalue decomposition of the Laplacian matrix can be expressed as L = Q Q T , where Q ∈ R K ×K is the matrix containing the eigenvectors on its columns and ∈ R K ×K is the diagonal matrix with the eigenvalues on the diagonal elements. Given the eigenvalue decomposition of L, the user grouping is performed as follows. The multiplicity of eigenvalue 0 is equal to the number of user groups G and the eigenvector of the corresponding eigenvalue 0 are the indicator vectors 2 1 G 1 , · · · , 1 G g , · · · , 1 G G for g = 1, · · · , G [40]. To explain these properties, we consider an eigenvector q g = q g,1 , · · · , q g,K T ∈ R K with the corre- 2 The indicator vector 1 G g of group g has value 1 at position k if the k th user belongs to group g, and it has value 0 at position k if the k th user does not belong to group g. VOLUME 8, 2020 Algorithm 1 Proposed User Grouping Algorithm Input: θ k,p , ψ k,p for k = 1 : K and p = 1 : P.
1: Generate angle pairs γ x,k,p = sin θ k,p cos ψ k,p and γ y,k,p = sin θ k,p sin ψ k,p . 2: Calculate the similarity measures ρ i,j via (5). 3: Form the Laplacian matrix L via (6). 4: Find G, the multiplicity of the eigenvalue 0 of L. 5: Find q g G g=1 , the eigenvectors with the eigenvalue 0. 6: Form Q G = [q 1 , · · · , q G ] via (8). 7: Apply K -means algorithm to the rows of Q G to cluster K users in G groups.
sponding eigenvalue 0, hence, we have: where (a) is obtained by using τ i = P j=1,j =i ρ i,j . As the similarity measures ρ i,j are non-negative, the above term can be zero, if and only if all the positive terms ρ i,j q g,i − q g,j 2 are exactly 0, ∀i, j. Hence, if ρ i,j is strictly larger than 0 (i.e., the two users i and j belong to the same group and have a strong similarity factor), we have necessary q g,i = q g,j , where q g,i is equal to a constant different from 0 for the users i belonging to the same group g. Moreover, when two users i and j belong to different groups, the similarity measure ρ i,j is almost zero due to the disjoint AoD supports. To satisfy the orthogonality among the eigenvectors with the eigenvalue 0, the remaining entries of the eigenvector q g should be equal to 0 for the corresponding users belonging to different groups rather than the group g. As a result, the corresponding eigenvector for the users belonging to the same group g becomes q g = 1 G g . A detailed proof can be found in [40].
Afterwards, once the number of groups G is obtained by finding the multiplicity of the eigenvalue 0 for the Laplacian matrix L, we can compute G eigenvectors q 1 , · · · , q G ∈ R K of the Laplacian matrix L with the eigenvalue 0 and build: Finally, K -means algorithm is applied to the rows of Q G in order to cluster K users in G groups. Here, we omit the explanation for the K -means algorithm and the detailed description for K -means algorithm can be found in [41]. The main complexity of the proposed grouping algorithm is the eigenvalue decomposition of the real symmetric Laplacian matrix with the size of K × K . In Algorithm 1, we summarize the procedures for the proposed user grouping algorithm. Our main objective is to collect users having similar AoD supports. One may think of applying K -means directly on the AoD information associated with each user. This approach is adopted in [36] for an ULA using the user covariance matrices with the prior knowledge of the number of user groups G. However, when we try to apply the K -means algorithm directly on the AoD information, the first obstacle is the number of groups G, which is unknown a priori. On the other hand, the Laplacian matrix given in (6) provides a systematic way to find the number of user groups by applying eigenvalue decomposition and also to cluster the users by manipulating its eigenvectors as explained in Algorithm 1.

IV. 3D ANGULAR-BASED HYBRID PRECODING
According to Figure 2, the transmitted signal is expressed as s = FBd, where F ∈ C M ×b is the RF beamformer designed via the slowly time-varying AoD information, B ∈ C b×K is the baseband precoder designed via the effective CSI seen by the baseband and d ∈ C K is the downlink data vector encoded by i.i.d. Gaussian codebooks (i.e., i.i.d. entries of d follow the distribution of CN (0, 1), so we have E dd H = I K ). The ultimate objective for partitioning the linear precoder into two sub-blocks is to reduce the channel estimation overhead and to decrease the hardware cost/complexity. Throughout this section, we first develop the RF beamformer and then present three approaches for the baseband precoder under different CSI overhead requirement. Furthermore, deterministic equivalents of the SINR expressions are obtained for all three baseband precoder approaches.

A. RF BEAMFORMER DESIGN
Suppose that K users are clustered in G groups based on the similarity of AoD information as shown in Figure 2, where the number of users in group g is denoted as K g such that K = G g=1 K g . Since the users inside the same group have similar AoDs, the instantaneous CSI of the k th user in group g is h T g k = g T g k g ∈ C M with the index g k = k + g−1 p=1 K p and g ∈ C P×M is the phase response matrix of group g. By using (2), the received signal at group g can be given by: where H g = G g g ∈ C K g ×M is the channel matrix, G g = g g 1 , · · · , g g Kg T ∈ C K g ×P is the path gain matrix and w g = w g 1 , · · · , w g Kg T ∈ C K g is the noise vector for group g. As mentioned earlier, the RF beamformer F depends on the AoD information, i.e., on the 84694 VOLUME 8, 2020 phase response matrix g . According to the user groups, we design G different sub-blocks for the RF stage as Then, the received signal is written as: where H = HF ∈ C K ×b is the reduced-size effective channel matrix seen from baseband. By defining , the effective channel matrix can be expanded as: where the diagonal block-matrix H g = H g F g = G g g F g is the effective channel matrix for group g and the off-diagonal block-matrix H p F g = G p p F g is the effective interference channel matrix between groups g and p, ∀p = g. To eliminate the inter-group interference, the RF beamformer matrices have to be chosen as: The above approximate zero condition implies that the RF beamformer for group g should be orthogonal to the phase response matrix of the group p. This can be verified as long as the columns of F g belong to the intersection of the null spaces of p , i.e., Span F g ⊂ ∩p =g Null p . Moreover, in order to maximize the beamforming gain, the columns of F g should belong to the subspace spanned by g , i.e., Span F g ⊂ Span g . Thus, the intersection of Span g and Null p , ∀p = g, should not be empty to obtain the RF beamformer matrix satisfying the above conditions. As explained later, when the number of antennas M increases as in a massive MIMO scenario, Span g is asymptotically included in Null p . Hence, a sufficient condition is to design the columns of F g in the basis of Span g . The AoD support of the group g is expressed as the union of AoD supports for all user in the corresponding group as: where ψ g = ψ min g , ψ max g = min g k ,p ψ g k ,p , max g k ,p ψ g k ,p is the azimuth angle support for group g and θ g = θ min g , θ max g = min g k .p θ g k ,p , max g k,p θ g k ,p is the elevation angle support for group g. To achieve Span F g ⊂ Span g , the columns of the RF beamformer for the group g can be constructed as: where e γ x , γ y = e x (γ x ) ⊗ e y γ y ∈ C M is the steering vector for γ x , γ y angle combination. Here, 1, e j2πdγ y , · · ·, e j2πd(M y −1)γ y T ∈ C M y are the steering vector along x-axis and y-axis, respectively. By defining the quantized angle-pairs as λ , M x and j = 1, · · ·, M y , one can verify the orthogonality between the steering vectors as: Therefore, provides as many orthogonal vectors as the number of antennas M = M x M y . The quantized angle-pairs provide M orthogonal steering vectors, which is the minimum number of angle-pairs to span the complete AoD space (i.e., the complete elevation and azimuth angle domain). Hence, it minimizes the RF chains utilization to cover the complete AoD support of a given user group. On the other hand, other angle-pairs can be also utilized for the RF beamformer design. However, they require a larger number of RF chains and increase the hardware cost/complexity. Furthermore, by choosing the quantized angle-pairs for the RF beamformer design, the inter-group interference can be further suppressed and well-managed compared to the non-orthogonal beamforming techniques (e.g., NOAS-HP [20] as discussed in Section VI-F). For any λ x i , λ y j ∈ AoD g and γ x , γ y / ∈ AoD g , the inner product: tends to be 0 for large antenna arrays. In order to prove the convergence of the above expression for the large antenna arrays, κ M x , M y is defined as follows: For the half-wavelength antenna separation (i.e., d = 0.5), using the finite geometric series expansion for λ x i = γ x and λ y j = γ y , an upper-bound for κ M x , M y can be obtained as:  where (a) is found via 1 − e jα = sin α represent the step-size for the quantized angle-pairs along x-axis and y-axis, respectively), each sine function located in the denominator converges to a constant value higher than 0. Therefore, as antenna array size increases (i.e. M goes to infinity), the upper-bound for κ M x , M y given in (18) converges to 0, when λ x i , λ y j ∈ AoD g and γ x , γ y / ∈ AoD g . Figure 3 illustrates the behavior of κ M x , M y for two antenna array configurations having M = M x × M y = 100 antennas. Firstly, in Figure 3(a), κ (100, 1) for 100 × 1 ULA is plotted versus λ x i − γ x , where the tightness of upper-bound expression obtained in (18) is also represented. As seen from the results, as long as the difference among the angle-pairs increases, κ (100, 1) approaches to 0, which provides the earlier defined approximate zero condition given in (12). Moreover, when the angle-pairs are quite closed to each other (i.e., |λ x i − γ x | ≈ 0), the beamforming gain can be maximized as indicated in (14). Secondly, κ (10, 10) for 10 × 10 URA is demonstrated versus λ x i − γ x and λ y j − γ y in Figure 3(b), where the maximum beamforming gain can be achieved by minimizing the distance between angle-pairs along both x-axis and y-axis. Otherwise, the beamforming gain approaches to 0 (i.e., it becomes nearly orthogonal).
Based on the phase response matrix given in (3), the q th row of g ∈ C P×M can be represented by e H γ x,g,q , γ y,g,q = φ T x γ x,g,q ⊗ φ T y γ y,g,q . Therefore, considering no intersection between the AoD supports of the user groups, p e(λ x i , λ y j ) approaches 0 for λ x i , λ y j ∈ AoD g as expressed in (18) and illustrated in Figure 3. Hence, when the RF beamformer for group g is constructed via the quantized angle-pairs covering AoD g , it both satisfies (i) Span F g ⊂ ∩p =g Null p to suppress the inter-group interference as indicated by the approximate zero condition given in (12) and (ii) Span F g ⊂ Span g to maximize the beamforming gain. Then, the quantized angle-pairs covering AoD g are

Algorithm 2 RF Beamformer Design
Input: M x , M y , and θ k,p , ψ k,p for k = 1 : K and p = 1 : P.
1: Generate angle pairs Form AoD g given in (13). 4: Find λ x i , λ y j pairs covering AoD g via (19). 5: Obtain the RF beamformer for group g as F g via (20). 6: end for 7: Generate the RF beamformer as found as: where denotes the boundaries of quantized angle λ y j along y-axis. According to the quantized angle-pairs, the number of λ x i , λ y j pairs satisfying (19) is denoted by b g . Actually, b g also represents the number of orthogonal-beams and the maximum number of data streams that can be supported for group g. Finally, by using the quantized angle-pairs, the RF beamformer for group g is obtained as: The RF beamformer design for the proposed AB-HP technique is summarized in Algorithm 2. Hence, the RF beamformer can be realized via phase-shifters by means of the constant-modulus property of the steering vectors. Also, the obtained RF beamformer matrix is unitary, i.e., F H F = I b . Based on Algorithm 2, an illustrative example for the RF beamformer design is discussed in Section VI-A, where Figure 5 explains how to obtain b g for a given AoD support.
It is worthwhile to mention that when the total number of users is higher than the number of available RF chains, the user scheduling policies in [42]- [44] can be applied per each group to serve a subset of users. In this paper, we suppose here that all users can be simultaneously served by the BS.

B. BASEBAND PRECODER DESIGN
After obtaining the RF beamformer, we apply three approaches: joint-group-processing (JGP), per-groupprocessing (PGP) and common-group-processing (CGP) to design the baseband precoding stage. Note that both JGP and PGP are also employed for the EBF-HP and BD-HP techniques in [23], where the RF-stage requiring variable-gain amplifiers as well as phase-shifters is constructed via the channel covariance matrix.

1) JOINT-GROUP-PROCESSING (JGP)
When the estimation of the entire effective channel matrix given in (11) is affordable, the baseband precoder can be jointly designed for all groups according to the JGP approach. Hence, the RZF baseband precoder is given by: where W = H H H + K αI b −1 , α is the regularization parameter and ε is the normalization scalar to satisfy the transmit power constraint defined as: where (a) follows the linearity of trace operator and E dd H = I K , (b) is the direct consequence of the unitary property of the RF beamformer. Using (2), the received signal at the k th user in group g is expanded as: After some mathematical manipulations, the instantaneous SINR at the k th user in the group g for the JGP approach is found as: where H [g k ] = h 1 , · · · , h g k −1 , , h g k +1 · · · , h K T represents the interference channel matrix. By applying the deterministic equivalent principle for large antenna arrays [9], the SINR in (24) converges almost surely to a deterministic quantity SINR J g k , which can be derived as: where m J g for g = 1, · · · , G is the unique solution of: where D J g = F H H g g F. Also, c J g and c J g,p are the elements of the following defined column vector and matrix: where Z, V and v are defined as: Proof: The detailed proof of (25) is presented Appendix A.
While the derived deterministic equivalent given in (25) is the limit of the instantaneous SINR expression in (24) as M goes to infinity. We verify in Section VI-C that the limit is accurate for a practical number of antennas.

2) PER-GROUP-PROCESSING (PGP)
To reduce CSI overhead, the PGP approach builds G different baseband precoder blocks requiring the effective channel for each individual group (i.e., H g = H g F g ). Particularly, only the diagonal block-matrices of the reduced-size effective matrix in (11) are required for the development of the baseband stage. Hence, the baseband precoder can be denoted by a block diagonal matrix as B = diag (B 1 , B 2 , · · ·, B G ). Then, the RZF baseband precoder for group g is given by: where W g = H H g H g + K g αI b g −1 and ε g is the normalization scalar for the group g regarding the transmit power constraint. By using (22) and considering equal power allocation, the normalization scalar is defined as: By again using (2), the received signal at the k th user in the group g is given by: Inter -Group Interference as the summation of the desired information bearing signal, the intra-group interference among the users in the group g, the inter-group interference due to the users located in all other groups and the noise. The inter-group interference VOLUME 8, 2020 results from the approximate zero condition given in (12) for a large array size. Then, the instantaneous SINR for the PGP approach is found as: where H g, [k] According to the deterministic equivalent principle for large antenna arrays [9], similar to the JGP approach, the deterministic equivalent of the instantaneous SINR given in (32) is denoted as SINR P g k . The deterministic equivalent of the instantaneous SINR at the k th user in the group g for the PGP approach is obtained as: where m P g for g = 1, · · · , G is the unique solution of: where D P g = F H g H g g F g . Moreover, c P g and c P g,p are respectively defined as: where The detailed proof of (33) is given in Appendix B.

3) COMMON-GROUP-PROCESSING (CGP)
Recall that the AoD support for each group given in (13) has two dimensions on both x-axis and y-axis. For the PGP approach, in the case of intersection between AoD g and AoD p on either x-axis or y-axis (not on both axes), baseband precoder for groups g and p are separately designed as in (29). We also propose a new design strategy for the baseband stage as the CGP approach, when AoD supports for group g and p have common angles either on x-axis or y-axis. A systematic way to determine the user groups having the common AoD Algorithm 3 Groups Having Common AoD Support Input: AoD g given in (13) for g = 1, · · · , G. 1: Generate γ x , γ y ∈ AoD g angle pairs. 2: if (γ x , ) ∈ AoD p or , γ y ∈ AoD p for any γ x , γ y ∈ AoD g and ∈ [−1, 1], then 3: Consider any group g and p as a common group.

4:
Design a baseband precoder for them via (36). 5: else 6: Consider group g as a single group. 7: Design a baseband precoder for group g via (29). 8: end if 9: Assign the total number of common/single groups to G . support is expressed in Algorithm 3. Then, the common baseband precoder for group g and p is given by: is the total number of users served by the common baseband precoder, b gp = b g + b p is the total number of quantized angle-pairs inside the concatenated RF beamformer, ε gp is the normalization scalar given by: Let G denote the number of baseband precoder blocks with 1 ≤ G ≤ G, the baseband precoder for the CGP approach is obtained as B = diag(B 1 , B 2 , · · ·, B G ). Also, CGP becomes identical to JGP for G = 1 and to PGP for G = G. Therefore, the channel estimation overhead for CGP is between that for JGP and PGP approaches. It is also important to point out that PGP and CGP approaches become identical for ULA structure due to its single-dimensional AoD support on either x-axis or y-axis. By using (31) and (32), after some manipulations, the instantaneous SINR is obtained as in (38) , H T p ] T . As in JGP and PGP, the deterministic equivalent of (38) is obtained as in (39), as shown at the bottom of the next page, where c P t and c P t,g are given in (35) for PGP, m C g is the unique solution of: where D C gp,g = F H gp H g g F gp . Also, c C g , c C p and c C g,p are respectively defined as: where Z gp , V gp and v gp are defined as: Proof: The detailed proof of (39) is given in Appendix C.

V. REDUCING THE NUMBER OF RF CHAINS
The proposed two-stage AB-HP scheme in Section IV-A requires b = G g=1 b g RF chains in order to exploit all degrees of freedom provided by the channel. Moreover, the total number of transmitted independent data streams should be less than or equal to the number of orthogonal-beams (i.e., K ≤ b). Hence, in comparison to the single-stage FDP, the number of RF chains can be reduced from M to b via the proposed two-stage AB-HP by placing b g RF chains for each user group. Indeed, b g is dictated by the AoD support of the group g given in (13) to design F g given in (20).
To further minimize the hardware cost/complexity in the massive MIMO systems, the number of RF chains for the group g, N RF,g ,should be as low as the number of independent data streams in that group. However, when N RF,g is smaller than b g , the RF beamformer F g can only contain N RF,g angles and cannot span the complete range of the phase response matrix g . As a result, some of the paths cannot be used for transmission and the sum-rate is subsequently lower than using all paths via b g RF chains. In order to fully exploit all paths while reducing the number of RF chains, we propose a new transfer block architecture illustrated in Figure 4, which only requires N RF,g = K g RF chains for each user group without affecting the sum-rate performance. Hence, the proposed transfer block architecture is capable to reduce the number of RF chains exactly to N RF = G g=1 N RF,g = G g=1 K g = K without sacrificing the performance.

A. PROBLEM FORMULATION
Given N RF RF chains, the baseband precoder outputs is related to all b angles contained in RF beamformer as illustrated in Figure 2. In other words, the transmitted signal from each RF chain should be mapped to cover all b angles, even when N RF < b. Accordingly, we propose to include a transfer block denoted by T ∈ C b×N RF between the RF beamformer F ∈ C M ×b and the reduced-size baseband precoder B rs ∈ C N RF ×K as represented in Figure 4, where the transfer block is placed at the RF-stage of AB-HP. Hence, the transfer block matrix should be realized via phase-shifters similar to the RF beamformer. According to the transfer block architecture, the transmitted signal from the BS can be rewritten as s = FTB rs d, which satisfies the maximum transmit power constraint of P T , i.e., E s 2 2 = tr B H rs T H TB rs ≤ P T . For a given baseband precoder B and b RF chains, it is desired to find the transfer block T and the reduced-size where T RF represents the set of matrices with the size of b × N RF satisfying the aforementioned unit-modulus property for the transfer block constructed by the phase-shifters. However, (43) is a non-convex optimization problem since the elements of T have unit-modulus.

B. TRANSFER BLOCK DESIGN
Solving (43) for T and B rs is a difficult problem because of the non-convexity of the constraint T ∈ T RF . We propose to represent the transfer block as a summation of two inner matrices as follows: where T A ∈ C b×K and T B ∈ C b×K are inner transfer block matrices requiring only N RF = K RF chains. Therefore, each baseband precoder output is passed through two phase-shifters and their summation is connected to the corresponding RF beamformer input as shown in Figure 4. One can show that when two complex numbers having the unit-modulus are summed, then its modulus varies between 0 and 2. Therefore, by defining T = (T A + T B ), the unit modulus constraint for T expressed in (43) can be converted to a modulus constraint within 0 and 2. According to (44), the optimization problem given in (43) can be reorganized as: Using Lemma 1 and applying the least-squares solution [45], the optimal solution for the inner transfer block and reduced-size baseband precoder matrices can be found as follows: where µ = 1 2 max i,q |B (i, q)| is the half of the highest modulus element at the baseband precoder B. Afterwards, one can prove that (46) minimizes the expression given in (45) by In a nutshell, to further reduce the hardware cost/complexity of the proposed AB-HP shown in Figure 2, we propose a new three-stage architecture by including the additional transfer block capable of reducing the number of RF chains N RF exactly to the total number of data streams K as shown in Figure 4. Furthermore, the three-stage architecture provides the same sum-rate performance as in two-stage, which is validated later in Section VI-D. The functions of each stage can be simply explained as follows: (i) the RF beamformer designed by slowly time-varying AoD information given in (13) reduces the CSI overhead as well as the hardware cost/complexity via reducing the number of RF chains from M to b, (ii) the transfer block based on the effective channel matrix given in (11) minimizes the RF chain utilization by decreasing its number from b to K , (iii) the reduced-size baseband precoder adjusts the transmit power and mitigates the interference among the users. Furthermore, for future work (beyond the scope of this paper), the last stage can be also utilized for power allocation among the users to enhance the sum-rate capacity of the system [9].
Finally, Table 1 summarizes the hardware cost/complexity and the CSI overhead for the proposed AB-HP in comparison to FDP. According to the three-stage architecture, the number of RF chains is reduced exactly to K , which is equal to M in the case of FDP. Regarding the required CSI size, each JGP, CGP and PGP approach expressed in Section IV-B has different CSI overhead according to the effective channel matrix size utilized for the baseband precoder.

VI. ILLUSTRATIVE RESULTS AND DISCUSSIONS
Monte-Carlo simulation results for the sum-rate performance of the proposed AB-HP in a single-cell environment are presented and compared to the approximations achieved by the deterministic expressions. Using the instantaneous SINR expressions given in (24), (32) and (38), the ergodic sum-rate capacity curves for JGP, PGP and CGP approaches are generated by: (47) where υ ∈ {J , P, C}. Furthermore, the approximated sum-rate capacity via the deterministic expressions given in (25), (33) and (39) is obtained by: [bps/Hz]. (48) According to the 3D urban micro-cellular scenario in [46],   assumed to be uniformly distributed in the specified range. Based on the geometry, the mean elevation angle and elevation angle spread can be respectively calculated as θ g = 73 • and δ θ g = 12.5 • by using [46, eq. (7.4-1)]. As in [47], the regularization parameter is chosen as α = σ 2 P T . We define the signal-to-noise ratio (SNR) as SNR = P T K σ 2 .

A. RF BEAMFORMER DESIGN
In Figure 5, the RF beamformer design expressed in Algorithm 2 is visualized for 20 × 20 URA. According to the URA configuration, there are 20 quantized angles along both x-axis and y-axis. Totally, there are 20 × 20 = 400 quantized angle-pairs plotted with red circles, which is equal to the number of antennas. Firstly, based on the AoD information given in Table 2 (i.e., ψ g , θ g , δ ψ g , δ θ g ), the AoD support for each user group is generated by using (13). Then, the quantized angle-pairs covering the corresponding AoD support are determined according to (19), which are represented by blue crosses. Furthermore, the total number of orthogonal-beams for each group (i.e., b g for group g given in (20)) can be easily found by counting the blue-crosses. For example, there are b 1 = 10 quantized angle-pairs in order to cover the AoD support of group 1. Similarly, group 2 requires b 2 = 8 quantized angle-pairs for the RF beamformer design in order to exploit all the degrees of freedom provided by the channel matrix. As a result, the total number of quantized angle-pairs to serve all user groups can be found as On the other hand, the group 1 has an overlapping AoD support with the group 6 along x-axis and the group 3 along y-axis. Similarly, the AoD support of the group 4 overlaps with the group 3 along x-axis and the group 6 along y-axis. According to Algorithm 3, these four groups are considered as a common group in the CGP approach. However, neither the group 2 nor 5 has an overlapping AoD support with other groups. Therefore, the CGP approach designs G = 3 baseband precoder blocks as the first two separate blocks for the group 2 and the group 5 by considering each of them as a single group and the last block for the remaining four groups by considering them as a common group.

B. USER GROUPING
In the following, we compare the performance of the proposed user grouping algorithm against the direct applications of K -means algorithm with chordal distance [24], weighted likelihood [36] and subspace projection [36] similarity measures, which are iterative methods and require a priori knowledge of the number of user groups. Figure 6 demonstrates the superiority of the proposed user grouping algorithm by correctly finding the number of user groups G. It is worthwhile to remark that user grouping algorithms in [24], [36] employ the channel covariance matrix of each user, whereas the proposed user grouping algorithm is based on the AoD information. For all figures, the user locations are plotted in 2D scale, the cross symbols in the middle represent the BS, the straightlines demonstrate the user group boundaries on the azimuth angle domain and the users within the same group have the same mark and color. We assume that there are K g = 6 users per each group and K = 36 users in total. In Figure 6(a), our proposed user grouping approach is applied to first find the number of groups and then cluster the users based on the eigenvectors of the Laplacian matrix defined in (6) having the eigenvalue 0. From Figure 6(a), the proposed algorithm finds the correct number of user groups G = 6. Then, applying K -means algorithm to Q G given in (8) provides well clustered users. On the other hand, the direct application of K -means algorithm as in [24], [36] with an incorrect priori knowledge of the number of user groups results in unsatisfactory clustering. When the input number of groups to K -means is G = 5 (Figure 6(b)), which is lower than its actual value, the well-separated user groups having the disjoint AoD supports (e.g., yellow circles in chordal distance, green diamonds in weighted likelihood and blue squares in VOLUME 8, 2020 FIGURE 6. User grouping for the user deployment as in Table 2 with K g = 6 and K = 36: (a) the proposed approach expressed in Algorithm 1 and the direct application of K -means algorithm with chordal distance [24], weighted likelihood [36] and subspace projection [36] similarity measures with an incorrect priori knowledge of (b) G = 5 and (c) G = 7.
subspace projection) are accidentally clustered in the same group. Considering the phase response matrix given in (3), the well-separated users symmetric about either x-axis (e.g., blue squares in subspace projection) or y-axis (e.g., green diamonds in weighted likelihood) have similar covariance matrices based on the symmetry property of cosine function about x-axis (i.e., cos(α) = cos(−α)) and sine function about y-axis (i.e., sin(α) = sin(π − α)). In Figure 6(c), a priori assumption for the number of groups G = 7 exceeds its correct value G = 6. The users having similar AoD supports (e.g., blue squares and red stars in chordal distance, purple triangles and turquoise hexagrams in weighted likelihood and subspace projection) can be mistakenly clustered in different groups, which may cause additional inter-group interference between those user groups. Figure 7 plots the sum-rate performance of the proposed AB-HP with PGP technique for 20 × 20 URA considering all user grouping techniques applied in Figure 6. The first and the most crucial observation is that the proposed user grouping algorithm can provide a superior performance by means of finding the correct number of user groups and well clustering them. On the other hand, when the number of user groups is presumed as G = 7 for K -means, shown in Figure 6(c), all K -means techniques provide almost the same sum-rate performance. Furthermore, the inter-group interference increases due to the similar AoD support of user groups and the sum-rate curves quickly saturate as the SNR increases. Besides, when the number of user groups is considered as G = 5 shown in Figure 6(b), subspace projection   [24], (c) weighted likelihood [36] and (d) subspace projection [36] with a priori knowledge of G = 5.
can provide better sum-rate performance compared to chordal distance and weighted likelihood similarity measures. The reason for this behaviour is that the AoD support boundaries for each user group are non-overlapping in the case of subspace projection.
The case of randomly deployed users is illustrated in Figure 8 with K = 60 users. The mean azimuth angle of each user ψ k varies between 0 • and 360 • based on their location. In Figure 8(a), the number of user groups are estimated as G = 5 via the proposed user grouping algorithm, then the users are clustered. As seen from the results, the azimuth angle spread of each user group varies due to the randomness. For instance, the user group illustrated via blue squares has a wider azimuth angle spread, however, it is narrower for the user group illustrated via red stars. On the other hand, the proposed approach successfully clusters the user groups by creating well-separation on the azimuth angle domain, which is important to mitigate the inter-group interference. However, by using a priori knowledge of user groups number as G = 5, the direct application of K -means algorithm with chordal distance [24] in Figure 8  in Figure 8(c) and subspace projection [36] in Figure 8(d) cannot provide fine separation. To illustrate, in the K -means algorithm, the well-separated red star users, which are symmetric about x-axis, are clustered in the same group due to their similar covariance matrices as mentioned above. Also, the AoD support for red star users covers the AoD support of the green diamond and blue square users. Figure 9 shows the sum-rate curves obtained from the derived deterministic SINR expressions and from Monte Carlo simulations, where there are K g = 3 users in each group. Moreover, it also compares the sum-rate capacity of the proposed AB-HP employing all three different baseband precoder design approaches (i.e., JGP, CGP and PGP) with the single-stage FDP.

C. VALIDATION OF THE DETERMINISTIC ANALYSES
In Figure 9(a), the sum-rate curves for 20 × 20 URA are plotted versus SNR. According to Table 1 and the chosen quantized angle-pairs for 20 × 20 URA shown in Figure 5, the required CSI sizes are as 56 × 18 for JGP, (40 × 12) + (16 × 3) for CGP and 56 × 3 for PGP. Noting that the FDP needs the entire CSI with the size of 400 × 18, the CSI overhead is decreased by 86% via JGP, 92.67% via CGP and 97.67% via PGP. It is shown that all three proposed AB-HP techniques can provide a comparable performance to FDP with only 1.2 dB performance gap in low to medium SNR regimes. However, in the high SNR regimes, the proposed AB-HP with CGP and PGP reaches a performance floor due to the inter-group interference level created by the approximate zero condition in (12). On the other hand, the CGP approach is capable of mitigating the inter-group interference better than PGP by jointly designing the baseband precoder for the groups, which have the common AoDs along either x-axis or y-axis as explained in Figure 5. Hence, CGP can provide higher capacity than PGP as expected. For instance, the performance floors observed for CGP and JGP are approximately 240 bps/Hz and 180 bps/Hz, respectively. Another important observation is that the performance gap between the proposed AB-HP and FDP remains con- stant for the JGP approach as the SNR increases, so it can avoid the performance floor problem by means of jointly designing the baseband precoder. Furthermore, the numerical results illustrate that the approximated sum-rate is a tight and accurate approximation of the ergodic sum-rate expression, especially in the low and medium SNR regimes. Hence, it validates the correctness of the derived deterministic equivalent SINR expressions for JGP, PGP and CGP approaches given in (25), (33) and (39), respectively.
In Figure 9(b), the sum-rate curves are produced versus various square array sizes having M x = M y for SNR = ±10 dB. It is observed that the sum-rate gap between the proposed AB-HP and FDP vanishes as the array size increases. From Figure 9(b), the sum-rate curves for CGP are always between those for JGP and PGP for all array sizes. Moreover, we can also see that the obtained deterministic equivalent SINR expression for JGP can tightly approximate the actual SINR even for small number of antennas. Regarding the performance floor occurred in the high SNR regimes for CGP and PGP, the approximations become more accurate as the number of antennas increases. For example, when we employ 15 × 15 URA using AB-HP with CGP at SNR = 10 dB, the approximated sum-rate value as 105 bps/Hz is 10.26% below its actual value as 117 bps/Hz. But, the difference decreases to only 0.37% for 50 × 50 URA configuration. Figure 10 illustrates the sum-rate comparison between the proposed AB-HP and FDP for the utilization of 10 × 10 URA and 1 × 100 ULA configurations, both having M = 100 antennas to serve K g = 2 users per group. We mention to remind that PGP is equivalent to CGP in the case of ULA due to its 1D structure, so the sum-rate curves are only generated for JGP and CGP in Figure 10(b). The ''circle'' signs represent the two-stage AB-HP requiring N RF = b RF chains demonstrated in Figure 2, whereas the ''cross'' signs represent the three-stage AB-HP requiring N RF = K = 12 RF chains demonstrated in Figure 4. We numerically show that the hardware cost/complexity of AB-HP can be minimized without sacrificing the sum-rate by employing VOLUME 8, 2020 a transfer matrix, which reduces the number RF chains to N RF = 12. Compared to FDP with N RF = M = 100, the proposed AB-HP with JGP suffers 3 dB in SNR in the case of URA (Figure 10(a)), but offers a comparable the performance in the case of ULA (Figure 10(b)). On the other hand, although the number of RF chains can be decreased from b = 22 to K = 12 for URA and b = 78 to K = 12 for ULA, the required CSI size given in Table 1 is still function of b. Therefore, considering the antenna array configuration, it brings an interesting trade-off between the CSI overhead and sum-rate performance. Figure 11 shows the sum-rate performance versus SNR for a larger array having M = 400 antennas, i.e., 20 × 20 URA and 1 × 400 ULA, for K g = 4. In comparison to Figure 10(a), although doubling the number of users in each group from K g = 2 to K g = 4 increases the interference, the performance difference between FDP and AB-HP with JGP in Figure 11(a) decreases to 1.4 dB in SNR due to larger array size. The main factor for this outcome is that the RF beamformer expressed in (20) has a better angle resolution, when the number of antennas increases. Hence, the RF beamformer enables the higher beamforming gain via focusing the signal energy on the region, where the user groups are located. Furthermore, the performance gap becomes indistinguishable for ULA illustrated in Figure 11(b).

E. IMPERFECT CSI
In the previous studies, we considered perfect CSI (P-CSI) for designing of AB-HP and FDP. In practical applications, CSI is contaminated by estimation error and the precoder matrices are constructed via the imperfect CSI (I-CSI). In presence of channel estimation error, similar to [23], [48], we assume that the BS has a perfect estimate of AoD information because it varies slower than the fast time-varying instantaneous CSI. Therefore, the RF-stage of AB-HP is not affected by I-CSI. On the other hand, the single-stage FDP requires the full I-CSI. In this study, two different models are considered regarding the channel estimation error.
In the first model, we assume that the variance of channel estimation error is fixed for all SNR values as in [48]- [50]. According to the effective P-CSI seen from the baseband given in (11), the effective I-CSI can be given by: where E 1 ∼ CN (0, I b ) and the parameter β characterizes the quality of CSI estimation. Hence, β = 0 represents the P-CSI case and 0 < β ≤ 1 represents the I-CSI case at the BS. In Figure 12, the effect of channel estimation error is investigated for both the proposed AB-HP and FDP, where the BS is equipped with 20 × 20 URA to serve K g = 3 users per group. The sum-rate curves are plotted versus SNR in Figure 12(a), where β = 0 and β = {0.01, 0.05} represent P-CSI and I-CSI, respectively. In addition to AB-HP with CGP and PGP, when the BS utilizes I-CSI for both FDP and AB-HP with JGP, these precoding techniques also reach a performance floor in the high SNR regimes due to the fixed channel estimation error defined in (49). However, in the low and medium SNR regimes, the sum-rate performance under I-CSI with both β = 0.01 and β = 0.05 remains approximately identical as in P-CSI. In Figure 12(b), the sumrate performance is plotted versus the quality of CSI estimation (i.e., 0.01 ≤ β ≤ 0.2) for SNR = 0 dB, where the straight lines demonstrate the case of P-CSI. It is shown that AB-HP with PGP is more robust to channel estimation errors compared to other precoding techniques due to the smaller CSI overhead as shown in Table 1. On the other hand, the performance of FDP requiring full CSI deteriorates in the case of high estimation error. To illustrate, in comparison to P-CSI, when β = 0.2, the sum-rate of AB-HP with PGP only decreases from 122 bps/Hz to 120 bps/Hz, whereas the sum-rate of FDP declines from 133 bps/Hz to 111 bps/Hz. In the second model, we assume that the power of the estimation error is inversely proportional to SNR and the number of pilot symbols used during the training phase as in [51]- [53]. According to the effective P-CSI seen from the baseband given in (11), the effective I-CSI can be given by: 84704 VOLUME 8, 2020 where E 2 ∼ CN 0, σ 2 ζ P T I b . Here, ζ depends on the number of pilot symbols used during the training phase and the applied channel estimation method [51]- [53]. So, ζ approaches to infinity for the case of P-CSI. Based on the second model given in (50), the sum-rate comparison of P-CSI and I-CSI is presented for 20 × 20 URA and K g = 3 in Figure 13. Due to the smaller required CSI size, AB-HP with PGP can outperform FDP in the low and medium SNR regimes. For example, when ζ = 1 and SNR ≤ 5.8 dB, the sum-rate of AB-HP with PGP is higher than its of FDP as shown in Figure 13(a). Moreover, in contrast to the first model given in (49), as long as SNR increases, no performance floor behavior is observed for neither AB-HP with JGP nor FDP because the power of the estimation error is inversely proportional to SNR in the second model given in (50). In Figure 13(b), the sum-rate curves are produced versus the quality of CSI estimation (i.e., 1 ≤ ζ ≤ 20) for SNR = 0 dB, where the straight lines display the case of P-CSI. Similar to the first model represented in Figure 12(b), AB-HP with PGP is again more robust to the channel estimation errors.

F. BENCHMARK
In addition to the single-stage FDP, the proposed AB-HP technique is also compared with the exiting HP solutions as eigen-beamforming based HP (EBF-HP) [23], block-diagonalization based HP (BD-HP) [23], orthogonal beam selection based HP (OBS-HP) [20] and non-orthogonal angle space based HP (NOAS-HP) [20]. It is important to remark that although EBF-HP and BD-HP use the channel covariance matrix as the slowly time-varying CSI for the RF beamformer design, OBS-HP and NOAS-HP construct the RF beamformer via instantaneous CSI and require complete channel knowledge. By defining R g ∈ C M ×M as the covariance matrix of H g = G g g ∈ C K g ×M , the eigenvectors of R g with dominant eigenvalues are utilized to design the RF beamformer for group g in EBF-HP and BD-HP. Considering that the channel covariance matrix may change over time under the same AoD support, then RF beamformer is not required to be updated in the proposed AB-HP scheme, however, EBF-HP and BD-HP need to reconstruct the RF beamformer for every channel covariance matrix. Moreover, due to the non-constant modulus characteristic of the eigenvectors, EBF cannot satisfy the constant modulus requirement and the RF beamformer cannot be realized via phase-shifters. Even though non-constant modulus elements are utilized at the RF beamformer of EBF-HP and BD-HP and the full CSI knowledge is required for OBS-HP and NOAS-HP, they nevertheless serve as benchmark for the sum-rate performance. In terms of the baseband precoder design, we consider that either JGP or PGP is employed for all benchmark techniques.
In Figure 14, the sum-rate curves are plotted versus SNR, where all HP schemes employ N RF = K = 18 RF chains while N RF = 400 for FDP. The numerical results imply that although NOAS-HP provides the best sum-rate performance among all benchmark techniques in JGP, the proposed AB-HP with JGP outperforms NOAS-HP with JGP by  6.5 dB. The primary factor for this performance enhancement is the proposed transfer block design capable of exploiting all degrees of freedom provided by the channel, while NOAS-HP selects only a set of significant beams as many as number of RF chains. Furthermore, in the high SNR regimes, when the PGP approach is utilized at the baseband precoder design, the performance floor for OBS-HP at 167 bps/Hz, BD-HP at 161 bps/Hz, NOAS-HP at 148 bps/Hz and EBF-HP at 142 bps/Hz can be increased to 180 bps/Hz via AB-HP. In other words, the proposed AB-HP with PGP is capable of providing 7.2%, 10.6%, 17.8% and 21.1% sum-rate enhancement with respect to OBS-HP, BD-HP, NOAS-HP and EBF-HP, respectively. There are two main reasons for the beneficial aspect of the proposed AB-HP with PGP. Firstly, the RF beamformer better deal with the effect of the inter-group interferences by providing the approximate zero condition given in (12) satisfied by (16), which cannot be provided by EBF-HP and NOAS-HP. Secondly, the direct application of the AoD support boundary for the RF beamformer given in (20) provides higher beamforming gain instead of only selecting the significant beams within the AoD support in OBS-HP and employing the eigenvectors of the covariance matrix as an indirect solution in BD-HP.  In Figure 15, the sum-rate curves are demonstrated versus the azimuth angle spread δ ψ g for SNR = 10 dB. As seen from the curves, the sum-rate of the proposed AB-HP with JGP and CGP increases for the wider azimuth angle spread similar to their single-stage counterpart FDP. However, the sumrate of all benchmark techniques with JGP remains approximately constant all across the azimuth angle spread values. Additionally, we observe that the proposed CGP approach can provide an intermediate solution between JGP and PGP as expected. For instance, when δ ψ g = 5 • and δ ψ g = 6 • , the number of baseband precoder block for CGP is found as G = 6, which means that CGP is identical with PGP for the corresponding values of δ ψ g . Then, for δ ψ g = 7 • and δ ψ g = 8 • , CGP with G = 4 can provide slightly better sum-rate in comparison to PGP. Similarly, when the azimuth angle spread is increased to δ ψ g = 9 • , we have G = 3 for CGP as illustrated in Figure 5. However, the sum-rate for PGP starts to decrease due the enhanced inter-group interference via the common AoD support along either x-axis or y-axis among the user groups. Afterwards, when 10 • ≤ δ ψ g ≤ 21 • , the performance of CGP with G = 2 closely reaches to JGP. Ultimately, when 22 • ≤ δ ψ g ≤ 25 • , CGP with G = 1 now becomes identical with JGP.

G. ENERGY EFFICIENCY
Finally, we represent the energy efficiency of the proposed AB-HP technique in comparison to FDP. According to the power consumption model in [54]- [56], the energy efficiency η is defined as: where P total = P T + N RF P RF + N PS P PS represents the total power consumption, P T is the total transmission power, P RF and P PS are the power consumption by each RF chain and phase-shifter, respectively, N RF and N PS are the number of RF chains and phase-shifters, respectively. As in [54], [55], we assume that P T = 1 W, P RF = 250 mW and P PS = 1 mW. N RF = b RF chains and N PS = b × M phase-shifters are required for two-stage AB-HP without a transfer matrix as illustrated in Figure 2. Similarly, three-stage AB-HP with a transfer matrix needs N RF = K RF chains and Figure 4. On the other hand, for the single-stage FDP technique, we need to place N RF = M RF chains in total without any phase-shifters. Figure 16 plots the energy efficiency curves versus total number of users K for 20 × 20 URA and SNR = 10 dB. Considering G = 6 groups in Table 2, the number of users in each group varies between K g = 1 and K g = 6 in Figure 16. The proposed three-stage AB-HP technique with a transfer matrix, which reduces the number of RF chain to N RF = K , provides higher energy efficiency compared to the proposed two-stage AB-HP technique with N RF = b = 56 RF chains (please see Figure 5). As expected, the gap between their energy efficiency curves becomes smaller as K increases since it converges to b. Moreover, we observe that AB-HP remarkably improves the energy efficiency in comparison to the single-stage FDP technique.

H. COMPLEXITY ANALYSIS
The computational complexity of the proposed AB-HP techniques with and without a transfer matrix are compared to the single-stage FDP in Table 3 (21). In overall, the proposed AB-HP with JGP without a transfer matrix has a computational complexity of O GM + b 3 + K b 2 . Furthermore, when we include a transfer block matrix as shown in Figure 4, according to (46), the inner transfer block matrices (i.e., T A and T B ) and the reduced-size baseband  precoder (i.e., B rs ) increase the computational complexity by O K 2 b + K 3 . Hence, the computational complexity of the proposed AB-HP with JGP employing a transfer matrix is O GM + b 3 + K b 2 + K 2 b + K 3 . Following similar process, the complexity of AB-HP with CGP and PGP are obtained as in Table 3. Figure 17 compares the complexity of FDP and AB-HP. As shown in Figure 17(a), the complexity of FDP can be significantly decreased by the proposed AB-HP technique. For example, when we employed a transfer matrix for AB-HP to serve K = 36 users, the complexity of JGP, CGP and PGP can be respectively reduced to 0.6%, 0.2% and 0.02% of the complexity of FDP. Furthermore, the efficiency of AB-HP improves as the antenna array size increases as demonstrated in Figure 17(b).

VII. CONCLUSIONS
A new user grouping algorithm and 3D angular-based hybrid precoding (AB-HP) technique have been proposed for massive MU-MIMO systems equipped with URA. The proposed user grouping algorithm can accurately find the target number of groups, then cluster users in the corresponding groups. Considering the practical implementations of the massive MU-MIMO systems with the reduced CSI overhead and hardware cost/complexity, the RF beamformer has been individually designed for each user group to eliminate the inter-group interference. The RF beamformer design only requires the slowly time-varying AoD information. Three different approaches as JGP, PGP and CGP have been examined for the baseband precoder design. By applying the deterministic equivalent principle onto the instantaneous SINR expressions, their tight deterministic approximations have been obtained for all three JGP, PGP and CGP approaches. The accuracy of the derived approximations is verified by Monte-Carlo simulation results. In addition, a transfer block design has been suggested to further reduce the number of RF chains exactly to the number of users without affecting the sum-rate performance. In comparison to FDP, the proposed AB-HP technique (i) requires low hardware and computational complexity, (ii) significantly enhances the energy efficiency performance and (iii) provides a comparable sum-rate performance, where the performance gap between them decreases as the size of URA increases. Furthermore, it has been shown that AB-HP is more robust to the channel estimation errors with respect to FDP. AB-HP provides a superior performance enhancement compared to its two-stage HP counterparts as EBF-HP [23], BD-HP [23], OBS-HP [20] and NOAS-HP [20].

APPENDIX A DETERMINISTIC EQUIVALENT OF THE SINR FOR JGP
The instantaneous SINR for JGP obtained in (24) consists of two terms: (i) the desired signal power at the k th user in group g as S J g k = ε 2 h T g k FWF H h * 2 , hence, we have: A. DESIRED SIGNAL POWER According to Lemma 2, the desired user power is written as: where The term in the numerator of (48) converges to: where m J g is given in (26), (a) can be obtained by using Lemma 3 and (b) is the direct consequence of [9,Theorem 1] for the large random matrices (i.e., M → ∞). The deterministic equivalent of the normalization scalar ε given in (22) with c C g,g and c C g,p defined in (41).

C. INTER-GROUP INTERFERENCE POWER
As in the PGP approach, by using both (70) and (72), the deterministic equivalent of the inter-group interference power for CGP can be found as: with c P t and c P t,g defined in (35), which are obtained for the PGP approach. Then, by substituting (77), (78) and (79) into (73), the deterministic equivalent of the SINR expression for the CGP approach can be found as given in (39).
By using µ = 1 2 max n |a n |, we have the following inequality 0 ≤ |a n | 2µ = |a n | max n |a n | ≤ 1. Therefore, by substituting ϑ n,1 +ϑ n,2 2 = a n , ϑ n,1 −ϑ n,2 2 = |a n | 2µ and cos cos −1 |a n | 2µ = |a n | 2µ into (81), the proof of (80) is completed as follows: µ e jϑ n,1 +e jϑ n,2 = 2µ |a n | 2µ e j a n = |a n | e j a n = a n . (82) Lemma 2 (Matrix Inversion Lemma) [37]: Let X ∈ C N ×N and X + ζ a H a ∈ C N ×N be invertible matrices with a ∈ C N and ζ ∈ C. Then, we have; Lemma 3 [37]: Let X ∈ C N ×N be a random matrix generated by the probability space ( , F, P) such that w ∈ X ⊂ with P (X ) = 1 and X (w) < ∞. Let a ∈ C N be a random vector with i.i.d. entries having have zero mean, variance 1 N , eight order moment of O 1 N 4 and independent of X. Then, a H Xa almost surely converges to tr (X) as N → ∞, thus, we have: Lemma 4 (Trace of a Matrix Product) [45]: Let X = [x 1 , · · · , x N ] ∈ C N ×A and Y = [y 1 , · · · , y N ] ∈ C N ×A be two complex matrices having the same size. Then, we have: Hence, the trace of the above matrix product can be expanded as follows: x H n y n .