Low-Complexity Large MIMO Detection via Layered Belief Propagation in Beam Domain

In large multi-user multi-input multi-output systems, the computational cost and circuit scale of base stations (BSs) are effectively reduced using two-stage signal processing consisting of a slow varying outer beamformer (OBF) based on long-term channel statistics and group-specific multi-user detection for instantaneous channel variations. However, the dimensionality reduction of the group-specific beam-domain channel based on the OBF causes significant performance degradation in the subsequent spatial-filtering detection. To compensate for this drawback, this paper introduces a novel layered belief propagation (BP) detector with a concatenated structure of beam- and antenna-domain BP layers for post-stage OBF processing. The proposed detector is designed for improving the convergence of iterative detection by suppressing intra- and inter-group interference in stages. The layered structure provides the advantages of both beam and antenna domains while maintaining low signal-processing complexity. Numerical results show the validity of our proposed method in terms of the bit error rate performance in both the uncoded and coded cases and the computational complexity.


I. INTRODUCTION
L ARGE multi-user multi-input multi-output (MU- MIMO) systems, which are equipped with a large number of antennas on both the transmitter and receiver sides, have been regarded as one of the most promising technologies in the physical layer of wireless communication systems [1]- [5]. Increasing the dimensions of MIMO channels can increase data rates with the aid of spatial multiplexing gain and improve detection reliability because of spatial diversity gain, in addition to supporting a massive amount of simultaneous wireless communications. However, for large MU-MIMO systems, the precoder design in the downlink and MU detection (MUD) in the uplink are difficult and computationally expensive [2], [3]. Even if linear spatial filters, such as minimum mean square error (MMSE) filters, are used to reduce the computational cost, matrix operations based on the high-dimensional antenna-domain channel matrix are unavoidable, increasing the circuit scale of the transmitter-receiver on the base station (BS) side.
To address this issue, several approaches, including fully digital two-stage beamforming in the downlink precoder design, have been proposed [6]- [11]. A notable method using this approach is joint spatial division and multiplexing (JSDM) [6], where user equipment (UE) is grouped based on the similarity of transmit correlation matrices to design a slowly varying outer beamformer (OBF). In practice, because cellular UE devices tend to be distributed according to the terrain around the BS, they can be efficiently grouped by matching angular spread on the BS side. The OBF is constructed by selecting a set of beams oriented toward each UE group based on long-term channel statistics, and it effectively reduces the dimensions of equivalent channels in the beam (angular) domain [6], [7]. An inner beamformer (IBF) is applied for spatial multiplexing on the reduced beam-domain channel to manage both intra-and inter-group interference, significantly reducing the computational cost and transmitter circuit scale [9]- [11]. The OBF varies over long time scales compared to the IBF and, thus, requires less frequent updates.
The low-complexity strategy involving the two-stage processing with the OBF can also be considered in uplink scenarios. As shown in Fig. 1, the wireless channels between the BS and the UE device exhibit a small angular spread from the perspective of the BS, as a result of local scatterers around the UE device and the high placement of the BS antennas [7], [12]- [15]. Increasing the number of BS antennas This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ leads to high spatial resolution in the beam (angular) domain. Because the received signal from each UE device is observed only in a subset of beams, large MIMO channels exhibit pseudo-sparsity (including values close to zero) in the beam domain owing to the attenuated sidelobes of the OBF. Recent studies have exploited the sparsity to propose low-complexity MUD schemes [16]- [19]. In [18], a dimensionality reduction is used to compress the channel state information (CSI) in the beam domain, with the goal of reducing memory usage. A linear detector using the compressed CSI was empirically shown to achieve improved performance, lower complexity, and reduced memory usage, in comparison with the typical antenna-domain counterparts. Beam selection algorithms and their hardware implementations were also presented in [17], [18]. As a more flexible strategy, a UE-specific OBF design was proposed in [19], where a contiguous block of beams can be selected for each UE device based on the mean square error (MSE) criterion. The resulting local linear MMSE filter achieved a detection capability comparable to that of the full antenna-domain MMSE filter, at a particularly low computational cost. However, these methods work effectively only when the number of UE devices M is sufficiently small compared to the number of receive antennas N .
As M increases, the number of beams required to maintain the performance increases. An excessive dimensionality reduction aimed at computational reduction will result in information loss, and severe performance degradation is unavoidable owing to inter-group interference (overlapping angular spread), particularly when M is of the same order as N . Additionally, M N must be satisfied for applying the conventional massive MIMO simplification based on the strong channel-hardening effect [2] and for the spatial filter to work well [2], [6].
To achieve reliable MUD in high-spatial-loading MU-MIMO scenarios while benefiting from dimensionality reduction, in the present study, we introduce a message passing (MP) algorithm based on belief propagation (BP) [20]- [32] in the post-stage of the OBF. The BP-based signal detector takes advantage of the law of large numbers to reduce the computational cost while maintaining the detection capability. The detection schemes can be roughly classified into matched filter (MF)-based BP and MMSE-based BP. The most common MF-based BP detector is approximate message passing (AMP) [24]- [26], which is systematically derived from a rigorous approximation of Gaussian BP (GaBP) [21]- [23] in the large-system limit, where the input and output dimensions, M and N , respectively, are infinity for a given compression rate ρ = N/M . These algorithms separate the multiplexed signals using MF and consist of only scalar-wise operations; therefore, they can achieve a computational complexity of O(M N ) for each iteration. Furthermore, a rigorous proof of convergence to the Bayes-optimal solution in the large-system limit was presented in [25]. This proof requires the antenna-domain channel matrix entries to be independent and identically distributed (i.i.d.) random variables with zero mean. This assumption is not reasonable in practical MUD scenarios, owing to the physical limitations of the receiver and the spatial fading correlation between BS antenna elements. To address a broader class of practical applications that involve a general form of structured matrices, e.g., unitary-invariant measurement matrices, MMSE-based BP detectors based on the expectation propagation (EP) framework [33] were proposed in [28]- [30] and proven in [31], [32] to converge toward the Bayes-optimal solution in the large-system limit. The EP-based detector utilizes an Onsager correction term to decouple the self-feedback of messages across iterations, as is the case with the AMP; without the term, the functionality of the detector would be consistent with the iterative detector based on probabilistic data association (PDA) [34]. A rigorous comparative analysis of EP and PDA was presented in [32]. The potential connection between AMP and EP was investigated in [35]- [39]. However, these MMSE-based BP detectors require inverse matrix operations, which have a prohibitively high complexity, for detecting each transmit vector. Additionally, even if the inverse matrix operations are allowed on the reduced beam-domain channel, insufficient large-system conditions caused by the dimensionality reduction severely degrade the detection capability.
To resolve the lack of an appropriate MUD scheme as the post-stage of OBF for high-spatial-loading MU-MIMO systems, in this paper, we propose a novel design for a concatenated structure of beam-and antenna-domain BP detectors. First, strongly correlated intra-group interference is iteratively suppressed via local MMSE filters in the beam-domain, and reliable prior information is generated to be passed to the subsequent antenna-domain BP detector. The subsequent MF-based BP is performed in the full antenna-domain to compensate for the information loss due to dimensionality reduction and to promote convergence during iterative detection while leveraging the prior information. Each detector can be regarded as a layer with a different role, and such a multi-layer structure enables the efficient suppression of intra-and inter-group interference in stages. Furthermore, we clarify that the domain conversion from the beam-domain to the antenna-domain between the layers allows us to benefit from the advantages of beam-and antenna-domain processing simultaneously, leading to a dramatic reduction in computational cost while preserving the performance of the EP detector in the full antenna domain. The contributions of this paper are summarized as follows 1 : 1) A novel low-complexity MUD scheme employing concatenated beam-and antenna-domain layered BP based on a given OBF is proposed for high-spatial-loading MU-MIMO systems. 2) By introducing beam-antenna domain conversion in the concatenated structure, the convergence of iterative detection is improved.
3) The proposed method is demonstrated to be feasible with a remarkably low computational cost while maintaining 1 The present paper extends our previous conference papers [40], [41] by presenting a detailed process flow of a layered BP algorithm with domain conversion, extending the layered BP to higher-order modulation schemes, and then providing a detailed simulation analysis of computational complexity and performance in uncoded and coded cases. the bit error rate (BER) performance of the (considerably complex) EP detector in the full antenna-domain channels. We emphasize that, to our best knowledge, a low-complexity MUD based on the OBF for high-spatial-loading MU-MIMO scenarios, which is the key contribution of this paper, has not yet been proposed.
The remainder of this paper is organized as follows. As a preliminary step, Section II presents a system model. Section III presents the appropriate OBF design using the knowledge of channel covariance matrices to reduce the computational cost of the subsequent detector and the circuit scale on the BS side. The equivalent signal model in the beam domain is also defined. Section IV proposes the novel layered BP detector with a concatenated structure for use in general quadrature amplitude modulation (QAM) systems. The beam-antenna domain conversion method is explained in detail. Section V validates the proposed method through simulations in both uncoded and coded cases, and provides a complexity analysis. Section VI concludes the paper.

A. Mathematical Notations
Throughout this paper, vectors and matrices are denoted by lower-and upper-case bold-face letters, respectively. Furthermore, (·) * , (·) T , and (·) H denote the conjugate, transpose, and conjugate transpose operators, respectively. CN (a, b) indicates a complex Gaussian process with mean a and variance b. P a|b [a|b] and p a|b (a|b) respectively represent the conditional probability mass function (PMF) and probability density function (PDF) of a realization a of a random variable a, given the occurrence of a realization b of a random variable b. E a {·} is the expected value of a. E a|b=b {·} denotes the conditional expectation of a, given the occurrence of b of b.
[·] and [·] are the real and imaginary parts of a complex value, respectively. I a represents an a × a square identity matrix. diag [a] denotes a diagonal matrix with nonzero elements equal to a. det[·] is the determinant of a matrix. proj a→b (·) denotes a projection function from an a-dimensional vector to a b-dimensional vector.

B. Antenna-Domain Signal Model
Consider an uplink MU The UE devices tend to be distributed according to the terrain and large structures (e.g., buildings and roads) and therefore are not uniformly distributed around the BS, allowing UE grouping based on the angle of arrival (AoA) of the RX signal in the beam (angular) domain [6], [7]. Here, the UE devices can be grouped into G groups by matching the angular spread, with the set of all groups G = {1, . . . , g, . . . , G}. As M and G increase, the angular spreads between adjacent groups inevitably overlap, leading to severe inter-group interference. Let U g be the set of all UE devices belonging to group g ∈ G, where |U g | = U g is the number of elements in U g . Thus, The m-th UE conveys a modulated symbol x m , which represents one of Q constellation points X = {χ 1 , . . . , χ q , . . . , χ Q }. The coefficient w x , which normalizes the average TX power density to E s , is given by 3E s /(2(Q − 1)) for Q-QAM [2]. Let and y = [y 1 , . . . , y n , . . . , y N ] T ∈ C N ×1 denote the TX and RX symbol vectors, respectively. The antenna-domain RX vector y is given by where H ∈ C N ×M is a MIMO channel matrix in the spatial domain. The channel H is assumed to be static during each coherence interval of length T . The entries of the complex additive white Gaussian noise (AWGN) vector z = [z 1 , . . . , z n , . . . , z N ] T ∈ C N ×1 are i.i.d. complex Gaussian variables with zero mean and N 0 variance, where N 0 is the noise power spectral density and the covariance matrix of z is given by E z zz H = N 0 I N . The n-th entry of y is expressed as where h n = [h n,1 , . . . , h n,m , . . . , h n,M ] denotes the n-th row vector of H. Without loss of generality, we use the geometric one-ring model [12]. Assuming a diffuse 2D field of isotropic scatterers around UE, the element in the i-th row and j-th column of the RX spatial correlation matrix for the m-th UE, Θ m ∈ C N ×N , is given by Here, δ m = (δ max m + δ min m )/2 is the location of the m-th UE in the azimuthal direction, and waves arrive from the m-th UE with an angular spread Δδ m = δ max m − δ min m . The antenna element spacing is fixed to half the wavelength, and μ m represents the path loss. The m-th column vector of H is given by h m = Θ 1/2 m m , where m ∈ C N ×1 represents small-scale fading and has i.i.d. complex Gaussian entries with zero mean and unit-variance [13].

III. OUTER BEAMFORMER DESIGN USING LONG-TERM
CHANNEL STATISTICS To avoid the high-dimensional matrix operations required for large-scale MUD, we adopt a two-stage receiver design consisting of the OBF and BP-based MUD. The OBF determines the size of the spatial filter used in the subsequent MUD, and therefore, plays a significant role in determining the overall computational cost. Let B g denote the number of beams oriented toward each UE group g ∈ G, where g∈G B g ≤ N . Here, the OBF for g, B g , contains all beams corresponding to the UE devices in g. Assuming that UE devices are grouped based on their long-term channel statistics (e.g., channel covariance matrices), we can design group-specific OBF matrices and use them over multiple channel realizations (over multiple coherence times).

A. Beam Selection
There are two well-known heuristic methods to find OBFs: eigen beam selection and discrete Fourier transform (DFT) beam selection [6], [7], [9]- [11]. In eigen beam selection, the eigenvectors of the channel covariance matrix are used to form the OBF matrix via eigenvalue decomposition (EVD) [6], [7]. By choosing B g eigenvectors corresponding to the B g largest eigenvalues, we obtain the OBF matrix B g . In contrast, DFT beam selection uses the column vectors of [11]. Consequently, the problem reduces to finding a subset of column vectors from the unitary DFT matrix. In [11], DFT beam selection was numerically shown to be better than the eigen scheme. Without loss of generality, we use the DFT scheme to form the OBF matrix in this study.
Here, the channel covariance matrix between all the UE devices in group g and the receiver on the BS side is given by To maximize the energy collected by the DFT beams, we select B g DFT column vectors for each group g ∈ G based on Algorithm 1. The maximization metric d H n R g d n is the average received energy from the UE devices in group g that can be received by the n-th DFT beam d n . Upon finding B g ⊆ N, the resulting OBF matrix that covers the strongest signal paths of each group is given by

B. UE Selection
Before shifting our focus to MUD, we represent a signal model in the beam domain with the OBF matrix. For each UE group g ∈ G, the beam-domain RX vector r g ∈ C Bg ×1 can be represented as where Ξ g = B H g H ∈ C Bg ×M and v g = B g z ∈ C Bg ×1 are the group-specific beam-domain channel and noise, respectively. Considering the computational cost for generating spatial filters, however, the number of row dimensions M remains undesirably high.
For further dimensionality reduction of the group-specific beam-domain channel matrix, we ignore the negligibly small inter-group interference from UE devices that are geographically remote with respect to group g owing to the attenuated sidelobes of the OBF. Similarly to DFT beam selection, in this case, we select S g (≤ M ) UE devices for each group g ∈ G based on Algorithm 2. The maximization metric i∈Bg d H i Θ m d i is the average received energy from the m-th UE device that can be received by the OBF B g toward group g. The resulting S g includes the UE devices in group g and the UE devices that strongly interfere with respect to them.

IV. LAYERED BELIEF PROPAGATION
As described above, the OBF-based beam-domain channel can significantly reduce the dimensions of MIMO channels. However, when using only linear structures of spatial filters, insufficient spatial degrees of freedom to suppress the inter-group interference causes a high error floor in the BER performance. Above all, the information loss due to dimensionality reduction cannot be compensated on the reduced beam-domain channels after the OBF. To address these issues, we introduce a concatenated structure of beamand antenna-domain BP detectors. As is well known, a BP detector is one of the nonlinear iterative receivers that provide better performance than linear spatial filters. The most vital point of our proposal is that the MMSE-based BP is performed in the reduced beam-domain channel; and then, using the outputs as prior information, the MF-based BP is performed in the full antenna-domain channel without dimensionality reduction. The MMSE-based BP mainly suppresses the correlated inter-group interference and generates reliable prior information for the subsequent MF-based BP, while taking advantage of the computational reduction resulting from the dimensionality reduction. The subsequent MF-based BP compensates the information loss due to dimensionality reduction and promotes the convergence of iterative detection by leveraging knowledge from the RX vector, CSI, and prior information passed from the MMSE-based BP. The name of the proposed method, layered BP, signifies its multi-stage detection process in which each BP detector has a distinct role. This structure achieves an excellent trade-off between detection capability and computational cost.
In this section, we describe a signal detection mechanism via layered BP as the post-stage processing of the OBF. After presenting the detailed algorithms of the MMSE-based BP layer, MF-based BP layer, and log-likelihood ratio (LLR) calculation for channel-coded systems, this section explains why the extra processing for domain conversion performs the MF-based BP in antenna-domain channels instead of beam-domain channels.

A. Concatenated Beam-and Antenna-Domain Layered BP
A work flowchart of the proposed detection mechanism via layered BP is illustrated in Fig. 2(a). When both the beam-and antenna-domain CSI are available at the receiver, the domain conversion is not necessary. Fig. 2(b) schematically illustrates the belief propagation (consensus) process, which detects signals via the MMSE-based BP layer and MF-based BP layer. For each stage, we use superscripts (·) a and (·) b , respectively. The number of iterations in the MMSE-based BP is K a and the number of iterations in the MF-based BP is K b . The corresponding iteration indices are denoted by k a ∈ {1, . . . , K a } and k b ∈ {1, . . . , K b }, respectively. The MMSE-based BP first suppresses correlated intra-group interference among the UE devices in the subset S g by using beam-domain local MMSE filters w g,m . The subsequent MF-based BP promotes the convergence of iterative detection by suppressing the residual interference in full antenna-domain channels H over all UE devices.
1) MMSE-Based BP Layer: At the first iteration (k a = 1), no soft interference cancellation (soft IC) is conducted, because prior beliefs are unavailable. In the detection of an arbitrary TX symbol x m , ∃m ∈ S g while ignoring inter-group interference from m ∈ S g , the RX vector r g of (5) can be approximated as where ξ g,m denotes the m-th column vector of Ξ g . In the large-system limit [2], the PDF of (6) approaches a multivariate complex Gaussian distribution because of the central limit theorem (CLT). This behavior is called vector Gaussian approximation (VGA). Accordingly,r g,m can be approximated by an equivalent signal model as where ψ g,m is the effective Gaussian noise. The covariance matrix Ψ g,m is derived from the expectation of the random variables v g and {x i , ∀i ∈ S g \ {m}} as with According to ξ g,m and Ψ g,m under VGA, the PDF ofr g,m based on Mahalanobis' distance is expressed as By applying the matrix-inversion lemma to Ψ −1 g,m in (10) to avoid matrix-inversion computations for every TX symbol detection, we obtain group-specific local MMSE filters as where we use Substituting (11) and (12) into (10), the following proportional relationship is obtained as where α g,m (χ q ) is likelihood information for x m = χ q givenr g,m .
In the same manner, the likelihood information {α g,m , ∀m ∈ S g } is computed as beliefs for each g ∈ G.
To facilitate belief combining among the UE groups, α g,m is projected to the M -dimensional vector α g as where each element in α g is given bỳ Assuming that the effective noise terms in α 1 , . . . , α G between the UE groups are not correlated to each other under VGA, the PDF of joint belief l a m is given by where the occurrence of x is assumed to be equiprobable Assuming that the effective noise terms in l a 1 , . . . , l a M are not correlated to each other under VGA in (7), the entries of (18) are readily given by the following symbol-wise expectation aš When the accuracy of VGA in (7) is not sufficient, the soft symbol obtained using (19) deviates from the exact conditional expectation of (18).
At the second and later iterations (k a = 1,x = 0), a soft IC is conducted using the soft symbol vectorx as follows: By canceling the interference from all the UE devices except for the m-th UE, the harmful impacts of inter-group interference can be suppressed.
Here, the group-specific local MMSE filter w g,m in (11) and η g,m in (12) are supposed to be updated according to the soft IC process. However, as long as there is information loss due to dimensionality reduction, the spatial filter will always contain model errors from the exact signal model; therefore, the convergence results of iterative detection will be poor even if w g,m and η g,m are updated at the expense of computational cost. Importantly, updating the spatial filter based on prior beliefs containing model errors induces error propagation, resulting in catastrophic performance degradation. To prevent such ill-convergence behavior and, at the same time, avoid repetitive matrix inverses, we choose to use the same filters, w g,m and η g,m , in all iterations. Of course, the fixed w g,m and η g,m are not matched to the residual interference at each iteration step; therefore, the BP-based iterative detection does not converge to the exact solution. However, the local MMSE filter can work as a whitening filter for suppressing the correlated intra-group interference, and the detection reliability is robustly enhanced via the soft IC process. Consequently, the generated reliable prior information enables the subsequent MF-based BP to converge to better solutions even in correlated MUD.
After updating the likelihood information α g,m in (14) withr g,m in (20), we compute (16), (17), and (19) again in each iteration. When the number of iterations reaches the predetermined value K a , the soft symbol vectorx is input to the subsequent MF-based BP layer as the prior information.
2) MF-Based BP Layer: As the MF-based BP for suppressing residual interference due to dimensionality reduction and fixed spatial filtering, a GaBP-like approach [21], [22], where low computational cost is paramount, is adopted in the antenna domain.
First, soft IC for the n-th RX symbol y n is conducted using a soft symbol vectorx n = [x n,1 , . . . ,x n,m , . . . ,x n,M ] T . At the first iteration (k b = 1), the soft symbol vector is given by the output of the MMSE-based layer asx n =x, ∀n ∈ N . In the detection of an arbitrary TX symbol x m , ∃m ∈ M, the cancellation process is expressed as We utilize a simple MF output as belief [42], which is given by Assuming that the effective noise terms in s 1,m , . . . , s N,m are not correlated to each other owing to the strong channel-hardening effect provided by ideal massive MIMO channels [2], the maximum ratio combining (MRC) of s n,m yields the following posterior belief: Nevertheless, if o m were utilized as the prior belief, it would be subject to the ill-convergence behavior of iterative detection due to the correlation between y n and s n,m included in o m , i.e., self-noise regression due to loopy propagation [21]. Therefore, the prior belief is typically provided by an extrinsic belief l b n,m , which is given by Here, a prior belief vector l b n = l b n,1 , . . . , l b n,m , . . . , l b n,M T is constructed by using (24). Finally, the soft symbol vectorx n can be obtained from the conditional expectation of x, given the prior belief vector l b n , by simply replacing l a with l b n in (18). Thus, the PDF p l b n |x l b n |x is essential to compute the conditional expectationx n = [x n,1 , . . . ,x n,m , . . . ,x n,M ] T . Assuming the strong channel-hardening effects, the symbol-wise conditional expectation is obtained by where the likelihood information is given by For the sake of readability, we have moved the detailed derivations of equations (25) and (26) to Appendix A.
However, the assumption of a strong channel-hardening effect is not reasonable in practical MUD scenarios, owing to the physical limitation of the receiver and the spatial fading correlation. To address this issue, an adaptively scaled belief (ASB) method was proposed in [22], which is equivalent to using the following symbol-wise expectation: instead of (25), where the effective gain for x m is given by According to [22], the predetermined parameter σ at the k-th iteration in (27) is given by the following monotonically increasing function: where σ linearly increases from a 0 to a 1 in the iteration for avoiding rapid convergence to local minima. We utilize (27) for generating soft symbol replicas in the MF-based BP layer. When the number of iterations reaches the predetermined value K b , x m is hard-detected aŝ 3) Algorithmic Structure: Algorithm 3 presents the pseudo-code of the concatenated beam-and antenna-domain layered BP detector. In the presence of spatial fading correlation among RX antennas, s 1,m , . . . , s N,m are correlated; therefore, the belief combining in (24) cannot successfully suppress the interference and enhance the effective noise of l b n,m . To suppress these negative effects, the node selection (NS) method [43], [44] is introduced in the MF-based BP layer, in which the indices in N are divided into several subsets so that the corresponding beliefs in the same subset are as uncorrelated as possible based on the RX antenna pattern. The beliefs are sequentially updated for each set. In Algorithm 3, the RX antenna indices N are divided into V subsets {V i , i = 1, . . . , V }, where |V i | = N/V . Under the ULA pattern, the indices should be evenly divided as V i = {V · (j − 1) + i | j = 1, . . . , N/V } to maximize the distance between two arbitrary RX antennas in each set. With the NS method, the number of iterations is K b · V , but the number of updated beliefs in each iteration is |V i | = N/V . Therefore, the NS method does not change the overall computational cost.

B. LLR Calculation for Channel-Coded Systems
In channel-coded systems, at the final iteration of the MF-based BP (k b = K b ), bit LLRs for input to the subsequent channel decoder must be calculated, instead of being obtained via hard decision. Under scalar Gaussian approximation (SGA) of the interference-plus-noise term inỹ n,m in (21), the conditional PDF ofỹ n,m can be expressed as where the likelihood information is given by [22] γ n,m ( with e n,m = χq∈X |χ q | 2 exp σ ωn,m · β n,m (χ q ) χ q ∈X exp σ ωn,m · β n,m (χ q ) .

Input: y, H, r, Ξ, {B
LLR corresponding to the n s -th bit c ns is given by where X |c ns = c is a set consisting of all candidate symbols such that c ns is c ∈ {0, 1}. For the numerical stability, the summations as in (35) are evaluated using the Jacobian logarithm [1].

C. Beam-Antenna Domain Conversion
Finally, in this subsection, we describe the necessity of domain conversion to perform the MF-based BP in the antenna domain. To properly suppress the residual interference due to the extrinsic MRC process of (24), the effective noise included in s 1,m , . . . , s N,m should be as uncorrelated as possible. As a result of averaging the impact of TX symbols, their correlation strength deeply depends on the instantaneous correlation among channel coefficients in the column dimension; this can be visualized using the intensity of off-diagonal elements in a Gram matrix: G H = HH H . That is, when performing the MF-based BP in the beam domain without domain conversion, instead of G H , we need to consider the following Gram matrix: The noise enhancement level induced in (24) becomes more severe as the intensity of the off-diagonal elements in the Gram matrix increases, resulting in the ill-convergence behavior of iterative detection due to error propagation. Fig. 3 shows the intensity of the off-diagonal elements in G Ξ and G H normalized by M , respectively, where the diagonal elements are removed. As shown in Fig. 3(a), owing to the DFT beamformer D H , the intensity of the off-diagonal elements in G Ξ is partially enhanced. In particular, the adjacent beams in the azimuthal direction increase the instantaneous correlation of the beam-domain channel coefficients, leading to the extreme off-diagonal elements. Consequently, owing to the correlation among s 1,m , . . . , s N,m , the effective noise of l b n,m is enhanced in (24), and the soft symbols tend to contain persistent fatal errors (i.e., incorrect hard decision symbols) via error propagation. Conversely, in Fig. 3(b), the intensity of the off-diagonal elements in G H is uniformly distributed and suppressed to values smaller than those in G Ξ in the beam domain. In this case, the effective noise is not substantially enhanced, and the approximation errors in (24) can be mitigated using the soft IC in the iterative process. Thus, with the MF-based BP in the antenna domain, the convergence property of the layered BP can be improved. Note that the ill-convergence behavior in the beam-domain MF-based BP does not occur in MMSE-based BP [32] because the impact of the DFT beamformer is completely canceled out at the MMSE filtering step owing to the fact that B H g B = I Bg . In practice, the beam-domain channel Ξ is estimated directly in the beam domain; thus, an extra step is needed in order to obtain the antenna-domain channel via FFT as H = DΞ when the antenna-domain channel is not available. The incremental computational cost for this operation is only O (M N log 2 N ) per channel realization.

A. BER Performance
Simulations were conducted to validate the performance of the proposed layered BP detector for large-scale MUD based on the OBF. The average RX powers from all TX antennas are assumed identical based on open-loop TX power control, i.e., μ m = 1, ∀m ∈ M in (3). Further, a sector antenna with a 120 • opening is considered. The UE devices are naturally partitioned into G segments with U g = M/G, ∀g ∈ G UE devices randomly dropped in each segment. The inter-group interference becomes severer as M and G increase. The angular spread for each UE device is 15 • . The beam-domain channel estimation as well as the time and frequency synchronization between TX and RX are assumed to be perfect. Fig. 4 shows BER performance with the system parameters (G, B g , S g ) = (8, 8, 3 · U g ), ∀g ∈ G. To properly suppress the correlated interference from the adjacent groups, S g is set to 3 · U g . The number of UE devices is set to M = 32, 40, 48, 56, and 64, whereas that of RX antennas N is fixed at 64; this represents a high-spatial-loading MU-MIMO configuration. The modulation scheme is Gray-coded QPSK in 4(a)-(d) and Gray-coded 16QAM in 4(e) and (f). The channel code is not utilized. The performance of the following methods is compared 2 : 2 Except under ideal channel conditions for BP [25], [32], it is difficult to theoretically determine the number of iterations required to converge to the fixed point by leveraging analytical tools such as the state evolution [24], [26]. In practical finite-size MIMO channels, the number of iterations for iterative MUD is determined to be fixed under the trade-off between detection capability and computational cost [2], [3]. In this study, we first set a fixed sufficient number of iterations for BP detectors and then determine the number of iterations for each iterative detector based on subsequent qualitative convergence analysis.
• MMSE: Baseline linear MMSE receiver in the antenna domain. • EP: State-of-the-art MMSE-based BP detector [29], [32], where the number of iterations is K = 8. The MMSE filter based on the full antenna-domain channel is applied in each iteration. To improve the detection capability in practical MUD scenarios, some large-system approximations are removed from [32]. If the drawback due to dimensionality reduction is successfully suppressed, the layered BP can approach the performance of EP detector. • GaBP: MF-based BP detector [21], where the number of iterations is K = 16. The AMP [24] and generalized AMP (GAMP) [26] are derived from GaBP using the large-system approximation; therefore, GaBP performs better than AMP and GAMP in finite-size MIMO channels. The performance results that can be achieved without whitening filters are presented. , which shows the ideal performance (lower bound) obtained when the perfect prior information is given. A brief explanation of MFB for MUD is presented in Appendix B. Note that the NS method is used in "GaBP" and the MF-based BP layer of "layered BP," where the number of sets V is set to 4. Several simulations were conducted to find the sub-optimal a 0 and a 1 for minimizing the required E s /N 0 at BER = 10 −4 in ρ = 2.0. Consequently, we set (a 0 , a 1 ) to (2, 0) for "layered BP (b)" and (1, 1) for "layered BP (b→a)." The results in Fig. 4 demonstrate that "MMSE w/ OBF" performs significantly worse than "MMSE" owing to dimensionality reduction based on the OBF. Even for M = 40 in Fig. 4(a), "MMSE w/ OBF" cannot achieve BER = 10 −4 . "GaBP" cannot sufficiently suppress the negative effect of severe interference due to spatial fading correlation, resulting in a high error floor in the BER performances. In contrast, "layered BP (b)" can significantly improve the detection capability compared to "GaBP" with low computational cost, owing to the local MMSE filtering in the beam domain. However, there remains the non-negligible performance gap from "EP," and it becomes larger as M and Q increase. The most attractive feature is that the proposed "layered BP (b→a)" can compensate for the performance gap and approach "EP" owing to domain conversion, even in the Gray-coded 16QAM cases in Fig. 4(e) and (f). The total gain w.r.t "MMSE" at BER = 10 −4 is approximately 6.0 dB, 9.5 dB, and more than 10 dB in the Gray-coded QPSK cases in Fig. 4(a)-(c), and approximately 4.0 dB and 5.0 dB in the Gray-coded 16 QAM cases in 4(e) and (f). In Figs. 4(c), (d), and (f), the error floor occurs even in "layered BP (b→a)," owing to the high-spatial-loading; however, such a low error floor can be suppressed using the channel decoder, as described later. In Figs. 4(a) and (e), where the spatial-loading is relatively low, the EP-based detector and the proposed method can approach "MFB." Let us shift our focus to other MIMO configurations. Fig. 5 shows the E s /N 0 required to guarantee a BER of less than 10 −4 according to M for N = 64. As M increases, the required E s /N 0 increases more rapidly in "MMSE w/ OBF" than in "MMSE" owing to the severe inter-group interference caused by dimensionality reduction in the former. The required E s /N 0 in "MMSE w/ OBF" approaches that in "MFB" only when M is far less than N and the UE devices are spatially well separated. Furthermore, "GaBP" cannot support a high number of UE devices under correlated MIMO channel conditions because of the lack of a whitening mechanism. Owing to the high error floor, it can achieve a BER of = 10 −4 only when M ≤ 16 in Gray-coded QPSK and M ≤ 8 in Gray-coded 16QAM. In contrast, "layered BP (b)" can drastically increase the spatial multiplexing gain in the region of low E s /N 0 ; however, in Gray-coded 16QAM cases, the required E s /N 0 rapidly increases as M increases. By contrast, owing to the domain conversion, "layered BP (b→a)" can be utilized up to approximately M = 56 in Fig. 5(a) and M = 40 in Fig. 5(b). Furthermore, in the region of M ≤ 32, the proposed method can approach "MFB" with almost the same performance as "EP." Another point to note is the BER performances in the coded MIMO systems, which are shown in Fig. 6   case in Fig. 4. The modulation scheme is Gray-coded QPSK in Figs. 6(a) and (b), and Gray-coded 16QAM in Figs. 6(c) and (d). The low-density parity-check (LDPC) code with a rate of 2/3 and length of 1944 coded bits that is used in the IEEE 802.11n standard is applied as the channel code. The error correction by the channel decoder is conducted only once after MUD. As shown in Fig. 6, the performance of "MMSE" is significantly degraded by severe inter-UE interference, while the iterative gain in "EP" is large. Additionally, the performance of "Layered BP (b)" is also severely degraded as M and Q increase. In higher-order modulation signal detection, the MF-based BP detector is more susceptible to the negative effect of the off-diagonal elements in the Gram matrix of channels. As described in Section IV-A, the DFT beamformer enhances the off-diagonal elements, and the convergence property of the beam-domain MF-based BP is severely degraded. Additionally, the consistency condition [1] of the output LLRs does not hold and the channel decoder is not capable of fully performing the error correction. In contrast, the most attractive feature is that "layered BP (b→a)" can approach "EP" in terms of performance owing to the domain conversion. From the viewpoint of generating reliable LLRs that satisfy the consistency condition, the domain conversion becomes more vital in coded MIMO systems. Specifically, the performance degradation from "EP" is only approximately 0.5 dB at BER = 10 −4 in Figs. 6(a)-(c). Even in   3 Finally, we focus on the convergence behavior of iterative detection. Because the MF-based BP layer is designed based on the GaBP rule, it can likely converge to the Bayes-optimal solution under the ideal massive MIMO channel conditions. However, a rigorous convergence analysis under less-than-ideal conditions remains an open issue. Therefore, the convergence behavior of the proposed method is described in a qualitative way using some numerical results. Fig. 7 shows the BER performances of the BP-based detectors at each iteration step. The E s /N 0 is fixed at −1.0 dB in Gray-coded QPSK cases of Fig. 7(a) and (c), and 4.0 dB in Gray-coded 16QAM cases of Fig. 7(b) and (d). The MIMO configurations are (N, M ) = (64, 48) and (64, 32), respectively, and the other settings are the same as in Fig. 4. In Figs. 7(a) and (b), "GaBP" converges to poor local solutions owing to the spatial fading correlation, while "EP" can converge to much better solutions within a small number of iterations (i.e., about 3 or 4), owing to the iterative MMSE filtering. It is worth noting here that "layered BP (b→a)" does not converge to inferior local solutions, but to solutions comparable to "EP." In Figs. 7(c) and (d), the BER performances of "layered BP (b→a)" at each iteration step with different (K a , K b ) settings are presented. "(K a , K b ) = (16, 0)" indicates the performance of only beam-domain MMSE-based BP. Although the convergence speed is very fast owing to the fixed local MMSE filters, the local solution to which the algorithm can converge is poor (far from the optimal solution) because of the information loss caused by the dimensionality reduction. Even if the local MMSE filters are updated at each iteration step, the BER performance deteriorates via iterative processing due to error propagation. By contrast, the convergence solution of "(K a , K b ) = (8,8)" is greatly improved compared to "(K a , K b ) = (0, 16)", which confirms the effectiveness of the multi-layer structure consisting of whitening processing based on beam-domain local MMSE filters and full-antenna domain MF-based BP. Furthermore, when K a is varied while K b is fixed at 8, it can be seen that the algorithm can converge to almost the same solutions even with lower K a settings. Even for the high-spatial-loading MIMO configurations as shown in Fig. 7, setting K a to at least 3 is sufficient to converge to achievable solutions.

B. Complexity Analysis
The computational cost of each detector was evaluated in terms of the number of real multiplication operations required to detect MIMO signals and calculate LLRs passed to the channel decoder. To evaluate the approximate number of   Fig. 7, the number of iterations K is set to 3 for "EP" and is set to 6 for "GaBP." The number of iterations for "layered BP" is set to (K a , K b ) = (3,8). The modulation scheme is Gray-coded 16QAM, i.e., Q = 16.
real multiplication operations, we adopt the following basic assumptions: [45] • A multiplication of a complex matrix A 1 ∈ C d1×d2 with a complex matrix A 2 ∈ C d2×d3 requires d 1 d 2 d 3 multiplications (of complex numbers). • An inversion of a complex matrix A 3 ∈ C d1×d1 requires approximately 2d 3 1 − d 2 1 multiplications and d 2 1 divisions (of complex numbers). • A division of complex numbers requires 1 complex multiplication (division and multiplication are assumed to have the same complexity). • A complex multiplication requires 4 real multiplications. With these assumptions, Table I summarizes the number of real multiplication operations of different detectors; they are plotted in Fig. 8 under some system conditions. Note that we can ignore the computational cost for constructing the OBF because it can be computed offline.
First, we focus on the operations required for each channel realization (per coherence time T ). Among all the methods, "GaBP" achieves the lowest complexity in Fig. 8(a) owing to its scalar-wise detection without any matrix operations. On the other hand, "MMSE" and "EP" require MMSE filter construction including the computation of matrix-inversion and matrix-multiplications according to the full antenna-domain channel size N × M , leading to O(M N 2 + N 3 ) complexity. Conversely, the OBF can reduce the equivalent beam-domain channel size to B g × S g . Thus, the complexity for the construction of the local MMSE filters {w g , ∀g ∈ G} in (11) can be reduced to O( G g=1 S g B 2 g + B 3 g ). Even when we employ the beam-antenna domain conversion in "layered BP (b→a)," its O(M N log 2 (N )) complexity is not dominant for determining the computational cost as shown in Fig. 8(a). Additionally, the reduction of the spatial-filter size allows us to significantly reduce the circuit scale and power consumption at the receiver, which is suitable for practical implementation.
Let us shift our focus to the operations required for the detection of each TX vector x. In "MMSE" and "MMSE w/ OBF," the only process required is linear filtering, and its computational cost is extremely low as shown in 8(b). Owing to the matrix operations according to the N × M antenna-domain channel matrix in each iteration, the computational cost of "EP" is prohibitively high, even if the number of iterations is relatively small. The typical "GaBP" requires O(QM N ) complexity in each iteration for computing the likelihood information for each constellation point. However, the low-computational implementation was proposed in [22], where the soft replica generation process is approximately functionalized so that the likelihood information need not be calculated for each constellation point, and the resulting complexity of "GaBP" can be reduced to O(M N + QM ) without any performance degradation. A similar methodology is applicable for the MF-based BP layer in "layered BP." More specifically, the lines 16-17 and 21-22 in Algorithm 3 can be replaced with the approximate soft replica generation function proposed in [22]. Consequently, the complexity of "layered BP" mainly involves the soft IC in (20) and (21) and the linear filtering process in each iteration, leading to O(M N + QM ). As shown in Fig. 8(b), the computational cost of "layered BP" is suppressed below 10 6 real multiplication operations even in the (64, 64) MIMO configuration.
As can be seen from Figs. 8(a) and (b), the dominant factor for determining the computational cost depends on the coherence time T . In the case of very small T , the proposed method can achieve lower computational cost compared to "MMSE," but in other cases, "MMSE" achieves much lower computational cost. However, considering that the "MMSE" fails to detect signals reliably as M increases and that "layered BP (b→a)" can achieve performance asymptotically approaching "EP," we can confirm that the proposed method realizes an excellent trade-off between detection capability and computational cost. Furthermore, by heuristically optimizing the number of iterations (K a , K b ) as shown in Fig. 7, the computational cost of "layered BP" can be further reduced according to spatial-loading, modulation level, channel model, and so on. From these results, we conclude that there are numerous scenarios in which the proposed method will be effective in terms of detection capability and computational cost.
VI. CONCLUSION In this paper, we proposed a novel concatenated beamand antenna-domain layered BP detector for realizing low-complexity, high-accuracy, large-scale MU-MIMO detection based on the OBF. The dimensionality reduction of beam-domain channels using the OBF, which is designed based on long-term channel statistics, can significantly reduce the computational cost for MUD and the circuit scale at the receiver. However, as the number of UE devices increases and the beam-domain channel size decreases, a high-precision MUD becomes more difficult to realize, owing to information loss and severe inter-group interference caused by dimensionality reduction. To address this problem, a concatenated structure of beam-domain MMSE-based BP and antenna-domain MF-based BP layers is introduced, which efficiently suppresses the severe interference in stages based on the pre-designed OBF. Simulation results explicitly demonstrate that the proposed method can dramatically reduce the computational cost to within a few percentage points of that of the EP-based detector while maintaining the BER performance. The most impressive feature of layered BP is the fact that the low-complexity BP detector is realized for correlated large-MUD scenarios while benefiting the dimensionality reduction.
As future research, we are considering applying our proposed layered BP detector to more flexible UE-specific OBF [19] for further computational reduction and improved convenience in beam-domain channel estimation. In such scenario, to manage severe inter-UE interference, the beliefcombining operation in the MMSE-based BP layer must be improved by taking the impact of partially overlapping local MMSE filters into account. Note that (46) is equal to (26). From (44), the conditional expectationx n is given by From (47) and Bayes' rule, the symbol-wise expectation is given byx From (45) and (48), (25) can be achieved.

APPENDIX B DERIVATION OF THE MATCHED FILTER BOUND (MFB) FOR MUD
The MFB provides the ideal lower bound of BER for the BP-based detectors and also for the maximum likelihood (ML) detector. The MFB is achieved when the inter-UE interference is perfectly canceled; thus, the BP-based detectors achieve the MFB when the soft replica vector used for soft IC is equal to the transmitted vector x at the final iteration. From (21), usingx n = x, the output of perfect soft IC is given bỹ y n,m = y n − i∈M\{m} h n,i x i = h n,m x m + z n .
By computing (22), (23), and (30) withỹ n,m of (49), the hard-decision symbol of MFB can be obtained. Similarly, by computing (32) and (35) withỹ n,m of (49) and ψ n,m = N 0 , ∀m, n, the bit LLRs of MFB can be computed. For ease of explanation, we have used the formula of the MF-based BP layer; however, the same results can be obtained for other BP-based detectors.