Bilinear Gaussian Belief Propagation for Massive MIMO Detection With Non-Orthogonal Pilots

We propose a novel joint channel and data estimation (JCDE) algorithm via bilinear Gaussian belief propagation (BiGaBP) for massive multi-user MIMO (MU-MIMO) systems with non-orthogonal pilot sequences. The contribution aims to reduce significantly the communication overhead required for channel acquisition by enabling the use of short non-orthogonal pilots, while maintaining multi-user detection (MUD) capability. Bilinear generalized approximate message passing (BiGAMP), which is systematically derived by extending approximate message passing (AMP) to the bilinear inference problem (BIP), provides computationally efficient approximate implementations of large-scale JCDE via sum-product algorithm (SPA); however, as the pilot length decreases, the estimation accuracy is severely degraded. To tackle this issue, the proposed BiGaBP algorithm generalizes BiGAMP by relaxing its dependence on the large-system limit approximation and leveraging the belief propagation (BP) concept. In addition, a novel belief scaling method complying with the data detection accuracy for each iteration step is designed to avoid the divergence behavior of iterative estimation in the early iterations due to the use of non-orthogonal pilots, especially in insufficient large-system conditions. Simulation results show that the proposed method outperforms the state-of-the-art schemes and approaches the performance of idealized (genie-aided) scheme in terms of mean square error (MSE) and bit error rate (BER) performances.

Bilinear Gaussian Belief Propagation for Massive MIMO Detection With Non-Orthogonal Pilots Kenta Ito , Graduate Student Member, IEEE, Takumi Takahashi , Member, IEEE, Shinsuke Ibi , Member, IEEE, and Seiichi Sampei, Life Fellow, IEEE Abstract-We propose a novel joint channel and data estimation (JCDE) algorithm via bilinear Gaussian belief propagation (BiGaBP) for massive multi-user MIMO (MU-MIMO) systems with non-orthogonal pilot sequences.The contribution aims to reduce significantly the communication overhead required for channel acquisition by enabling the use of short non-orthogonal pilots, while maintaining multi-user detection (MUD) capability.Bilinear generalized approximate message passing (BiGAMP), which is systematically derived by extending approximate message passing (AMP) to the bilinear inference problem (BIP), provides computationally efficient approximate implementations of large-scale JCDE via sum-product algorithm (SPA); however, as the pilot length decreases, the estimation accuracy is severely degraded.To tackle this issue, the proposed BiGaBP algorithm generalizes BiGAMP by relaxing its dependence on the large-system limit approximation and leveraging the belief propagation (BP) concept.In addition, a novel belief scaling method complying with the data detection accuracy for each iteration step is designed to avoid the divergence behavior of iterative estimation in the early iterations due to the use of non-orthogonal pilots, especially in insufficient large-system conditions.Simulation results show that the proposed method outperforms the state-of-the-art schemes and approaches the performance of idealized (genie-aided) scheme in terms of mean square error (MSE) and bit error rate (BER) performances.

I. INTRODUCTION
M ASSIVE multiple-input multiple-output (MIMO) sys- tems have been considered as one of the key technologies for the fifth generation (5G) advanced and sixth generation (6G) networks, which promise significant performance improvements in many aspects, such as spectral efficiency, detection reliability, and energy efficiency [1], [2], [3].In particular, massive MU-MIMO technology, where a base station (BS) is equipped with massive number of antenna arrays, can simultaneously serve a massive amount of wireless links, and brings enormous connectivity in the uplink access [4], [5].Under such a scenario, low-complexity and large-scale MUD consisting of high-dimensional channel estimation and subsequent data detection play a key role in the operation of massive MU-MIMO [6], [7].
Conventional MUD is composed of the following two steps, channel estimation based on a pilot (training) sequence and subsequent data detection based on estimated channel state information (CSI).To obtain highly accurate CSI knowledge with this scheme, orthogonal pilot sequences must be used under the assumption that the pilot length K p is, at least, larger than the maximum number of uplink users M communicating simultaneously [8].This implies that the communication overhead required for CSI acquisition increases rapidly with the number of simultaneous connections, which means that it might be infeasible to obtain accurate CSI in the case of fast-fading environments due to its short channel coherence time [9], [10].In addition, the facile use of short non-orthogonal pilots (K p < M ) for channel estimation results in rank-deficient conditions in the estimated channel, leading to severe performance deterioration in subsequent data detection.
One promising solution to shorten the pilot length without sacrificing the estimation accuracy is JCDE [11], [12], [13], [14].By exploiting the statistical quasi-orthogonality of data sequences, JCDE takes advantage of estimated data symbols as equivalent soft pilot symbols, providing significant improvements in system performance.The classical JCDE scheme assumes the exchange of log-likelihood ratios (LLRs) between the signal detector and channel decoder based on the turbo-principle [11], [12], where the reliability of tentative data symbols is enhanced at every iteration through an error correction process and then used as additional pilot symbols.However, iterative decoding increases power consumption and causes severe processing delays, which have been obstacles to the practical use of these typical JCDE algorithms.To address this issue, the JCDE scheme based on Bayesian message passing (MP) without requiring channel decoding at every iteration has been investigated [15], [16].
The most common algorithm of this approach is BiGAMP [15], which is derived by approximating the SPA designed for the BIP according to the generalized approximate message passing (GAMP) framework [17], [18] in the largesystem limit. 1 However, the original GAMP has been proposed in the context of linear inference problems (LIPs), which is appropriately designed on the premise of a bipartite factor graph (FG) consisting of factor nodes (FNs) and variable nodes (VNs) corresponding to observations and unknown parameters, respectively.This makes it difficult to straightforwardly extend GAMP to the BIP which is represented by a tripartite FG consisting of FNs and two VNs.Therefore, the BiGAMP algorithm in [15], which is derived by extending the GAMP algorithm to the BIP forcibly, cannot properly decouple the self-feedback between the two VNs across iterations using the Onsager correction term [16].Another problem is that in aligning the resultant message update rules to correspond with the GAMP rules, the approximation accuracy assumed in the derivation of each message is not consistent throughout the algorithm.Because of the above inconveniences, the JCDE algorithm via BiGAMP in [15] can achieve high estimation accuracy only when a sufficient system size can be assumed and prior information, e.g., sufficiently long pilots, is available [19].
On the one hand, while accepting the drawback of BiGAMP described above, some methodologies have been proposed to achieve robust signal recovery by incorporating the knowledge of problem structure into the MP rules.In [20], [21], and [22], for instance, the sparse structure of massive MIMO channel in the beam domain brought about by a limited range of angle of arrival (AoA) is exploited as additional prior information to mitigate the under-determined condition of MUD.
On the other hand, in an attempt to solve the drawback of BiGAMP described above, the authors in [16] focus on the FG structure of BIP and propose to modify the MP rule of BiGAMP itself.Specifically, by interpreting a tripartite graph corresponding to BIP as consisting of two bipartite graphs, and by designing the MP rule while considering both the self-feedback that occurs in each bipartite graph and the self-feedback that occurs in information exchange between the graphs, this enables us to derive appropriate Onsager correction terms.The resulting MP rule is a reasonable extension of the GAMP framework to BIP, eliminating the mathematical inconsistencies found in the conventional BiGAMP algorithm.Through numerical simulations, it has been shown in [16] that the modified algorithm achieves more robust performance against changes in system size than the conventional alternatives.However, poor convergence behavior due to mismatch with the asymptotic conditions in the large-system limit is still unavoidable.The mitigation of this behavior requires the introduction of successive MP mechanism that results in an increase in the processing delay [16].
Inspired by these works, we proposed the JCDE algorithm via BiGaBP in [23] 2 to achieve more robust signal recovery in more realistic system sizes, with the same order of complexity as BiGAMP.The core idea of the proposed method is to relax approximations in the algorithm derivation process.Gaussian belief propagation (GaBP) [24], [25] relies only on scalar Gaussian approximation (SGA) based on mild central limit theorem (CLT), in contrast to the much harder asymptotic conditions required by the GAMP algorithm.Indeed, the GAMP algorithm in the context of LIPs is proven to be systematically derived from a rigorous approximation of the GaBP algorithm in the large-system limit condition [26].The proposed BiGaBP algorithm, which can be systematically derived by extending the GaBP framework to BIPs, also relies only on SGA, which provides more stable iterative convergence behavior in insufficient system sizes even when using non-orthogonal pilots [23].The BiGaBP framework was recently employed in the receiver design of cell-free massive MIMO (CF-mMIMO) systems adopting low-resolution analog-to-digital converters (ADCs) [27], the joint activity and channel estimation (JACE) scheme of extra-large MIMO (XL-MIMO) systems [28], and the receiver design of grant-free (GF) access [29], and in all of them the BiGaBP-based approaches were shown to outperform earlier state-of-the-art schemes.
However, there are no literature that mention in detail the differences and relationships among the algorithmic structures of BiGAMP [15], modified BiGAMP [16], and BiGaBP.The performance comparison via computer simulations presented in the literature [23] was done only in a limited problem setting in a specific system configuration; hence, its systematic position and effective scope have not yet been fully reported.Therefore, to validate the efficacy of the BiGaBP framework to design the JCDE algorithm, this article extends our conference paper [23] by presenting a detailed process flow of the JCDE algorithm via BiGaBP with a novel belief scaling method, clarifying the relationship between BiGAMP, and then providing a detailed simulation analysis of estimation performance.
The contributions of the article are summarized as follows3 : • A novel JCDE algorithm via BiGaBP is presented.The derivation relies only on the SGA in conformity with mild CLT, whose underlying assumptions are much softer than the large-system limit assumption on which the BiGAMP algorithm heavily relies, is presented.In addition, to suppress unstable iterative convergence behavior due to using non-orthogonal pilots, we also propose a new belief scaling method that extends adaptively scaled belief (ASB), which was proposed in [25], [26], and [33] so as to improve detection capability of the MP algorithm in LIPs, to BIPs.Specifically, the proposed scaling method combines the beliefs of the pilot and data parts propagated on the FG with different weights depending on the number of iterations, thereby suppressing iterative divergence behavior in the first stage of iteration, which is a problem when using non-orthogonal pilots.Note that this extension is only valid for BiGaBP which can suppress the self-feedback via generating extrinsic values based on the BP regime rather than Onsager term.• To clarify that BiGaBP is truly a relaxed approximation version of BiGAMP, we prove that the BiGAMP algorithm in [16] can be derived by approximating the BiGaBP algorithm in asymptotic conditions of the largesystem limit.Furthermore, we compare the BiGAMP algorithms proposed in [15] and [16], respectively, to identify differences in algorithm structure resulting from the different interpretations of FGs described above.• To confirm the efficacy of the JCDE algorithm via BiGaBP in massive MU-MIMO systems, we compare the performance of conventional and proposed methods on various system parameters.Our numerical results are presented in terms of normalized mean square error (NMSE) of estimated quantities and BER.In addition to evaluating performance as a function of conventional signal-to-noise ratio (SNR), we also evaluate the estimation performance for different pilot length and data length settings.The simulation results are shown to outperform the current state-of-the-art for any system parameter and approach the performance of an idealized scheme in which channel coefficients are perfectly known.Finally, this section concludes by discussing deep learning (DL)-based JCDE schemes that have been attracting attention in recent years and then clarifying the position of the proposed method relative to these methods.Starting with the proposal of the direct demodulation technique without prior explicit channel estimation was proposed for orthogonal frequencydivision multiplexing (OFDM) systems in [34], the JCDE methods using deep neural networks (DNNs)-based channel equalization, have been actively investigated [35], [36], [37].More recently, the DL-based JCDE algorithms incorporate model-driven algorithm design, such as the deep unfolding (DU) technique [38], have been proposed for various systems in [39], [40], and [41].These methods have been reported to achieve more efficient learning by leveraging domain knowledge and can achieve better performance at lower learning cost than the traditional black-box counterparts, especially in massive MIMO scenarios.As a template for such DU-aided methods, this article contributes to provide a novel framework to design a more robust JCDE algorithm.
Notation: Throughout this paper, vectors, and matrices are denoted by lower-and upper-case bold-face letters, respectively.The conjugate, transpose, and conjugate transpose operators are denoted by • * , • T , and • H , respectively.P a|b [a|b] and p a|b (a|b) respectively represent the conditional probability mass function (PMF) and the probability density function (PDF) of a realization a of random variable a given the occurrence of a realization b of random variable b.E a {•} is the expected value of random variable a. E a|b=b {•} denotes the conditional expectation of random variable a given the occurrence of a realization b of random variable b.C a×b denotes a complex field of size a × b.CN ( ẋ; a, b) indicates that ẋ obeys a complex-valued Gaussian process with mean a and variance b.

II. PRELIMINARIES A. Signal Model
Consider a single-cell massive MU-MIMO system, where a BS has N receive (RX) antennas and M (≤ N ) user equipment (UE) devices are equipped with a single transmit (TX) antenna.At the k-th discrete time instance, the m-th UE transmits a TX symbol x mk , which represents one among The average energy density of constellations in X is denoted by E s .Denoting the TX vector at the k-th discrete time instance by x k ≜ [x 1k , . . ., x mk , . . ., x M k ] T ∈ X M ×1 , the RX symbol received at the n-th RX antenna under the assumption of frequency-flat and slow-fading is given by where H ∈ C N ×M denotes an N × M MIMO channel matrix, where the (n, m) element, h nm , obeys CN (h nm ; 0, ϕ), with ϕ = 1/N , owing to slow TX power control.The complex additive white Gaussian noise (AWGN) vector is denoted by w k ≜ [w 1k , . . ., w nk , . . ., w N k ] T ∈ C N ×1 obey CN (w nk ; 0, N 0 ), where N 0 is the noise spectral density, and thus the covariance matrix of w k is given by Then, the spatial-temporal matrices can be expressed as Assuming that the channel matrix H is constant during K successive transmissions, concatenating K successive RX vectors yields the following compact spatial-temporal RX signal representation as In the TX symbol matrix X, each UE device forms a frame with a length of K symbols, which includes K p pilot symbols with the index Consequently, the spatial-temporal matrices can be sub-divided into pilot and data parts as follows: The goal of the JCDE algorithm is to detect the intended TX symbol matrix X d and accurately estimate the channel matrix H, out of the spatial-temporal RX matrix Y and pilot matrix X p .

B. Channel Estimation Using Spatial Filtering
In this subsection, we briefly review basic channel estimation techniques based on spatial filtering using X p to obtain the initial estimates input to the JCDE algorithm.
1) Least square (LS): When using orthogonal pilots (K p ≥ M ), the channel estimator based on the LS criterion is given by where is Moore-Penrose pseudoinverse of X p .The linear filtering error is uniformly superimposed on the estimates; hence, the MSE of each element in H LS can be expressed as 2) Minimum norm solution (MNS): When using short non-orthogonal pilots (K p < M ), we can no longer compute X † p due to the underdetermined condition.Instead, one can use the MNS to find a unique solution, which is given by the Lagrange multiplier method as where Similarly, the MSE of each element is given by where X H p is a resolution matrix.From ( 6), severe estimation error is inevitable in channel estimation based on the MNS owing to the off-diagonal elements in G p , which is caused by the non-orthogonality of X p .

III. BILINEAR GAUSSIAN BELIEF PROPAGATION
This section describes a MUD process based on a BiGaBP mechanism to design JCDE receivers for massive MU-MIMO systems.Fig. 1a shows the tripartite FG consisting of FNs and two VNs, which correspond to the channel coefficients and TX symbols, respectively.The edges between nodes indicate dependencies between corresponding variables, and the information is propagated and exchanged across these edges to perform bilinear inference.In addition, Fig. 1b shows the entities of the propagation information and the direction in which they propagate between each node.As can be seen from these figures, the estimation procedure of the proposed method is performed by exchanging beliefs (i.e., likelihood information reflecting estimation reliability) and soft replicas (i.e., tentative estimates) on the tripartite FG.
To elaborate, in the FNs, soft interference cancellation (Soft IC) is performed on each RX symbol using the soft replicas of H and X estimated in the previous iteration.Then, the beliefs are computed based on the Soft IC outputs, which are propagated to the VNs.In the VNs, the beliefs from the FNs are combined following the typical BP regime [24] to generate extrinsic likelihood information.Then, the soft replicas of H and X are computed on the basis of the conditional expectation, given the beliefs, which are used in the FNs at the next iteration step.We emphasize that all belief distributions propagated from FNs to VNs are approximated by a scalar Gaussian distribution based on CLT, so that only information on the mean and variance of the estimated belief distribution is actually required [25].That is, please note that the entities of the beliefs are their mean and variance values, as shown in Fig. 1b.
For latter convenience, let us define the soft replicas of x mk and h nm as {x n,mk , ∀n} and ĥk,nm , ∀k , respectively, such that their MSEs can be respectively expressed as where xn,mk ≜ x mk − xn,mk and hk,nm ≜ h nm − ĥk,nm denote the estimation errors, respectively.Herein, the JCDE algorithm for estimating x mk and h nm is focused upon.Please note that in practice the following process is performed on all the RX indices in parallel to estimate all channel coefficients and TX symbols simultaneously.

A. FNs: Soft IC and Belief Generation
Let us start with the Soft IC for the (n, k) symbol in Y , y nk , with the aid of the soft replicas xn,mk , ĥk,nm , ∀m generated in the previous iteration step.At the first iteration (t = 1), the soft replicas are appropriately initialized.In the detection of an arbitrary TX symbol x mk , the cancellation Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
process is expressed as Under large-system conditions, the residual interference-plusnoise component in ( 8) can be approximated as a complex Gaussian variable in conformity with CLT; this approximation is referred to as SGA [1].Accordingly, under an SGA of the effective noise ν x m,nk , the conditional PDF of ỹm,nk , given x mk , can be expressed as with where |x mk | 2 = E s when the modulation scheme is phaseshift keying (PSK).Even when using quadrature amplitude modulation (QAM) signaling, we may approximately use the true variance, i.e., In a similar manner, the estimation of an arbitrary channel coefficient h nm can be obtained.When using QAM, under SGA conditions in (8), the conditional PDF of ỹm,nk , given h nm , can be expressed as with where the instantaneous channel gain |h nm | 2 is not available; hence, we here approximately use the true variance, i.e.,

B. VNs: Belief Combining and Replica Generation
Assuming a high-precision SGA of the effective noise components in {ỹ m,nk , ∀n}, the beliefs corresponding to x mk are combined over all the RX indices except for the n-th RX index, which results in the extrinsic belief p rn,mk |x mk (r n,mk |x mk ) for x mk .This is expressed as where The extrinsic combining operation in (13) enables significantly suppress the correlation between y nk and rn,mk by removing the belief propagated from the n-th FN, ỹm,nk , from the extrinsic belief and decoupling the self-noise regression due to w nk .Consequently, the trapping of the iterative process in a poor local solution can be avoided.In a similar manner, the extrinsic belief p qk,nm |hnm (q k,nm |h nm ) for h nm is expressed as where Assuming that the effective noise components in {r n,mk , ∀m, k} are not correlated to each other under SGA conditions, using Bayes' rule, the soft replica of x mk and its MSE can be in general obtained from the symbol-wise conditional expectation, given rn,mk , as xn,mk where the denominators in the summation is introduced for normalization purpose.
In Gray-coded quadrature PSK (QPSK) signaling, i.e., X = {±c x ± jc x }, c x = E s /2, with p x mk (χ q ) = 1/Q, ∀χ q ∈ X , ( 17) can be readily obtained by the following denoiser as [42] xn,mk In a similar manner, the soft replica of h nm can be obtained from the coefficient-wise conditional expectation, given qk,nm , as ĥk,nm When h nm obeys CN (h nm ; 0, ϕ), using the Gaussian-PDF multiplication rule [15], yields where C and C ′ are Pearl's normalization constants.From (20), the soft replica of ĥnm and its MSE are expressed as Thus, here the entire processing in the JCDE algorithm via GaBP is completed.

C. Design of ASB for JCDE via BiGaBP
Although the large-system approximation assumption is relaxed in BiGaBP schemes compared to the conventional BiGAMP, the operating principle of BiGaBP still depends on an accurate SGA of residual interference components in (8).When this approximation accuracy is not sufficient due to, e.g., physical limitations of the receiver, mismatches between the SGA and the stochastic behavior of actual effective noise may result in belief outlier [25].This makes it hard to generate accurate soft replicas using the denoiser functions of ( 18) and ( 21), causing performance deterioration due to error propagation [43], especially during earlier iterations of the JCDE algorithms and/or in systems with short non-orthogonal pilots.
As a simple and highly effective solution to mitigate such potential issues, belief damping [19], [44], which prevents the algorithm from converging to local minima and belief scaling [25], [26], [45], which controls convergence speed, have been proposed.In this subsection, we propose an extension of ASB [25], [26], a belief scaling method designed for the Bayesian linear inference via low-complexity MP algorithms, to BIPs to further improve the convergence property of the JCDE algorithm via BiGaBP.
1) ASB for Data Detection: The main cause of significant performance deterioration in data detection via the MP algorithms under insufficient large-system conditions is the error propagation of erroneous hard-decision symbols as the soft replicas due to the input of the aforementioned beliefs outliers to the denoiser function of (18a).This is due to the fact that the variance ψ r n,mk in (18a) gives the shape of the optimal denoiser function only when the algorithm is operating with ideal behavior and thus cannot properly handle belief outliers.To address this issue, the ASB introduces a parameter instead of ψ r n,mk to adjust the denoiser function according to the iteration number, while taking into consideration the instantaneous CSI.In QPSK signaling, when using ASB, the denoiser functions of (18) can be replaced with [25] xn,mk where γ is the scaling parameter.According to [25], the scaling parameter is designed to be a monotonic increasing function of the number of iterations, i.e., γ(t) = τ 0 + τ 1 • t T , where (t) indicates the variable at the t-th iteration step, T is the maximum number of iterations, and (τ 0 , τ 1 ) are the predetermined parameters.
Fig. 2 shows the dynamics of f ASB (r n,mk , γ) as a function of rn,mk with different scaling parameters, which indicates that the scaling parameter changes the slope of the denoiser function, instead of ψ r n,mk in (18a), enabling adjustment of the iterative convergence speed of the algorithm.In the early Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.iterations, γ is set to lower values to prevent convergence to local minima due to the erroneous hard-decision symbols, and in the later iterations, γ is set to higher values to facilitate the convergence of the algorithm.
2) ASB for Channel Estimation: Next, we consider how ASB for channel estimation should be designed.From (22a), it is found that the scaling parameter in ASB is multiplied by the log-likelihood, i.e., rn,mk , with respect to the detected symbol.In addition, belief combining for channel estimation is performed for discrete-time indices k ∈ K as shown in (15); however, it is not reasonable to multiply the beliefs propagated from the discrete-time indices corresponding to the pilot symbols and to those corresponding to the data symbols, by the same scaling parameter.This is because they should have different statistical properties throughout the iterative process, as described later.
Based on the above, a straight extension of ASB for data detection to channel estimation would replace the extrinsic belief in (15) with with where α and β are scaling parameters corresponding to the pilot and data parts, respectively.From ( 24), α and β are responsible for adjusting the ratio of the beliefs propagated from the pilot and data parts, and (23) coincides with (15) if and only if α = β = 1.It is also worth mentioning that when α = β, qk,nm in (24a) coincides with qk,nm in (16), and only ψ q k,nm is multiplied by the scaling parameter, i.e., ψq k,nm = αψ q k,nm , which is equivalent to the operation to change the slope in the denoiser function of f h (q k,nm , ψq k,nm ) as well as the ASB mechanism for data detection adjusting the equivalent variance of extrinsic beliefs, as can be inferred from (22).Since f h (q k,nm , ψq k,nm ) is a simple linear function of qk,nm , it is not suitable for adjusting the iterative convergence speed by the scaling parameters taking advantage of the non-linearity of denoiser function.Instead, we can utilize the scaling parameters, α and β, as weight parameters to stabilize the convergence behavior of iterative estimation in the early iteration steps.
From (24), the beliefs propagated from the VNs corresponding to the pilot symbols contain only the estimated channel coefficient uncertainty, while the beliefs propagated from the VNs corresponding to the data symbols contain both the estimated channel coefficient and estimated data symbol uncertainty.In other words, the reliability of the former beliefs is much higher than that of the latter beliefs, especially in the early iteration steps when the reliability of detected symbols, i.e., soft replicas of {x mk , ∀(m, k)}, is low.
In light of the above, it may be possible to suppress the divergence behavior of iterative estimation by setting α and β to large and small values, respectively, in the early iteration steps.As the iterative process increases the reliability of the soft replicas, gradually decreasing the value of α and gradually increasing the value of β can facilitate the iterative convergence of the algorithm; thus, it is reasonable for the dynamics of α and β to be monotonically decreasing and monotonically increasing functions, respectively, with respect to the number of iterations.In addition, the dynamics of α and β should satisfy the following two requirements: a) α = β = 1 at the final iteration step so as to converge to the fixed point of the original BiGaBP algorithm, and b) K p + K d = αK p + βK d so as to change the weights between the pilot and data parts without varying the scale of the extrinsic belief.
Based on the above, if a simple linear function is chosen as the monotonic function, the dynamics of scaling parameters is given by a function of the number of iterations as where 0 ≤ τ 2 ≤ 1 is the predetermined parameter to determine the initial value of β, i.e., β(0).The operation of ASB for channel estimation can be interpreted as a soft scheduling strategy for belief update, where the reliability of the beliefs is sequentially enhanced from the nodes near the VNs corresponding to the pilot symbols to the whole of the FG via iterative processes, leading to stabilization of the convergence behavior of BiGaBP.

D. Algorithmic Description
The pseudo-code of the JCDE algorithm via BiGaBP with belief damping and scaling is given in Alg. 1.Besides the RX matrix Y and the pilot matrix X p , the algorithm requires the initial estimates of the channel coefficients Ĥ and their MSEs ψh , which are given by the spatial filtering based on MNS or LS criteria according to the pilot length, as described in Section II-B, outputting the hard-decision estimates of the TX matrix and the estimates of the channel coefficients.The belief damping [15], [19], [44] is also introduced in lines 11, 12, 15, and 16 to prevent the convergence to local minima by averaging the current beliefs based on the past information, where η ∈ [0, 1] is a damping factor.Unlike [15], the introduction is changed to apply the damping process only once for each belief, to be consistent with the MP rule.
In addition, as the most vital algorithmic structural difference between BiGaBP and conventional BiGAMP, lines 9, 10, 13, and 14 corresponding to the self-feedback removal mechanism via extrinsic belief generation are highlighted with gray boxes in Alg. 1.

IV. DERIVATION OF BIGAMP FROM BIGABP
In this section, we prove that the BiGaBP algorithm presented in Section III can be rigorously approximated under the large-system limit assumption, i.e., N, M, K → ∞ for a given compression ratio ρ 1 ≜ N/M, ρ 2 ≜ K/M , to derive the BiGAMP algorithm proposed in [16].We then compare these algorithms to the original BiGAMP algorithm proposed in [15] and elaborate on the differences in the algorithmic structure.It is worth noting here that in both references [15] and [16], the BiGAMP algorithm is derived by approximating the linear inference via SPA in the large-system limit; however, the derivation process from the BiGaBP algorithm, i.e., the bilinear inference algorithm designed based on the GaBP framework [24], has not yet been reported explicitly, to the best of our knowledge.
Since the process of data detection and channel estimation is the same except for the soft replica generation, the approximation process is described mainly using the data detection part of the BiGaBP algorithm.To make the process across iterations easier to understand, (t) is used in this section.The derivation process of BiGAMP from BiGaBP consists of three processes: a) approximation of the second-order moments, b) approximation of the first-order moments, and c) closing the loop of the algorithm.

A. Approximation of Second-Order Moments
Let us consider the process at the t-th iteration step.First, the variance ξ x m,nk (t) in ( 10) can be rewritten as (26), shown at the bottom of the page, where the third term in (26) converges almost surely to zero as N → ∞ in the asymptotic conditions.Similar to (26), the variance ψ r n,mk (t) in ( 14) can be rewritten as From the above, the variance, i.e. the second-order moment, can be approximated with the accuracy of O(1), removing the dependence of the index n or m, respectively.

B. Approximation of First-Order Moments
Next, using ( 26) and ( 27), the first-order moments is approximated with the accuracy of O(1).The Soft IC output ỹm,nk (t) in ( 8) can be rewritten as with the soft replica of interference components, Substituting ( 28) into ( 14), the expected value rn,mk (t) can be rewritten as rn,mk (t) = ψ r mk (t) with where the second term in (30) is the component that is removed in the extrinsic belief combining in the BiGaBP algorithm.
Let us see how this self-feedback component is handled in the BiGAMP algorithm.Consider a Taylor series expansion of Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Algorithm 1 JCDE Algorithm via BiGaBP With Belief Damping and Scaling
Input: Y , X p , Ĥ, ψh , T Output: the denoiser function in (18a) about the point rmk (t), the soft replica used in the next iteration xn,mk (t+1) can be expressed as xn,mk (t + 1) . (32) When using the Bayes-optimal denoiser function as f x (•), i.e., the denoiser function in (17a), ( 32) can be rewritten as where xmk (t + 1) ≜ f x (r mk (t), ψ r mk (t)) and ψ x mk (t + 1) ≜ V x (r mk (t), ψ r mk (t)).The second term in (33) denotes the self-feedback component propagating to the next iteration step through the denoiser function.Note that (32) can be rewritten in a simpler form as in (33) if and only if the denoiser function is designed according to the Bayes optimal criterion so that the identity between the expectation and variance holds [16]; otherwise, the Wirtinger derivatives in equation ( 32) must be calculated, as in [46].In a similar manner, ĥk,nm can be rewritten as ĥk,nm (t + 1) = ĥnm (t + 1) − ψ h nm (t + 1) x * n,mk (t)ỹ nk (t) where h nm (t + 1) ≜ f h (q nm (t), ψ q nm (t)) and ψ h nm (t + 1) ≜ V h (q nm (t), ψ q nm (t)).
Substituting ( 35) and ( 36) into ( 26), ξ x nk (t) can be rewritten as Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Similarly, ψ r mk (t) in ( 27) can be rewritten as Substituting ( 35) and ( 36) into (29), pnk (t) can be rewritten as (39), shown at the bottom of the next page, where the third term in (39) is the Onsager correction term, which is found to be responsible for predicting and canceling the self-feedback component in the large-system limit assumption, from the derivation process so far.
Similarly, substituting ( 35) and ( 36) into (31), rmk (t) can be rewritten as where the fourth term in (40) is the Onsager correction term for the VNs.The above discussion is equally applicable for the channel estimation part.Under the large-system limit assumption, the derivation of the JCDE algorithm via BiGAMP proposed in [16] is completed by asymptotically setting the remainder terms to zero.The pseudo-code is given in Alg. 2, where belief damping is introduced in lines 14-15 and 19-20 as in [16].

D. Differences in Algorithmic Structure From BiGAMP [15]
Finally, we clarify the differences between the JCDE algorithm via the original BiGAMP proposed in [15] and JCDE algorithm via BiGAMP [16] derived above, in terms of algorithmic structure.Due to space limitations, the derivation of the original BiGAMP algorithm [15] is omitted and offered only in a summarized form in the pseudo-code of Alg. 3. Focusing on the points that differ from Alg. 2, first, line 8 in Alg. 3 is the Onsager correction term, which is obtained by approximating (39) as |x mk (t)| 2 ≈ xmk (t)x * mk (t − 1) and | ĥnm (t)| 2 ≈ ĥnm (t) ĥ * nm (t − 1).Next, in lines 14 and 19, the output of the ramp function is used to calculate the equivalent gains of the soft replicas in the VNs, which is derived as a result of approximating the expectation and variance of the beliefs with different precision.Therefore, if the large-system limit assumptions are not tightly satisfied, these mathematical inconsistencies cause stochastic misalignments between the belief expectation and variance, leading to unstable convergence behavior of iterative estimation.Finally, Algorithm 2 JCDE Algorithm via BiGAMP [16] Input: Y , X p , Ĥ, ψh , T Output: x * mi (t)ŝ ni (t) 18: ∀(m, n) : qnm (t) = q nm (t)−ψ q nm (t) ĥnm (t−1) the biggest difference in algorithmic structure is the absence of the Onsager correction term in the VNs in Alg. 3 to make the algorithm structure in the VNs closer to the linear inference algorithm via GAMP [18].However, as described above, the Onsager correction term is intended to cancel the self-feedback, and the original BiGAMP algorithm without this mechanism cannot completely decouple the self-feedback of beliefs across iterations between the VNs.This is the main reason why the BiGAMP algorithm proposed in [16] Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

V. PERFORMANCE ASSESSMENT
Computer simulations were conducted to demonstrate the performances of the proposed JCDE algorithm via BiGaBP for uplink MUD in massive MU-MIMO systems.In all subsequent simulations, the average RX power from each TX antenna was assumed to be identical on the basis of slow TX power control as mentioned in Section II-A, and the time and frequency synchronization was assumed to be perfect.The pilot sequences, i.e., rows of X p , are given by Zadoff-Chu sequences [47], [48].In cases when K p ≥ M , orthogonal pilot sequences are used, while in cases when K p < M , the codebook of pilot sequences is given by the rows of a matrix constructed from an M × K p sub-matrix of an M × M orthogonal pilot matrix consisting of Zadoff-Chu sequences.Gray-coded QPSK-modulated signals were used for X d , and the channel code is not used.As for algorithmic parameters, the damping factor η was set to 0.5 in all the JCDE algorithms, the predetermined parameters in ASB 4 were set to (τ 0 , τ 1 ) = (1,4) in γ(t) and τ 2 = 0.4 in β(t), respectively, and the maximum number of iterations was set constant to T = 32, although other criteria based on convergence could also be employed.A stop criterion was introduced for each JCDE algorithm to avoid divergence of estimates caused by unstable convergence behavior, and the predetermined parameter for the stop criterion was set to τ BiGaBP = τ BiGAMP = 5 in all the JCDE algorithms.In addition, we define the compression ratio as the ratio between the pilot length K p and the number of UE devices M ; thus, κ ≜ K p /M .

A. BER Performance
Our first set of results is given in Fig. 3, where the performances in terms of BER as a function of the SNR, of the massive MU-MIMO systems with short non-orthogonal pilots in uncorrelated Rayleigh fading channels, i.e., h n,m ∼ CN (h n,m ; 0, 1/N ), ∀(n, m), are compared: • MMSE: Baseline two-stage receiver consisting of spatial filtering-based channel estimation presented in Section II-B and linear minimum mean square error (MMSE)-based data detection.• GaBP: Two-stage receiver consisting of spatial filtering-based channel estimation presented in Section II-B and GaBP-based data detection [24], [25].• BiGAMP (Alg.3):JCDE receiver based on BiGAMP [15] presented in Alg. 3. 4 Several simulations were conducted to find the sub-optimal parameters for minimizing the BER at SNR = 10 dB in the setting of Fig. 3 shown on the next page.Learning optimization of these scaling parameters as trainable parameters using DU techniques [38] remains as future work.For further details, we refer the readers to [49].
Onsager correction term Fig. 3. BER performances of MU-MIMO systems: ratio of pilot length κ fixed at 0.625.
• BiGAMP (Alg.2):JCDE receiver based on modified BiGAMP [16] presented in Alg. 2. • BiGaBP (Alg.1):Proposed JCDE receiver based on BiGaBP presented in Alg. 1. • GaBP w/ perfect CSI: Genie-aided scheme in which the perfect CSI is known at the receiver.The results in Fig. 3 show the BER performance in massive MU-MIMO systems with ρ 1 ≜ N/M = 2, κ ≜ K p /M = 0.625, and K d = 256.The two-stage MUD schemes, "MMSE" and "GaBP," fail to detect MIMO signals reliably (BER > 10 −1 ) due to fatal errors in channel estimation using MNS under severe underdetermined conditions of κ = 0.625.On the one hand, "BiGAMP (Alg.3)" can improve estimation accuracy significantly by taking advantage of the quasi-orthogonality of data structure.However, its performance deviates significantly from "GaBP w/ perfect CSI" because the mismatches between the finite MIMO system and large-system limit assumptions break the mathematical consistency in the MP rule.On the other hand, "BiGAMP (Alg.2)" can further improve the performance owing to the MP rule that enables suppression of the harmful effect of the self-feedback with higher precision.However, there remains a non-negligible degradation from the Genie-aided performance, and the high-level error floor is inevitable due to poor initial estimation accuracy caused by the non-orthogonal pilots, especially in Fig. 3a, where the system size is relatively small.In contrast, our proposed "BiGaBP (Alg.1)" significantly outperforms both "BiGAMP (Alg.2)" and "BiGAMP (Alg.3)" at the operating SNR with the same computational complexity and approaches the Genie-aided performance without suffering from error floors, owing to the ASB mechanism.Remarkably, the degradation at BER = 10 −4 is less than 1.0 dB for both configurations.
As shown in Section IV, BiGAMP is derived as a rigorous approximation of BiGaBP in the large-system limit; therefore, the performance difference between these two methods decreases as the system size increases.In fact, as can be seen from the comparison in Figs.3a and 3b, the gain of the proposed method compared to "BiGAMP (Alg.2)" becomes smaller, especially in the low SNR region.However, this apparently small difference is due to burst errors in symbol blocks, which have a significant impact on transmission performance in the actual systems.In addition, as will be discussed later in Section V-D, when spatial correlation exists among the fading coefficients, the improvement by our proposal is significant, even for massive configurations.
• MMSE limit: Genie-aided scheme in which the perfect knowledge of X d is provided as prior information at each iteration of the JCDE algorithm [27], [50].Provides an absolute lower bound in terms of the NMSE performance.Fig. 4 shows the NMSE performance, where the system parameters are the same as in Fig. 3.As expected from the results in Fig. 3, serious estimation errors occur in "MSE," indicating that increasing the SNR does not improve the NMSE performance significantly due to prohibitive inter-pilot interference with κ = 0.625.Although "BiGAMP (Alg.3)" can improve the performance by relaxing the underdetermined conditions by the JCDE mechanism, its operation becomes unstable when the system size is small, and its operation at low SNR induces divergence behavior of iterative estimation, making it difficult to achieve highly accurate channel estimation.Note that the concave shape of the MSE curve in the low SNR region in Fig. 4b is due to the stopping criterion for iterative divergence behavior.Next, "BiGAMP (Alg.2)" can alleviate this inconvenience, but still deviates from the lower bound.In contrast, "BiGaBP (Alg.1)" operates robustly even in the low SNR region and approaches the Genie-aided performance, "MMSE-limit," in the high SNR region for both configurations.

C. Robustness to Changes in Symbol Length
Let us shift our focus to the robustness of the proposed method to changes in frame configurations.Fig. 5 shows the BER performance as a function of κ (≜ K p /M ), where the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.other system parameters are the same as in Fig. 3.The SNR is fixed at 10 dB and 9 dB in Figs.5a and 5b, respectively.For both configurations, it is clearly confirmed that the proposed method can work at lower κ than the BiGAMP-based counterparts.More specifically, "BiGaBP (Alg.1)" approaches the Genie-aided performance, "GaBP w/ perfect CSI," up to about κ = 0.7 in Fig. 5a and κ = 0.6 in Fig. 5b, indicating that the improvement from the conventional methods becomes significant, especially when the system size is small.Fig. 6 shows the BER performance as a function of K d .Intuitively, it would seem that the longer K d , the better the orthogonality of the pilot-plus-data sequence and the better the estimation performance.In fact, when using orthogonal pilots, the channel estimation performance improves due to the improved diversity gain over the discrete-time dimension as K d increases.However, since we use non-orthogonal pilots in Fig. 6, the initial channel estimation accuracy is poor; hence, the convergence behavior of iterative estimation becomes unstable due to error propagation if K d is too large.As a result, it is not necessarily advantageous to increase K d for the BiGAMP-based methods that average beliefs in the large system limit, and the performance degrades when K d is set very long, e.g., 512."BiGaBP (Alg.1)" can alleviate the above problem by the MP rule that does not rely on belief averaging using the large-system limit approximation and the ASB mechanism for stabilizing convergence.Consequently, as K d increases, the performance improves and asymptotically approaches the Genie-aided reference.

D. Robustness to Correlated Massive MIMO Channels
In practice, employing a large number of antennas at a BS leads to spatial correlation among fading coefficients; hence, it is vital to confirm that our method works well even in correlated channels.In this article, we use two channel models: the geometric one-ring model [27], [51], [52] and the finite path model [53], [54], [55], which are commonly used to represent spatial correlation among fading coefficients in massive MIMO scenarios.
1) Geometric One-Ring Model: In practice, the wireless channels between the BS and the UE exhibit a small angular spread from the perspective of the BS, as a result of local scatterers around the UE and the high placement of the BS antennas [33], [51], [52], [56].Assuming diffuse 2 × D field of isotropic scatters around UEs, the (i, j) element of the RX spatial correlation matrix for the m-th UE, Θ m ≜ which denotes the correlation coefficient between the i-th and j-th RX antenna elements.Here, waves arrive from the m-th UE with an angular spread ∆θ m ≜ θ max m − θ min m , and  the antenna element spacing is fixed to half the wavelength.The m-th column vector of H is computed by h m = Θ 1/2 m ν m , ν m ∼ CN (ν m ; 0, ϕI N ).Fig. 7a shows the BER performance of massive MU-MIMO systems in correlated channels following the one-ring model, where (N, M ) = (64, 16) and the other system parameters were the same as in Fig. 3.A sector antenna of 120 degrees opening was considered.The angular spread for each UE was set to 30 degrees.The UEs were naturally partitioned into 8 segments with M/8 UEs randomly dropped in each segment.
Although the ratio of the number of spatially multiplexed streams to the number of receive antennas is ρ 1 ≜ N/M = 4, a configuration that is relatively easy to obtain diversity gain, the BiGAMP-based JCDE algorithms are not able to provide highly accurate estimation due to the high level of error floor.This is because the correlation among the fading coefficients greatly degrades the accuracy of the large-system limit approximation, making it difficult to decouple the self-feedback across iterations due to the Onsager correction term.In contrast, the BiGaBP-based JCDE algorithm, which relies only on SGA based on mild CLT, is relatively robust against correlation among fading coefficients, and significant performance improvement can be achieved by adjusting the convergence speed with the assistance of the proposed ASB.Specifically, our method can achieve BER = 10 −4 , and the degradation from the lower-bound reference is suppressed to within 1.0 dB at BER = 10 −4 .
2) Finite Path Model: Since the geometric one-ring model assumes infinite scatterers around every UE as found in (41), the Gaussianity is relatively high; however, such a rich scattering environment cannot always occur in practice.For example, in millimeter-wave (mmWave) wireless communications, where diffraction and scattering rarely occur, the number of paths arriving at the receiver is limited [57], [58].To represent such wireless channels, the finite path model is often utilized.Assuming that L propagation paths arrive at the BS from each UE, the m-th column vector of H can be expressed as [53], [54], and [55] where g l,m ∼ CN (g l,m ; 0, ϕ) is the channel gain along the l-th path of the m-th UE.This is obtained from the steering vector s (Ω l,m ) ≜ [1, exp [jπΩ l,m ] , . . ., exp [jπ (N − 1) Ω l,m ]] T .
where Ω l,m ≜ cos θ l,m with θ l,m denoting the azimuth angle of the l-th propagation path of the m-th UE.The antenna element space is fixed to half the wavelength.Fig. 7b shows the BER performance of massive MU-MIMO systems in correlated channels following the finite path model.A sector antenna with a 120 degrees opening was considered, and the UE devices were randomly dropped in the above angular region around the BS.The number of paths was set to L = 8.
As can be inferred from the results of Fig. 7a, the BiGAMP-based JCDE algorithms fail to provide highly accurate estimation and suffer from a high level of error floor with BER > 10 −2 .The lower the Gaussianity of the massive Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.MIMO channel, the greater the discrepancy between the Onsager correction term and the actual self-feedback value, which causes error propagation and significantly degrades the iterative convergence property.In contrast, the proposed method can significantly reduce the error floor level, achieving BER = 10 −4 .The deviation from the lower-bound reference is also suppressed within 2.0 dB at BER = 10 −3 , confirming that the self-feedback cancellation by exchanging extrinsic values and adjusting the iterative convergence speed by ASB are effective for highly accurate estimation in such deterministic wireless channels.

E. Complexity Analysis
First, the computational complexity of each JCDE algorithm was evaluated in terms of the number of real multiplication operations required to detect data symbols and estimate channel coefficients.To evaluate the approximate number of real multiplication operations, we adopt the following basic assumptions presented in [59].
Fig. 8a shows the number of real multiplication operations as a function of the number of UE devices M , and the compression ratio is fixed to κ ≜ K p /M = 0.625.The dominant factors determining the computational complexity in the conventional BiGAMP-based methods are the process to compute pnk (t), qnm (t), and rmk (t) in Alg. 2 (and Alg.3), whose complexity is of order O(M N K) per iteration, which is similar to that of the proposed BiGaBP-based method.This fact is confirmed by the results in Fig. 8a, which shows that the proposed method can significantly improve estimation accuracy without largely increasing the amount of computation.More specifically, at M = 32, "BiGaBP (Alg.1)" can operate at about twice the computational cost of "BiGAMP (Alg.3)" and about 1.5 times that of "BiGAMP (Alg.2)," respectively.
For a more practical evaluation, Fig. 8b shows the average execution time 5 for each JCDE algorithm to detect data symbols and estimate channel coefficients.When the programs were actually executed and their execution times were compared, as shown in Fig. 8b, the results show that the relative relationship between the proposed and conventional methods is a similar trend to that shown in Fig. 8a in terms of the number of multiplication operations.
The above results show that BiGAMP and BiGaBP have a similar complexity order, but in terms of more practical processing cost, BiGAMP, which can reduce the number of beliefs propagated by the large-system limit approximation, can achieve lower computational cost.However, considering that BiGAMP are extremely vulnerable to insufficient system size and channel correlations, and that BiGaBP can significantly improve performance and achieve performance close to the lower-bound reference in many cases, the proposed method can be seen to achieve an excellent trade-off between estimation capability and computational cost.

VI. CONCLUSION
In this paper, we proposed a novel JCDE scheme via BiGaBP for uplink massive MU-MIMO systems with short non-orthogonal pilots.The proposed BiGaBP framework operates based on the SGA under CLT, which is a milder assumption than the large-system limit condition the BiGAMP algorithm relies on.In addition, the ASB mechanism is extended to BIPs, realizing stable convergent behavior of JCDE when using non-orthogonal pilots.It is also shown that BiGAMP is derived by approximating BiGaBP under the large-system limit assumption, and the relationship between the original BiGAMP framework and the proposed scheme is clarified.The numerical results show that our proposed method outperforms the state-of-the-art scheme and approaches the performance of the idealized scheme for a variety of system parameters.

B
i̸ =b a i means the operation of adding the elements ofa i corresponding to i ∈ {1, • • • , b − 1, b + 1, • • • , B}.I a represents an a × a square identity matrix.tr [•] denotes the trace of the matrix and |•| F denotes the Frobenius norm.[A] ab is the (a, b) element in the matrix A. a f (a) is an indefinite integral of f (a).

Fig. 1 .
Fig. 1.Schematic of the belief propagation process in the proposed BiGaBP algorithm.

Fig. 5 .
Fig. 5. BER performances of MU-MIMO systems as a function of the pilot length.

Fig. 6 .
Fig. 6.BER performances of MU-MIMO systems as a function of the data length.