Low-Complexity Detectors for Uplink Massive MIMO Systems Leveraging Truncated Polynomial Expansion

In this work, we propose low-complexity detectors for massive multiple-input multiple-output (MIMO) systems. Particularly, we leverage variants of truncated polynomial expansion (TPE) in order to reduce the computational complexity of the signal detection in the uplink direction. Linear detectors such as zero-forcing (ZF) and minimum mean square error (MMSE) involve expensive matrix-matrix multiplication and matrix inversion operations. TPE-based detectors are appropriate candidates for approximating these linear detectors. However, tuning the normalization factor of TPE-based detectors may require calculating the minimum and the maximum eigenvalues of the channel Gram matrix. These calculations become computationally expensive for some massive MIMO systems, especially for systems with a large ratio of single-antenna user terminals to the number of antennas at the base station, i.e., loading factor. We propose to tune the normalization factor using appropriate approximations for the extreme eigenvalues. The proposed TPE-based detectors exhibit a bit error performance similar to that of the TPE-based detector with the optimal normalization factor. Moreover, our proposed detectors achieve the error performance of ZF and MMSE for different loading factors of spatially correlated and uncorrelated massive MIMO channels. The computational complexity of the proposed detector is proportional to the number of base station antennas and the number of users.


I. INTRODUCTION
Multiple-input multiple-output (MIMO) is an essential technology in modern wireless systems as it offers significant capacity and link reliability improvements [1]. Despite these benefits of regular MIMO technology, it does not meet the ever-increasing demand for connectivity driven by the unprecedented introduction and adoption of new applications such as vehicle to everything, augmented and virtual reality, and connected autonomous systems. The massive MIMO technology has been investigated and considered by academic, standardization, and industrial bodies to achieve the requirements of these applications, such as coverage, capacity, and user throughput. Massive MIMO technology The associate editor coordinating the review of this manuscript and approving it for publication was Olutayo O. Oyerinde . tightly packs a large number of antennas at network terminals, e.g., base stations (BSs), which are simultaneously serving several low-end single-antenna user terminals (UTs) [2], [3], [4], [5]. It has been shown in [3] that the matched filter (MF) detector can achieve similar performance to the zero-forcing (ZF) or minimum mean square error (MMSE) detector in the uplink of massive MIMO systems. However, it has been shown in [6], [7] that the regime of number of BS antennas per user at which matched filter detector achieves the performance of the ZF or MMSE detector is impractically large [8]. Therefore, for practical massive MIMO configurations, the ZF or MMSE detector must be used to exploit the full potential of massive MIMO systems. One main benefit of massive MIMO is that linear detectors, such as ZF and MMSE, achieve similar performance to sophisticated optimal detectors.
Employing extensively large numbers of antennas at the base stations, the implementation of these linear detectors becomes computationally expensive. The exact solutions of ZF and MMSE involve a matrix-matrix multiplication to compute the channel Gram matrix 1 and a matrix inversion. These two operations require a computational complexity of O(NK 2 ) and O(K 3 ), respectively, where N is the number of receive antennas at the BS, and K is the number of UTs. In [9], an algorithm based on the alternating direction method of multipliers (ADMM) has been proposed, which also involves a matrix-matrix multiplication and a matrix inversion. Such operations become computationally expensive for massive MIMO where N and K are large. Several techniques have been proposed in the literature to reduce the computational complexity of these two operations. With the help of a truncated Neumann series expansion, approximate algorithms avoid direct matrix inversion [10], [11], [12], [13]. Replacing the Neumann series approximation with other iterative algorithms such as the Gauss-Seidel (GS) [14], successive over-relaxation (SOR) [15], Jacobi [16], and Richardson [17] methods can result in further improvements. These methods reduce the complexity of the matrix inversion from O(K 3 ) to O(K 2 ). However, they still have O(NK 2 ) computational complexity due to the Gram matrix calculation [18].
Another class of approximate methods uses the truncated polynomial expansion (TPE), in which the matrix inversion is approximated by J finite terms of the Taylor series expansion [8], [19], [20], [21], [22]. Such methods can approximate the ZF and the MMSE solutions by a weighted summation of a series of matrix-vector multiplications implemented iteratively. By assuming that appropriate coefficients of the summation are given, the computational complexity of the matrix-vector multiplications will be proportional to the number of BS antennas and the number of users, i.e., O(JKN ). However, the convergence speed of the TPE for approximating the ZF and the MMSE solutions highly depends on a normalization factor in the polynomial coefficients. The optimal normalization factor, in the sense of convergence speed, requires the calculations of the largest and the smallest eigenvalues of the channel Gram matrix [23]. This implies calculating the Gram matrix and its eigenvalues, increasing the overall computational complexity to O(NK 2 + K 3 ).

A. MAIN CONTRIBUTION
In this work, 2 we propose TPE-based detectors where the normalization factor is tuned for massive MIMO systems with various loading factors β = K N . The contributions of this work are summarized as follows: • For massive MIMO systems with small β's, we develop a TPE-based detector where we approximate the eigenvalues of the Gram matrix of the massive MIMO channel using the asymptotic properties of complex Wishart matrices. This aims at efficiently tuning the normalization factor of the proposed TPE-detector in a way to enhance its convergence performance. Such a normalization factor is calculated using the dimensions of the system.
• For massive MIMO systems with large β's, we develop another TPE-based detector where we approximate the extreme eigenvalues of the Gram matrix of the massive MIMO channel by devising an efficient algorithm based on the power method. We utilize the derived approximate extreme eigenvalues in tuning the normalization factor of the proposed detector.
• We provide a comprehensive computational complexity analysis for the proposed detectors. We also compare the computational complexity of the proposed detectors with the relevant prior art. The proposed detectors are shown to have a significant computational complexity reduction compared to prior works, such that the overall computational complexity is O(JNK ).
• We provide extensive numerical simulation results for the proposed detectors over both spatially correlated and uncorrelated channels. The proposed detectors achieve the error performance of the linear detectors, ZF and MMSE. Further, we provide conclusive convergence simulation results where the convergence rate of the proposed detectors is shown to be comparable to that of the TPE-based detectors with the optimal normalization factor.

B. COMPARISON WITH PRIOR ART
In the literature, there exist detection schemes with a computational complexity of O(mNK ), where m is the number of iterations of the schemes. Approximate message passing (AMP)-based detectors [25], [26], [27], [28] offer a computational complexity of O(mNK ) by deploying matrix-vector multiplications rather than the matrix-matrix multiplication. However, AMP-based detectors require the knowledge of noise variance, and consequently, inappropriate noise variance adjustments can lead to performance degradation. Moreover, they suffer from severe performance degradation for massive MIMO systems with large loading factors. Also, they may not converge even with many iterations for spatially correlated MIMO channels. In contrast to AMP-based detectors, our proposed TPE-based detector ensures convergence for both small and large loading factors of massive MIMO systems for both scenarios of spatially uncorrelated and correlated massive MIMO channels. Optimized coordinate descent (OCD)-based detectors [29] offer a fast converging performance with a computational complexity of O(mNK ), where m is the number of iterations. These detectors perform a series of coordinate-wise updates in order to solve the main optimization problem. Despite the fast converging performance of OCD-based detectors, their main challenge resides in the data dependency of successive updates. In these detectors, the update of each coordinate corresponding to a user depends on the previous coordinates VOLUME 10, 2022 updates. The work in [30] also suffers from the dependency of successive updates. Such a dependency prevents fullyparallel implementation [29]. Also, it increases each user's processing delay to a number proportional to the number of users and iterations, i.e., mK . Pipeline interleaving addresses this challenge by simultaneously processing multiple coordinates. However, this approach results in a significant hardware overhead [29]. On the other hand, although our proposed TPE-based detector may need a larger number of iterations to achieve the error performance of ZF or MMSE, the detection of all users is simultaneously calculated in each iteration. Consequently, the processing delay is independent of the number of users and only scales with the number of iterations. This property enables fully-parallel implementation of our proposed TPE-based detector with architecture pipelining.
The conjugate gradient (CG) is an efficient iterative algorithm for solving the problem of signal detection for the uplink of massive MIMO systems. One key advantage of the CG-based detector is that it converges in K iterations. It can even be terminated with fewer iterations while being sufficiently close to the exact solution [31]. However, one disadvantage of the CG-based detector is that it does not provide the post-processing signal-to-interference-plus-noise ratio (SINR) information required for the calculations of log-likelihood ratio (LLR) values for the soft-output version of the detector. In [31], a CG-based soft-output detection scheme is proposed for massive MIMO systems. Some works in the literature, for example [32], consider the Gram matrix (or its regularized version) as an input to the CG-based detection algorithm. Such an implementation requires a matrixmatrix multiplication to calculate the Gram matrix. However, one can easily verify that the corresponding matrix-matrix multiplication can be replaced with matrix-vector multiplications. Therefore, as Table 1 shows, the overall computational complexity of the CG-based detector is O(mKN ).

C. ORGANIZATION AND NOTATION
The rest of the paper is organized as follows: we introduce the system model in Section II by briefly explaining the details of TPE-based detectors. Our efficient TPE-based detectors are detailed in Section III. We discuss the computational complexity analysis of our proposed methods in Section IV. This section also compares our proposed detectors with the state-of-the-art low-complexity detectors for massive MIMO systems. Section V contains simulation results for various massive MIMO systems. Section VI concludes this paper.
Matrices and vectors are denoted by bold-face capital and small letters, respectively. For a given matrix A: A H , A T , and A −1 denote the Hermitian transpose, transpose, and inverse of A, respectively. The columns of A are denoted by {a 1 , a 2 , . . . , a K }. Moreover, a i,j indicates the (i, j)-th element of A. The K × K identity matrix is denoted as I K .

II. SYSTEM MODEL
We consider an uplink multiuser massive MIMO system with N receive antennas at the BS and K single-antenna UTs. The received signal is written as is the transmitted signal vector, whose entries are from a given constellation, X with an average energy of are assumed to be independent and identically distributed (i.i.d.) ∼ CN (0, 1/N ). Moreover, n ∈ C N ×1 is the noise vector at the BS with i.i.d. entries ∼ CN (0, N 0 ). The above model can be rewritten as such that E |x| 2 = E x , and P is a diagonal matrix with non-zero entries of (p 1 , p 2 , . . . , p K ). Among the solutions for multiuser uplink massive MIMO detection are the ZF detector and the MMSE detector is the regularization factor of the MMSE detector.
For the sake of simplicity, we assume that all users have the same transmit power, i.e., P = I K . Such an assumption will not affect the overall mechanism of the proposed methods in Section III. The effect on the computational complexity of the proposed methods will be discussed in Section IV. With this assumption, the received signal is simplified to Note that even if all the users transmit with different powers, the power control mechanism will compensate for the largescale fading. Consequently, the signals arrive at the BS with equal powers, which justifies the model presented in (5). For this model, we havê Both detectors require a matrix inversion with O(K 3 ) operations and a matrix-matrix multiplication with O(NK 2 ) operations.
An alternative approach to the matrix inverse calculation is the TPE since the inverse of the Gram matrix can be expressed as a matrix polynomial [21], where the low-order terms are the most dominant ones. Lemma 1 ( [21]): For any positive definite Hermitian matrix X, where the second equality holds when 0 < α < 2 λ max (X) such that λ max (X) is the largest eigenvalue of X. The parameter α is referred to as the normalization factor. By using this lemma and X = H H H, the ZF can be approximated as Similarly, using X = H H H + µI K , the MMSE detector can be expanded as We denote W TPE as the TPE detector such that where for ZF and for MMSE, and J is the TPE order. Note that when N K , the matrix-matrix multiplication is even more expensive than the matrix inversion. Hence, the matrix-matrix multiplication is avoided by implementing iterative computations of the J terms where a series of matrix-vector multiplications are performed to detect the transmitted symbols vector. Hence, one can write From (13) and (17), it can be shown that the computational complexity 3 ofx . Moreover, the calculations of w l 's need some additional operations. For a given α, the computations of all coefficients require J 2 + 25J − 12 and J 2 + 37J − 16 FLOPs, respectively for ZF and MMSE versions of the TPE detector.

III. PROPOSED LOW-COMPLEXITY TPE-BASED DETECTORS
The normalization factor α has a crucial impact on the convergence of the TPE detector. The convergence requirement in Lemma 1 results in the following condition for ZF and for the MMSE, where λ max (G) denotes the maximum eigenvalue of G = H H H. This normalization factor is used to shift the eigenvalues of G to the convergence area as they may lie outside of the area. A coarse choice for the normalization factor can be α = 2 Trace(G) , as one can write where λ n (G) is the n-th eigenvalue of G. However, the convergence of the TPE with this value of α is slow. This approximation needs 8NK − 2 operations for the calculations of the summation of the diagonal entries of G. It is shown in [34] that the fastest convergence happens when the two extreme cases, αλ min (X) and αλ max (X), are equally distant to unity. Hence, one can write for ZF and for the MMSE. However, the calculation of G itself requires O(NK 2 ) operations. In addition, the calculations of λ min (G) and λ max (G) require O(K 3 ) operations. An approximate method is proposed in [23] by offering intervals for the eigenvalues of G. This method reduces the complexity of the eigenvalues calculations to O(K 2 ). However, it still needs the computation of the entries of G.

A. PROPOSED NORMALIZATION FACTOR FOR SMALL LOADING FACTORS
In an effort to reduce the computational complexity pertaining to the calculations of optimal normalization factor, we here propose to set the normalization factor based on the approximation for the extreme eigenvalues of G. Since G is a complex central Wishart matrix, when N and K grows we have [35] λ max (G) ≈ (1 + β) 2 , where β = K N is the loading factor of the massive MIMO system. The approximations in (24)  x min = v min 11: end for 12: N are large and the loading factor is small. Suppose we are using the ZF version of the TPE detector. If we select this normalization factor satisfies the convergence condition in Lemma 1, as one can write .
The computational complexity of calculating the proposed normalization factor in (26) is O(1) as it only needs one addition and one division. Moreover, as will be shown in Section V, by using this normalization factor, the convergence of the TPE is similar to that of the TPE with the optimal normalization factor in (22).

B. MASSIVE MIMO SYSTEMS WITH LARGE LOADING FACTORS
The approximations for the extreme eigenvalues of G in (24) are accurate when K and N are large and β is small. For large β's, the eigenvalues may be different from the approximations, which will affect the error performance of the system. As shown in Section V, for a large β, e.g., β = 16 64 = 0.25, the proposed TPE-based detector's bit error (BER) performance diverges from that of the TPE-based detector with α opt . To address this shortcoming, we propose approximating the extreme eigenvalues of G using a low-computational complexity algorithm. Particularly, we exploit the power iteration method in approximating these eigenvalues which are ultimately utilized in the normalization factor for massive MIMO systems with large loading factors.
The power method [33] is an iterative algorithm for approximating the largest eigenvalue of a matrix with linearly independent eigenvectors and a dominant eigenvalue. The algorithm starts with a non-zero initialization vector iteratively multiplied by matrix G, i.e., For large powers of m, a good approximation of the dominant eigenvector of G is obtained. The corresponding eigenvalue is obtained by the Rayleigh quotient Here, we propose initiating the algorithm with x 0 = H H y and limiting the number of iterations of the power method in (29) to m = J − 2 in order to use the already calculated TPE terms in (17) for the calculation of the largest eigenvalue. By comparing (17) and (30), one can write With these choices of initialization and number of iterations, the power method only needs 16K + 2 additional FLOPs for the calculation of λ max (G). We note that since we utilize the TPE terms for the calculation of λ max (G), it can be implemented in parallel to the TPE terms calculations without imposing further processing delay on the system. In the following, we discuss how efficiently we can obtain an approximation for λ min (G). As G is a positive definite matrix, matrix G = G − λ max (G)I K is a negative definite matrix with the dominant eigenvalue λ min (G) − λ max (G). As a result, by inputting matrix G to the power method, an approximation of λ min (G) can be calculated as follows For this approach, the approximation of λ min (G) requires prior knowledge of λ max (G). The straightforward solution is to calculate λ min (G) after obtaining λ max (G) in (30) using two separate instances of the power method. However, this approach increases the processing delay by the number of iterations of the power method for calculating λ min (G). As a result, we propose to calculate both λ min (G) and λ max (G) simultaneously, where a coarse approximation of λ max (G), which is obtained at the early steps of the power method, is used for λ min (G) approximation. Algorithm 1 contains the detailed steps of the proposed approach.
Remark 1: For massive MIMO systems with large loading factors, using the proposed method, a relatively accurate approximation of λ max (G) can be attained. However, the approximated λ min (G) might differ from the actual smallest eigenvalue for some channel realizations. This mismatch can be reduced by increasing the number of iterations or selecting an appropriate initialization at the cost of increased computational complexity. However, for our application, as shown in Section V, such a mismatch does not affect the system's performance.

C. SPATIALLY CORRELATED MASSIVE MIMO CHANNELS
In realistic wireless communication environments, the error performance of the uplink of massive MIMO systems is affected by the spatial correlation between the antennas at the BS. We consider the spatially correlated channel model in [36] and [37] such that where R ∈ R N ×N is the correlation matrix defined as where ρ is the correlation coefficient. In Section V, we evaluate the error performance of our proposed TPE-based detector for spatially correlated MIMO channels. We replace the channel matrix H in (5) with H sc in (33). We also normalize H sc in (33) by the norm of R 1/2 in order to have a consistent SNR adjustment with the uncorrelated MIMO channel scenario. We also note that for such channels, we use the TPE-based detectors with the proposed normalization factor using Algorithm 1 as the extreme eigenvalues approximations in (24) are not valid for such channels.

IV. COMPUTATIONAL COMPLEXITY ANALYSIS
In this section, we verify the computational complexity of the proposed TPE-based detectors. We assume six and two FLOPs per complex multiplication/division and complex addition/subtraction, respectively. We note that the multiplication of size K ×N and N ×M matrices requires (8N −2)KM FLOPs [33]. For a given massive MIMO system, one can show that: • The calculation of α requires 8 FLOPs.
• For a given α, the calculations of all w l 's require J 2 + 25J − 12 and J 2 + 37J − 16 FLOPs for the ZF and the MMSE versions of the TPE, respectively. • By using the already calculated terms in (17), the calculation of λ max (G) needs 16K + 2 FLOPs.
• The calculation of λ min needs 16K + 4 FLOPs. As a result, compared to the constant normalization factor, the computational complexity of Algorithm 1 increases by O(J 2 K ). Hence, for both cases of small and large loading factors of massive MIMO, the overall computational complexity of our proposed TPE-based detector is O(JKN ).
In Section II, we assumed that all users have the same transmit power. Now, we discuss how different transmit power of users will affect the computational complexity of the proposed methods. It can be shown that the TPE of the ZF solution in (3) requires only K extra multiplications, and consequently O(K ) extra FLOPs, compared to TPE of the equal transmit power case in (6). For the MMSE solution, by writing the TPE of (4), in each iteration, the multiplication of matrix P with the estimated vector requires K multiplications. Therefore, it requires O(JK ) extra FLOPs compared to the equal transmit power case in (7). Therefore, for the unequal transmit power scenario, the overall computational complexity of the proposed method will also be O(JKN ). We should also note that as the results in Section V show, the ZF and the MMSE detectors have the same error performance for considered massive MIMO configurations. Therefore, a small saving in computational complexity can be achieved by the ZF implementation for both scenarios of equal and unequal users' transmit power.
In Table 1, we compare the complexity of different detection schemes. For m = J , our proposed TPE-based detector has the same computational complexity as AMP-based and OCD-based detectors. In Section V, for different massive MIMO systems, we compare the error performance of our proposed TPE-based detector with AMP-based and OCD-based detectors.
The AMP-based detectors fail to converge for large loading factors of massive MIMO systems or spatially correlated massive MIMO channels. Moreover, an inappropriate adjustment of the noise variance can degrade the error performance of the AMP-based detectors. In contrast, our proposed TPE-based detector ensures convergence to the ZF or the MMSE detectors. Also, the ZF version of the proposed TPE-based detector does not require knowledge of noise variance. VOLUME 10, 2022  The OCD-based detector can converge fast for different scenarios of massive MIMO systems. However, the main drawback of the OCD-based scheme lies in its processing delay, as the update of each coordinate (corresponding to a user) depends on the updates of previous coordinates. This dependency increases the processing delay to a number proportional to both the number of users and iterations, i.e., mK , and prevents fully-parallel implementation of the scheme. However, although our proposed TPE-based detector needs a larger number of iterations to approach the error performance of the ZF or the MMSE, it can be implemented in a fully-parallel manner. In each iteration of our proposed detectors, the estimates of all users are calculated simultaneously, resulting in a processing delay proportional only to the number of iterations (or TPE terms), i.e., m. Table 1 also includes the overall computational complexity of the CG-based detector, which is O(mKN ). It is worth mentioning that some works in the literature, for example [32], have reported a higher computational complexity for the CG-based detector. In those works, it is assumed that the Gram matrix (or its regularized version) is given as an input to the CG-based detector, which requires a matrix-matrix multiplication. However, one can easily verify that the corresponding matrix-matrix multiplication can be replaced with matrix-vector multiplications. Therefore, the overall computational complexity of the CG-based detector is O(mKN ).

V. SIMULATION RESULTS
In this section, we consider several massive MIMO systems in order to investigate the error performance of the proposed TPE-based detectors. For all simulations, we refer to the proposed constant normalization factor in (26) and using Algorithm 1 as α constant and α power , respectively. Fig. 1 shows the average mean square error (MSE) between the exact inverse of G and its approximation using TPE for 10, 000 channel realizations. We consider different normalization factors and TPE orders in order to investigate the convergence of the TPE. Our proposed normalization factor α constant exhibits an MSE similar to the TPE with α opt . Moreover, α = 2 Trace(G) has a very slow convergence. Also, α = 1 λ max (G) shows a better convergence speed, but it still is far from the optimal convergence speed. The simple choice of α = 1 diverges at large TPE orders when the loading factor is large. It is worth mentioning that the provided results in Fig. 1 are the average MSE for 10, 000 channel realizations while the worst case MSE is also important, especially for the BER of wireless communication systems where one coarse approximation can result in a poor error performance, especially at high SNRs.
As Fig. 2 shows, the BER performances of the TPE-based detector with α constant are similar to the case when α opt is used for 128 × 16, 256 × 16, and 512 × 16 MIMO systems. Moreover, J = 5, J = 4, and J = 3 are respectively sufficient for these systems to approach the BER performances of the ZF and the MMSE detectors with exact inversion. However, for a 64 × 16 massive MIMO system, although the TPE-based detector with α constant performs similar to the TPE-based detector with α opt at low and moderate SNRs, there is a performance gap at high SNRs. It happens due to the inaccuracy of the extreme eigenvalues approximations in (24) for systems with large loading factors.
In Fig. 3, we use the TPE-based detector with α power in order to resolve the convergence issue with α constant for the 64 × 16 MIMO systems, which has a large loading factor. As Fig. 3a shows, the TPE-based detector converges with α power , and it requires J = 10 to approach the BER of the ZF or the MMSE for 16-QAM modulation. In Fig. 3b, we evaluate our proposed TPE-based detector for the higher-order 64-QAM modulation. For this system, a similar convergence behaviour is observed, and α power with J = 12 is required to approach the BER of the ZF or the MMSE detectors.
We should note that for this MIMO system with 16-QAM modulation, for SNRs higher than approximately 10 dB, to approach the BER of ZF/MMSE, the proposed method requires a TPE order J that results in a higher computational complexity than the direct implementation of ZF/MMSE. However, this is not the case for all MIMO configurations with β = 0.25. For example, doubling the number of users and BS antennas, we have a 128 × 32 MIMO system with the same β = 0.25. For this MIMO system with 16-QAM modulation, according to our simulations, at high SNRs, J = 10 is required to achieve the BER of ZF/MMSE. Table 2 contains the computational complexity of ZF/MMSE detectors and the proposed methods in terms of FLOPs for different BER requirements. For a 128 × 32 MIMO system with 16-QAM modulation, the computational complexity reduction of the proposed method compared to the ZF/MMSE detector is about %11.67 at the SNR of 20 (BER ≈ 10 −6 ). The reduction improves to %30.48 and %49.16 for BERs of 10 −4 and 10 −2 , respectively, where J = 8 and J = 6 are required to achieve these BERs.
Despite such a limitation for our proposed methods for some MIMO configurations, our proposed methods have advantages over the ZF/MMSE detector. Besides the fullyparallel implementation capability, our proposed methods  offer a flexible framework where the computational complexity can be adjusted based on BER requirements with considerable computational complexity savings at low to medium SNRs. For example, for a 64 × 16 MIMO system with 16-QAM, the proposed methods has less computational complexity than the direct implementation of the ZF/MMSE detector for SNRs less than 10 dB, where J = 6 or less is needed to approach the BER of the ZF/MMSE detector. In Fig. 4, we compare our proposed TPE-based detector with AMP-based and OCD-based detectors. We consider a 64 × 16 massive MIMO system with a large loading factor of β = 0.25 and a 256 × 16 massive MIMO system with a small loading factor of β = 0.0625. The proposed TPE-based detector converges for these two systems. However, although the AMP-based detector converges with m = 3 for massive MIMO systems with small β in Fig. 4b, it fails to converge for the massive MIMO systems with large β in Fig. 4a such that it cannot approach the BER of MMSE and after a certain point increasing the number of iterations will not improve the BER performance.
According to Fig. 4, the OCD-based detector converges to the BER of the MMSE with m = 6 and m = 3 iterations for 64 × 16 and 256 × 16 massive MIMO systems, respectively. In contrast, the proposed TPE-based detector requires m = J = 10 and m = 4 iterations for these two systems. However, the processing delay associated with the OCD-based detector for these two systems with the mentioned number of iterations are respectively proportional to   In Fig. 5, we consider a 128 × 16 massive MIMO system with two different correlation coefficients, ρ = 0.2 and ρ = 0.3. For these two systems, the AMP-based detector suffers from a severe performance degradation such that it cannot approach the BER of the MMSE detector even with a VOLUME 10, 2022 large number of iterations. The proposed TPE-based detector approaches the BER of the MMSE detector, respectively, with J = 12 and J = 16. Although the OCD-based detector can achieve these BER performances with approximately half of the number of iterations required for the TPE-based detector, the processing delays of the OCD-based detector for these two systems are proportional to 96 and 128, while those of the proposed TPE-based detectors are proportional to 12 and 16, respectively.
We should note that for spatially correlated massive MIMO channels, the superiority of our proposed methods in terms of computational complexity holds up to a specific BER. Furthermore, the BER range is extended when the spatial correlation reduces. For example, in Fig. 5b, only for BERs less than ≈ 10 −2 the proposed method has a smaller complexity than the direct implementation of ZF/MMSE. For the spatial correlation of ρ = 0.2 in Fig. 5a, the BER range is extended to BER ≈ 0.04. According to our simulation results, the BER range is extended to ≈ 10 −3 and ≈ 3 × 10 −4 for ρ = 0.1 and ρ = 0.5, respectively.
In Fig. 6, we compare our proposed TPE-based detector with the CG-based detector for a 128 × 16 massive MIMO system with 16-QAM modulation. As Fig. 6a shows, for this system with the uncorrelated channel matrix, the proposed TPE-based detector with α constant requires m = J = 5 while the CG-based detector requires m = 4 in order to approach the ZF or MMSE solution. For the system with the spatial correlation of ρ = 0.2 between BS antennas in Fig. 6b, the CG-based detector converges with m = 6 while our proposed TPE-based detector needs m = J = 12 in order to approach the ZF or MMSE solution. Similar to our proposed TPE-based detectors, in each iteration, the CG-based detector updates the detected signal vector for all users simultaneously. However, the CG-based detector will not provide the post-processing SINR information for the calculations of LLR values for the soft-output detection.

VI. CONCLUSION
We proposed efficient TPE-based detectors for uplink multiuser massive MIMO systems. The efficiency of the proposed detectors is realized based on achieving comparable performance to linear detectors while requiring significantly lower computational complexity. For massive MIMO systems with small loading factors, we exploited the asymptotic properties of the complex Wishart matrices and proposed a constant normalization factor for the TPE-based detector. Also, we proposed an efficient algorithm for systems with large loading factors by utilizing the power method for approximating the extreme eigenvalues of the channel Gram matrix for tuning the corresponding normalization factor. The proposed detectors have a linear computational complexity in the dimensions of the system and the order of the TPE. Moreover, our proposed detectors ensure convergence to the ZF or the MMSE detectors with a fully-parallel implementation capability and, consequently, a small processing delay. One future direction for this work is to extend the proposed schemes to the case of coordinated multi-point transmission where two or more based stations cooperate in serving multiple users. Such a system model requires processing the proposed schemes in a distributed manner.