Hardware Implementation and Performance Analysis of Improved Sphere Decoder in spatially correlated Massive MIMO channels

Detection for high-dimensional multiple-input multiple-output (MIMO) and Massive MIMO (MMIMO) systems is an active ﬁeld of research in wireless communications. While most works consider spatially uncorrelated channels, practical MMIMO channels are correlated. This paper investigates the impact of correlation on Sphere Decoder (SD), not only for Single-User (SU) but also for Multi-User (MU) scenarios. The complexity of SD is mainly determined by the Initial Radius (IR) method and the number of visited nodes during detection. This paper proposes both an efﬁcient IR and a new metric constraint in the tree searching algorithm, that signiﬁcantly decrease the number of visited nodes and render SD feasible for large-scale systems. In addition, a hardware implementation featured with a one-node-per-cycle architecture, minimizes the latency of the detection process. Trade-offs between bit error rate (BER) performance and computational complexity are presented, either modifying the backtracking mechanism or limiting the number of radius updates. Simulation results prove that the proposed optimizations are effective for both correlated and uncorrelated channels, regardless the level of noise. The decoding gain of SD compared to the low-complexity Linear Detectors (LD) is higher in the presence of correlation than in the uncorrelated case. However, as expected, spatial correlation adversely affects the performance and the complexity of SD. Simulation results reported here also conﬁrm that correlation at the side equipped with more antennas is less detrimental. Hardware aspects are examined for both a Virtex-7 FPGA device and a 28-nm ASIC technology.


I. INTRODUCTION
The increasing demand for higher data rates and for more connected devices has led to MMIMO technology. The use of a high number of antennas is no longer just a theoretical concept but a principal way of increasing the capacity of cellular communication systems [1]. The term "MMIMO" has no precise definition and is utilized variously in research, but mainly is used to impose a MU system where the receiving antennas (N R ) outnumber the transmit antennas (N T ). In this work both SU and MU systems are examined, therefore a clear explanation is required. As MMIMO a MU system is considered, while the SU scenario is referred to as highdimensional MIMO. Both MIMO schemes are quoted as MMIMO channels.
A Maximum Likelihood (ML) detector yields the optimal solution in the MIMO detection problem but its complexity rises exponentially with the order of the modulation scheme and with the number of transmit and receive antennas. On the contrary, an analysis by Hassibi and Vikalo has revealed that the expected complexity of SD is polynomial under certain circumstances [2]. In addition, Jaldén and Ottersten report that SD can exceed polynomial time algorithms in several use cases [3]. The reduced complexity is justified because the aim of SD is to detect the transmitted signal vector by searching only among candidate vectors which lie inside a hypersphere of radius R 0 around the received signal, in contrast to the ML detector where all vectors are examined [3], [4]. Confining the search within the hypersphere only while providing ML performance [4]- [6], makes IR selection a critical process for the complexity of SD. A large initial value of sphere radius can lead to an exhaustive search provoking exponential complexity; in contrast a very small radius may contain no lattice point inside the hypersphere and the search will have to be restarted with a new estimation of IR nullifying also the arithmetic operations that had preceded up to that point. However in large-scale systems an effective IR is not always enough, as optimizations on node pruning are also required.
Several authors have studied IR selection and a wide range of variations of the SD algorithm have been proposed, targeting to decreasing the number of nodes visited during the search. In systems, where N T is limited and the modulation order is low, a method for IR selection is not required, since the tree searching involves a small number of nodes; however in large-scale systems the absence of an efficient IR estimate leads to an exhaustive search, not feasible for hardware implementations. Defining a constant or infinity as IR, or relying on the noise distribution [7]- [9] are practical techniques only for small-scale systems. For both MMIMO and high-dimensional MIMO schemes, IR estimation with the help of a pre-processing block or an other lowcomplexity detector [7], [8] seems to be the most effective method diminishing the number of lattice points inside the hypersphere, despite the need for additional hardware. Recent works [10], [11] exploit deep learning in order to compute the IR in systems, where N T = N R . The results are encouraging in terms of searching time but the additional complexity of a Neural Network (NN) is not negligible especially for largescale systems. In [10] the NN estimates a number of IR values, based on the noise statistics and on the structure of the channel matrix H, implementing the increasing radius search algorithm [2], [12], while in [11] the NN predicts the minimum path metrics of sub-trees in order to optimize the search order and the IR estimate. Apart from IR selection, searching strategies are also crucial. In [13] a different radius constraint for each level of the tree is proposed, while in [14], [15] both algorithmic and architectural optimizations about sphere decoding algorithm are described.
This contribution is an extension of an earlier conference paper [16], where an implementation of Improved Sphere Decoder (ISD) is presented. The following points differentiate ISD from the conventional SD, and renders its implementation feasible for large-scale systems: 1) an efficient IR method; 2) an additional metric constraint at tree searching; and 3) a one-node-per-cycle hardware architecture.
The proposed IR technique exploits that the SD algorithm requires an estimate by a linear detector; it reuses the estimation, combined with an approximation of a 2-norm distance of a matrix to obtain an IR selection which leads to low complexity. Simulation results from [16] show that the proposed IR combined with the extra metric constraint reduce significantly the number of nodes and consequently the computational complexity and the latency of the detection process. SD is known for a non-deterministic behavior, therefore a decrease in the required clock cycles is crucial. Apart from the reduction of visited nodes, a one-node-per-cycle architecture targets this problem by minimizing the latency. Assuming an uncorrelated channel, ISD achieves 0.5-1.25 dB decoding gain compared to LD and its implementation is attainable, both at FPGA and ASIC technology, for E b /N 0 ≥ 4 dB [16].
In this extended work, the effect of spatial correlation on ISD is investigated in matters of BER performance, visited nodes, computational complexity and detection throughput rate. Both SU and MU scenarios are examined. Simulation results are promising, proving that optimizations made in ISD are effective also for correlated channels. Correlation at receiver, given that N R N T , is less harmful as expected [17], [18]. The decoding gain against LD depends on the level of correlation, but is always higher compared to the uncorrelated case. Nevertheless, in the presence of spatial correlation, an improved Signal-to-Noise Ratio (SNR) is required to render ISD feasible for a hardware implementation. In addition trade-offs between complexity and performance are presented due to a new constraint, which defines the maximum number of possible radius updates through tree searching. Modifying the backtracking mechanism of SD algorithm leads also to similar trade-offs. Besides, the problem with the high number of required bits for data representation presented in [16] is solved in this paper with suitable scaling of all tree distances. ISD demands a word length equal to LD in the search part, but IR computation needs a few more fractional bits. This decrease in required word length reduces the area requirements for a one-node-per-cycle component, in comparison with [16]. As a result, compared to [16], higher detection throughput rates are achieved despite the increase of the critical path caused by some changes in one-node-per-cycle architecture.
LD have gained the interest of researchers in MMIMO and high-dimensional MIMO systems, where N R N T . LD feature low complexity and their performance is quite satisfactory when the receiving antennas outnumber the transmitting antennas; but when N R = N T , their performance lags behind [19]. Unlike this work, the majority of the papers about SD assume N R = N T and the reason is that a comparison with LD is more favourable in these systems due to their poor BER performance. The most common used LD in MMIMO channels are Zero-Forcing (ZF) and Minimum Mean-Square Error (MMSE). As for the Matched-Filter (MF), which is the simplest detector in terms of complexity, is efficient only for BPSK modulation when N R = N T [20], [21]. Under these circumstances, MF might be able to replace ZF estimate both in the SD algorithm and in the proposed IR method. This simplification would further reduce the computational complexity of ISD. However for large-scale systems or for more dense modulation schemes, which relate to this contribution, MF is not applied [20]- [22]. Furthermore results from [19] show that ZF or MMSE can achieve the performance of MF with a significantly reduced number of antennas. In this paper, results about BER performance of ISD are compared with ZF and MMSE detectors due to the inefficiency of MF.
A typical assumption in the literature is that the considered channels are spatially uncorrelated. However, in real-world propagation scenarios correlation exists and ignoring it may lead to inaccurate conclusions. Real testbeds prove the presence of correlation in MMIMO channels [23]- [25]. Antenna spacing and no rich scattering conditions are some of the main reasons why MMIMO channels suffer from correlated fading [19], [26]- [28]. In [24], the effect of antenna spacing is investigated. In large-scale systems, where N R N T , the correlation normally exists at a lower level provided a rich enough angular spread [19], [23], [24]. Results of [23] demonstrate that the correlation, which always depends on the respective scattering environment, decreases as the quotient N R N T increases. The problem of correlation is more severe at SU than MU schemes [26]. This is reasonable since in MU systems a user is not necessarily close to an other and the possibility of having independent propagation paths is larger. Furthermore, as mentioned before, correlation on the side with fewer antennas is more detrimental [17], [18]. In the cases examined in this paper, this side refers to the transmitter since N R N T . BER degradation and generally the impact of spatially correlated channels on LD is analyzed in [29]. Regarding SD, the effect of correlation is investigated for small-scale MIMO systems in [5], [6] and for spatial multiplexing MIMO of higher dimensions in [10], [30]. In these contributions, it is assumed that N R = N T and all conclude that as correlation increases the complexity of SD rises and its performance deteriorates. Results in [5] show that one-side correlation imposes the same BER degradation irrespective of whether it incurs at the transmitter or at the receiver. This is explained by the same number of antennas at both sides. The extent of the effect, in terms of the BER performance and the number of visited nodes, depends on the level of correlation. A common method to estimate it, is to derive the distribution of the channel eigenvalues; i.e., the eigenvalues of matrix H H H. The statistical behavior of channel eigenvalues depends on the correlation [27] and determines the capacity of the system [31]. In a strongly correlated channel a few eigenvalues dominate [24], [28], [32], in contrast with weakly correlated scenarios where the eigenvalues are almost identical. Another technique, based also on the channel eigenvalues, is the computation of the condition number. Condition number is defined in [23] and is often used to estimate the level of correlation.
In addition, a feature of uncorrelated channels is the diagonal dominance [33], which does not apply in the presence of spatial correlation. Such methods are recommended to evaluate the magnitude of correlation, without always being accurate.
The remainder of this paper is structured as follows: Section II reviews the system model and the detection problem of SD. In addition, channel models for both SU and MU scenarios are derived. Section III describes the SD algorithm and presents the proposed optimizations for a more efficient node pruning during tree searching. Section IV evaluates IR methods effective for large-scale systems and proposes an approach adapted for implementation on hardware. Section V details the one-node-per-cycle architecture. Section VI shows the effect of spatial correlation and presents simulation results about BER performance, visited nodes, computational complexity and detection throughput rate. Hardware aspects for both FPGA and ASIC implementation are revealed. Finally, Section VII discusses conclusions.

II. SYSTEM AND CHANNEL MODELS
This section describes the structure of the simulation model. Channel models for both SU and MU scenarios under correlation are also presented. Several spatially correlated channel models are employed in the literature, each adopting specific assumptions. This paper focuses on the analysis of the effects of correlated channels on sphere decoding methods, and not on identifying channels that best model the real propagation conditions.

A. System Model
A high-dimensional MIMO or a MMIMO system consists of N T transmit and N R receive antennas (N R N T ). In this work uplink communication is investigated, which means that the Base Station (BS) receives data from the N T single-antenna users in the case of MU scenario or from a transmitter equipped with N T antennas in the case of pointto-point spatial multiplexing MIMO. When correlation is not considered, a fully spatial multiplexing MIMO and a MU scheme are equivalent in terms of simulation. Results of [16] refer to both system models, since they assume an uncorrelated channel.
The transmitted data are mapped to constellation symbols by using Quadrature Amplitude Modulation (QAM) of M points. Let C denote the set of all possible lattice points. A flat-fading channel, instead of a frequency-selective-fading channel, is assumed and is transformed into N flat-fading sub-channels using Orthogonal Frequency Division Multiplexing (OFDM) transmission. Let N denote the number of subcarriers. The N R × N T spatially correlated channel matrix H illustrates the complex transfer function from transmitter j to receiver i. In the case of an uncorrelated channel, all the h ij are considered as Rayleigh fading channel coefficients. Additive white Gaussian noise (AWGN) is assumed. The received signal r is defined as r = Hs+n, where s denotes the transmitted N T OFDM symbols, r denotes the received N R OFDM symbols and n is the independent and identically distributed complex AWGN with zero mean and variance N 0 . Provided that H is known at the receiver, the detection problem of SD [4] iŝ B. Channel Models 1) Single-User (SU) correlated channel For point-to-point high-dimensional MIMO systems, the Kronecker model is employed. Kronecker model is mathematically manageable and allows transmitter side and receiver side correlation to be considered separately [5], [32]. The channel matrix is where H u is the N R × N T Rayleigh fading channel matrix used for the uncorrelated scenario, R R is the N R × N R correlation matrix of the receiver and R T is the N T × N T transmit side correlation matrix. Both spatial correlation matrices are positive semi-definite generated by the exponential correlation model [34] as follows where ρ ρ , ρ τ are correlation coefficients with magnitude ≤ 1 and r * Rji , r * Tji are the conjugates of r Rij and r Tij respectively. The exponential model is convenient since the level of correlation is determined by modifying one parameter, but it is not especially physically accurate. In [31], it is reported that the exponential correlation model is reasonable in the case of a Uniform Linear Array (ULA) in a richly scattering environment. The Kronecker model is assumed in [5], where the impact of spatial correlation on SD in small-scale pointto-point MIMO is examined. In [6], where the complexity of SD is investigated in the presence of correlation for a 4 × 4 MIMO system, the Kronecker model combined with the exponential model for the computation of R R , R T is considered. Contributions in [31], [35] also use this model. Real measurements in [36], [37] prove the efficiency of the Kronecker model in MIMO systems. This paper focuses on the impact of correlation on SD and not on the channel model.
2) Multi-User (MU) correlated channel For MMIMO schemes, the Gaussian local scattering model [38,Ch. 2.6] is illustrated. Recent works [26], [39] also assume this channel model. The local scattering model considers a BS equipped with a ULA of N R antennas featured with half-wavelength spacing, which is elevated without scatterers in its near-field and is located at the center of a cell sector (cs). In this paper, both 60°and 120°cell sectors are examined. On the contrary, scattering is localized around the N T single-antenna users, whose nominal angle φ u is crucial for the generation of the N R × N R spatial correlation matrix R u . Apart from random φ u , the assumed standard deviation σ φ ≥ 0 is also important for modeling matrix R u . As σ φ decreases, the correlation increases [38]. For the simulation σ φ = 10°is selected, which represents strong spatially correlated channels and is considered as a plausible value in urban environment [40]. The structure of the NLOS propagation model is depicted in [38] and in [39, Fig. 3].
The spatial correlation matrix R u , individually for each user, is derived from [38,Eq. (2.23)]. For a given user, the channel vector h u is computed as where V u E 1 2 u V H u is the eigenvalue decomposition of R u and e is sampled from ∼ CN (0, I N R ). Let γ denote the rank of R u . The γ ×γ E u matrix is a diagonal matrix containing the γ positive non-zero eigenvalues of R u and the N R ×γ V u matrix consists of the corresponding eigenvectors, such that V H u V u = I γ . Given that R u is symmetric, the eigenvalue decomposition and the singular value decomposition are equivalent.

III. SPHERE DECODER (SD) A. Sphere Decoder Algorithm
Sphere constraint of (1) can be rewritten, according to [4], asŝ where U is obtained by the Cholesky decomposition of matrix H H H andŝ is the Least Squares estimate of s, SD performs a tree search through N T levels, where every node has M branches. A limited number of possible candidate vectors is examined by SD depending on IR. The detection of the ML solution vector requires a metric constraint equal to R 2 0 . When a solution path is found, the sphere constraint updates its value with the new path metric. Let D i denote the accumulated Euclidean metric down to antenna level i, where d i is the partial Euclidean metric from antenna level i, D i+1 is the accumulated Euclidean metric down to antenna level i + 1, s i is the constellation symbol corresponding to the branch of level i and z i is derived from

B. Improved Sphere Decoder (ISD)
In [41], depth-first tree traversal with IR selection combined with pruning techniques is presented as the most favorable strategy for SD implementations. ISD meets these specifications and as mentioned is characterized by an efficient IR, an additional constraint at each node, and a one-cycle-per-node architecture. The conventional SD checks at each node only if the accumulated distance D i is smaller than the square of the sphere radius. On the contrary, ISD examines initially if the partial distance d i exceeds the 1 64 of IR. This value is constant during tree searching, in contrast with the sphere constraint in (7), and is hardware-friendly since arises easily with shift operations. If this restriction is not met, ISD does not examine the sphere constraint of D i and all the paths originating from that node are discarded. A similar concept, in which each node has its constraint, is mentioned in [4]. Additional restrictions are also reported in [42]- [44], where K-best SD implementations are examined. The reason which leads to the addition of new constraints is the inefficient pruning of the conventional SD algorithm especially in upper tree levels, where D i is a sum of a few partial distances. Effective node pruning in high levels has significantly important savings, particularly in high-order modulation schemes.
Concerning ISD implementation, scaling at all distance metrics is employed in order not to provoke increase in word length. The proposed IR and the one-node-per-cycle architecture are analyzed in next sections. ISD is very effective at node pruning having as lower bound of N T × M visited nodes per tree searching. The lower bound on the number of visited nodes achieved by ISD equals to the total number of nodes required for K-best SD algorithm when K = 1.
The main advantage of the K-best approach is the constant throughput and complexity [41], [43]. However, to prevent BER performance losses a large value of K is required [43]. Selecting a high value of K leads to a large number of nodes though. In [42], in order to achieve a detection performance close to ML, K = 10 is assumed for a 4 × 4 MIMO with 64-QAM modulation. Regarding the impact of the new metric on BER performance, ISD achieves ML performance assuming uncorrelated channel. However there is a low possibility of an extremely small loss in the presence of correlation. Simulation results are discussed in Section VI.

C. ISD with limited radius updates (LRU-ISD)
A variation of ISD, which leads to trade-offs between computational complexity and BER performance, is proposed here, namely LRU-ISD. Having a limit of L radius updates terminates the algorithm prematurely, unless L is larger than the total number of radius updates that take place at ISD algorithm, decreasing the number of visited nodes. When L exceeds the number of required radius updates, LRU-ISD and ISD are equivalent. It is reasonable that as the value of L falls, the visited nodes are reduced but its BER performance deteriorates. Especially for the case of L = 1, the solution path is the first path that emerges, satisfying the sphere constraint of (7) and the partial distance metric constraint of ISD, without the need for backtracking. The lower bound for this specific case regarding the visited nodes is N T ×M 2 , while for L > 1 is N T ×M as ISD. An early termination scheme is also presented in [11]. Simulations results of LRU-ISD are analyzed in Section VI.

D. ISD with backtracking constraint (BC-ISD)
Another variation of ISD for additional node pruning, but at the cost of losing BER performance, is introduced here and modifies the backtracking mechanism. BC-ISD, in contrast with ISD, does not apply backtracking for the N T 2 low levels of the tree when a path solution is found. Selecting the number of non-backtracking levels is a trade-off, since the more levels are omitted the more nodes are reduced. On the other hand, removing nodes arbitrarily from the search increases the likelihood that the optimal path will not be found, resulting in a drop in BER performance. The visited nodes of BC-ISD are always less than ISD and always more than LRU-ISD with L = 1. This is explained due to the fact that ISD backtracks without neglecting tree levels, while LRU-ISD with L = 1 terminates as soon as it finds the first solution path. Due to non-deterministic behavior of sphere decoding methods, a combination of LRU-ISD and BC-ISD for further reduction of visited nodes is meaningless. If the ML solution path is close to the area of the first solution, then LRU-ISD seems to be more efficient. However, if numerous radius updates are required through detection until the ML solution is found, BC-ISD may be a more ideal approach. Further discussion about BC-ISD in Section VI.

IV. INITIAL RADIUS (IR) A. Proposed Method
IR computation is considered as a complex process in both high-dimensional MIMO and MMIMO schemes, therefore an efficient and low-complexity IR method has to be defined. As mentioned, the simplified techniques employed in MIMO and the option of not calculating IR are not functional for largescale systems. The usage of a detector or a pre-processing block is preferred [7], [8] and utilized in this paper. The proposed IR method requires the estimate of ZF detector, which is derived asx MMSE detector has a BER performance similar to ZF detector, when N R N T , but the proposed method exploits that (6) and (9) are equivalent. The transmitted symbol estimate of ZF detector is required for the SD algorithm, hence reusing it for the IR calculation does not increase the overall complexity. The proposed IR is defined as follows For reduced complexity, MF could replace ZF both at SD algorithm and IR computation but only for BPSK modulation and when N R = N T , i.e., in conditions that do not apply in this work. Selection of ML detector would be optimal, but due to high complexity it is avoided. For the IR defined in (10), the computation of a 2-norm of a matrix is required.
Then according to (10), the IR can be written as The 2-norm of A is the largest singular value of A, which equals to the square root of the maximum eigenvalue of the positive-semidefinite matrix A H A [45]. Consequently (12) is rewritten as It should have a nonzero component in the direction of the dominant eigenvalue; where B is a N R × N R matrix and equals to A H A. In this contribution, two alternatives for approximating the 2-norm distance of a matrix are evaluated. The first approach attempts to estimate the largest eigenvalue of B, while the second one computes instead of the 2-norm, the 1-norm or the infinity norm of A or B matrix.

1) Based on the largest eigenvalue (M1)
For the estimation of the largest eigenvalue λ max a known iterative method, namely power iteration [45], [46], is utilized in a modified and simplified way. The conventional method is described in Alg. 1. Power iteration by default appraises the greatest in absolute value eigenvalue of a matrix and is often utilized for matrices with large dimensions, since no matrix decomposition is required. In case B is diagonalizable, the method converges with ratio λ2 λ1 , where λ 1 = λ max and λ 2 is the second largest in absolute value eigenvalue [45]. Therefore its convergence rate depends on how much the two largest, in absolute value, eigenvalues differ. If they are close, the power iteration converges slowly and the number of demanded repetitions is large. Let Z denote the required number of iterations the power method needs to converge. In MMIMO technology, which feature with high-dimensional matrices, slow convergence of this method is a common problem.
Given that B is diagonalizable, line 2 of the Alg. 1 can be transformed, according to [45], as As a result, in contrast with the conventional algorithm an iterative process is not required regarding the estimation of the dominant eigenvector b k , which corresponds to the largest eigenvalue. In Alg. 1, the accuracy of estimated λ max depends on the number of executed iterations k. If k = Z, the largest eigenvalue is calculated in full precision. However a large value of k leads to a high latency implementation. This problem is addressed by (14), but in this case k denotes the times where B is multiplied by itself. Therefore a large k provokes high computational complexity and can lead to possible numerical overflows. If the selected k is smaller than Z, as in this paper, the line 3 of Alg. 1 is rewritten as follows which is known as Rayleigh quotient [46]. It must be supplemented that the product b H k b k equals to 1 when the method converges, therefore line 3 of Alg. 1 and (15) are equivalent for k = Z. Exploiting (15), it is proved that the denominator of (14) has no use. Assuming that it can be derived from (14) and (16) The proposed method relies on (14)- (17) for the approximation of λ max , thus avoiding recursive procedures. In an effort to prevent numerical overflows, the classic power iteration presented in Alg. 1 utilizes the denominator of the ratio in line 2 in order to normalize the b k at each iteration. M1 confronts that issues by exploiting the initialization of b 0 . The vector b 0 acts as a scaling factor with all its elements equal to 2 −φ , where φ is the number of fractional bits required for the hardware implementation. Therefore the multiplication in (16) is reduced to shift-and-add operations.
Minimizing the latency, M1 chooses k = 1 for (16), which means that the estimation is a result of one iteration only. Therefore the largest eigenvalue of B and consequently the IR estimate deviates from the desired value. The IR resulting from M1 by selecting k = 1 always produce an empty hypersphere. IR is crucial for a sphere decoding method and especially for ISD where it plays a key role in node pruning as already mentioned. So to confront that, attempting to ensure the existence of a lattice point inside the hypersphere, a constant δ is added to the estimation of the largest eigenvalue as , indicates an estimation of λ max for k = 1. Based on the fact that the largest eigenvalues are always close in magnitude for all the examined cases in this work, a possible value of δ is estimated through simulations. Specifically, the estimated value of the greatest eigenvalue after k iterations is derived as λ max = ξ + θ k ξ, where 0 < θ k < 1. This creates an upper bound equal to 2ξ for the value of λ max . Consequently the proposed value of δ, which never leads to an empty hypersphere, is δ = ξ. Furthermore, since B is symmetric, the nominator of ξ can be written as where c * 1i indicates the conjugate of c 1i . Therefore the number of the required multiplications has a significant decrease. The complete modified algorithm is presented in Alg. 2.
2) Based on 1-norm and infinity norm (M2) The second approach estimates the 2-norm distance computing the 1-norm or the infinity norm of the matrix. A similar concept is described in [41], but in this contribution the approximation refers to a matrix. The matrix, whose norm distance is approximated, has not to be square for this approach and therefore according to (13) the IR can be derived either as A 2 or as B . Let Q be defined as a ψ×χ complex matrix, then the 1-norm of Q is written as which is the maximum absolute column sum of the matrix [46]. The infinity norm of Q is defined as which is the maximum absolute row sum of the matrix [46]. The selection between 1-norm and infinity norm but also the dimensions of the preferred matrix for the approximation leads to either less sums of more terms or more sums of less terms. Having many terms at each sum makes the design vulnerable to numerical overflow issues, while in contrast more sums require more parallel arithmetic units. It must be noted that B 1 = B ∞ due to the fact that B is symmetric. Aiming for minimum computational complexity, during the calculation of the magnitude of a complex number, techniques to omit a hardware unit responsible for the computation of the square root are employed. It is achieved using the approximations [41], or The above approximations do not differ to a great extent in complexity, but (22) produces higher values compared to (23). In terms of complexity and regarding the selection between A and B for the IR estimate, the estimation of B presupposes the matrix multiplication of A H A to be implemented. However since B is symmetric, it is not necessary to compute the magnitude for all elements in order to approximate its 2-norm distance.
Due to the approximation concerning the magnitude of a complex number but also because of the computation of 1norm or infinity norm instead of the 2-norm distance of a matrix, the final estimate has accuracy issues. However this divergence for a specific MMIMO scheme is stable regardless of the level of noise, hence it can be treated with proper scaling. Scaling is suggested to take place before 2-norm approximation by appropriately prescaling matrix entries, to prevent word length increase. Selection of a power of two value scaling factor converts this process to shift operations. The divergence of the estimation is not affected neither from M nor N T , provided that N R N T , but depends only on the dimensions of the matrix the 2-norm of which, is approximated. These factors are N and N R when IR computation relies on A, while for the case when estimated as B there is no dependency on N .

B. Evaluation
Both methods concerning the approximation of the 2-norm distance of a matrix lead to an IR estimate, which is equally efficient at node pruning. Let V n denote the mean number of visited nodes per tree searching. Table I shows the effectiveness of the proposed IR and compares ISD with the conventional SD algorithm in terms of visited nodes through detection. Assuming an uncorrelated channel and E b /N 0 = 5 dB, a significant decrease caused by the proposed IR, in conventional SD algorithm, is observed for each examined system compared to the total number of visited nodes when no IR (R 0 = ∞) exists. In addition, the vital contribution of ISD to node reduction is displayed. ISD employs the proposed IR not only for its sphere constraint but also for its additional restriction on partial distance of each node.
Evaluating the proposed 2-norm approximations in order to come up with the most prevalent, other criteria must be also considered. Regarding the latency, none of the approaches lags behind. It is notable that no recursive process is required neither for the approximation of the largest eigenvalue, in contrast with the classic power iteration, nor for the computation of 1-norm or infinity norm of a matrix. Therefore the most suitable method is the one with the minimum computational complexity, since they do not differ in matters of latency and node pruning. Table II summarizes the complexity of each approach. The results emerged considering as N = 256 and N R = 128, while the constellation size M does not have an effect, at this point. Table II states that M2 using matrix A is less demanding as to complexity, so is the one which is proposed for the computation of (12) and is utilized at all simulations of this paper.

V. A ONE-NODE-PER-CYCLE ARCHITECTURE
The number of visited nodes may be the most crucial factor for sphere decoding methods. It determines to a great extent not only the overall complexity but also the latency of the detection process. However, in a hardware implementation, apart from the number of visited nodes, the number of clock cycles required for processing at each node is also very important. In this contribution one clock cycle per node is required, which is optimal in terms of latency. It must be noted that in order to process more than one node in a clock cycle, there must be no data dependency, which does not apply in the proposed depth-first tree traversal. In [47], four nodes are investigated in parallel in a clock cycle, which is explained because it describes a breadth-first searching approach without data dependency between the nodes of the same level.
ISD algorithm concerning tree searching, requires a series of tasks to be executed at each node. The top-level block diagram of the proposed one-node-per-cycle architecture is depicted in Fig 1. Let en denote the enable signal of the component and sel denote the selection signal of the depicted multiplexers, which are both driven by the Finite-State-Machine (FSM) of the design. When en = 1, tree searching takes place and the tree parameters are initialized before the first visited node. The level i and the node enumeration nd, which declares which of M branches is examined, are defined by the initial values i 0 and nd 0 respectively. As R 0 we assign the sphere radius IR, which is computed by the proposed method M 2 using matrix A. Excluding the first node IR n , i n and nd n , produced by this component, are selected as tree parameter inputs. Tree searching occurs N times, thereforeŝ sub denotes one of N columns of ZF estimate. In order to achieve the one-node-percycle processing in this work, all memory resources are fully partitioned. Multiplexing is utilized to deliver all the necessary data to the other subcomponents. In addition, the quotients uij uii in (8) and the value of u 2 ii in (7) are computed before the tree exploration, since Cholesky decomposition has preceded. The total number of the required quotients is N T ×(N T −1) 2 , while the square of real-valued u ii has to be calculated for N T times. The calculation of z i takes place initially and then it is the turn of the computation of d i and the check of the partial distance constraint. Depending on that result, it is determined whether the calculation of D i and the sphere constraint affect the value of pr. The 1-bit signal pr determines whether pruning will be done, while D 1 denotes the accumulated distance from the root up to the leaf node. Both signals contribute to the calculation of outputs. In addition, control mechanisms in order to decide whether the sphere constraint needs to be updated and whether the algorithm has to terminate having found the ML solution are applied and determine the values of the rupd and stop 1bit signals respectively. The stop signal is input of FSM, while depending on the rupd signal it is decided if the temporary path path sl has to be stored. When rupd = 1 the stored path vector is renewed, therefore the last update is the solution path of tree searching.
Apart from the memory partition and the precomputation of some data entries, another reason which makes possible the one-node-per-cycle architecture is the implementation of the circuit that computes z i . Specifically, the implemented loop of (8) has to be fully unrolled. This implementation is presented in Fig. 2. The most demanding processing of onenode-per-cycle component concerns the z i computation, which is directly related to the number of transmit antennas since the unroll factor equals to N T . Loop unrolling contributes to the minimization of latency, but has an impact on area utilization. A rise in transmit antennas provokes increase in required arithmetic units and in critical path of the design, in which extra adders are added. Regarding the selection signals of N T multiplexers, a log 2 N T to N T decoder, having as input the level i, is demanded. When i = N T all selection signals equal to 0, while for i = N T the sel N T is always equal to 1. The rest of selection signals are determined according to the value of i.
In [41], a one-node-per-cycle architecture with variations in the tree searching procedure is also achieved. In this work, one unit specifies the next node regardless of whether the sphere constraint or the additional partial distance restriction of ISD are met. Contrariwise in [41] two units operate in parallel, where the one is utilized for forward node selection and the second one estimates the next node in case of pruning or when a temporary path solution is found. Besides, in [41] the SD algorithm is demonstrated for a limited number of levels assuming as IR the infinity value, so an initial sphere constraint does not exist in order to contribute to node pruning. However as mentioned the IR computation for the large-scale systems we examine is mandatory and especially for this paper, where the IR apart from the conventional sphere confinement, also determines the partial distance restriction of ISD during the entire search.
Throughout the detection process, the tree searching takes place N times, which are not related to each other. For that reason the area required by a one-node-per-cycle component is critical, because the number of instances of this unit lead to implementations featured with higher detection throughput rates. The reduction in the word length concerning the data representation, caused by the scaling of all the tree distances, provides a decrease in the required area of one-node-per-cycle component thus allowing more replications, compared to [16]. The results of [16] deal with the calculations that are responsible for the assessment of the next level and the next node. In this work a complete version of the one-node-per-cycle component is described, since in addition to these, mechanisms that consider either a possible radius update or a possible termination of the search have been added. This causes an increase in the critical path of the design, maintaining the one-node-per-cycle architecture though. The computational complexity of the one-node-per-cycle component is presented in Table II. Timing and area reports but also trade-offs between area and detection throughput rate are analyzed at Section VI for both FPGA and ASIC implementations.

VI. EXPERIMENTAL RESULTS AND DISCUSSION
In this section, the proposed ISD and its variations are evaluated, analyzing the effect of spatial correlation. As usual in systems where N R N T , the results are subject to comparison with near-optimal LD. The linear detector chosen for comparisons is ZF, because regarding the examined systems of this work has a similar performance with MMSE and also the low-complexity MF can not be applied. The assumed MMIMO and high-dimensional MIMO schemes, for both SU and MU scenarios, are only those that have the potential for practical implementation. More transmit antennas or more dense modulations lead to systems that are meaningless due to high computational complexity and latency.

A. BER performance and visited nodes
The advantage of SD is that it yields the ML solution and thus surpasses the LD in terms of BER performance. The mean number of visited nodes during the detection process determines whether the decoder is attainable and provided that, what SNR value is required. Spatially correlated channels adversely affect both BER performance and visited nodes. Apart from this paper, in [5] it is also observed that the square values of the diagonal elements of the matrix resulting from the Cholesky decomposition decrease as the correlation level increases. The value of u 2 ii is a crucial factor for a sphere decoding method, since it determines the number of candidate symbols per tree level [5]. Therefore in highly correlated systems, the total paths to be examined and consequently visited nodes increase. However the additional constraint of ISD about partial distance of each node reduces to a great extent this problem. Furthermore, as mentioned correlation causes BER degradation at LD [29]. As a result ZF estimate, which is utilized by ISD both at IR computation and at detection process, is less efficient compared to the uncorrelated scenario. It must be supplemented that the impact on IR selection is significantly important for ISD because besides the initial sphere constraint, the upper bound of d i is defined through it.
The simulation results are based on the system presented in Section II, where, depending on whether it concerns the SU or the MU scenario, the corresponding correlated channel model is employed. For the uncorrelated case, a Rayleigh fading channel is assumed, while for all simulations the IR is computed according to M2 using matrix A, the number of subcarriers equals to N = 256 and the total number of processed M -QAM symbols equals to N T × N . Fig. 3 depicts the BER performance of ISD, compared to ZF, under correlation on both sides for 128 × 16 SU-MIMO and 16-QAM modulation scheme. It is observed that raising the level of correlation both at transmitter and receiver side, the performance of ISD worsens taking into account the uncorrelated scenario but the decoding gain against the linear ZF detector increases. In [5], [10] BER degradation at SD due to spatial correlation is also noticed. For the same SU-MIMO system, Fig. 4 shows the mean number of visited nodes of ISD throughout the detection for various correlation values. As correlation increases, ISD examines more nodes in order to find the ML solution vector. However, regardless of the correlation level ISD achieves its lower bound, which equals to N T × M visited nodes. In highly correlated conditions, a larger E b /N 0 value is demanded though. In [3], it is mentioned that the required operations and in general the complexity of SD decreases provided a high SNR value. This is also evident in Fig. 4, where fewer visited nodes in high SNR regions are translated into fewer arithmetic operations. In Figs. 5 and 6, the effect of one-side correlation is examined. It is inferred from both figures that transmit correlation affects the performance of ISD more than the case when only correlation at the receiver is assumed. This outcome is expected, since as mentioned in [17], [18] the correlation on the side equipped with less antennas, which in this work refers to the transmitter, is more devastating. It goes without saying that the correlation on both sides is worse than one-side correlation. A summary of the measurements for all the examined highdimensional MIMO schemes of this contribution is presented in Tables III to VI. Modifying the coefficients ρ τ and ρ ρ , not only one-side but also correlation at both sides are investigated. The decoding gain compared to ZF detector and the required E b /N 0 value in order ISD to be feasible for implementation for each test condition are displayed. Furthermore, Tables III to VI reveal the number of integral and fractional bits (int.frac) needed to represent data throughout the IR computation and detection process. In this paper, the criterion which determines when the ISD is suitable for implementation is the upper bound of 1024 visited nodes. Computational complexity and latency are the reasons for establishing this threshold. Therefore, for the depicted E b /N 0 value it is implied that ISD examines a maximum of 1024 nodes. It must be noted that the lower bound of ISD for N T = 16 and M = 64 equals to the threshold.
According to Tables III to VI, the decoding gain of ISD against the linear detector ZF, which as already mentioned, grows as the correlation level increases, can reach up to 13 dB in extremely highly correlated scenarios. Correlation causes BER degradation both at ZF and ISD but the impact is greater on LD. Assuming an uncorrelated channel, the decoding gain ranges between 0.5 and 1.25 dB therefore the benefit of ISD in spatially correlated channels is evident. However in a highly correlated environment, a larger SNR value is necessary hence the benefit from BER performance may be expensive. The decoding gain of ISD when no correlation exists is in agreement with [48], which analyzes the BER performance of ML detector as opposed to ZF. So it is concluded that in the absence of correlation ISD achieves ML performance without lurking an error probability at decoding. On the contrary, there is a low probability of error in ISD when correlation is added to the channel, especially for denser constellations, but the decoding gain remains high. This can be observed by comparing the decoding gain depicted in the Tables III to VI for the same correlation level. What one would expect would be the increase in decoding gain as M grows and the largest gain against ZF detector to be achieved for the case of 32 transmit antennas. However in some cases, only when correlation exists, this is not observed due to the additional partial distance constraint of ISD, which by dropping a node does not ensure that the ML solution path is not in the discarded ones and originated from that node. The reason why the highest decoding gain is expected in a system equipped with 32 transmit antennas is because when the ratio N R N T falls, LD become more inefficient [23]. Furthermore, the results in Tables III to VI show that the transmit-side correlation leads to higher decoding gain and SNR requirements in comparison with the case assuming correlation at the receiver only, confirming what was mentioned before; i.e., correlation is more unfavorable on the side of the few antennas.
Besides, Tables III to VI reveal the word length employed for implementation. ISD requires as many bits as ZF for the tree searching process, while IR computation requires additional fractional bits. The accuracy of the IR estimate is crucial for node pruning, because it is utilized for both the sphere constraint and the partial distance restriction of ISD. The number of fractional bits is determined by the BER performance, while the integral bits are selected so as to avoid overflow issues. Further reduction of fractional bits reduces word length but causes BER degradation and increases the number of visited nodes. Scaling at all the tree distances prevents an increase in the required word length in contrast  with [16]. In addition, it is observed that a rise of correlation increases the word length required not only for the detection but also for the IR estimate.
In Figs. 7 and 8, the proposed variations of ISD algorithm are examined concerning the BER performance and visited nodes. The results relate to a 128 × 16 SU-MIMO system featured with 16-QAM modulation scheme, assuming ρ τ = ρ ρ = 0.3. LRU-ISD is tested for different values of permitted radius updates. A loss in BER performance is expected, therefore both LRU-ISD and BC-ISD make sense when the decoding gain of ISD compared to ZF is adequate. The concept is to achieve a compromise between visited nodes and BER performance. Figs. 7 and 8 manifest in which E b /N 0 value the proposed trade-offs are effective. The visited nodes should not exceed the threshold set in this work and a lower BER than ZF must be achieved simultaneously. In high noise conditions, it is observed that LRU-ISD does not perform as expected. This is explained by the fact that ISD needs to update its sphere constraint several times in these SNR regions. However, performance of LRU-ISD changes at higher E b /N 0 , where radius updates are limited due to the efficiency of ZF which in turn leads to a more appropriate IR selection. It must  be noticed that as L increases, the performance of LRU-ISD improves but the mean number of visited nodes increases. Regarding BC-ISD, the finite number of radius renewals in high SNR regions rises the likelihood of not including the ML solution path. It is inferred from Fig. 7 that BC-ISD is more efficient than LRU-ISD at high noise levels. In addition, Fig. 8 shows that each variation achieves its lower bound for visited nodes. It equals to N T ×M 2 for LRU-ISD featured with L = 1, while for L > 1 is N T × M as ISD. As expected, BC-ISD always fluctuates between ISD and LRU-ISD with L = 1 in terms of visited nodes.
As already mentioned, it is generally observed due to the random distribution of users that MU systems are less prone to spatial correlation than SU systems. Single-antenna users are assumed in this paper. The nominal angles φ u are generated randomly for all the users and the obtained results are the average of five cases. A similar method is utilized in [39], where the final results are obtained as the mean value of 10 user drops. The standard deviation of the multipath angle distribution σ φ is considered equal to 10°, as recommended in [40]. The same assumption is made in [39]. Figs. 9 and 10 concern a 128 ×16 MU-MIMO system featured with M = 16 and reveal the BER performance and the visited nodes of ISD under correlation. Both 60°and 120°cell sectors are examined. As expected and shown in the results, a smaller angle range leads to more correlated propagation paths, thus causing a BER degradation and an increase in the mean number of visited nodes throughout the detection process. Table VII depicts, for the examined systems and for both cell sectors, the decoding gain of ISD compared to ZF. In addition, the minimum E b /N 0 value needed in order ISD to be feasible for implementation and the required number of bits are presented. For each MMIMO scheme, the E b /N 0 value and the word length are identified for both cell sectors. The threshold of 1024 visited nodes applies as in SU system and N R = 128 for all the examined cases. As opposed to SU-MIMO, it is observed less decoding gain and lower requirements in terms of word length and SNR. In the case of an 120°cell sector the results are similar with the uncorrelated scenario, while for 60°the decoding gain notes a slight increase. It means that the impact of correlation on MU networks is not so great. However the assumptions of the employed channel model play a key role in this, since scatterers are located around the N T users. Therefore transmit correlation, which is detrimental when N R N T , is not assumed in this concept. According to Table VII the decoding gain compared to ZF in MU-MIMO reaches up to 2.15 dB, but in [33] expectation-propagation detection is examined in real propagation conditions for NLOS In comparison with SD, expectation-propagation detection lags behind in the matter of BER performance [49], therefore ISD may demonstrate a higher decoding gain in real measurements.

B. Computational complexity
The reason why LD are widely used both in highdimensional MIMO and MMIMO systems, except that they have a quite sufficient performance, is their low-complexity. ISD is compared with ZF in this work, whose estimate is utilized for the initialization of the sphere radius and is reused for the tree exploration. Apart from ZF estimate, the Cholesky decomposition of matrix H H H and the IR computation are required before the execution of tree searching. The IR is estimated once for the N T ×N M -QAM symbols by M2 using matrix A and the computational complexity of this method is depicted in Table II. The complexity of ZF is also in Table II, where for the required matrix inversion work from [50] is considered. The IR component and the Cholesky decomposition have a combined complexity practically equivalent with ZF detector. Therefore, since the complexity of ZF is taken into account in the calculation of overall complexity, ISD has almost double complexity compared to ZF without the indispensable computations of tree searching for the N T × N transmitted symbols. This explains why node pruning is critical not only for the latency of the detection process but also for the computational complexity. Table II presents the arithmetic operations, which take place at each node. These are repeated for N × V n times. Table VIII demonstrates the computational complexity of the proposed ISD for the examined schemes, contingent upon the mean number of visited nodes per tree search. More specifically, it shows how many times its complexity is greater than that of ZF. For the results, N = 256 is assumed. The complexity of ISD either concerns SU-MIMO or MU-MIMO system, is presented in Table VIII and depending on the correlation level a suitable E b /N 0 value is required to achieve the displayed bound of V n . Having in mind the overall complexity and the latency of the implementation, the E b /N 0 values depicted in Tables III to VII have been determined for each case. A lower E b /N 0 than the one proposed means more visited nodes during the detection and therefore greater complexity. The computational complexity of the one-nodeper-cycle component shown in Table II, does not differ significantly for LRU-ISD and BC-ISD, hence their complexity is modified only with respect to V n . Table VIII reveals the overall complexity considering the defined threshold of V n = 1024, the lower bound of ISD as well as the bound of LRU-ISD featured with L = 1, which is the minimum number of visited nodes achieved in this paper. BC-ISD examines a number of candidate symbols between those of ISD and LRU-ISD with L = 1, therefore its complexity is intermediate. It must be supplemented, that the total arithmetic operations to be executed are not related to the number of instances of one-node-per-cycle components. The number of instances affects the area and the detection throughput rate, which are discussed below. From Table VIII it is inferred that the higher complexity is observed for M = 64, despite the fact that the examined scheme featured with N T = 32 consists of a tree representation with more nodes as described in Table I. Therefore it is concluded that a high number of tree levels may be more preferable for ISD than extremely dense constellations.

C. Detection throughput rate
Detection throughput rate is also a crucial factor of implementation. Having reduced the mean number of visited nodes and achieved the minimum processing time per node, throughput rate of tree searching process for both FPGA and ASIC implementation is presented here for ISD and its variations. FPGA implementations have been mapped to a Virtex-7 VT690T device. The ASIC results are obtained by synthesis using Mentor Graphics' Catapult as a front-end highlevel synthesis tool and Cadence Genus and Innovus as backend tools, targeting a 28-nm standard-cell library.
Tables IX and X depict the maximum frequency (f max ) and the area report of the one-node-per-cycle component for the FPGA and the ASIC implementation respectively. Trade-offs between area and throughput rate arise by replicating onenode-per-cycle components. Aiming at parallel detection area utilization scores of Table IX, which are based on FPGA resources, determine the maximum number of components. Regarding the ASIC implementation, the cost increases as the design enlarges, effectively imposing area limits. Let α denote the number of components. The detection throughput rate is also displayed in Tables IX and X and is computed in terms of parameters α and V n . Replacing the value of V n according to the respective E b /N 0 , the detection throughput rate can be computed for both ISD and its variations, since the one-nodeper-cycle component for all variations has almost the same computational complexity and therefore the same upper bound of α is imposed. The detection rate is maximized for ISD and its variations when α receives its highest value and V n equals to the lower bound of each proposed decoder. The decoded bits per tree searching for each examined scheme equals to N T × log 2 M .
Tables IX and X concern all the considered systems of this paper for both SU and MU scenario. Since uncorrelated and MU correlated case require exactly the same word length for detection, the results of the tables that assume an uncorrelated channel refer to both instances. For the SU systems, except for the uncorrelated case, spatial channel correlation featured with ρ τ = ρ ρ = 0.3 is examined. It is inferred from Table IX that as M rises, the one-node-per-cycle component becomes more demanding in terms of memory resources. The same holds for increasing N T . In this case, a rise in DSP48 units is also observed, which are critical for the determination of the upper bound of α considering the resources of the selected board. Similarly in Table X the required area increases. In addition, in both implementations the scheme equipped with more transmit antennas causes the greatest decrease in the f max of the design. The results for the case with N T = 32 are reasonable considering Fig. 2, where for the computation of (8) full loop unrolling is implemented with an unroll factor equal to N T . This justifies the sharp increase both in DSP48 units of FPGA implementation and critical path, since more arithmetic operations take place at one clock cycle. Apart from M and N T , the word length at each examined scheme affect the f max and the area utilization.
Although the increase in transmit antennas is more detrimental concerning the critical path and the required area of the design, the lowest detection throughput rate is observed for the examined case featured with M = 64. If we make some comparisons related to FPGA implementations of ISD, where similar conclusions emerge also for the ASIC implementations, the maximum throughput rate for N T = 16 and M = 64 is 278.75 Mbps compared to 338.71 Mbps for N T = 32 and M = 16 assuming uncorrelated or correlated MU channel. For the correlated SU scenario when ρ τ = ρ ρ = 0.3 is 122.48 Mbps and 146.44 Mbps respectively. It is explained due to fewer visited nodes in the case of N T = 32, despite the fact that the system with M = 64 has the capability of using more duplications. So it is proved once again how important the mean number of visited nodes is for a sphere decoding method. It must be noticed that a more complex modulation scheme than M = 16 in combination with N T = 32 would not be so effective. The throughput rate of ISD and its variations is maximized for N T = M = 16, where ISD reaches up to 768.89 Mbps and LRU-ISD featured with L = 1 up to 1537.77 Mbps regarding the FPGA implementation. These results are obtained for the maximum allowable α and the minimum V n for each case. On the contrary, 7088.33 Mbps and 14176.66 Mbps are the maximum values respectively for the detection rate of ASIC implementation, assuming the same number of duplications as in FPGA implementation. As expected, ASIC implementations may support high throughput rates. The effect of correlation leads to a reduction of the maximum throughput rate. In this paper, this fact is observed only at SU-MIMO systems where the impact of correlation is greater but it would be applied also for the MU-MIMO if the assumptions of the employed correlated channel model were different. However higher E b /N 0 value compared to the uncorrelated case is required for both SU and MU systems in order to achieve the maximum detection rate.
Taking into account the results of [16], there is a reduction in the required area in this contribution due to the decrease in the word length thanks to employed scaling at all the tree distances, as described in Section V. However, the critical path increases because apart from the operations of [16] additional functions that examine when the searching terminates and when the IR has to be updated are introduced in this work. Furthermore, the achieved detection throughput rate is higher compared to [16], in spite of the deflation of the f max of the design. This is justified as the decrease in area utilization allows more one-node-per-cycle components to be implemented.

VII. CONCLUSION
In large-scale systems, a sphere decoding method is considered as prohibitive due to high computational complexity and latency, but the results of this paper disrupt conventional thinking and show that there may be a future in certain high-dimensional MIMO or MMIMO schemes. The proposed optimizations render ISD feasible for implementation, even if there is correlation, for both SU and MU scenarios. As the level of correlation rises the decoding gain compared to linear detectors enlarges, reaching up to 13 dB in highly correlated conditions, requiring though a larger value of E b /N 0 . Variations of ISD algorithm are also presented leading to trade-offs between BER performance and computational complexity. The minimum complexity is achieved by LRU-ISD featured with L = 1 and equals to 2.71 times the complexity of ZF. In addition, one-side correlation is investigated for SU-MIMO schemes, confirming that the correlation on the side equipped with less antennas is more detrimental. Furthermore, the large gains observed in SU compared to MU systems are due to the greater effect of correlation, which is partly justified in the assumptions of the employed channel for the MU case. Trade-offs between area and detection throughput rate are also presented for both FPGA and ASIC implementations reaching up to 1537.77 Mbps and 14176.66 Mbps respectively.