A Design of Fixed-Complexity Sphere Decoder Combined With Interference Mitigation Algorithm for Downlink MU-MIMO Systems

In this paper, a design and implementation of fixed-complexity sphere decoder (FSD) combined with interference mitigation algorithm for downlink (DL) multiuser (MU) multiple-input multiple-output (MIMO) system is presented. To overcome performance degradation from inter-user interference in DL MU-MIMO systems, a hardware friendly optimized interference mitigation algorithm is proposed. The proposed algorithm achieves 0.5 dB near-optimal performance, which is 4 dB better than existing interference whitening filter for the soft-output decision. In this paper, both hard and soft-output FSD detectors are implemented with the proposed interference mitigation algorithm to function $4\times 4$ antenna MIMO detection and interference mitigation for 4 additional streams. A fully pipelined architecture is employed to support 3.6 Gbps at 150 MHz for signals modulated by 64-QAM. From the synthesis result, the designed symbol detector has a gate count of 887 K for the FSD without pre-processor and a gate count of 1220 K for interference mitigation. By comparing previously implemented MIMO detectors, the proposed design represents a promising architecture for DL MU-MIMO systems in terms of significant performance improvement relative to inter-user interference and the compatible hardware implementation maintaining high-throughput within acceptable gate count.


I. INTRODUCTION
In wireless communications, increasing transmission speed in a limited frequency band has become progressively more important with the spread of broadband access services utilized by various handheld devices such as smart-phones, tablets, and laptops [1], [2]. Higher transmission rates have been achieved by various technologies, and studies to find methods to improve transmission rates are ongoing. For decades, multiple-input multiple-output (MIMO) communication has been considered a key technology for achieving high data rates [3], [4].
The associate editor coordinating the review of this manuscript and approving it for publication was Gian Domenico Licciardo .
Recently, multiuser (MU)-MIMO antenna technology has been studied and applied to many consumer products. For such products, the representative standard is IEEE 802.11ac, which is a widely used wireless local area network standard in MU-MIMO downlink (DL) systems [5]. Standardization trends, such as IEEE 802.11n, have attracted considerable attention, and IEEE 802.11ac aims at a transmission speed of 1 Gbps or more [6]. For IEEE 802.11ac, DL MU-MIMO transmission technology in the 5 GHz band has been proposed [7]. The maximum transfer rate of the physical layer is approximately 7 Gbps with this technology.
Unlike conventional MIMO systems, in MU-MIMO systems, multiple users can communicate simultaneously using a single access point (AP). In other words, in a DL environment, a signal sent from the AP to a given user also reaches other users [8]. These signals are referred to as interfering signals, and users receive both the desired signal and the interfering signals simultaneously [9]. Consequently, performance is reduced by these interfering signals. Therefore, most DL MU-MIMO systems attempt to solve the interference problem by employing precoding technology in the AP [10], [11], [12], [13]. The pre-coding technique is a technique that the AP multiplies a precoding matrix on the channel in advance such that all users are not affected by the interference signals [10], [14], [15]. For example, in the representative zero-forcing beamforming technique, theoretically, an interfering signal can be zero, each user and the AP comprise of a MIMO system, and the entire MU-MIMO system is divided into MIMO [14]. However, in real-world environments, errors, referred to as practical errors, inevitably occur due to errors in channel estimation and incomplete calculation [16]. This practical errors result in an incomplete precoding matrix; thus, some users will still receive interfering signals [17]. Compared to the desired signal, the interference signal is relatively small; however, these signals are enough to affect system performance [18]. Therefore, an interference-aware receiver is required.
Various approaches to mitigate inter-user interference for DL MU-MIMO systems have been proposed. These methods can be categorized into two representative approaches: interference whitening (IW) and interference detection (ID) [19]. IW simply treats interference as Gaussian noise [20]; thus, it is less computationally complex than ID; however, it is unsatisfactory for strong interference. The IW approach has been studied by many interference rejection combiners and their extensions [22], [23], [24], [25], [26]. However, they still can not achieve high performance with strong interference. In contrast, ID detects both desired signals and interference signals to mitigate interference [21]. ID can achieve high performance (regarded as optimal), however, it suffers from high computational complexity. There were several ways to reduce complexity using existing MIMO decoder algorithms [27], [28], [29], [30], [31]. This approach can reduce complexity compared to identity, but is much more complex than IW. Thus ID is not considered for practical implementation. As mentioned previously, the performance of IW is not satisfactory under strong interference conditions, which is an issue with current and future of DL MU-MIMO systems.
In this paper, we propose a symbol detector architecture with an interference mitigation algorithm suitable for DL MU-MIMO systems to realize a suitable hardware implementation. To date, many hardware implementation studies related to the MIMO detection algorithms have been published [32], [33], [34], [35], [36], [37]. However, no studies have attempted to develop implementations for DL MU-MIMO systems despite their widespread usage. Therefore, we focus on improving existing MIMO detectors for DL MU-MIMO systems by combining them with an interference mitigation algorithm. In this process, the target is to minimize cost relative to compatibility with existing MIMO detectors, which we expect to simplify the adaptation of the MIMO detectors to DL MU-MIMO systems. The primary contributions of this paper are summarized as follows.
• An improved interference mitigation algorithm is proposed. The proposed algorithm can achieve better performance than existing interference mitigation algorithms in terms of bit error rate (BER) and computational complexity. Furthermore, the proposed algorithm is suitable for a hardware implementation, particularly with tree-search based MIMO detectors.
• An efficient method to combine the proposed interference mitigation algorithm and a fixed-complexity sphere decoder (FSD) algorithm is presented. The proposed interference mitigation algorithm is partially calculated according to the successive architecture of an FSD algorithm [38], [39].
• For high throughput, a fully-pipelined hardware architecture design is presented. The implemented design can support 3.6 Gbps at 150 MHz for 8× [4,4] with 64-QAM modulated signals in DL MU-MIMO systems. The remainder of this paper is organized as follows. Section II reviews MIMO and MU-MIMO system models and briefly discusses MIMO detection and interference mitigation algorithms. Section III describes the proposed interference mitigation algorithm and its implementation. Performance and computational complexity comparisons with other algorithms are presented in Section IV. Section V describes the hardware design and implementation of the FSD with the proposed algorithm. Implementation results are compared to previously implemented MIMO detectors in Section VI, and conclusions are given in Section VII.

II. SYSTEM MODEL
This section introduces a basic review of MIMO and DL MU-MIMO systems. First, a mathematical description and MIMO detection algorithms for MIMO systems are presented. Then, the mathematical description and interference mitigation algorithms for DL MU-MIMO systems are discussed.

A. MIMO SYSTEM 1) MATHEMATICAL DESCRIPTION
We consider a MIMO system with M transmit and N receive antennas [4]. Here H denotes an M × N channel matrix. For simplicity, symbol time in packet transmission is ignored. x m denotes symbols transmitted from the m-th transmit antenna and y n denotes signals received from the n-th receive antenna. Then, the signal vector received over the MIMO channel is given by where y = [y 1 , y 2 , · · · , y n ] T , x = [x 1 , x 2 , · · · , x m ] T and n = [n 1 , n 2 , · · · , n n ] T denote a received signal vector, a transmitted signal vector, and a noise vector, respectively. n is an independent and identically random distributed (i.i.d) complex additive white Gaussian noise vector with zero mean and VOLUME 10, 2022 variance σ 2 . Generally, MIMO channel capacity increases linearly as min(M , N ). Note that controlling the trade-off between receive diversity and multiplexing gain is a fundamental problem in MIMO transmission.

2) MIMO DETECTION
A maximum likelihood (ML) detector can offer the best performance; however, it is unsuitable for hardware implementation due to its exponential complexity. This problem has led to the development of tree-search based detectors. The FSD algorithm provides near-ML performance in a predefined tree architecture with fixed complexity [38]. FSD can be considered a hybrid model that combines a full extension (FE) stage to perform ML detection and a single extension (SE) stage to perform linear detection as shown in Fig. 1 where z = Q H y and v = Q H n. For the first p levels, all possible branches are extended. For the remaining M − p levels, a single branch determined by successive interference cancellation (SIC) on each node is extended. These processes are defined by the FE and SE stages. The FSD generates a candidate list expressed as follows wherex k = [x k,M , · · · ,x k,1 ] T , denotes the constellation set. Here, the i-th level is expressed as where Q(·) denotes a slicing operation. The FSD solution is given byx To extend the hard-output FSD to soft-output FSD (SFSD), a list-extended SD (LSD) algorithm has been proposed, as shown in Fig. 1 (b), such that additional tree searching is performed for L candidates with the shortest Euclidean distance (ED) among L [39]. For additional tree searching, only node expansion for its opposite bit in the SE stage is required. For example, for 16-QAM modulated signals from four transmit antennas, if FE = 1 and L = 2, additional visiting nodes for list expansion can be calculated as 2 × 4 × (3 + 2 + 1) = 48.
Although the FSD can achieve ML performance and is suitable for a fully-pipelined hardware implementation with high throughput, it can have significant computational complexity depending on the total number of levels, particularly sensitive to the number of FE levels. Here, a DL MU-MIMO system with a single AP with M antennas for each user and K users each having N antennas is considered as shown in Fig. 2. The received signal vector for the i-th user, y i , size of N × 1, can be expressed as where n i is an i.i.d complex zero-mean Gaussian noise with variance σ 2 , H i is the N × M DL channel matrix corresponding to user i, V i is the precoding matrix, and x i is the transmitted data vector. Note that the transmit power is not considered for simplicity. In this paper, a block diagonal precoding matrix design algorithm is assumed [14]. This scheme employs a precoding matrix that satisfies H i V j = 0 for i = j. Note that, under ideal conditions, the interference signal in (6) can be eliminated. However, in practice, the precoding matrix inevitably includes errors that occurs due to channel estimation, feedback bit quantization, and subcarrier grouping. Such practical errors cannot be estimated; thus an imperfect precoding matrix V i can be represented by separating the error matrix as follows where E i denotes the practical error matrix and V i denotes the ideal precoding matrix satisfying H i V j = 0 for i = j. With these terms, (6) can be represented in the following simplified form: where

2) INTERFERENCE MITIGATION
As mentioned previously, interference mitigation algorithms can be categorized into two representative approaches [19], i.e., IW and ID. The straightforward approach to inter-user interference is to treat interference as Gaussian noise [20]. To whiten the effective noise, which is assumed to be colored  Gaussian noise, the whitening filter is multiplied as follows where the whitening filter and covariance matrix can be respectively. Note that W is derived using Cholesky decomposition: The complexity of IW approach is relatively small because it is linear filter. Another approach to mitigating interference is to detect the interference signal together with the desired signal [21]. The ML solution of ID using joint ED can be defined as In addition, the FSD application of ID can be performed using a rank-deficient SD technique [28]. However, because ID approach detects detect interference symbols directly, complexity increases significantly as the number of interference signals increases. Furthermore, it must be modified by expanding existing MIMO detectors to detect more streams, which impacts compatibility with existing.

III. PROPOSED ALGORITHM
This section introduces the proposed interference mitigation algorithm. First, the proposed algorithm is derived in consideration of trade off between computational complexity and performance. Then, a method to implement the proposed interference mitigation algorithm with the FSD algorithm is presented.

A. ALGORITHM DESCRIPTION
The goal of the proposed algorithm is to achieve an efficient interference mitigating technique to ensure high performance under the interference conditions without significantly increasing complexity compared to existing MIMO detectors. The proposed algorithm includes both IW and ID methods because the IW method alone does not achieve sufficiently high performance under strong interference conditions. The IW method involves a filtering process to whiten interference signals. Note that interference signals are detected directly as desired signals with the ID method. Therefore, the key idea is to minimize the influence of an interference signal when detecting a desired signal using IW as a preprocessing step and to detect the interference signal for only the candidate desired symbol vectors detected by the FSD. The IW filter makes the effective channel matrix orthogonal, i.e., (WH I ) H (WH I ) I ; thus, linear detection using a ZF detector is sufficient for adequate performance. Note that the effective channel matrix is not perfectly orthogonal in practice. However, note that the detection of interference signals are detected to eliminate the effect of interference on the desired signals rather than decoding the data bits from the detected interference symbols. In other words, reducing complexity and simplifying the architecture compared to the ZF detector is more important in an interference mitigation algorithm.
The overall flow of the proposed algorithm is divided into three stages. 1) IW. Prior to detecting the desired signal with the FSD, an IW filter is applied to minimize the effect of the interference signal. Here, the interference signal is considered the same as whited noise, and the detection of VOLUME 10, 2022 the request signal does not have a colored effect on each stream.

2) FSD. A tree-search detector used in existing MIMO
systems is utilized to detect a desired signal. Note that we assume an FSD algorithm for the hard-output detector and an SFSD algorithm for soft-output detector. 3) IC. The interference signal vector is detected based on the desired signal candidate vector obtained by the FSD, and the final output is generated. If a hard-output detector is assumed, the interference signal is calculated according to each candidate signal vector, and the final ED is recalculated to select the desired signal candidate vector that minimizes this value. With the soft-output detector, the ED, including the interference signal according to each desired signal candidate vector, is recalculated to generate the LLR value per bit.

B. IMPLEMENTATION METHOD
The IW filter minimizes the influence of the interference signal and the FSD detects the desired signal without considering the interference signal. The IC process removes influence of the interference signal based on the candidate vectors. Here, the N c candidate vectors of the desired signal obtained by the FSD are denotedx D,1 , · · · ,x D,N c . These vectors are determined by the FSD tree architecture. The ED can be obtained for each candidate vector. For the hard-output detector, the smallest ED value is used as the final detection vector. In the proposed algorithm, the IC is inserted to remove the influence of the interference signal prior to determining the result. Here, the interference signal corresponding to the candidate vectorx I ,1 , · · · ,x I ,N c should be detected. This differs from joint detection in that detecting the interference signal does not affect the detection of the desired signal. As mentioned previously, it is considered that the detection of interference signals should target low complexity rather than accurate detection by minimizing the influence on the detection of the request signal. Therefore, in the proposed algorithm, interference signal detection is achieved with fast and low complexity using a ZF detector. First, QR decomposition in (2) is applied to the system model in (9), and is then passed to the IW filter as follows.
where Q is an unitary matrix and R is an upper triangular matrix such that QR = WH D . The zero forcing detector of the interference signalx D , i for the i-th candidate vectorx I , i is implemented as follows.
By multiplying the pseudo inverse of Q H WH I on both side, (13) can be rewritten in terms ofx I , i as follows.
x I ,i,ZF = ((Q H WH I ) H (Q H WH I )) −1 Through the assumption of (14), complex operations can be avoided, and the effect on actual performance is considered to be minimal. In addition,x I , i is finally obtained through the slicing function as follows.
where slice(·) denotes the slicing function to determine a symbol belong to one of the QAM constellations. The final output can be re-determined using (11) rather than the existing ED.

IV. ALGORITHM PERFORMANCE ANALYSIS
In this section, BER performance of multiple interference mitigation algorithms is presented for both hard-output and soft-output detection cases under two scenarios with different antenna configurations. In addition, a comparison of computational complexity approximated by the multiplier in the tree-search process is also presented for hard-output and softoutput detection cases.

A. BER COMPARISON
Computer simulations were conducted for hard-output and soft-output scenarios to evaluate multiple interference mitigation algorithms including the proposed algorithm. The simulations considered two scenarios in a DL MU-MIMO system comprising two users with three or four antennas each and an AP with six or eight antennas corresponding to the users, such that 6 × [3, 3] and 8 × [4,4]. For simplicity, all scenarios assumed that the desired and interfering user used 16-QAM mode. In addition, FSD and SFSD MIMO detection algorithms were used for each scenario. To verify the performance of the interference mitigation scheme, BER performance was evaluated realted to energy per bit to the noise-power-spectral density (Eb/N0) and relative to signal-to-interference ratio (SIR), respectively. Here, SIR is    defined as δ = 10log 10 ( H D 2 / H I 2 ) to adjust the influence of the interference signal in (8) caused by an error in the precoding matrix in (7). The simulation scenarios and parameters are summarized in Table 1.   For the scenarios 1 and 2 with different antenna configurations, the uncoded and coded BER was evaluated and is presented from Fig. 3 to Fig. 8, respectively. Figure 3 and Fig. 6 show the BER versus Eb/N0 with the SIR fixed at 15 dB VOLUME 10, 2022 in order to determine the effect of traditional noise on performance due to channel estimation and synchronization errors. On the other hand, Fig. 4 and Fig. 7 show the BER versus SIR with the Eb/N0 fixed at 40 dB to show how interference affects performance. Finally, in Fig. 5 and Fig. 8, the coded BER performance using the channel codec is presented.
The uncoded BER performance was evaluated using hardoutput detection with an FSD. As shown in Table 1, four interference mitigation algorithms were compared. Here, ID (D) denotes D stream of interference signal is detected selectively according to a previously proposed method [28] because the original ID is too complex, and proposed (F) denotes the proposed algorithm with the number of FEs of F in the FSD, which can generate more candidate outputs from the MIMO detector. Other interference mitigation algorithms do not depend on the number of candidate output; thus, F was set to 1. On the other hand, the coded BER performance using soft-output detection with the SFSD. All compared interference mitigation algorithms were set the same in the hardoutput detection case. To generate more candidate outputs from the MIMO detector, the number of list extensions was set to L in the SFSD for proposed (L). The number of FEs of F was set to 1 and L = 2 for all cases (except for proposed (L = 16)). For both the hard-and soft-output detection cases, the performance gap between ID and the proposed algorithm is controlled by parameter D, F and L. However, the proposed algorithm is much simpler relative to computation and more compatible relative to hardware design than the ID algorithm. In other words, if similar performance can be achieved, we argue that the proposed algorithm is more suitable for DL MU-MIMO receivers.
Look at the scenario 1, with the antenna configuration of 6 × [3, 3]. In Fig. 3, due to the influence of interference, it shows a performance convergence in which the BER does not decrease even if Eb/N0 increases. On the other hand, ideal (SIR = 0dB) does not show the performance convergence, and it explains why figure interference mitigation is required in a DL MU-MIMO system. Similarly, as shown in Fig. 4, for the increasing interference strength, performance gaps can be observed between none, IW, and ID. At a BER of 10 −5 , IW can achieve 4 dB better than none, and ID (D = 1) and ID (D = 2) can achieve 2.5 dB and 7.5 dB better than IW. The BER convergence in Fig. 3 results this significant performance gap. For the proposed scheme, as going high SIR, proposed (F = 1) and proposed (F = 2) can achieve performance that is comparable to that of ID (D = 1) and ID (D = 2) within 0.5dB and 2dB gap, respectively. As shown in Fig. 5, relatively similar results were obtained in the hardoutput detection case. IW achieved 3 dB better performance than none, and ID (D = 1) and ID (D = 2) achieved 5 dB and 9.5 dB better than IW, respectively. In addition, proposed (L = 2) achieved 3 dB near performance to ID (D = 1) and proposed (L = 16) achieved 2 dB near performance to ID (D = 2).
For the scenario 2, with the antenna configuration of 8 × [4,4], the proposed scheme can show the more powerful performance because the number of interference signal to detect is increased. In Fig. 6, the similar result can be seen with scenario 1, although the performance gap between IW and ID is reduced. Similarly, as shown in Fig. 7, at a BER of 10 −5 , IW can achieve 5 dB better than none, and ID (D = 1) and ID (D = 2) can achieve 1 dB and 6 dB better than IW. Also, proposed (F = 1) and proposed (F = 2) can achieve performance that is comparable to that of ID (D = 1) and ID (D = 2) within 0.5dB and 1dB gap, respectively. As shown in Fig. 8, IW achieved 3 dB better performance than none, and ID (D = 1) and ID (D = 2) achieved 2.5 dB and 5 dB better than IW, respectively. In addition, proposed (L = 2) and proposed (L = 16) achieved 0.5 dB near performance to ID (D = 1) and ID (D = 2), respectively.

B. COMPLEXITY COMPARISON
It is difficult to estimate the design cost of the algorithm; thus, the approximated computational complexity was used to compare the complexities of the algorithms. The FSD algorithm is assumed in this paper; thus, we used the characteristic of the tree search algorithm. The tree architecture of the FSD comprises multiple nodes, and each node performs similar computations such that the number of visiting nodes is usually considered the total complexity of tree search algorithms. Therefore, for the compared interference mitigation algorithms, we first analyzed the number of visiting node and approximated the number of multiplications calculated per node. Then, the number of multiplications for ED calculation was added. Note that other processes, such as FSD channel ordering and QR decomposition, are not changed by the interference mitigation algorithms. In addition, for simplicity, pre-processing for IW with Cholesky decomposition was not considered in this analysis because it has much lower complexity than a tree search unit. Table 2 shows the approximated complexity for hardoutput detection with an FSD with the different interference mitigation algorithms. The preprocessing for IW is not included; thus, none and IW are the same in this analysis. First, ID and proposed scheme makes more number of visiting nodes in tree searching. Next, the proposed algorithm requires IC, which can be performed by each node; therefore, the number of multiplications per node is double that of the other algorithms. Finally, the number of ED calculations depends on the number of candidate outputs, and, compared to none and IW, ID and the proposed algorithm require double the number of multiplications for each calculation. At the upper part of the table, the calculation equation in each part is described using the parameter N , P, F and D. At the lower part of the table, the calculation results of scenario 1 and scenario 2 are shown. Also for comparison, a normalized value based on none and IW is presented. As shown in BER performance results, proposed (F = 1) and proposed (F = 2)  show performance that is comparable to that of ID (D = 1) and ID (D = 2), respectively. However, as shown in Table 2, the normalized values of their approximated complexity are eight and nine times less than that of ID. Table 3 shows the same analysis for the soft-output detection using the SFSD. Here, the number of visiting node was calculated, including additional nodes from the list extension technique. The proposed algorithm can achieve near performance to ID with 4 and 15 times lower complexity.

V. ARCHITECTURE DESIGN
In this section, the architecture design of the FSD combined with the proposed interference mitigation algorithm for DL MU-MIMO systems is presented. First, the design target and overall architecture are presented. In addition, issues that should be considered when designing the symbol detector and overall architecture efficiency are discussed. Note that the the architecture can be divided into two stages, i.e., the IW stage and the symbol detection stage. Design specifies are explained in the following.

A. DESIGN OVERVIEW
The designed symbol detector aims to improve communication performance of a DL MU-MIMO system by adding an interference mitigation algorithm to an existing MIMO detector. The considered environment is a 8× [4,4] DL MU-MIMO system comprising 4 × 4 MIMO detector with interference mitigation of four streams. The modulation scheme is set to 16-QAM for both the desired and interference signals, and the MIMO detector is designed by the FSD and SFSD algorithms. Here, we focus on the meaningful hardware unit; thus, the tree search architecture, the preprocessing unit (e.g, FSD channel ordering and QR decomposition), and the calculator (e.g, minimum value finder and LLR calculator) were not implemented. The overall architecture for the hard-output detector is shown in Fig. 9, and the extra blocks for the softoutput detector are shown in Fig. 10.
A DL MU-MIMO was introduced to improve system throughput, such that the designed symbol detector can support high throughput. Generally, throughput can be calculated as where f c denotes the clock frequency, B is bit per symbol such that B = log 2 M and N clk is the number of clock cycles needed to detect one symbol. Note that it is common to use pipelining techniques to improve throughput because throughput is sensitive to N clk . Pipelining technique arranges the hardware such that more than one operation can be performed simultaneously. Our design achieves high throughput by applying a fully-pipelined architecture that makes each hardware unit work for a symbol in a clock. Therefore, N clk = 1 in our design, which makes it possible to improve throughput significantly. VOLUME 10, 2022   The pipelining technique increases throughput efficiently; however, it requires many hardware units, which can increase overall complexity. Therefore, an efficient algorithm that minimizes complexity should be used. As shown in Fig. 9 and 10, the proposed design can be divided into two stages, i.e., the IW and symbol detecting stages. In the IW stage, the IW filtering process is implemented according to an optimized technique. In the symbol detecting stage, the proposed interference mitigation algorithm is implemented efficiently using the existing FSD and SFSD algorithms.

B. IW STAGE
In the IW stage, finding W according to (10) is the key part. In this equation, generally, the inverse matrix function is performed first and Cholesky decomposition is then performed; however, here, the reverse is performed such that the inverse operation is performed for the triangular matrix. Note that the final calculated matrix has the same characteristic used for IW.
The overall functional flow of the IW stage is shown in Fig. 11. First, the value of H H I H I + σ I is computed for the given input H I , passed through the Cholesky decompostion module, passed through the inversion module, and finally converted to y, H D , H I . Finally, Wy i , WH I , and WH I is generated as outputs. First, we perform Cholesky decomposition on the given channel H I . In the Cholesky decomposition module, the value of H H I H I + σ I is given as input, and the Cholesky decomposition is designed to solve the 4 × 4 matrix using the divide and conquer method. The Cholesky decomposition module is implemented in 10 pipelining stages. To perform for the 4 × 4 matrix, four stages are required because the triangular matrix of the final output is obtained by the divide and conquer method (4 × 4, 3 × 3, 2 × 2, and 1 × 1). As shown in Fig. 12    column and updates the value of the remaining 3 × 3 matrix in the third sub-stage.
Note that matrix inversion by Cholesky decomposition is simple due to the triangular matrix. The matrix inversion is also designed to enable full pipelining as shown in Fig. 12. In addition, matrix inversion can be performed with a delay in each column stage, which can make the design efficient with lower latency.

C. SYMBOL DETECTING STAGE
The FSD stage can be divided into three key functions, i.e., FSD pre-processing, FSD and LSD. Note that FSD preprocessing is excluded in the hardware implementation in this paper. The FSD and LSD are designed as a fullypipelined architecture such that each has the same number of sub-stages as the number of tree levels.
A block diagram of the entire FSD stage (consisting of FSD-LSD) is shown in Fig. 9. First, hard-detection is performed in the FSD stage and the LSD stage is initiated to secure soft output. In this process, there is a minimum selector that determines the minimum candidate value based on the FSD output. Finally, there is an LLR calculator based on the SFSD output.

1) IMPLEMENTATION OF FSD UNIT
For a four-stream antenna, the number of full extensions is fixed to one, which is close to the near optimal performance. Figure 14 shows the hardware architecture of the FSD for 4 × 4 64-QAM. Here, each tree level has 64 nodes, and a total of 64 × 4 = 256 nodes are implemented for pipelining. Each node consists of an interference cancellation unit (ICU) that updates the matrix based on the detected symbols, and a node selection unit (NSU) that performs symbol slicing of the current stream. The ICU and NSU calculate (17) and (18), sequentially.z VOLUME 10, 2022 In the first stage, which does not have to select a symbol because the first level is FE, only the ICU exists. In addition, in the last stage which does not have to update the matrix, only the NSU exists.

2) IMPLEMENTATION OF LSD UNIT
After passing through the four stages of the FSD, a bit-based tree-extension stage is initiated to generate soft-output for all bits. This tree extension is performed based on the LSD algorithm and generates the soft-output, and the list parameter L is fixed to 2. In the SFSD stage, two of the 64 values from the FSD stage are selected to expand the opposite tree of each bit at each tree-level. The LSD block is designed as shown in Fig. 15. Since 64-QAM signals have six bits, six tree-extensions are performed on one node. Note that three stages are required to expand at the second tree level, and only two stages are required for expansion at the third and fourth tree levels; thus, 6 × 6 = 36 additional nodes are required because SFSD proceeds on two values in total. Thus, 36 × 2 = 72 nodes are created in total. As shown in Fig. 15, some nodes do not require the NSU because such nodes already have a pre-decided symbol. Note that such nodes in the first tree level also do not require the ISU likewise FSD.

3) IMPLEMENTATION OF IC STAGE
Note that the proposed interference mitigation scheme is implemented by hardware, and the proposed scheme is implemented to minimize additional hardware requirements by adding an IC unit to the FSD and LSD units as shown in Fig. 13. The interference signal is detected through the operation of (14), which can proceed sequentially in the treelevel structure. For calculation, after FSD preprocessing is performed, H I H W H Q is calculated by the IC preprocessing unit. As shown in Fig. 16, the IC unit is added and performs per FSD sub-stage. Each unit calculates a column of the matrix for the IC, Q H Wy i − Rx D , i .
After four stages of the IC unit, the ZF matrix is completed and then goes through the slicing unit, which detects interference signals. Finally, the detected interference signal is used with the detected desired signal in the ED calculation unit.
Note that the IC unit for the LSD works in the same manner. Therefore, in total, 64 + 36 = 100 candidates and those recalculated ED can be obtained. These values are used to calculate the LLR.

VI. IMPLEMENTATION RESULT
The proposed soft-output symbol detector design, including the proposed interference mitigation algorithm, was implemented for DL MU-MIMO system. Here, a 8× [4,4] antenna configuration was considered, and a Verilog HDL and Synopsys Design Complier were used for RTL design and synthesis, respectively. The synthesis was performed based on a 65-nm CMOS standard cell library with an operating frequency of 150 MHz, and the gate area was 2633 mm 2 . Table 4 shows the gate count of each implemented block. As can be seen, the total gate count is 2107 K. The existing FSD and LSD units account for 887 K (42.1 %) of the total gate count, and the other units for interference mitigation account for 1220 K (57.9 %) of the total. Table 5 compares the overall hardware implementation results of previous studies and the proposed design. [32] shows the implementation results of the SD algorithm with hard-output. [33], [34], and [35] show the implementation results of the K-best algorithm with hard and soft output, respectively. [36] and [37] show the implementation results of the FSD algorithm with hard and soft output, respectively. Proposed shows the implementation results of the proposed algorithm with only the FSD and with both the FSD and IC.
To compare the results fairly, under the same condition, throughput and power consumption were normalized to the 65-nm technology at a supply voltage (V dd ) of 1.2 V. The normalized throughput is calculated as follows: Normalized Throughput = ( Tech. 65 nm ) × Throughput, (19) while the normalized power consumption is calculated as ) × Power. (20) As shown in Table 5, [32] shows a gate count of 212 K and a normalized throughput of 132.9 Mbps at an operating clock rate of 193 MHz. Note that the SD algorithm is not friendly to the hardware architecture; thus the implementation is not efficient and cannot provide high throughput. [33] shows a gate count of 1760 K and a normalized throughput of 100 Mbps at an operating clock of 198 MHz. Note that this is the first implementation of the K-best algorithm with a sorter in each stage. Therefore, the implementation is not efficient, which results in a high gate count and low throughput. [34] attempts to optimize the existing K-best algorithm in terms of both throughput and gate count, and it shows a gate count of 298 K and a normalized throughput of 2 Gbps; however, it has a very high operating clock rate of 833 MHz, which is too high for use in a practical implementation. Recently, [35] was presented to achieve high-speed performance as normalized 3.28 Gbps at an operating clock rate of 137 MHz with a gate count 1753 K.
As mentioned previously, the FSD algorithm is proposed as a hardware friendly implementation. [36] shows much better performance with a gate count of 88.2 K and a normalized throughput of 1.98 Gbps at an operating clock rate of 165 MHz. Here, a potential disadvantage of the FSD algorithm is how to obtain sufficient candidates for softoutput generation because it typically uses a much lower number of candidates compared to K-best algorithm. [37] shows the soft-output FSD with a gate count 555 K and a normalized throughput of 3.05 Gbps at an operating clock rate of 370 MHz.
Since there have been no previous implementations of the interference mitigation algorithm for an MIMO detector, the proposed algorithm was first implemented using only the FSD stage for comparison with existing MIMO detectors. In addition, it was implemented with the IC stage to evaluate the cost of extending the DL MU-MIMO system.
First, when comparing the results of [37] and the FSD component of this study, the FSD component shows a gate count of 887 K and a normalized throughput of 3.6 Gbps at an operating clock rate of 150 MHz. Recall that this study employs a fully-pipelined architecture; therefore, the throughput is much greater at the same operating clock rate although the gate count is greater than that of [37]. Similarly, power consumption can be lower due to the lower operating clock rate. In other words, the proposed implementation is comparable to the existing FSD implementation.
Next, the FSD+IC component is compared to the FSD component to determine how it is changed by extending the MIMO detector to a DL MU-MIMO detector. As shown in Table 3, the proposed FSD with IC requires approximately double the computational complexity of the existing FSD. This result can be observed in Table 4. If we consider FSD preprocessing, the total gate count is closer to the double value because the implementation includes preprocessing for only the IC because FSD preprocessing is usually excluded due to its variable implementation method. As shown in Table 5, with the interference mitigation algorithm, proposed algorithm can achieve a normalized throughput of 3.6 Gbps with a gate count of 2107 K, which is significantly better performance than existing MIMO detectors under DL MU-MIMO systems and when only IW is included.

VII. CONCLUSION
In this paper, we have proposed a design and implementation of an FSD combined with an interference mitigation algorithm for a DL MU-MIMO system. The simulation results show 0.5 dB near-optimal performance which is 4 dB better than the existing IW method for soft-output decisions. Both hard and soft output FSDs are implemented with the proposed interference mitigation algorithm for 8 × [4,4] DL MU-MIMO systems. The proposed design can support 3.6 Gbps at 150 MHz using a pipelined implementation. In a comparison with the previous MIMO detectors, the proposed design represents a promising architecture for DL MU-MIMO systems in terms of significant performance improvement in inter-user interference situations and has a compatible hardware implementation that maintains high throughput at an acceptable gate count.