Due to its capability of providing high spectral efficiency and link reliability, multiple-input multiple-output (MIMO) technology has become a key part in many new wireless standards such as IEEE 802.11n, IEEE 802.16e/m. However, one of the implementation challenges for MIMO systems is to develop high-throughput low-complexity MIMO detectors and related signal processing blocks. QR decomposition (QRD) is an essential signal processing task that is required by most MIMO detection schemes to decompose the estimated channel matrix into an upper triangular matrix, providing a suitable framework for sequential detection schemes. Note that this work aims to design a QRD core for MIMO receivers used in high mobility applications, that require high-throughput signal processing blocks to accommodate fast-varying channel environments.

The three basic methods for computing QR decomposition include: the Modified Gram-Schmidt Orthonormalization (MGS) algorithm, Householder reflections and Givens rotations. A straightforward implementation of the MGS algorithm and Householder reflections require multiplication, division and square-root operations, resulting in a high hardware complexity and computation latency. For MGS, [1] proposes an idea of using log-domain computations to implement these operations using low-complexity adders, subtractors and shifters. However, the scheme presented performs frequent data conversions between log and linear domains and it requires large storage space to hold the look-up tables. In [2], a low-complexity approximation is presented to implement the inverse square-root function. However, due to the underlying approximation, it might lead to bit error rate (BER) performance degradation, especially for fixed-precision arithmetic. Householder reflections have the advantage of nulling multiple rows simultaneously. However, this benefit comes with a challenging implementation to carry out multiple reflections in parallel. Since Givens rotations work on only two matrix rows at a time, they can be more easily parallelized. Furthermore, the CORDIC algorithm, in its vectoring and rotation modes, can be used to perform Givens rotations using low-complexity shift and add operations. These two factors make Givens rotations the method of choice for QRD implementation.

However, using the conventional sequence of Givens rotations to decompose matrices with large dimensions leads to high computational complexity, due to the large number of required vectoring and rotation operations. To alleviate this problem, a modified sequence of Givens rotations is presented in [3] that keeps the block-wise symmetry between the sub matrices intact during the annihilation process. However, this improved sequence still leads to a large number of rotation operations for high-dimensional MIMO systems (e.g., 4 × 4). Furthermore, the sequential nature of element annihilations for certain sub-matrices and the large number of required rotations for each annihilation causes a throughput bottleneck.

To resolve these issues, this paper proposes a *hybrid* QR decomposition scheme based on a combination of Householder reflections, conventional 2D Givens rotations and multi-dimensional Givens rotations. In fact, the proposed scheme in this paper integrates the benefits of multi-dimensional annihilation capability of Householder reflections plus the low-complexity nature of the conventional 2D Givens rotations. This integration is a perfect marriage that relieves the throughput bottleneck and reduces the hardware complexity, by first decreasing the number of rotation operations required and then by enabling their parallel execution. The computational complexity is further reduced using the extended multi-dimensional CORDIC algorithms and the Householder CORDIC algorithms, presented in [4] and [5] respectively, that only use shift and add operations to perform multi-dimensional vector rotations. Furthermore, this paper presents a novel pipelined architecture for the QRD core that uses 2D, Householder 3D and 4D/2D configurable un-rolled pipelined CORDIC processors, to maximize throughput and resource utilization, while minimizing the gate count. Synthesis results in 0.13 μm CMOS indicate that this QRD design computes a new 4 × 4 complex **R** matrix and 4 updated 4 × 1 complex symbol vectors every 40 cycles, at a clock frequency of 270 MHz and requires 36 K gates.

SECTION II

## SYSTEM MODEL AND SPECIFICATION DERIVATION

Let's consider a MIMO system with *N*_{T} transmit and *N*_{R} receive antennas. The complex baseband equivalent model for this system can be given as where and ỹ denote the transmitted and received symbol vectors, respectively. The *N*_{R}-dimensional vector is an independent identically distributed (i.i.d) complex zero-mean Gaussian noise vector with variance σ^{2}. The matrix represents an *N*_{R} × *N*_{T} complex-valued channel matrix. The real-valued system model can be derived using the real valued decomposition (RVD) algorithm [6], and can be given as **y** = **Hs**+**v**, where the dimensions of **s**,**y** and **H** are 2*N*_{T} × 1, 2*N*_{R} × 1 and 2*N*_{R} × 2*N*_{T}, respectively. Note that QRD using the conventional RVD model produces 4 upper-triangular sub-matrices. However, if a modified RVD model shown in Fig. 1 is used, then we can attain a strictly upper-triangular 8 × 8 **R** matrix. Note that in Fig. 1, and denote the real and imaginary parts of the complex elements, respectively.

Many of the MIMO detection schemes start the estimation process of the transmitted symbol vector by decomposing the channel matrix **H** into a unitary **Q** and an upper-triangular **R** matrices. Performing nulling operation on the received signal by **Q**^{H} results in the updated system equation, **z** = **Q**^{H}**y** = **Rs**+**Q**^{H}**v**. This indicates that the QRD core needs to compute and provide matrix **R** and updated symbol vector **z** = **Q**^{H}**y** to the MIMO detector.

In this work, a QRD core is designed to be used with the K-Best 4 × 4 MIMO detector presented in [6] with K = 10, and hence its performance specifications can be derived as follows. The K-Best detector in [6] requires a new input **z** vector every K = 10 clock cycles, and assumes that the channel is quasi-static and is updated every four channel use. Hence, the QRD core should be designed to generate a new 8 × 8 real **R** matrix and four 8 × 1 real **z** vectors every 40 clock cycles, at a clock frequency of at least 270 MHz, while minimizing power dissipation and gate count as the target applications are for mobile communications.

SECTION III

## PROPOSED QR DECOMPOSITION SCHEME

As discussed earlier, among the basic methods for QRD computation, the Givens rotations method is superior in terms of performance and hardware complexity. However, QRD of a large **H** matrix using conventional sequence of Givens rotations requires a large number of vectoring and rotations operations, e.g., for a 4 × 4 complex matrix, a total of 28 vectoring and 252 rotation operations are required. The modified sequence of Givens rotations presented in [3] reduces the number of vectoring and rotation operations required to 16 and 136, respectively, which is still considerably large. Another issue with using the existing sequences of Givens rotations for QRD is the sequential nature of element annihilations for certain sub-matrices. For example, from the nullification sequence shown in Fig. 1 for the first two columns of **H** matrix, annihilations of the , and elements and their corresponding rotations have to performed sequentially using appropriate pivot elements, since they use common sets of rows. These issues with the existing sequences of Givens rotations lead to larger latency, hence lower throughput, and higher hardware complexity for QRD implementation.

To reduce the number of rotation operations required and to enable their parallel realization, this paper proposes a hybrid QR decomposition scheme that uses a combination of multi-dimensional Givens rotations, Householder reflections and the conventional two-dimensional Givens rotations. The fundamental idea is to increase throughput by annihilating multiple elements simultaneously, by using multi-dimensional Givens rotations and Householder reflections, and to reduce the circuit complexity by implementing these multi-dimensional rotations using series of shift and add operations.

Multi-dimensional Givens rotations operate on vectors of dimensions larger than 2, to align them with the first axis. A generic way to implement multi-dimensional Givens rotations is to use high-complexity multiply-and-add based algorithms. However, in [4], 3D and 4D CORDIC algorithms are presented as an extension of the conventional 2D CORDIC algorithms, that use low-complexity shift and add operations to carry out Givens rotations for 3 and 4 dimensional vectors. Householder reflections also provide the capability of introducing multiple zeroes simultaneously. In [5], a novel Householder CORDIC algorithm is proposed, that can perform 3 and 4 dimensional vector rotations based on a sequence of simple Householder reflections using shift, carry-save-addition (CSA) and addition operations.

We carefully examined and compared the complexity of the elementary rotation matrices for 3D and 4D cases of these two algorithms, which led to the choice of the Householder CORDIC algorithm for 3D and the multi-dimensional CORDIC algorithm for 4D Givens rotations. The elementary rotation equations, for *i*^{th} iteration of the Householder 3D CORDIC algorithm [5] can be derived from its elementary rotation matrix as follows:
TeX Source
$$\eqalignno{X^{i+1}_1 &=X^i_1 - 2^{-2i+1}X^i_1 +2^{-i+1}D^i_1X^i_2 + 2^{-i+1}D^i_2X^i_3\cr X^{i+1}_2 &=-2^{-i+1}D^i_1 X^i_1+ X^i_2 - 2^{-2i+1}D^i_1D^i_2X^i_3\cr X^{i+1}_3 &=-2^{-i+1}D^i_2X^i_1 - 2^{-2i+1}D^i_1D^i_2X^i_2 + X^i_3&\hbox{(1)}}$$where, the rotation directions can be obtained from the input operands as: and . The elementary rotation equations for the 4D CORDIC algorithm [4] can be derived as:
TeX Source
$$\eqalignno{X^{i+1}_1 &= X^i_1 - 2^{-i}D^i_1 X^i_{2} - 2^{-i}D^i_2 X^i_{3} - 2^{-i}D^i_3 X^i_4\cr X^{i+1}_2 &= 2^{-i}D^i_1 X^i_{1} + X^i_{2} + 2^{-i}D^i_3 X^i_{3} - 2^{-i}D^i_2 X^i_4\cr X^{i+1}_3 &= 2^{-i}D^i_2 X^i_{1} - 2^{-i}D^i_3 X^i_{2} + X^i_{3} + 2^{-i}D^i_1 X^i_4\cr X^{i+1}_4 &= 2^{-i}D^i_3 X^i_{1} + 2^{-i}D^i_2 X^i_{2} - 2^{-i}D^i_1 X^i_{3} + X^i_4&\hbox{(2)}}$$where, the rotation directions are calculated as: and

The following describes the proposed QR decomposition scheme. Note that this scheme uses a special sequence of Givens rotations, partially derived from [3], that keeps the symmetry between the adjacent columns of **H** intact. Hence, the scheme will only need to perform vectoring and rotation operations on odd numbered columns of **H**, and the values for the elements in the even numbered columns can be derived automatically, without any computations. Also, note that the scheme is described here for QRD of a 4 × 4 complex matrix, however, it can be generalized to any matrix dimensions by appropriately using 2D, 3D and 4D Givens rotations and Householder reflections. The multi-dimensional CORDIC algorithm and the Householder CORDIC algorithm, along with the simplified and efficient 2D, 3D and 4D CORDIC VLSI architectures, presented in Section IV, can then be used to develop a high-throughput low-complexity architecture for QRD for any matrix dimensions. The scheme begins with annihilating the elements in the first column of the **H** matrix in a completely parallel manner using conventional 2D Givens rotations. It then uses 4D Givens rotations to annihilate the elements and simultaneously and in parallel, as opposed to the sequential annihilation using the conventional 2D Givens rotations. As a result, the number of corresponding rotation operations is reduced by a factor of 3, from 42 to 14. As the next step, the conventional 2D Givens rotations are used once again to perform parallel annihilation of the elements in the third column of the **H** matrix, shown in Fig. 1. The proposed scheme then uses the 3D Householder CORDIC algorithm to annihilate and simultaneously. The effect of element annihilation is propagated to non-zero elements in rows 2, 3, 4 and 6, 7, 8 in parallel, and this further reduces the number of corresponding rotation operations by a factor of 2. As the last step, this scheme annihilates the and elements, in the order given here, using the conventional 2D Givens rotations.

SECTION IV

## QR DECOMPOSITION VLSI IMPLEMENTATION

### A. Overall QR Decomposition Architecture

The improved QRD scheme presented in Section III is used to develop a QRD architecture for 4 × 4 MIMO receivers. The QRD core needs to perform a total of 16 vectoring and 136 rotation operations to output an 8 × 8 **R** matrix and four 8 × 1 **z** vectors every 40 clock cycles. In order to meet these challenging specifications, we propose a novel pipelined architecture that uses un-rolled CORDIC processors iteratively to implement the proposed QRD scheme.

The overall architecture, shown in Fig. 2, consists of 6 pipelined stages. The `Input Controller`

and `Output Controller`

stages provide interfaces of the QRD core with the preceding and succeeding stages in a MIMO receiver, to read in or write out the input and output matrices. The four central stages, `Stage1-4`

, compute the QR decomposition of input **H** matrix, as well as 4 **z** vectors using un-rolled pipelined 2D, 3D and 4D CORDIC processors. The datapath of each of these stages also contains a multiplexer bank, that is used to select the input operands for the CORDIC processor every cycle, and a register bank, that is used to re-direct and hold the CORDIC outputs until the current stage completes its desired computations. Each of these stages also contains an independent `Stage Controller`

, that provides control signals to direct appropriate data in and out of the CORDIC processor every cycle. It is also responsible for controlling the CORDIC mode of operation, the rotation direction transfers and re-use of the pipelined CORDIC stages to maximize resource utilization. Finally, note that the CORDIC modules are designed to minimize gate count by performing CORDIC algorithm iterations in each half of the clock cycle, however the Stage Controllers are designed to use full clock cycles for reduced complexity. Thus, the proposed architecture meets the QRD processing latency specification of 40 cycles, while maximizing resource utilization and minimizing the gate count.

### B. 2D, 3D and 4D CORDIC Architectures

In order to satisfy the challenging latency requirements of the QRD core, the CORDIC modules need to perform a large number of vectoring and rotation operations within the limited number of cycles, while trying to achieve the smallest gate count possible. Hence, the 2D, 3D and 4D CORDIC processors are designed with the primary aim of achieving high throughput, and then, as the secondary aim, their gate count is reduced by using various strategies.

In general, the 2D, 3D and 4D CORDIC processors, designed in this work, consist of multiple pipelined core stages, where each core stage is configurable to implement one or more of the CORDIC elementary rotation equations in either vectoring or rotation mode. In addition to the core stages, the CORDIC processors also contain stages to perform input coarse rotation, output inverse coarse rotation and output scaling to compensate for CORDIC processing gain. Based on extensive MATLAB simulations, architectural decisions were made to use 8 CORDIC iterations and the two's complement data format for input, output and internal data with a word-length of 16 bits and 11 bits for the fractional part.

One of the strategies used to achieve lower gate count, is to use implicit angle transfer using the elementary rotation directions, rather than explicitly computing and transferring actual rotation angles [5]. This results in a hardware savings of around 30%, since the hardware resources in the angle datapath can be removed. Also, since each CORDIC core stage needs to perform fixed shift, it can be performed using re-wiring of the input operands and hence the area intensive barrel shifters can be removed. Another hardware saving strategy used is to re-use the CORDIC stages to perform more than one elementary rotations per stage. This reduces the number of pipelined stages required and increases the datapath hardware utilization significantly. Also, the QRD design does not use any multiplier, divider, square-root or RAM modules. Thus, these strategies result in considerable gate count reduction, while still achieving the same performance.

Both the 2D and 3D un-rolled CORDIC processors are made of 4 pipelined stages. For the 2D CORDIC processor, each stage implements 2 sets of conventional 2D CORDIC elementary rotation equations, in a single clock cycle. The idea is to use the same set of 16-bit signed adders twice and use multiplexers to select the inputs to these adders, to implement one set of CORDIC equations in each half of the clock cycle. In `Stage 1`

, the 2D CORDIC performs 4 2D vectoring and 24 2D rotation operations, and in `Stage 4`

, it performs 3 2D vectoring and 24 2D rotation operations, within 40 cycles. On the other hand, each stage of the 3D un-rolled CORDIC processor implements 2 sets of Householder 3D CORDIC elementary rotation equations, within 2 clock cycles. From the single stage architecture shown in Fig. 3 for the 3D CORDIC, the top 2 adders compute and the bottom 2 adders compute and , within a single clock cycle, using the Householder 3D CORDIC equations, shown in (1). The outputs and are then fed back as inputs to the same stage, and the same procedure is used to compute and which serve as the final outputs of the stage. The un-rolled 3D CORDIC processor is used in `Stage 3`

of the QRD core to perform one 3D vectoring and 12 3D rotation operations.

The `Stage 2`

of QRD contains a 4D/2D configurable un-rolled CORDIC processor. It consists of 8 pipelined stages, each of which is programmable to operate in either 4D or 2D mode. In the 2D mode, each stage implements 2 sets of 2D elementary CORDIC equations, and in the 4D mode of operation, each stage implements 1 set of 4D CORDIC elementary rotation equations, shown in (2). From the single stage architecture shown in Fig. 4, the adders are used to compute and in the first half of the clock cycle, and and in the second half of the clock cycle, in the 4D mode. The 4D/2D configurable CORDIC processor performs a total of one 4D vectoring, 14 4D rotation, 3 2D vectoring and 18 2D rotation operations within 36 clock cycles.

SECTION V

## PERFORMANCE AND COMPLEXITY COMPARISON

Let's first compare the performance of the proposed QRD scheme, with the conventional scheme that uses only 2D Givens rotations. Fig. 5 shows the BER curves obtained by simulating the combination of QRD and K-Best MIMO detector for different QRD schemes. It can be noticed that the BER performance for the proposed QRD scheme is identical to that of the conventional scheme, for both floating-point and fixed-point models. Note that these QRD MATLAB models use the CORDIC algorithms for performing Givens rotations, with 8 CORDIC iterations and appropriate scale factors. Fig. 5 also shows the BER curve for QRD using ideal Givens rotations, which is marginally better compared to that using the CORDIC algorithm. This can be explained by the fact that the CORDIC algorithm just approximates actual vector rotations, with the accuracy dependent on the number of iterations and the compensation scale factors [5]. Fig. 5 also shows the BER curves for fixed-point QRD models using the proposed scheme, that use 6 and 10 CORDIC iterations. It can be noticed that QRD using 6 iterations leads to a significant BER performance degradation, while QRD using 10 iterations offers a minimal gain. This justifies our choice of using 8 iterations for 2D, Householder 3D and 4D/2D configurable CORDIC processors.

Table I shows the design characteristic comparison between the QRD architecture presented in this paper and other state-of-the-art QRD implementations for 4 × 4 complex matrices. Note that here Processing Latency refers to the number of cycles after which a new QR decomposition output is ready. The QRD design in [1] requires considerable storage space to hold the look-up tables, and hence incurs large gate count. In contrast, the QRD design in [2] achieves much smaller gate count by using a low-complexity approximation of the inverse square-root function, however, it has a very large QRD processing latency. In comparison, the novel QRD scheme and architecture presented in this paper outputs a new 4 × 4 complex **R** matrix and four 4 × 1 **z** vectors every 40 cycles, and requires 36 K gates. Thus, the proposed QRD architecture achieves the lowest processing latency, while still achieving the second lowest core area.

To summarize, the focus of this paper is on the design of a high-speed low-complexity QRD architecture to be used in MIMO receivers for mobile applications with fast-varying channels. A hybrid QRD scheme is proposed that integrates the advantageous simultaneous annihilation capability of Householder transformations and multidimensional Givens rotations, with the low implementation complexity benefit of the CORDIC algorithms. The proposed scheme reduces the overall computational complexity and allows higher execution parallelism, and is proved to have the same BER performance as the conventional scheme through simulations. Using the improved QRD scheme and the novel un-rolled pipelined architectures of the multidimensional CORDIC processors, the designed QRD core is shown to achieve the lowest processing latency and the second lowest core area for QR decomposition of input 4 × 4 complex channel matrices.