Design and Chip Implementation of a SMI/MVDR Dual-Mode Beamformer for Wireless MIMO Communication Systems

This paper presents a low complexity chip design supporting dual-mode beamforming, i.e. sampling matrix inversion (SMI) and the minimum variance distortionless response (MVDR), for wireless Multiple-Input Multiple-Output (MIMO) communication systems. The auto-correlation matrix inversion is the critical computing kernel shared by the two beamforming schemes. To alleviate the computing complexity, the auto-correlation matrix is approximated by a Toeplitz counterpart, which can be decomposed efficiently by applying the Cholesky decomposition and the Schur algorithm. This leads to an <inline-formula> <tex-math notation="LaTeX">$O\left ({N^{3} }\right)$ </tex-math></inline-formula> to <inline-formula> <tex-math notation="LaTeX">$O\left ({N^{2} }\right)$ </tex-math></inline-formula> complexity reduction, where <inline-formula> <tex-math notation="LaTeX">$N$ </tex-math></inline-formula> is the matrix size, while preserving computing parallelism for the hardware design. In addition, a diagonal loading technique is employed to mitigate the stability problem when the matrix is ill-conditioned. Simulation results indicate that no performance loss is observed due to the algorithm simplification measures. A systolic array based mapping procedure converts the two beamforming algorithms to a unified hardware accelerator design with 80% shared circuitry. Complex-valued divisions are achieved by adopting a hardware efficient coordinate rotation digital computer (CORDIC) scheme. In chip implementation, a TSMC 90nm UTM process technology is used and the design specs largely follow the requirements of IEEE 802.11ac standard. The core size of the chip design is 0.68mm<sup>2</sup>. The measurement results show that the chip can operate up to 200MHz with a power consumption of 49.03mW. It can complete the computation of a new beamforming vector (of size 8) every 0.64us and exhibits the highest throughput among the 6 compared designs.


I. INTRODUCTION
Beamforming, also known as spatial filtering, is an essential technique of array signal processing and was traditionally applied to radar systems to achieve the directional transmission/receiving of signals. With the prevailing of wireless communication systems, beamforming schemes have also been employed extensively to enhance the transmission performance, such as extending the service range or improving SNR values at the receiving end. Beamforming controls the phase and relative amplitude of the signal at each transmitter (or receiver) for directional signal transmission (or reception). For wireless communication, this feature is The associate editor coordinating the review of this manuscript and approving it for publication was Jingen Ni . useful in combating path loss problems and supporting spatial multiplexing. For mmWave based wireless communication systems such as 5G, the path and penetration loss problem is even more prominent and the beamforming technique is a must [1]. The directionality of beamforming depends on the size of antenna array. The larger the array size, the better the directionality. The implementation of beamforming can be achieved in the RF section or in the baseband (digital) section. The former one employs a phase shifter for each antenna and requires an additional signal combiner at the receiving end [2]- [4]. The accuracy of phase shift / signal combination and the incurred signal attenuation, which negatively affecting the overall RF link budget, are the major hurdles of this approach. The digital approach [5], although requires extra RF chains (one per antenna) and ADC/DACs, can effectively mitigate these problems. It can also support the combination of beamforming and digital precoding to combat the channel impairments efficiently. In this paper, we will focus on digital beamforming. Figure 1 shows the block diagram of a digital beamformer at the receiving end. Signals received from each antenna are down converted to baseband, digitally sampled, and multiplied respectively with a beamforming weight, before being combined for demodulation. Beamforming weights are complex-valued, which modifies both the amplitude and the phase of the received signal. A beamforming module is in charge of the calculations of beamforming weights.

A. ADAPTIVE BEAMFORMING
Instead of receiving signal along a fixed direction, adaptive beamforming can steer the beam in a specific direction according to the target source position, and can dynamically adjust the beam weight as the signal source moves. By changing of the null direction of receiving, it can also suppress possible interferences to enhance the received signal quality. Various beamforming weight calculation algorithms have been developed subject to different optimization criteria. These criteria include minimum mean square error (MMSE), minimum output energy (MOE), and maximum output signal to interference plus noise ratio (Max SINR). The MMSE algorithm tries to minimize the error with respect a reference signal d(t). The autocorrelation matrix R xx of the received signals x(t) and the cross correlation vector r xd between x(t) and d(t) are first calculated. The solution is commonly known as the Wiener filter. The MOE algorithm minimizes the total output energy while keeping the gain of the array of desired signal a constant. This algorithm does not require a reference signal as in the case of MMSE, but needs the knowledge of spatial signature (or steering vector) h 0 . In line of sight communications, this corresponds to the direction of arrival (DoA). The weight vector can be calculated as When the constant c is set to 1, it leads to a Minimum Variance Distortionless Response (MVDR) beamformer. The DoA estimation can be obtained by using schemes such as Multiple Signal Classification (MUSIC) and Estimation of Signal Parameters via Rotational Invariance Technique (ESPRIT) algorithms. The Max SINR algorithm aims at maximizing the signal to interference plus noise ratio. It also requires the spatial signature information. The optimal weight vector of the max SINR algorithm is similar to that of the MOE scheme by replacing the signal autocorrelation matrix R xx with noise covariance matrix R n .
It can be shown, these three beamforming vectors, although being derived by using different optimal criteria, are actually equivalent.
In implementing practices, the autocorrelation matrix R xx and cross correlation vector r xd can be estimated by taking average of multiple samples of received data. The inverse of the autocorrelation matrix estimate ( R −1 xx ) is then calculated to obtain the optimal beamforming weight vector. This is considered the direct matrix inversion (DMI) or sample matrix inversion (SMI) approach. Note that the computing complexity of matrix inversion is O(N 3 ), where N is matrix size. To alleviate the computing overheads, iterative algorithms based on the principle of steepest descent were developed. These include least mean square (LMS) and recursive least square (RLS) algorithms. In LMS algorithm, autocorrelation matrix and cross-correlation vector are replaced by their snapshot values. The convergence, however, takes a long time and the selection of step size µ is critical. The RLS algorithm is considered to achieve a faster convergence than the LMS algorithm. It computes the matrix inversion incrementally in each iteration and the total computing complexity is actually higher than the DMI approach. A RLS digital beamformer is presented in [6], it uses modified inverse matrix operations to reduce computational complexity. A SMI-MVDR beamformer was proposed in [7], it changes the dimensions of matrix operations and implements matrix decomposition using recursive updates.
As for the implementation platforms, due to intensive computing efforts required for adaptive beamforming, either high end programmable DSP processors or hardwired solutions are plausible for real time operations. The former one is flexible in implementing different algorithms. For example, a Steered Minimum Variance (STMV) adaptive beamforming was proposed in [8] and implemented on the TigerSHARC Digital Signal Processor platform. The programmable option, however, consumes much more power than the hardwired alternative. It is also less efficient in achieving high throughput computations. An ASIC design using QR decomposition to realize adaptive beamforming is proposed [9]. The ASIC is implemented by using a TSMC 0.13mm process technology and can operate up to 150MHz. In [10], a Real-Time QRD-Based MVDR Beamforming is implemented in a Xilinx Virtex-4 platform and can operate at up to 250MHz.

B. PROPOSED WORK AND CONTRIBUTIONS
In this paper, a low-complexity dual-mode adaptive digital beamformer design for packet based wireless communication VOLUME 8, 2020 systems is investigated. In particular, the prevailing MIMO OFDM system is considered as the target. The beamforming technique can be applied to either the time domain (pre-FFT) or the frequency domain (post-FFT) in a MIMO OFDM system. Taking the receiving end as an example, the time domain approaches [11]- [14] combine the baseband signals received from the antenna array by multiplying them with a set of beamforming weights. The frequency domain approaches [15]- [18] performs the beamforming on a per subcarrier basis after transforming the FFT transform and each sub-carrier can have a distinct beamforming weight. This, however, calls for much higher computing efforts. In this paper, the time domain approach is used with a potential application to hybrid beamforming schemes for massive MIMO systems Instead of adopting iterative algorithms with uncertainty in convergence speed, the DMI schemes are employed. The beamforming vector computations are designed to be accomplished within the fixed time frame of packet preamble. In particular, two popular DMI schemes are supported, i.e. MMSE and MVDR. The former one performs the MMSE beamforming using the preamble of packet data as the reference signal while the latter one achieves the beamforming based on given DoA information. The computation of the autocorrelation matrix inverse is the most computationally intensive part in both schemes. A new computing scheme is developed to alleviate its computing complexity. Instead of direct matrix inversion, Cholesky decomposition of the auto-correlation matrix is performed first, which decomposes the matrix into the product of a lower and an upper triangular matrix. We further take advantages of the near Toeplitz structure property of auto-correlation matrix and apply a fast computing Schur algorithm to obtain the decomposition result. This measure successfully reduces the computing complexity by an order, i.e. from O(N 3 ) to O(N 2 ), where N is the matrix size. After matrix decomposition, the matrix inversion is converted to two successive matrix vector multiplications, which also facilitates efficient pipelining in hardware implementation. In hardware accelerator design, we use the framework of 802.11ac as a reference in determining timing and throughput specs. A multi-stage systolic array design is developed. It serves as the shared computing kernel of two beamforming modes and reduces the overall circuit complexity by 41%. Various circuit simplification and power reduction techniques are applied. The design is implemented in a 90nm process technology and measured results are provided.
The remaining of the paper is organized as follows: In Section II, the proposed auto-correlation matrix inversion computing scheme is presented. This includes the Schur algorithm based Cholesky decomposition and extra algorithm modifications to improve hardware design efficiency. In section III, the performance evaluations, as well as the bit true simulations, of the two adaptive beamforming schemes are elaborated. In addition, a diagonal loading technique to tackle matrix rank deficiency problem is also discussed. Section IV addresses the algorithm to hardware mapping issue and discusses various optimization techniques employed to obtain a unified dual-mode beamformer design. Finally, the chip design summary and implementation results are given in section V. A performance comparison with existing works is included. Possible extension to a more general MIMO configurations is also discussed.

II. MMSE/MVDR DUAL-MODE BEAMFORMING COMPUTING SCHEME
A. SCHUR ALGORITHM FOR MATRIX DECOMPOSITION Refer to Eqs (1) and (2), auto-correlation matrix inversion, R −1 xx , is common to the weight vector computations of MMSE and MVDR beamforming scheme. Since direct matrix inversion is too costly in computing complexity, it is seldom computed explicitly. If the matrix inversion is meant for solving a linear system A · b = c with the solution b = A −1 . c, by decomposing the matrix A into the product of a lower triangular matrix L and an upper triangular matrix U(LU decomposition), i.e., A = L · U, the solution vector b can be computed by solving two equations successively.
Because of the special structures of L and U matrices, vector d and then vector b can be calculated by using forward substitution and backward substitution schemes, respectively. If the matrix is Hermitian and positive definite, e.g., an autocorrelation matrix with full rank, the LU decomposition can be converted to a LDL decomposition, a variant of the classical Cholesky decomposition.
where L is a unit lower triangular matrix, D is a diagonal matrix, and the superscript H indicates the conjugate transpose. The algorithm computing complexity of LDL decomposition is roughly one half that of regular LU decomposition [19]. In spite of this, the algorithm complexity is still of O(N 3 ), where N is the matrix size, and poses a big challenge to real time computation.
To further reduce the computing complexity, we take advantage of the special property of the signal vectors received from the antenna array. Consider a uniform linear array (ULA) of antennas, assume there are l signal sources each arrives from a distinctive direction. Define the corresponding steering vectors (or array response vector) The autocorrelation matrix R xx of the received signal vector x(t) can be expressed as where R ss is the auto-correlation matrix (l by l in dimension) of the l signal sources multiplied by the respective path attenuation factor, and A = [a(θ 1 ) · · · a(θ l )] N ×l . R nn is the covariance matrix of the noise vector n(t) observed at the receiving antenna array. If l signal sources and the channel model are both stationary, R ss becomes a Toeplitz one, so does AR ss A H . If all noises are AWGN (additive white Gaussian Noise) and mutually independent, R nn = σ 2 n I. This suggests R xx is also a Toeplitz one. A Toeplitz matrix (or diagonal-constant matrix) has each of its descending diagonals from left to right as constant. Eq. (8) shows a 4×4 matrix example of R xx , which is both Toeplitz and Hermitian.
Note that the diagonal values of the matrix are real-valued and identical. Due to the special matrix structure of Toeplitz matrices, fast decomposition algorithms with a computing complexity of O(N 2 ) can be applied. There are basically two types of fast algorithms, i.e. Levinson type and Schur type. The classical Levinson-Durbin recursion was first proposed by Norman Levinson in [20] and later improved by James Durbin in [21]. It constructs the decomposition of the inverse of the Toeplitz matrix by solving the last columns of the inverses of the leading principle submatrices recursively. A typical Schur type algorithm was presented in [22] as fast Cholesky factorization for positive definite Toeplitz matrices. The computations proceed progressively in arrowwise manner to obtain the decomposition of the Toeplitz matrix. Both types of algorithms have similar computing complexities. More specifically, Schur algorithm uses O(N 2 ) space, whereas Levinson-Durbin recursion uses only O(N ) space, which suggests a smaller memory footprint of the latter in software implementation. Schur algorithm, however, is numerically stable but Levinson-Durbin algorithm is not for some ill-conditioned matrix. From the aspect of hardware implementation, the computing parallelism of the algorithm is essential to the efficiency of hardware acceleration. Both Schur and Levinson-Durbin algorithms are recursive. Full computing parallelism (with an O(1) time complexity) can be exploited from the vector addition and scalar times vector operations in Schur algorithm. However, only partial computing parallelism (O(logN ) time complexity) can be extracted from the vector inner product operations in Levinson-Durbin algorithm. Based on these two reasons (numerical stability and time complexity), Schur algorithm is employed as the basic framework of our auto-correlation matrix inversion scheme. Figure 2 shows the pseudo code of the Schur algorithm, which triangularizes a Toeplitz and Hermitian matrix R xx asL whereL is a unit lower triangular matrix and U is an upper triangular matrix. The elements of matrix R xx are r ij = r * ji = t |i−j| , for i,j = 1∼N and i ≥ j.
The scheme performs matrix triangularization by calculating the rows ofL and U matrices progressively (one row per iteration) from top to bottom. Calculating the first rows ofL and U matrices (iteration 1) is trivial because l t 1 = [1 0 · · · 0] and u t 1 is identical to the first row of R xx . Starting from iteration 2, two row vectors v t i and u t i are updated based on the results of v t i−1 and u t i−1 from the previous iteration. The initial setting of v t 1 and u t 1 are equal to the first two rows of R xx , respectively. The u vector is first shifted right by one position with a new entry added to the leading position. A reflection coefficient κ i is then calculated and the v t i and u t i vectors are updated by performing a butterfly operation, which consists of two elementary row operations using κ i as the multiplication constant. The leading entry of the u t i vector will be nullified and becomes the i th row of the U matrix.

B. EFFICIENT AUTO-CORRELATION MATRIX INVERSION SCHEME
Refer to Eq. (5), after Schur decomposition we have Our goal is calculating the product of R −1 xx with a vector p. The first step result indicated in Eq. (4) can be obtained without performing a forward substitution. Since L = L −1 , we can compute g = L −1 · p by using a matrix vector VOLUME 8, 2020 Although it's possible to obtain the final result by performing a backward substitution using the upper triangular matrix U, the computing sequence is actually opposite to that of the decomposition process. From the hardware implementation aspect, this indicates that these two computation phases, i.e. decomposition and backward substitution, cannot overlap. This leads to a prolonged computing latency. To mitigate the problem, the inverse of matrix U is computed explicitly. The backward substitution is then replaced by a matrix vector multiplication M · g, where M = U −1 . The two computation phases can be chained together to shorten the latency. Refer to Eq. (10), Matrix M can be calculated as D is a diagonal matrix consists of the main diagonal of matrix U. D −1 can be easily obtained by taking the reciprocal of diagonal elements. The pseudo code for computing R −1 xx · p based on the Schur decomposition resultL · R xx =U is shown in Figure 3.

III. ALGORITHM PERFORMANCE SIMULATION RESULTS AND ANALYSES
There are two basic assumptions made in developing the efficient beamforming computing schemes as shown in section II. The first one is that the auto-correlation matrix R xx is positive definite. The second one is that the received signal vector x is stationary so that becomes R xx Toeplitz. The first assumption does not always hold and some remedial measures should be applied. As for the second assumption, in implementation practices, the auto-correlation matrix can only be estimated by taking average over a limited length of data. If the length is not sufficiently long, the estimated R xx may not be a Toeplitz one even if the signal is truly stationary. Some form of approximation has to make, which would inevitably lead to some beamforming performance loss. This effect should also be examined through the simulations. The wireless communication system adopted to evaluate the performance of beamforming schemes largely follows the IEEE 802.11ac standard. A 1 × 8 SIMO (single-input multiple-output) OFDM system configuration with 8 receiving antennas is assumed. The data modulation format is 64-QAM. The propagation path of the channel pulse response is set to 11-taps, and the azimuth direction of arrival (DoA) angle of the signal is assumed to be 0 degree. One VHT-LTF (very high throughput -long training field) is used as the training symbol to calculate the autocorrelation matrix and cross-correlation vector. The training symbol length is 512.

A. DIAGONAL LOADING SCHEME
For the positive definite property of the auto-correlation matrix, refer to Eq. (7), if the noise term R nn is ignored, R xx is positive definite if AR ss A H is full ranked. The full rank condition holds if rank(R ss ) equals N , the number of receiving antennas. This means there should be N independent signal sources, which may not be true if the channel environment is not scattering rich. It is now up to the noise term R nn to ensure the full rank property of R xx . However, when the SNR (signal to noise ratio) value of the received signal is high, the noise term becomes negligible and R xx becomes ill-conditioned. This will adversely affect the beamforming performance. Figure 4 shows the beam patterns of SMI and MVDR beamforming under two SNR values, i.e. 0dB and 15dB. In the 0dB cases where significant AWGNs (additive White Gaussian noises) are observed, the beam patterns of the beamforming schemes both exhibit a dominant peak at degree zero, the assumed DoA. However, in the 15dB cases, the beam patterns of both schemes fail to match the assumed DoA. To improve the robustness of the beamformer, diagonal loading schemes [23]- [25] can be applied by adding a scaled identity matrix to the original R xx . The scaling factor (or loading factor) is γ .
A heuristic rule to determine the loading factor is setting its value proportional to the trace of R xx , i.e., the summation of the main diagonal elements.
where ε is the power ratio of the loading factor, and different magnification values are set according to different antenna array size. Generally, the value is greater than the reciprocal of the number of antennas [26]. Fig. 5 shows the simulation of the SMI scheme power beam patterns under different ε values when SNR = 20dB. It can be observed that a large ε value can achieve better sidelobe suppression. Fig. 6 shows the symbol error rate (SER) simulations of the MVDR scheme with (ε = 0.01 ∼ 2.2) and without (ε = 0) diagonal loading technique. The results show that SER improves with the increase of the ε value. When ε value reaches 0.05 and beyond, the SER curves almost overlap. A magnified plot indicates the SER performance reaches its best when ε = 0.6. Any further increase of ε value leads to a performance rollback. Based on these two simulation results, we choose the ε value as 0.5 for its computing simplicity.

B. EVALUATION OF TOPELITZ APPROXIMATION
We will next examine the performance impact of Toeplitz approximation. Fig. 7 shows the SER simulations of 5 different schemes. The SISO scheme is used as the reference. Exact SMI / MVDR schemes use the Hermitian (but not Toeplitz) auto-correlation matrix. Proposed SMI / MVDR schemes mean the autocorrelation matrix is converted to a Toeplitz one by replacing all elements along a descending diagonal with their average. The results show that the 4 beamforming schemes provide significant SNR improvements over the  SISO scheme. For a SER setting of 10 −2 , the SNR gain is roughly 8dB. In particular, the two MVDR schemes outperform the two SMI schemes by about 1dB. The magnified plot further reveals the proposed schemes with Toeplitz approximation even exhibit a small performance edge over their counterpart schemes (0.25db and 0.13dB, respectively). This means the Toeplitz approximation technique can reduce the beamforming computing complexity by an order without causing any performance loss.

A. ACCELERATOR DESIGN SPECIFICATIONS AND SYSTEM ARCHITECTURE
The design specs of the beamformer processor, the hardware accelerator of the proposed beamforming scheme, are VOLUME 8, 2020 We will next present a unified architecture design, which is basically a collection of distinct DSP modules needed in both algorithms. With the hardware centric deliberation in algorithm development, these DSP modules can be seamlessly pipelined and the number of distinct DSP modules can be minimized, as well. Fig. 8 shows the system architecture of the proposed SMI and MVDR dual mode beamformer. There are 9 modules in total and 7 of them are shared by the two beamforming schemes. An input buffer (SV module) is used to collect data from different antennas. The Toeplitz matrix R xx is generated by the autocorrelation matrix (AM) module. The matrix is updated with the diagonal loading scheme in the DL module. The Schur decomposition is performed next in the SD module to obtain theL and U matrices, respectively. Then the M matrix and the g vector are calculated in the MM and GV modules, respectively, for Cholesky decomposition. Finally the SMI weights w = R −1 xx · p are obtained in the SW module where p is cross-correlation vector. In the MVDR mode, refer to Eq. (15), w = R −1 xx · a (θ) is first calculated in the same way as in the SMI mode. Parameter λ is next calculated, and finally the MVDR weights are performed in the MW module.

B. ALGORITHM TO HARDWARE ARCHITECTURE MAPPING
A canonical mapping procedure [27] from algorithm representations, i.e., dependence graphs, to systolic array designs is applied to derive the hardware design of each computing module. The mapping starts with selecting a scheduling vector and a projection vector. An algebraic method called space-time transform is next applied to derive the scheduling and hardware allocation. Due to space limitation, the mapping process of each module is not elaborated and only the final results are given. Fig. 9 shows the systolic array design of a Schur decomposition scheme. For illustration simplicity, a 4 × 4 matrix size is assumed. It consists of two rows of processing elements (PE), each row of size 5 (N + 1). The first row of R xx (with diagonal loading) enters the array from the top. This two-row PE array design are responsible for the update of two row vectors, v t i and u t i . The leading PE of the first row is also in charge of the computation of reflection coefficient. This calls for a complex-valued division. The finalL and U matrices output from the bottom of the array, with one row generated per computing iteration. Figure 10 shows the pipelined architecture of the SMI weight vector computing module. TheL and U matrices from the Schur decomposition module are fed to a 3-stage PE array. Refer to Eq. (12), stage one computes matrix M row by row. A divider is needed to calculate the reciprocal of the diagonal element d i . N multipliers are employed to compute all elements in one M row in parallel. Stage two performs a matrix vector multiplication to obtain the g vector. A linear systolic array containing N MAC (multiply & accumulate) modules is employed. Stage three performs another matrix-vector multiplication M · g for the SMI beamforming vector result. Its hardware mapping is identical to that of stage 2. Note that the data flows across stages are consistent and can be seamlessly pipelined. Fig. 11 shows the hardware design of an 8 × 8 Schur decomposition corresponding to the linear systolic array of the first computing stage in Figure 10. It takes 8 computing iterations, one for the update of a row, to complete the 8 × 8 Schur decomposition.
The two decoders (A and B) generate the multiplexer control signals for operand selection. The circuit designs for M matrix, g vector and the final SMI weight vector can be derived straightforwardly from the systolic arrays shown  in Fig. 10. Extra pipeline retiming schemes, however, are applied to reduce the critical path delay of the inner product operation. Fig. 12 depicts the hardware designs of computing an 8×8 auto-correlation matrix and an 8×1 cross correlation vector. Both assume a linear systolic array structure. For the auto-correlation matrix module, due to the special structure of Toeplitz matrices, the design requires only one row of calculations, which reduces the hardware complexity by 87% when compared with a full-blown design.  pipelined registers are placed between the multipliers and the adders, and also along the accumulation path

C. COMPLEX-VALUED HARDWARE MODULE DESIGN
Note that all arithmetic operations in this design are complexvalued operations. A complex-valued multiplication requires 4 real-valued multiplications and two real-valued additions. Complex-valued divisions needed in Schur decomposition, M matrix and λ parameter calculations call for extra design efforts to simplify the circuitry. Refer to Eq. (16), in a conventional approach, the denominator is first converted to a real value by multiplying with its complex conjugate (applied to both the numerator and the denominator). After performing a complex-valued multiplication is the numerator, two real-valued divisions follow to obtain the final result. In total, 6 multiplications, 3 additions, and 2 divisions, all real-valued, are needed.
To reduce the circuit complexity, the Coordinate Rotation Digital Computer (CORDIC) is adopted. The CORDIC scheme was first introduced in 1959 by Jack E. Volder and can compute various arithmetic functions by applying a sequence of micro-rotations iteratively using shift and add operations only. There are three types of CORDIC computations, i.e. circular, linear and hyperbolic. Each type of computation is further divided to vector and rotation modes. Trigonometric functions can be realized in the circular type computing while multiplications/divisions are supported in the linear type computing. Note that the accuracy of CORDIC scheme improves with the increase of computing iterations. A complex-valued division is divided into two computing phases. In phase 1, the denominator is treated as a 2-dimensional vector, with the real part as the first entity and the imaginary part as the second entity of the vector. The second entity can be nullified by rotating the vector to the axis of the first dimension and the value of the first entity becomes the norm-2 of the vector. This calls for a circular type vector mode CORDIC computation. The numerator is rotated accordingly by using a circular type rotation mode CORDIC computation. In phase 2, the real-valued divisions are then performed by applying linear type vector mode CORDIC computations. Depending on the area-time complexity tradeoff, the implementation of a CORDIC computing module ranges from a purely sequential form to a fully pipelined one. In a purely sequential form, (or a folded architecture), a single set of adder and shifter is used repetitively to complete all micro-rotations. On the contrary, a fully pipelined design performs one micro-rotation in a dedicated pipeline stage. The former design is most hardware economical but requires the longest computing latency while the one achieves the highest throughput at the expense of hardware complexity. In this design, a balance between the circuit complexity and the throughput is sought. Figure 14 (a) shows the circuit design of phase 1 computation, i.e., converting the denominator to a real value and updating the numerator accordingly. The leading CORDIC module, operating in a circular type vector mode, is in charge of converting the denominator to a real value by nullifying its imaginary part. Two micro-rotations are performed at a time and the results are feedback for the subsequent computing iterations. This can reduce the computing latency by one half with a mild increase in circuit complexity when compared with a purely sequential design. Extra simulations were conducted to determine the required iteration number of the CRODIC scheme. The result indicates 8 iterations are sufficient and the computing latency in this design is thus 4 clock cycles. Note that in the M matrix calculation, all elements in the same row of the auto-correlation matrix need to perform the same complex-valued divisions. Therefore, the micro-rotation sequence generated by the leading CORDIC module is passed to the trailing N CORDIC modules, all operating in a circular type rotation mode, to update their values, respectively. This technique gives rise to an additional 21.8% gate count saving. One more design optimization measure is applied here. According to the original CORDIC algorithm, a scaling factor K must be multiplied after the sequence of rotations to obtain the correct magnitude of the rotated vector. Since the rotations are applied equally to both numerator and denominator, the scaling factors cancel out each other. Figure 14(b) shows two CRODIC modules configured in the linear type vector mode to complete the real-valued divisions for the real part and imaginary part, respectively. N pair of such CRODIC modules are employed to compute all elements in the same matrix row concurrently. The circuit complexity reduction by using the CORDIC scheme is significant. The synthesis results reveal that the gate counts for the complex-valued vector division design using conventional dividers and CORDIC modules are 35.6k and 4.7k, respectively. This means an 86% gate count saving.

D. SYSTEM TIMING DIAGRAM
The timing diagram of the proposed dual-mode 1 × 8 beamformer design is shown in Fig. 15. 128 out of the 512 sampling points from the VTH-LTF symbol are dedicated to estimations of the auto-correlation (AM) matrix and cross-correlation vector (CV). After the estimation results are ready, it takes one clock cycle to perform the diagonal loading (DL). SD stands for the Schur decomposition module and takes 10 clock cycles to complete one row of computations. The M matrix (MM) and the g vector (gV) computations are performed concurrently followed by the calculation of one element of SMI weight vector. Because of this highly parallel computing scheme, it takes only 82 clock cycles to complete the SMI weight vector calculations. For the MVDR mode beamforming, 11 additional clock cycles (and 93 clock cycles in total) are needed to complete the task. This leads to a total latency of 210 and 221 clock cycles, respectively, for the two beamforming modes.

V. CHIP IMPLEMENTATION AND PERFORMANCE EVALUATION
Based on the design presented in section IV, the corresponding hardware accelerator chip design is developed. The synthesis is based on an ARM standard cell library and the implementation technology is a TSMC 90nm UTM CMOS process. The first step is conducting fixed point simulations to obtain a bit true design and evaluating the implementation loss.

A. FIXED POINT SIMULATIONS FOR BIT RATE DESIGN
Fixed point simulations take the windowing and the quantization effects into consideration, and determine the best windowing position and the minimum number of bits (word length) of each variable so that the performance deviation from the floating point version is acceptable. Figure 16 shows the fixed point simulation trials to determine the fractional length of the SMI weight variable. The integral length is bounded by the maximum value collected in the simulation trials, which is 3 bits in this case, so as not to introduce overflow. The fractional length ranges from 1 to 9 bits in simulations. The result shows that the SER improvement saturates when the fractional length reaches 8 bits. The SNR degradation with respect to the floating point version at SER equal to 10 −3 is less than 0.05dB. The word length for the SMI weight variable is thus set to 11 bits. Similar simulations are conducted, proceeding from input to output, to determine  the word lengths of variables at different computing modules. Table 2 summarizes the derived word lengths of output variables in each module. The variable word length of the auto-correlation matrix and the cross-correlation vector modules is larger because of the 128-point-long accumulation phase needed to obtain the estimation results. As mentioned in sub-section IV.C, the CORDIC module iteration number is set to 8.
Based on the word length settings specified in Table 2, Fig. 17 shows the SER curve simulations of the fixed point design versus the floating point version design and the original SMI scheme without any algorithmic simplifications. The results show that the three SER curves basically overlap. The magnified plot indicates the hardware implementation loss is merely 0.2dB (compared with the floating point design). A slight performance edge over the original scheme remains, which is consistent with the simulation results shown in Figure 7.

B. CHIP IMPLEMENTATION AND MEASUREMENT RESULTS
For a design to achieve its desired throughput, 8 complexvalued samples received from the 1×8 antenna array must be admitted every clock cycle. While the input data bandwidth is not an issue for on chip communications, it is problematic for off chip I/O communications when IC implementation is VOLUME 8, 2020 considered because of the extremely large pin count. In this implementation, a CLCC 84-pin packaging is used. To cope with the pin count limitation, only one complex-valued sample can be input per clock cycle. The target clock rate, as stated in Table 1, is equal to the baseband bandwidth, i.e., 160MHz. To receive all eight samples through a single input data port within one clock period, the I/O frequency must be 8 times higher, which is considered not feasible. Since our goal is to achieve silicon verification of the kernel design, a prolonged input latency plus an input data buffer to provide the required data bandwidth internally are acceptable. Similarly, only one data port is allocated for the output of weight vector, which takes 8 clock cycles to complete. A mode selection pin is used to switch the operations between SMI and MVDR beamforming. The synthesis result indicates the circuit complexity of the design is about 205k in gate count. 60% of them is for combinational logic and the remaining 40% is for FFs and sequential logic. To facilitate design testability, a scan chain consisting of 5,990 additional FFs is added to the chip design. This makes the total gate count slightly rising to 216k. Synopsys TetraMax is used to generate test patterns. With a set of 251 test patterns, the fault coverage is up to 99.16%. Fig. 18 shows the hardware complexity breakdown of the design.
The Schur decomposition module consumes the largest portion of the gate count. All other computing modules have similar gate count numbers. This is because their designs are all based on a linear systolic array structure. Fig. 19(a) shows the chip layout accomplished by using Synopsys IC Compiler, and Fig. 19(b) is a die photograph after fabrication. The core size is 0.68 mm 2 and the chip die size reaches 2.13 mm 2 after adding I/O pads for CLCC 84-pin packing. The chip measurements were conducted on an Agilent 93000 SOC Test System and the Shmoo plot subject to different power supply voltages and clock rates is illustrated in Figure 20.   The green tiles indicate the conditions tests pass. For a nominal power supply voltage (1V), the maximum clock rate is 208 MHz.
Note that when we developed the chip design, some design headroom is added to ensure the speed performance in face of process variations. This leads to an implementation slightly outpacing the required clock rate. 208MHz is the corner case and 200MHz can be considered a nominal clock rate functioning correctly in most cases. In real applications, the clock rate must be in line with the baseband bandwidth, i.e., 160MHz. We may lower the Vdd voltage setting to achieve power saving. The power consumption measurement results of the entire chip with a nominal 1V Vdd setting are 44.11mW @160MHz and 49.03mW @200MHz.
Finally, the chip design summary is listed in Table 3. As mentioned before, the prolonged input latency is due to the pin count limitation. If the design were used as an IP in the entire baseband chip design, this limitation can be lifted. The initiation interval is defined as the minimal time span between two successive initiations of beamforming vector weight computations. This is actually bounded by the interval of performing the auto-correlation matrix estimation. The throughput is calculated as the reciprocal of the initiation interval. It is 1.25M BF vectors/sec when the clock rate is 160MHz.

C. DESIGN COMPARISONS
In the proposed dual-mode beamformer design, the SMI algorithm and the MVDR algorithm share most of the computing modules, i.e., the Autocorrelation matrix, the Schur decomposition, M matrix, g vector, diagonal loading and the SMI weight calculation. The shared hardware accounts for about 70% of the entire design. This leads to a significant circuit complexity reduction when implementing two beamforming modes in a unified design.
Although there are many papers addressing the beamforming schemes (particularly those using the time domain approach), very few of them offer the hardwired solutions. Table 4 shows the performance comparison with 5 related designs found in the literature. Two [8], [10] of them support MVDR beamforming and the remaining three [9], [28], [29] use RLS beamforming. Besides the proposed one, design [9] is the only one with an ASIC implementation. It employs a QR decomposition scheme to accomplish the RLS algorithm based beamforming. It is designed for an antenna array of size 41. The design is implemented in a TSMC 0.13um process technology and can operate at 150MHz. The gate count is roughly 5 times the that of the proposed one and the power consumption is formidable (4W). No throughput information is available. Design [10] also adopts a QR decomposition scheme to solve the matrix inversion needed in the MVDR beamforming. The design is implemented in an FPGA platform and the clock rate is 250MHz. It takes 14,190 clock cycles to complete the beamforming weight calculation. Compared with the 128 clock cycles needed in the proposed design, its throughput rate is only one 50 th of the proposed one after normalization with the matrix size. The normalization factor is proportional to the N 3 . Designs [28], [29] implement a variant of RLS algorithm based on dichotomous coordinate descent iterations and the implementation technology is FPGA. The reported performance indexes are update rates, which are different from the throughput rate used in the proposed and other MVDR scheme based designs. In the RLS algorithm based design, the beamforming weights are updated incrementally until convergence. For the comparison purpose, we assume the iteration number needed for convergence is equal to the number of input sample vectors needed for the auto-correlation matrix estimation. This means the update rate should be divided by 128 to obtain an equivalent throughput rate. The simulation results in [28], [29] also indicate that hundreds of input sample vectors are needed for the convergence. The equivalent throughput rates of these two designs are much smaller than that of the proposed one even after the normalization. Design [8] realizes a Steered Minimum Variance (STMV) beamforming, which can compute beamforming vector weights from a single time-series snapshot and thus belongs to the MVDR category. The design adopts a programmable solution where a TigherSHARC digital signal processor (DSP) is used. It supports matrix size operations from 16 × 16 to 48 × 64. The cycle count is up to 30,457. Even the design can operate at the highest clock rate (600MHz) among all designs, its throughput rate still falls behind to the proposed one.
Because of the differences in implementation technology, problem size and beamforming scheme, it is difficult to develop other composite performance indexes for the comparison purpose. However, the throughput rate advantage of the proposed design is obvious. This is because the proposed design is the only one adopts a O(N 2 ) computing scheme, which gives it an inherent advantage in computing throughput.

VI. EXTENSION TO GENERAL MIMO CONFIGURATIONS
The target 1 × 8 MIMO system configuration suggests that there is only signal source (or data stream) from the transmitter. The proposed scheme, however, can be easily extended to handle the M × N configurations by treating it as a superposition of M of 1 × N sub-systems at the receiving end. In particular, the most complex computation in the beamforming schemes, i.e., the inverse of the auto-correlation matrix (R −1 xx ), is common to all signal sources and needs to be computed only once. So is the calculation of auto-correlation matrix. Only the parts distinct to each signal source need VOLUME 8, 2020 to be re-evaluated. For the SMI scheme, it's the estimation of cross correlation vector r xd and the weight vector calculation W MMSE = R −1 xx ·r xd . For the MVDR scheme, the DoA estimation scheme needs to identify the directions of all signal sources. The beamforming weight vector calculation, as shown in Eq. (2) needs to be recalculated for each steering vector h 0 . In chip implementation aspect, if the target throughput remains unchanged, multiple computing modules have to be employed and can operate in parallel. According to the hardware complexity breakdown chart shown in Fig. 18, circuit duplication is needed in only 4 out of the 9 modules, i.e., MVDR, SMI, cross-correlation (CC) and lambda, which account for 40% of the entire chip design. It's also possible to reuse existing modules multiple times to handle all signal sources successively. Observing from the timing diagram shown in Figure 15, except for the CC module, the remaining three modules are not fully occupied. By introducing one more copy of the CC module, we can double the throughput per signal source and the overall throughput is reduced by a factor of K /2, where K is the number of signal sources.
Beamforming conducted at the transmitter site is relatively easy because it does not have to deal with the signal interference and the noise contamination problems. The steering vector can be used directly to steer the beam direction. In 5G systems, as shown in the figure 21, a hybrid beamforming scheme is supported where baseband precoding and beamforming are integrated. The transmitter and the receiver designs are coordinated to determine the optimal beamforming performed at both ends. This is often regarded as a joint transceiver design. Basically, a singular value decomposition (SVD) is performed on the channel matrix. The right singular matrix (for the transmitter) is further decomposed as the product of a beamforming matrix and a precoding matrix. To reduce the computing complexity, the beamforming matrix is constructed by choosing appropriate steering vectors from a codebook using a heuristic scheme such as orthogonal matching pursuit. Interested readers are referred to [30], [31] for details. In most real system implementations, a straightforward beam search scheme is often employed. In this approach, the transmitter and the receiver simply enumerate all possible beam direction combinations to determine the best of each. This approach, however, is applicable to the case of one signal source.

VII. DISCUSSION AND CONCLUSION
A low complexity dual mode SMI/MVDR beamformer chip design was proposed. The design was carefully developed starting from the algorithm development, architecture mapping, circuit optimization to the final chip implementation. Since algorithm complexity plays a crucial role in hardware implementation, various algorithm simplification measures were applied. These include identifying the shared modules of two beamforming modes, approximating the Hermitian type auto-correlation matrix with a Toeplitz type counterpart, and developing an efficient computing scheme based on Cholesky and Schur decompositions for matrix inversion. The computing complexity was successfully reduced by an order, from O(N 3 ) to O(N 2 ). The ill-conditioned matrix inversion problem was properly tackled by a diagonal loading technique and no noticeable performance loss arises due to these algorithm relaxation / simplification measures.
The developed computing schemes were next mapped to a systolic array design containing 6 pipelined computing stages. The two beamforming designs coexist in this unified hardware architecture with 80% of the circuitry are shared. The timing specs of the chip design is compliant with the IEEE 802.11ac standard. To reduce the overall design gate count, complex-valued divisions were implemented by using the CORDIC scheme. A scan chain was also added to improve the testability.
The implementation result on a TSMC 90nm UTM process technology indicates a core size of 0.68 mm 2 and a die size of 2.13 mm 2 . The equivalent logic gate is merely 216.64k. The chip can work successfully at a nominal 160MHz clock rate and can further operate up to 200MHz with a power consumption of 49.03mW. It shows the highest throughput among 6 related beamforming hardware designs. The initiation interval of successive beamforming vector weight updates for a sized 8 antenna array is as small as 128 clock cycles, which is only 0.64us when operating at the highest clock rate (200MHz). The low complexity and high throughput features of the proposed design make it very suitable for applications in massive MIMO communication systems.
Regarding the future work, we have discussed the design extension to a general M × N MIMO system. The principle is treating the system as a superposition of M 1×N subsystems. Since the inverse of the auto-correlation matrix (R −1 xx ), is common to all signal sources, it is computed once and the corresponding computing module can be shared by M sub-systems. Only the weight vectors need to be calculated for each signal source. We may either introduce M parallel computing modules to maintain the throughput or use one module only to process them sequentially.