Efficient Hardware Realization of a New Variable Regularized PAST Algorithm With Multiple Deflation

This paper proposes a new variant of the projection approximation subspace tracking (PAST) algorithm with multiple deflation (MD) and its efficient hardware architecture. It extends the PAST with deflation (PAST-d) algorithm by performing multiple deflations at each step and relies on a recently introduced variable forgetting factor, and variable regularized PAST algorithm to improve the overall convergence rate, steady-state error, and numerical properties. It shares the same simple hardware structure of the PAST-d algorithm in pipeline realization but offering a more flexible tradeoff between complexity and performance. Moreover, methods for estimating the eigenvalues and the dimension of the signal subspace are proposed. Novel simplifications of the proposed variable forgetting factor (VFF) and variable regularization (VR) PAST-MD algorithm are also developed to avoid the expensive cubic root and division operations involved to facilitate its hardware implementation. Moreover, a combined data-regularization update is introduced to avoid the additional QR decomposition (QRD) update associated with the regularization, at the expense of very slight performance degradation. A novel pipelined hardware implementation of the simplified VFF-VR-PAST-MD algorithm based on the QRD and the COordinate Rotation DIgital Computer (CORDIC) is also proposed and implemented in Xilinx field programmable gate array (FPGA). Thanks to the proposed “root- and division- free” schemes, our proposed architecture can achieve around 20.2% higher working speed and save 1.9% lookup tables (LUTs), 1.8% slice register, and 22.8% digital signal processors (DSPs) over conventional implementation of the proposed architecture. Compared to the previous work, which is based on PAST-d algorithm, the proposed QRD-based algorithms offer better performance and a more flexible tradeoff between hardware resources and performance.


I. INTRODUCTION
Subspace estimation and tracking have important applications in array signal processing [1], [2], system identification [3], [4], speech processing [5], directions of arrival (DOA) estimation in radar, sonar, and mobile/wireless communication systems [6]- [8], etc. For instance, subspace-based methods have been proposed for high-resolution spatial domain spectral analysis in multiple signal classification (MUSIC) method [9], the minimum-norm method [10], the estimation of signal parameter via rotational invariance techniques The associate editor coordinating the review of this manuscript and approving it for publication was Diego Oliva .
(ESPRIT) method [11] and the weighted subspace fitting (WSF) method [12]. These methods are also widely used for estimation and tracking of DOA in antenna arrays. Conventional methods for computing the subspace are usually based on the batch eigenvalue decomposition (ED) or singular value decomposition (SVD) of the data covariance matrix, which can be computational intensive, especially for moving sources. To reduce the arithmetic complexity, efficient subspace tracking algorithms with much lower arithmetic complexity have been proposed [1], [13]- [17]. An efficient algorithm is the projection approximation subspace tracking (PAST) [13] method, which has a low computational complexity of only O(Nr), where N and r denote the dimension of the input data vector and the dimension of the subspace. Other effective subspace tracking algorithms include the orthonormal PAST (OPAST) algorithm [18], bi-iteration SVD algorithm [16], Bi-LS [19], QS-decomposition-based algorithms [20], fast approximated power iteration (FAPI) [14], YAST [21], etc. For more information on these and other related algorithms, please refer to [22] for more details.
Compared with other algorithms, the conventional PAST algorithm and related algorithms such as the OPAST algorithm are based on the recursive least squares (RLS) algorithm using a fixed forgetting factor (FF). Consequently, efficient hardware structure such as QR decomposition (QRD) using the COordinate Rotation DIgital Computer (CORDIC) technique can be used for its efficient realization. On the other hand, it is known that using a variable FF in the RLS algorithm can considerably improve its tracking speed in time-varying environment and steady-state error in stationary environment [23]- [26]. Another possible problem of the conventional RLS algorithm is that the data covariance matrix may become ill conditioned when the input is not persistently exciting. This is often encountered in situations with signal fading where the signal power level drops rapidly and hence, the estimation error may increase dramatically. To tackle this problem, regularization techniques are commonly employed to reduce the estimation variance [27].
Due to the wide applications of subspace techniques in real time systems, it is highly desirable to develop an efficient hardware for realizing the PAST algorithm. To our best knowledge, hardware realizations of subspace tracker are limited and pioneering works can be found in [28]- [31]. In [29], an efficient parallel implementation of the ED algorithm was proposed. Another novel hardware structure based on a special case of the PAST algorithm called the PAST with deflation (PAST-d) algorithm was proposed in [31]. The PAST-d algorithm extracts the subspace vectors one at a time using the PAST algorithm via the deflation technique. Therefore, its can be realized efficiently in a pipelined structure with a much lower hardware per stage than direct implementation using the QRD [31]. In fact, the hardware complexity of the CORDIC implementation of QRD grows with O(r 2 ), which can be substantial if r is large. Moreover, it is rather complicated to vary r in applications where r is adaptively determined. For PAST-d, one can cascade more pipelining stages to determine the appropriate dimension to be used. For QRD, increasing r may need a triangular array with much more elements working in pipeline. However, due to possible error accumulation of the deflation technique and the use of a fixed FF, the steady-state error and tracking speed of the PAST-d algorithm have to be considerably compromised.
Motivated by the needs for a subspace tracking algorithm with good tracking performance and an efficient hardware implementation, we propose in this paper a novel variable FF (VFF) and variable regularized (VR) PAST algorithm with multiple deflation (MD), and its efficient hardware architecture. We first extend the conventional PAST-d algorithm to perform multiple deflations at each step, which can be achieved by applying successively the basic PAST algorithm with a subspace of dimension P 1, instead of restricting it to one as in the PAST-d algorithm. This leads to a faster algorithm and improved numerical properties at the expense of slightly increased complexity at each stage. However, since P is fixed, a QRD implementation would require a fixed complexity of O(P 2 ). Moreover, multiple such QRD modules can be cascaded for pipeline implementation and can be made variable to determine the appropriate dimension of the subspace to be used. Thus, the proposed PAST-MD algorithm shares the same hardware advantage of the PAST-d algorithm in reusing the same basic hardware module for modular and pipeline realization. Compared with the basic PAST-d implementation, the error accumulation will be reduced leading to better overall accuracy. Therefore, by choosing an appropriate subspace dimension at each deflation step, P, a more flexible tradeoff between hardware complexity and accuracy over the conventional PAST and PAST-d algorithms in [13], [17], [31] can be achieved. Moreover, we show that the eigenvalues and the dimension of the signal subspace can also be estimated conveniently from the score and residual vectors after each deflation. More precisely, the eigenvalues associated with each extracted subspace can be obtained from the roots of the characteristic equation associated with the covariance matrix of the score vector. Using the ratio of the power of the residual vector to that of the input power, one can also determine the dimension of the signal subspace to retain a given fraction of input power in the signal subspace. To compensate for the degraded convergence and steady mean square error of the PAST-MD algorithm due to the use of multiple deflation, we focus on a recently proposed VFF-VR PAST algorithm [17] as the basic hardware module for improving the convergence speed, steady-state error and stability of the proposed PAST-MD algorithm. In particular, we shall make use of the VFF-VR PAST algorithm in [17] at each stage of the deflation to improve the tracking speed, steady state error and stability. The algorithm in [17] models the channel using a local polynomial and optimizes the FF as well as the regularization to minimize the asymptotic MSE.
Since the VFF scheme of the VFF-VR PAST algorithm in [17] involves cubic root and division operations, novel techniques are proposed in this work to reduce its processing delay and hardware complexity. In particular, novel ''rootand division-free'' discretized VFF and VR schemes are introduced. Moreover, a combined data-regularization update is introduced to avoid the additional QRD update required for incorporating the regularization, at the expense of very slight performance degradation. Furthermore, the constant coefficient multiplications involved in the proposed VFF and VR schemes are implemented using the canonical signed digits (CSD) or sum-of-power-two numbers [32] resulting in multiplier-less realization.
Finally, we propose a novel pipelined hardware implementation of our proposed VFF-VR-PAST-MD algorithm. The architecture extracts a subspace of P dimension from the input at each pipelined stage using the proposed modified VOLUME 9, 2021 VFF-VR PAST algorithm. The QR decomposition and the COordinate Rotation DIgital Computer (CORDIC) technique are used to realize the basic PAST algorithm due to its good numerical property and attractive parallel implementation. Moreover, the three-angle complex rotation (TACR) [33] is adopted to simplify the Givens rotation for complex data. Due to the ''root-and division-free'' schemes, the excessive delay due to division can be avoided, resulting in a lower hardware resources and higher throughput compared to the conventional implementation. To verify the efficiency of the proposed approach, the proposed QRD-based VFF-VR-PAST-MD architecture is implemented in Xilinx Vertex 7 (XC7VX980T) field programmable gate array (FPGA). Compared to the previous work in [31], which is based on the PAST-d algorithm, our proposed QRD-based VFF-VR-PAST-MD algorithms offer better convergence and steady state error performances and a more flexible tradeoff between hardware resources and performance. For a 10-element uniform linear array (ULA), the proposed architecture can be implemented in 22-bit wordlength with a very impressive maximum operating speed of 143 MHz for different values of P at 5 MHz sampling rate. Compared with the conventional implementation of the proposed architecture using multiplier and divider, our proposed work can achieve around 20.2% higher working speed and save 1.9% LUTs, 1.8% Slice Register, and 22.8% DSPs, respectively.
The rest of this paper is organized as follows: the proposed VFF-VR-PAST-MD algorithm will be introduced in section II. The hardware-friendly QRD-based VFF-VR-PAST algorithm and its efficient architecture are described in sections III and IV respectively. Computer simulation, FPGA implementation and comparison with other conventional works will be presented in section V and conclusion is drawn in section VI.

II. THE PROPOSED VFF-VR-PAST-MD ALGORITHM A. SUBSPACE TRACKING AND DOA TRACKING
Subspace technique is frequently used in DOA estimation. Consider for instance a uniform linear array (ULA) with L omni-directional and identical sensors impinged by K far-field narrow-band uncorrelated sources s k [n], k = 1, . . . , K with DOAs θ 1 , . . . , θ K , respectively. The signal vector recorded from the L sensors at time instant n can then be described by the following signal model: ] T denote the sensor signal and the source signal vector respectively. A(θ) = [a(θ 1 ), a(θ 2 ), . . . , a(θ K )] contains the steering vectors associated with the K DOAs, θ = [θ 1 , θ 2 , . . . , θ K ] T , of the sources, and η[n] is the additive sensor noise vector which is modeled as an independent and identically distributed (IID) white Gaussian noise (AWGN) with zero mean and covariance matrix σ 2 I, where I is the identity matrix. For ULAs, the steering vector of a source with angle θ can be written as a(θ) = [1, e j2π λ −1 d sin(θ) , . . . , e j2π λ −1 (L−1)d sin(θ) ] T , (2) where λ is the wavelength of the propagating signals and d is the inter-sensor spacing. From (1), one gets the following relationship between the signal covariance matrix R xx and the matrix A(θ) where ] is the source signal covariance matrix. As the K source signals are uncorrelated, the covariance matrix in (3) can be expressed in terms of the signal and noise subspaces as follows where U S = [u 1 , u 2 , . . . , u K ] and U N = [u K +1 , u K +2 , . . . , u L ] are respectively the signal subspace and noise subspace, and S and N are diagonal matrices containing the eigenvalues associated with the K source signals and sensor noise, respectively. In practice, the covariance matrix is estimated from snapshots of the signal vector , where M is the number of snapshots. Since the desired steering vectors are in the signal subspace and they are orthogonal to the noise subspace, this implies that a H (θ i )U N = 0. Hence, the DOAs can be found from the local peaks of the MUSIC spectrum through grid search in the angles given the steering vector of an array. When the number of sources is small, it is advantageous to estimate the signal subspace U S and hence the noise subspace by U N = I − U S U H S . For DOA tracking, the continuous estimation of U S is computational intensive and recursive algorithm such as the PAST algorithm can significantly reduce the arithmetic complexity. In this paper, we shall focus on the applications of the proposed VFF-VR-PAST-MD algorithm and hardware architecture to the problem of DOA estimation and tracking.

B. THE PROPOSED PAST ALGORITHM WITH MULTIPLE DEFLATION (PAST-MD)
The signal subspace containing the majorP eigenvectors of R xx can be recursively estimated from x[n] by minimizing the following least squares (LS) function  2 2 is the sum of the squared values of its components, the PAST algorithm can also be viewed as the following L LS sub-problems with the same inputȳ[i] . It should be noted that these sub-problems are independent of each other and hence, they can be solved independently and the optimal solution to (7) is where Applying the RLS algorithm for solving (8), one gets the PAST algorithm in Table 1.

1) PAST-D AND PAST-MD
In the PAST with deflation (PAST-d) algorithm, the subspace is extracted one by one by means of deflation. Specifically, let the optimal weight vector at the first stage of the deflation process be w (1) [n] and x (0) [n] = x[n]. Then, the residual vector after extraction, x (1) (1) [n]ȳ (0) [n], will consist of components from the remaining subspace and the process can be repeated for x (1) [n] and so on to give and hence the eigenvectors can be successively extracted as w (r) [n], r = 1, . . . ,P. The final subspace has a dimension of P. The PAST-d algorithm is summarized in Table 2.
One of the advantages of the conventional PAST-d algorithm is its simplicity, especially for hardware implementation as the basic module for extracting one eigenvector can be time multiplexed or pipelined for efficient realization. On the other hand, the deflation process in (9a) may affect the orthogonality of the vectors extracted due to error accumulation and its convergence rate will also be slower. Here, we extend the conventional PAST-d algorithm to perform multiple deflations at each step. This leads to a faster algorithm and improved numerical properties at the expenses of slightly increased complexity. More precisely, we propose to extract a subspace of small dimension successively instead of one. This is possible by using successively the basic PAST algorithm with a subspace of dimension P, say, as follows where W (r) [n] ∈ C L×P , R denotes the number or stage of deflation and the final subspace dimension has a size of RP. Specifically, when P = 1, the PAST-MD algorithm will be reduced to the PAST-d algorithm.
In general, the size of the subspace extracted at each step of deflation may be different. Of course, using the same size has the advantage of possible reuse of hardware modules and modular hardware realization in pipeline realization. Thus, the proposed algorithm offers a more flexible tradeoff between arithmetic and hardware complexity over the conventional PAST and PAST-d algorithms. The proposed PAST-MD algorithm is summarized in Table 3. A comparison of arithmetic complexity of the PAST, PAST-d, and PAST-MD algorithms is given in Table 4. We now proposed novel VFF and VR schemes for improving the convergence speed, steady-state error and stability of these algorithms. The hardware architecture for their efficient realization will be discussed later in Section IV.

2) EIGENVALUES ESTIMATION AND ADAPTIVE SUBSPACE DIMENSION DETERMINATION
One of the advantage of the PAST-d algorithm is its simplicity in estimating the eigenvalues of the subspace, which is in VOLUME 9, 2021 fact the power of the projected scoreȳ (r−1) [n] in (9b). In the PAST-MD algorithm, each extract subspace has a dimension of P, and hence the projected scoreȳ (r−1) [n] will be a P-th vector. Since W (r) [n] ∈ C L×P only span the subspace and they are not eigenvectors, one cannot estimate the corresponding eigenvalues from the power of the elements inȳ (r−1) [n] as in PAST-d. In fact, the eigenvalues can be estimated from the eigenvalues of the covariance matrix of Since P is usually not a very large number, the eigenvalues of C y (r−1) can be estimated by solving the roots of the characteristic equations as follows Moreover, for P less than or equal to 4, analytical formulas are available to compute all the roots of p (r−1) (λ), which allows the eigenvalues to be computed on the fly. From our simulation in the supplementary materials, the eigenvalues can be estimated quickly with high accuracy. Another advantage of the PAST-d and PAST-MD algorithms is its convenient in determining the dimension of the subspace to be tracked. This can be done by examining the ratio of the powers of the residual vector at the r-th deflation step to that of the input: E (r) represents the fraction of the input energy in the residual vector after the r-th stage of deflation. Therefore, if it is sufficiently small, one can terminate the deflation process and determine the dimension of the signal subspace of interest. Next, we shall focus on methods to improve the convergence speed and the steady-state error of the basic PAST algorithm.

C. THE PROPOSED VFF-VR-PAST-MD ALGORITHM
The FF of the PAST algorithm plays an important role in its convergence speed and steady-state error. In stationary environment, a large FF is desirable to achieve a low mean square error (MSE). For time-varying environment, a relatively small forgetting factor should be employed to achieve fast tracking speed. Thus, a variable FF is desirable for both stationary and dynamic environment. Due to the close relationship between the RLS algorithm and the PAST algorithm, the VFF-VR scheme for real-valued RLS algorithm [34] recently proposed by one of the authors can be extended to the complex case [17]. In particular, the steady-state MSD of w l [n] in (7) under the local polynomial time-varying model of the RLS is given by [34], where σ 2 l and σ 2 w l represent the total error variance E[ e[n] 2 2 ] and the variance of the true parameter w l [n], respectively. For notational convenience, we shall drop the subscript (r) in describing the operations at the r-th deflation step in the subsequent discussion. Thus, the VFF and VR procedures described will be applied to the PAST algorithm at each stage of deflation. The expression is the same as the real-valued case but the variance σ 2 l and σ 2 w l are for the complex noise and weight vector. The estimation of these quantities will be discussed later in this section. Since the mean squares deviation (MSD) of the PAST estimator is equal to the sum of all its component, one can obtained from [34] the MSD of W [n] by summing (13) for all l = 1, . . . , L, which yields where σ 2 = L l=1 σ 2 l and σ 2 W = L l=1 σ 2 w l denote the total noise variance and variance of the true parameter W [n] (or system variance), respectively. Similarly to [34], the locally optimal λ[n] at time n, which minimizes J MSD [n], can be obtained by partial differentiating (14) with respect to λ[n] and setting the resultant derivative to zero, which yields ) as in [34], (13) can be reduced to Assuming that the value of µ is much larger than one when λ[n] is near to one, we have µ − 1 ≈ µ and the desired FF can be simplified to where λ L and λ[n] denote respectively a large fixed FF and the FF obtained at time n. Due to the use of a large FF to estimate the steady-state noise, it may experience a lag when sudden system change occurs. To address this issue, a system change detection scheme is employed based on the long-term and short-term estimates of noise variance, in which (17) can be regarded as a long-term estimateσ 2 _L and the short-term estimate is given bŷ where λ S is a relatively small FF compared to λ L . Suppose thatσ 2 _S χσ 2 _L [n] for some sufficiently large constant χ, it suggests that there is a sudden system change. Under this situation, the system variance is roughly equal to the total error variance and a smaller FF should be used for the latter update. Hence, we suggest to estimate , the algorithm is converging and hence the system variance should be updated by (18). This suggests the following equation for updatingσ 2 W [n], shown at the bottom of this page, and χ is an algorithmic parameter which can be chosen as 4, which corresponds to 99.99% confidence that system changes have been detected. From simulation results to be presented in Section V, it is found that the VFF scheme can improve considerably the convergence speed and steady-state error over the conventional PAST algorithm.
In practical applications, the input power may be time-varying and may not be persistently exciting especially at low signal level, the covariance matrix may be in poor condition or even singular, which may affect numerical stability of the PAST algorithm as it is based on the RLS algorithm. To tackle this problem, the regularized RLS algorithm can be used where a regularization term κ[n]D is added to the covariance matrix Rȳȳ[n] in (8) to improve its condition number and hence the variance of the estimator in exchange for a certain bias. The resulting solution is given by where D is a positive definite symmetric matrix and κ[n] 0 is a regularization parameter. Here, we adopt the variable regularization parameter proposed in [27] with D = I and where σ 2 and σ 2 y represent the total noise power and signal power, respectively. W 0 denotes the theoretical weight vector and γ . Since the subspace vectors should be orthogonal to each other and has unit norm, hence W 0 2 2 is equal to P. The arithmetic complexity comparison of various algorithms is shown in Table 4. The arithmetic complexities of the PAST and PAST-MD algorithms are 3LP + O(P 2 ) and 4LP + O(P) per update, respectively, whereP is the dimension of the tracked subspace and L is the dimension of the input data vector. The proposed VFF and VR schemes require additional 3LP + 3L + 7P and 2P operations, respectively, per update. Hence, the total arithmetic complexity of the proposed VFF-VR-PAST-MD is 7LP + 3L + 9P + O(P). It should be noted thatP = RP and when P is equal to one, the PAST-MD approach will reduce to the PAST-d algorithm.
It can be noticed that the proposed approach has comparable and acceptable arithmetic complexity compared to other algorithms.

III. HARDWARE-FRIENDLY SIMPLIFICATIONS OF THE PROPOSED VFF-VR-PAST-MD ALGORITHM
In Section II, we have proposed an efficient class of VFF-VR-PAST-MD algorithms. The proposed VFF and VR schemes lead to better convergence speed and steady-state error. Using deflation with more than one vector at a time, a more flexible tradeoff between arithmetic complexity, convergence speed and numerical accuracy can be achieved. Moreover, it is amenable to hardware implementation in a pipelined manner through iterative reuse of the basic RLS hardware module.
The basic VFF-VR-PAST-MD algorithm however still present certain challenges to efficient hardware implementation: 1) Firstly, from the expression of FF in (16), it can be seen that it requires the cubic root operation, which is rather complicated to evaluate and implement in hardware. Secondly, it also involves a division, which may introduce much latency in pipeline implementation and hence lead to a lower operating frequency. 2) The basic RLS is rather sensitive to numerical error and hence QR decomposition (QRD)-based algorithm using CORDIC is usually preferred. In this case, the term Tr(R −1 yȳ ) is expensive to compute since QRD updates only the Cholesky factor of Rȳȳ and computinĝ VOLUME 9, 2021 R −1 yȳ is expensive. In [34], it is proposed to approximate Rȳȳ as a diagonal matrix so that R −1 yȳ can be more conveniently computed from the reciprocal of the diagonal elements of Rȳȳ. Though this approach considerably simplifies the updating of the VFF and yields good performance, the division operations may limit the maximum operating frequency in hardware implementation. Moreover, the incorporation of the regularization term in the QRD usually requires an additional QRD update, which will also slow down the throughput of the QRD RLS algorithm.
We now propose several novel techniques to facilitate the hardware implementation of the proposed VFF-VR-PAST-MD algorithm and the overall architecture will be presented later in Section IV. In particular, a novel ''cubic root-and division-free'' discretized VFF and VR schemes will be introduced for addressing the first challenge mentioned above. A novel ''division-free'' method is also proposed for computing Tr(R −1 yȳ ). Furthermore, a combined data-regularization update is introduced to avoid the additional QRD update, at the expense of very slight performance degradation.

A. THE DISCRETIZED VFF AND VR SCHEMES
From the expression of FF in (16), it can be seen that it requires the cubic root operation, which is rather complicated to evaluate and implement in hardware. Moreover, it also involves a division, which may introduce much latency in pipeline implementation and hence lead to a lower operating frequency. To tackle this problem, we propose a novel discretized FF scheme to avoid these expansive operations. More precisely, we quantize the FF inside the given range, say [0.9, 1], into a set of representative FF values and select the best one with the lowest MSD as given by (13).
whereλ j , j = 1, . . . , F represent the set of F discretized values of FF. Moreover, these representative values are chosen as canonical signed digits (CSD) or sum-of-powers-of-two (SOPOT) coefficients so that the terms (1 −λ j )/(1 +λ j ) andλ j (1 −λ j ) −2 involvingλ j above can be pre-computed and represented as CSD. Since multiplication with CSD can be implemented as additions and hardware shifts, these multiplications can be realized in additions only. Moreover, the simultaneous multiplication of σ 2 Tr(R −1 yȳ ) and σ 2 W with different values of (1 −λ j )/(1 +λ j ) andλ j (1 −λ j ) −2 for j = 1, . . . , F can be implemented as a multiplier block [35]. From the simulation to be presented in Section V, it is found that F = 4 is sufficient to give a performance close to that of (16)). Therefore, the proposed discretized FF significantly reduces the arithmetic and hardware complexity in realizing the VFF-PAST algorithm.
Meanwhile, since √ γ /P in updating the VF in (8) is also a function of these discretized FFs, they can again be pre-computed and represented as CSD for multiplier-less realization. Using the selected FF, sayλ j , the corresponding value of γ j /P can then be multiplied to σ 2 η σ 2 y . Consequently, the arithmetic and hardware complexity of computing the variable regularization in (22) can also be reduced significantly. In summary, the discretized VFF-VR version further simplifies the computation of the VFF and VR schemes, which avoids the expensive cubic root and division operations and simplifies the evaluation of regularization parameter.

B. QRD-BASED VFF-VR-PAST ALGORITHM
In the QR-RLS algorithm, the rank- ] allows us to reduce the QRD update from two to one. Simulation results show that satisfactory performance can be achieved with a slight performance degradation. The reason is that normally κ[n] is close to zero when the Rȳȳ[n] is well-conditioned. During input signal fading, the presence of a regularization term greatly improves the condition number and hence reduces the variances of the estimator. On the other hand, the exact value of the regularization term is less critical. This variant of the proposed algorithm is referred to as combined VR (CVR) scheme and the resultant algorithm is called VFF-CVR-PAST. As mentioned earlier, when computing the term Tr(R −1 yȳ [n]) for updating the VFF, Rȳȳ[n] is treated as a diagonal matrix and its p-th diagonal value is recursively estimated as follows whereȳ p [n] is the p-th element ofȳ[n] and λ p ∈ (0, 1] is a forgetting factor. Computing Tr(R −1 yȳ [n]) amounts to taking the reciprocal of σ 2 y p [n], p = 1, . . . , P, and adding them together. This requires divisions which may limit the maximum operating speed in hardwire implementation. Here, we propose to quantize σ 2 y p [n] to discrete levels so that the reciprocals of these values can be precomputed and implemented as multiplications. Moreover, since these are constant coefficient multiplications, they can be implemented in CSD using limited additions and shifts. More precisely, suppose that the sensor signals are in normalized fixed-point format with a magnitude less than one. Then 2 2 = L since w p , the p-th column of W , has unit norm. As its power is limited to L, it makes sense to quantize its value to a set of reconstruction levelsŝ j , j = 1, . . . , q, where q is the total number of discrete levels in approximating σ 2 y p [n]. The quantization process can be rewritten aŝ From simulations to be presented in Section V, this approximation gives satisfactory performance when σ 2 y p [n] is uniformly quantized to 16 levels for a L = 10 sensor array. For an L-length sensor antenna, simulated experiments show that discrete levels is 2 log 2 2L or 2 log 2 4L where a denotes the nearest integer smaller than a. Sinceŝ j , j ∈ {1, . . . , q}, is a set of constants, their reciprocals can be computed offline asŝ −1 j , j ∈ {1, . . . , q}. Consequently, the division is now replaced by a constant multiplication from one of these values. Moreover, since they are constant coefficients multiplications, they can be represented as CSD for multiplier-less realization. Our derived efficient QRD-based VFF-VR-PAST-MD algorithm is summarized in Table 3.

IV. HARDWARE ARCHITECTURE OF THE PROPOSED QRD-BASED VFF-VR-PAST-MD ALGORITHM
According to the functionality of each operation, the proposed architecture for QRD-based VFF-VR-PAST is divided into six processing units, including projection approximation unit, variable forgetting factor unit, variable regularization unit, QRD-RLS weight extraction unit, and error computation unit. The processing units work in a pipelined manner and the data flow diagram of the proposed architecture is shown in Fig. 1. This unit-based design can simplify the modification of the hardware architecture for other alternative subspace tracking algorithms. The structures and functionalities of these processing units at the r-th iteration are briefly summarized as follows.

A. PROJECTION APPROXIMATION UNIT
The unit computes the projection approximationȳ[n] = W H [n − 1]x[n] as matrix-vector multiplication and its structure is illustrated in Fig 2. B. VARIABLE FORGETTING FACTOR UNIT As we described in Section II-B, the computation of VFF can be simplified by using the discretized FF scheme. A set of VFF candidates is evaluated with (23) and the one with the minimum value will be chosen as the desired FF. The block diagram of this unit is illustrated in Fig. 3 where the grey area denotes the evaluation of (23) for a particular discretized FF. According to (23), the computation is further divided into three subunits for σ 2 , Tr(R −1 yȳ ) and σ 2 W , which will be described below.  2 2 can be realized using the SOPOT coefficients shown in Table 6 as a series of additions and shifts. The block diagram of this subunit is illustrated in Fig. 4.

2) Tr(R −1 yȳ ) COMPUTATION SUBUNIT
The subunit is based on the proposed division-free approach where the quantization operation in (25) is performed by a quantizer, where the input is successively compared with a set of thresholds arranged in a tree-like structure to determine which region it belongs to. Once the desired region or interval is determined, the pre-stored reciprocalŝ −1 j can be forwarded to a multiplier for approximating the division. In contrast to a division, it only requires log 2 q comparisons, where q is the total number of discrete levels. As DSP blocks in FPGA device is very valuable, our division-free method using the look-up-table to replace the division will result in considerable saving of DSPs. For uniform quantization, the quantizer can be further simplified to the structure in Fig. 5. The diagonal values of the approximated R −1 yȳ are sent to an adder tree for accumulation.

3) σ 2 W COMPUTATION SUBUNIT
From (17)- (20), we can see that the computation of σ 2 W andσ 2 _S has the same computation operations withσ 2 _L . Consequently, the structure for (IV-B1) can also be used for computing σ 2 W andσ 2 _S in (IV-B3). Due to the page limitation, the block diagram of this subunit is omitted.

C. VARIABLE REGULARIZATION UNIT
From (22), we can see that the value of κ[n] is directly dependent on λ[n]. Forming the term σ 2 σ 2 y requires a multiplier for computing the product of σ 2 and σ 2 y and a square root operation. The term γ j P can be precomputed and represented as SOPOT coefficient, which is presented  in Table 6, to simply its multiplication with σ 2 σ 2 y . The block diagram of this unit is illustrated in Fig. 6.

D. QRD-RLS WEIGHT EXTRACTION UNIT
The Givens rotation implementation of QRD is attractive for its efficient parallel hardware implementation [36]- [38] using say the CORDIC algorithm. For the conventional implementation of complex-valued QRD, the complex Givens rotation, Q[n], is employed to zero out the lower left row of the matrix on the right hand side of ( * ) in Table 3  . The process is repeated until the entire row is zeroed out. In summary, the QRD starts with the 1 st row, and use its first element with the newly appended row to compute the corresponding rotation and apply it to the remaining elements of the two rows. This is then repeated and for the j-th step, it takes in the j-th row and the last row and uses its leading nonzero coefficients to determine the rotation and apply it to the remaining elements.
The weight vector can be solved from W H [n] = R −1 [n]U[n] by back-substitution using a divider, which results in slow working speed and high hardware consumption. An alternative method based on the extended QRD-RLS algorithm [13] [39] can provide a fully concurrent computation for the weight extraction as (26), shown at the bottom of the next page, and This algorithm can be easily implemented with a double triangular systolic array [13] [38], which is illustrated in Fig. 7. In our proposed structure, an alternative complex Givens rotation matrix [33], namely three angle complex rota- tion (TACR), is used, which provides significant reduction of latency. In the TACR method, two Givens rotations are operated sequentially in two stages, and each stage has a similar hardware implementation. The basic idea of TACR-based QRD can be illustrated by an example of 2 × 2 complex-valued matrix A where we wish to zero out the lower-left element of the matrix by Givens rotation: x 1 e jθ x 1 x 2 e jθ x 2 y 1 e jθ y 1 y 2 e jθ y 2 , where j = √ −1, x 1 , x 2 , y 1 , y 2 are the magnitudes and θ x 1 , θ x 2 , θ y 1 , θ y 2 are the angles of the complex entries in A. In particular, the complex Givens rotation matrix can be expressed as G = cos(θ 1 )e −jθ x 1 − sin(θ 1 )e −jθ y 1 sin(θ 1 )e −jθ x 1 cos(θ 1 )e −jθ y 1 = cos(θ 1 ) − sin(θ 1 ) sin(θ 1 ) cos(θ 1 ) · e −jθ x 1 0 0 e −jθ y 1 .
The computation process is performed in two stages.
Then, the real-valued Givens rotation matrix is applied in the second stage to introduce a zero at the desired position From (29), we can see that the TACR method converts the diagonal elements in real numbers except the one at the lowest position, which implies that it requires one additional rotation. This operation is illustrated in Fig. 8. The required Givens rotation in the TACR can be efficiently implemented based on the CORDIC algorithm mentioned earlier. The CORDIC-based QRD hardware structure for processing each pair of input rows has its leading element called vector normalization mode (NM), which determines the rotation angle θ. Then, this angle will be applied to other CORDIC elements operating in the vector rotation mode (RM) in parallel for rotating the remaining elements in the two rows. This operation can be pipelined and it suggests an implementation in form of double triangular systolic array as shown in Fig. 7. The fully pipelined triangular CORDIC-based QRD will require (P+1)P/2 elements which can be area and resource intensive for a large P. To reduce the hardware resources, one can employ a linear pipelined CORDIC structure for processing two rows at a time and use it to process the N pairs of rows consecutively. This provides

R[n] U[n] R −H [n] 0 H c[n] g[n]
=   another tradeoff between hardware resources and processing speed and is attractive for lower speed applications. Yet another method to reduce the arithmetic/hardware complexity is to use a small value of P and then use multiple stages of deflation, say R, to compute a subspace of size RP. In this case, the above order-P CORDIC-QRD PAST can be cascaded to yield very high throughput with modest complexity at the expenses of slightly larger error and slower convergence speed due to error accumulation during the deflation process. The resulting hardware and arithmetic complexity is of the order R(P + 1)P/2, rather than (RP + 1)RP/2 if the conventional PAST is used.

This unit computes the error vector e[n] = x[n] − W [n]ȳ[n]
, where each inner product can be realized using L complex multipliers and an adder tree is used to accumulate the final result as shown in Fig. 9. The small grey rectangle stands for one time delay.

F. DEFLATION COMPUTATION UNIT
The unit computes the deflation operation x (r) (r) [n]ȳ (r−1) [n] and its structure of this unit for x (r) i [n] is illustrated in Fig. 10.

V. SIMULATION AND IMPLEMENTATION RESULTS
Simulation results are now presented to evaluate the performance of the proposed QRD-based VFF-VR-PAST-MD, VFF-PAST-MD, and other conventional work [31] in DOA estimation and tracking. Simulations are performed on a ULA with 10 sensors separated by half wavelength. Both stationary and dynamic environments are investigated. Unless specified otherwise, all results are averaged over 100 Monte-Carlo simulations. A Verilog described fixed-point VFF-VR-PAST-MD is also designed based on our proposed architecture to evaluate the performance of the hardware implementation. The SOPOT coefficients of the parameters used in our simulation as presented in Table 6. This architecture has been simulated and synthesized using Xilinx ISE 14.7 and successfully implemented on Xilinx Vertex 7 (XC7VX980T) FPGA, and the implementation results are shown in Table 5. Considering the tradeoff between hardware resource and system performance from our simulation result to be presented later, the wordlength for the QRD-based VFF-VR-PAST-MD algorithm is chosen as 22 bits. The resultant pipelined implementation achieves an impressive maximum working speed of 143 MHz at 5 MHz sampling rate for different values of P. If the proposed VFF-PAST-MD approach is implemented using multipliers and dividers (denoted by VFF-CVR-PAST-MD where C is referred to as combined update) with the same P value, our proposed hardware architecture can achieve around 20.2% higher working speed and save 1.9% LUTs, 1.8% Slice Register, and 22.8% DSPs, respectively. On the other hand, the full implementation of the proposed VFF-VR-PAST-MD architecture can achieve the same working speed with our proposed one, it will double the latency at the QRD update, which leads to decreased system throughput. From the implementation results, we can also see the VR scheme only requires a slight increase in hardware, but it can improve considerably the system robustness. In summary, the proposed architecture offers high throughput rate and different tradeoffs between hardware resources and performances, which will be further elaborated in the following section.

A. DOA ESTIMATION IN STATIONARY ENVIRONMENTS
In this simulation study, four uncorrelated narrow band signals located at directions θ 1 = 8 • , θ 2 = 20 • , θ 3 = 45 • and θ 4 = 60 • with equal power are considered. The short-term and long-term FFs are chosen as 0.9 and 0.99, respectively and the upper bound for the regularization parameter is set at 0.2. The confidence factor χ is chosen as 4, which corresponds to 99.99% confidence that a system change has been detected. The number FF candidates is F = 4.
The FF of the PAST-MD is set to 0.98. The average DOA deviation and the root mean squared error (RMSE) are used to evaluate the performance of different algorithms   where K M and K denote respectively the number of Monte-Carlo runs and the number of signals, θ n represents the n-th DOA andθ i,n is the n-th estimated DOA in the i-th Monte-Carlo simulation.

1) EXPERIMENT 1
This experiment is carried out to evaluate the performance of the VFF-VR-PAST-MD algorithm with different number of discrete levels q for σ 2 y p [n] for P = 4 and L = 10 in a stationary environment. The deviation between the floating-point division and division-free approaches for q = 4, 8, 16, 32 and 64 are shown in Fig. 11. It can be seen that when q is larger than 16, the results are generally satisfactory. The average deviations over the stimulated horizon as shown in Table 7 also reveals a similar observation. Considering the hardware resource required and operation speed, q = 16 is adopted in the subsequent experiments. For an L-length sensor antenna, simulated experiments show that 2 log 2 2L or 2 log 2 4L yields reasonable approximation where a denotes the nearest integer smaller than a.

2) EXPERIMENT 2
This experiment is conducted to determine the appropriate wordlength, including the integer bit and fractional bit,  for hardware implementation. The VFF-VR-PAST-MD algorithm with P = 4 and L = 10 is utilized to compare the performance. Firstly, we shall determine the integer bit. Given sufficient fractional bit, say, 20 bits, the RMSEs under different integer bits are shown in Fig. 12. It can be seen that the RMSE decreases with increasing integer bit. The gap between adjacent integer bits used is also decreasing and the differences after 7 bits are much smaller than before. Therefore, the integer bit is chosen to be 7 bits. Fig. 13 shows the RMSEs under different fractional bits used. As expected, the RMSEs decreases significantly initially when the fractional bits used increases from a relatively small value and then gradually level off after 14 bits. Hence, a fractional part of 14 bits is adopted to guarantee a sufficiently low RMSE. Thus, including the additional sign bit, a total  wordlength of 22 bits is used in our simulation and hardware implementation.

3) EXPERIMENT 3
This experiment examines the DOA estimation accuracy of the proposed algorithm under different single-to-noise ratios (SNRs). A set of SNR levels ranging from -5 dB to 25 dB are recorded at the 800-th snapshot, where the algorithms have been converged. Fig. 14 illustrates the DOA estimation results of different methods under different SNRs in stationary environment. It is seen that the VFF-PAST-MD algorithm performs better than the PAST-MD at high SNR levels, while they are comparable at low SNR levels. The proposed VR scheme further improves the performance at low SNRs. As for VFF-VR-PAST-MD, larger P results in better performance both in high and low SNRs. With the increase of SNR, the superiority gradually increases and the RMSE decreases. The performance degradation due to the CVR QRD is acceptable with only half the arithmetic complexity. The performance of the 22-bit fixed-point implementation with q = 16 has a slightly inferior initial convergence speed as its floating point counterpart. This is due  to the quantization of Tr(R −1 yȳ ) which slow down the update of the FF. On the other hand, the variation of the FF at the steady-state is reduced and hence its steady-state performance is better than its floating point counterpart.

4) EXPERIMENT 4
This experiment inspects the convergence performance of different methods at 5dB SNR. Fig. 15 shows the average deviation between DOAs of the four designated signals and the estimated looking directions. To better illustrate the convergence speed, Table 8 displays the convergence time, which is denoted by the point after which the deviation is no more than 10% of the average maximum deviation value of the given method. It can be seen that the VFF-based algorithms converge faster than the corresponding PAST-MD algorithm with a constant FF. The VFF-VR-PAST-MD algorithm converges slightly better than the VFF-PAST-MD algorithm. The averaged error after convergence which is referred to as the ''convergence error'' gradually decreases with increasing values of P. The performance of CVR update is still comparable with the full update with only slight degradation. The performances in floating-point arithmetic and fixed point arithmetic for VFF-VR-PAST-MD algorithm with P = 4 are similar.

B. DOA TRACKING IN DYNAMIC ENVIRONMENTS
The case of four uncorrelated narrow band signals with equal power is again considered. However, two sources are assumed to be invariant and are located at directions θ 1 = 8 • and   In this experiment, the DOA tracking performance of the proposed method is evaluated in nonstationary cases under SNR with 5 dB. θ 2 and θ 3 are assumed to be linearly and slowly varying as follows θ 2 = 20 − 1 × 10 −2 t, 0 t 800, Fig. 16 depicts the tracking results of different methods for both invariant and time-varying DOAs. It is seen that for P = 1, both VFF-PAST-MD and VFF-VR-PAST-MD algorithms outperform PAST-MD method with smaller tracking errors. Moreover, VR-based algorithms can achieve a slightly better performance. Larger value of P yields a result closer to the ground truth. Although the tracking performance of VFF-CVR-PAST-MD with P = 4 suffers from slight degradation, it is still comparable to the VFF-VR-PAST-MD algorithms with P = 1 or P = 2. The fixed-point implementation of the proposed approach is reasonably close to its floating-point counterpart.

2) EXPERIMENT 2
To further verify the performance of the proposed VFF-VR-PAST-MD algorithm, we carry out an experiment with signal fading in which a short period with low signal power will be where A 0 is the original signal amplitude. Fig. 17 shows the resultant performance of the various algorithms. VR-based algorithms generally offer improved performances over the VFF and fixed FF algorithms due to reduced estimation error variance. VFF-VR-PAST-MD with larger P results in smaller tracking deviation from the ground truth. The CVR-QRD also experienced acceptable performance degradation. The performances between floating-point arithmetic and fixed-point arithmetic are similar for VFF-VR-PAST-MD algorithm with P = 4. In summary, the advantages of this MD algorithm is that 1) it can be used to adaptively estimate the total subspace dimension by monitoring the residuals at each stage. 2) it can be implemented in a pipelined manner which is very efficient and scalable as it only requires the cascade of a core VFF-VR-PAST module of a given size P, which can be easily optimized rather than the original PAST or VFF-VR-PAST algorithms which involves a complexity which grows with O((RP) 2 ), where R is the number of deflation stages which may be unknown in practice. This makes the hardware optimization rather difficult. 3) it has a better performance than the PAST-d algorithm with slight increase in hardware complexity, thus offer a more flexible tradeoff between hardware resources and performance.

VI. CONCLUSION
A new VFF-VR-PAST-MD algorithm for subspace tracking and its related efficient hardware architecture have been presented. The VFF and VR schemes improve the convergence speed, steady-state error and stability of the conventional PAST algorithm. Novel simplifications of the VFF-VR-PAST algorithm are also proposed to avoid the cubic root and division operations involved to facilitate its hardware implementation. A novel pipelined hardware implementation of the simplified VFF-VR-PAST-MD algorithm employing the TACR-based QRD is developed. The proposed QRD-based VFF-VR-PAST-MD architecture is successfully synthesized and implemented in FPGA with a reduced hardware resources and higher operating speed than conventional approaches. The proposed algorithms offer better performance and a more flexible tradeoff between hardware resources and performance. The efficient architecture can also be applied to real-time scenarios including adaptive subspace identification, digital communication, and etc.
WEI ZHAO received the B.Sc. degree in measurement and instrument engineering from the Harbin University of Science and Technology, China, in 2010, the M.Sc. degree in telecommunication engineering from City University, London, U.K., in 2011, and the Ph.D. degree in digital signal processing from The University of Hong Kong, Hong Kong, in 2017. He is currently an Assistant Professor with the Institute of Electrical Engineering, Chinese Academy of Sciences, China, and the Ganjiang Innovation Academy, Chinese Academy of Sciences. His research interests include array signal processing, adaptive filtering, digital hardware design, and embedded systems. JIAN-QIANG LIN received the bachelor's degree in information engineering from Beihang University, Beijing, China, in 2015, and the Ph.D. degree in digital signal processing from the Department of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong, in 2020. His current research interests include array signal processing, subspace identification, adaptive filtering, and mathematics. VOLUME 9, 2021