Radix-2 Self-Recursive Sparse Factorizations of Delay Vandermonde Matrices for Wideband Multi-Beam Antenna Arrays

This paper presents a self-contained factorization for the Vandermonde matrices associated with true-time delay based wideband analog multi-beam beamforming using antenna arrays. The proposed factorization contains sparse and orthogonal matrices. Novel self-recursive radix-2 algorithms for Vandermonde matrices associated with true time delay based delay-sum filterbanks are presented to reduce the circuit complexity of multi-beam analog beamforming systems. The proposed algorithms for Vandermonde matrices by a vector attain $\mathcal{O}(N \log N)$ delay-amplifier circuit counts. Error bounds for the Vandermode matrices associated with true-time delay are established and then analyzed for numerical stability. The potential for real-world circuit implementation of the proposed algorithms will be shown through signal flow graphs that are the starting point for high-frequency analog circuit realizations.

(which are intermediate constant complex multiplications found in FFT algorithms) using microwave or analog IC-based phase-shifter implementations has led to the "Butler Matrix" type multi-beam array beamformers that are well known in the literature. However, such FFT beams suffer from frequency dependent beam directions. Known as "beam squint" because the beam directions are strongly dependent on the temporal frequency of operation, DFT based multi-beam beamformers can only be used for narrowband wireless systems.
The FFT is capable of computing the DFT or its inverse in O (N log N) complexity. Therefore, FFT-based multi-beam beamformers are very useful for wireless systems having narrow bandwidth. However, for emerging 5G mmW systems that exploit increasingly wide bandwidths, the beam-squint problem can be significant.
For emerging 5G mmW systems that fully exploit the available bandwidth for increasing system capacity, one must utilize the true time-delay based multi-beam beamformers described by its own delay Vandermonde matrix (DVM). The DVM, however, is equal to the DFT only at a single temporal frequency. Therefore, FFT-based factorizations are not applicable for the DVM matrix. In this paper, we describe the complexity of an FFTlike factorization algorithm for the Vandermonde matrices, in order to be able to implement truly wideband multi-beam mmW beamformers based on true-time-delay networks albeit at O (N log N) complexity.
The paper is organized as follows. Section 2 contains an introduction to complexity metrics of analog and digital parallel computation systems for matrix-vector products. Section 3 introduces novel self-contained factorizations for Vandermonde matrices and radix-2 algorithms, while in section 4 we will derive arithmetic complexity and elaborate on numerical results based on the proposed algorithms for Vandermonde matrices.
Next, section 5 analyzes error bounds and stability in computing radix-2 algorithms for Vandermonde matrices having true time-delays. In section 6 we will present signal flow graphs of the proposed radix-2 algorithm for Vandermonde matrices. Finally, section 7 concludes the paper.
2 Analog Implementations for 5G and Beyond: Quantifying Complexity Fast analog radio frequency (RF) integrated circuit (IC) realizations of the beamforming algorithms become necessary when the bandwidths of interest are greater than a few GHz. For emerging 5G, 6G and beyond, the bandwidths of interest are too high for digital computing solutions to keep up. The solution is to replace digital systems with fast analog implementations of wideband beamforming algorithms, which in turn, requires a revisit to traditional algorithm complexity theory because of differences in analog parallel architectures compared to conventional digital approaches. In analog implementations, the bandwidth effectively sets the rate at which the analog computation can be updated. The DVM building block employs true time delays that can be realized using transmission line segments and/or all-pass networks followed by amplification stages.
Let us define DVM fast algorithms as consisting of gain-delay-block (GDB) and addition/subtraction blocks.
Instead of computing the number of multiplications for accessing with arithmetic complexity (as one would do for digital systems), we need to count the number of parallel circuit implementations of GDBs in order to access the circuit complexity of analog parallel algorithms. The larger the number of GDBs, the higher the circuit complexity and hence higher chip area and power consumption. In analog fast algorithms, the objective is to factorize the original matrix into products of sparse matrices, such that the total number of GDBs is reduced We remark here that the gain is not equivalent to the coefficient multiplication. Although a delay of t is simply multiplication by e − jωt in the mathematical sense, it requires a separate true time delay circuit in the analog domain. Hence, the multiplication complexity is different from GBD counts.

Matrices
Low complexity and stable algorithms for the delay Vandermonde matrix, A N = [α kl ] N,N−1 k=1,l=0 , where α = e − jω t τ and accounts for the phase rotation associated with the delay τ at frequency f , and ω t = 2π f , have been derived through our previous work [1,16,17]. It is important to realize that the matrix elements are integer powers of α = e − jω t τ which are functions of the temporal frequency variable ω t ; this is an important distinction from the DFT matrix where the elements are constants defined as the primitive Nth roots of unity. Because integer powers of α = e − jω t τ are dependent on ω t the DVM frequency responses are functions of two frequency variables: ω x , which is typically a spatial variable, and ω t which is typically the temporal frequency variable.
The DVM matrix frequency responses are defined using the spatial frequency variable ω x via 2-D filterbank responses that contain ω t as a parameter, and given by the expression for the kth filter for k = 0, 1, . . . , N − 1 as H k ( jω x , jω t ) = i α ki e − jω x i , i = 0, 1, . . . , N − 1. Therefore, considering both ω x and ω t the DVM defines N 2-D frequency responses.
Further, the DVM is the super-class of the DFT matrix without having nice properties like unitary, periodicity, symmetry, and circular shift. There is no self-contained radix-2 DVM algorithm in the literature. The manuscript [17] proposes a self-contained sparse factorization of DVM with O (N 2 ) arithmetic complexity. The displacement structure of Vandermonde-related matrices is used to derive O (N log 2 N) arithmetic complexity algorithms in [7,8] and an O (N) arithmetic complexity algorithm in [14]. The manuscripts [12,13,23] propose O (N 2 ) complexity algorithms to compute Vandermonde matrices (having real nodes) by a vector. The DVM algorithm in [17] extends the results in [12,13,23] utilizing complex nodes without using displacement equations as in [7,8,14]. Moreover, we have addressed the error bounds and stability of the DVM algorithm in [17] by filling the gaps in [12,13,23]. The DVM algorithm in [16] is faster than [17] but does not produce arithmetic complexity of order O (N log N). On the other hand, there are no constraints for nodes of DVM in [17] as opposed to what we propose here.
In this section, we derive novel self-contained factorization for the Vandermode-type matrices and propose a radix-2 algorithm for the Vandermonde matrices. We will account for the phase rotation associated with delay and frequency in the factorization of Vandermonde matrices.

Self-contained Factorization for Vandermonde Matrices
Algorithms operating on analog signals for computing Vandermonde matrix by a vector can be seen as the evaluation of (N −1) th degree polynomial at N points, albeit using a paralleled analog computing circuit as opposed to a digital realization that must operate on samples and quantized signals. Here we derive self-contained factorization of Vandermonde matrices to obtain efficient continuous-time algorithms for implementation on analog circuits while reducing GDB counts.
One can observe the computation of Vandermonde matrix by a vector with arithmetic complexity O (N log 2 N) in [4,7,8]. Here, arithmetic complexity refers to the number of GDBs in an analog RF-IC circuit implementation, unlike the traditional approach of the number of multipliers and adders in a digital system.
There are several mathematical techniques available to derive radix-2 and split-radix FFT algorithms, as described in [3,10,18,20,22]. It has been shown in [15] that Vandermonde matrices are badly ill-conditioned with a narrow class of exceptions whereas cyclic sequences of nodes are equally spaced on the unit circle C(0, 1). In here, we propose self-contained and sparse factorization for the well-conditioned Vandermonde matrices and extend the results for C(0, r), where r > 1 (i.e. circle of radius r centered at the origin in the complex plane).
The proposed factorizations will then be used to derive fast algorithms while reducing GDB counts.
where P N is the even-odd permutation matrix, I N 2 is the identity matrix, Proof: Let us permute rows of V N by multiplying with P N and write the result as the block matrices: It is clear that the (1,1) block of the product P N V N is V N 2 . Now, we consider (1,2), (2,1), and (2,2) blocks of P N V N (2) and represent each of these by V N 2 and the product of diagonal matrices.
By (1,2) block of (2) we get: Since nodes are equally spaced on By (2,2) block of (2) we get: Thus by (3), (4), and (5), we can state (2) as: . Let us consider the product of mth row of V N and lth column of V H N , where V H N is the conjugate transpose of V N . Thus, we have: In the above, the first equality follows as v k ,v k ∈ C(0, 1) for k = 0, 1, . . . , N −1 and the second equality follows as v 2k+1 = v 2k · e 2π j N . Hence, V N is unitary up to scaling by 1 N . By using this we can state (6) as: Now let us consider the product V H . Therefore, we have thatV In the above, the first equality follows as v 2k ,v 2k ∈ C(0, 1) and the second equality follows as v 2k are nodes of Thus, by following the above one can see the (m, l) entry ofV N Notice that even nodes on C(0, 1) can be expressed as v 2k = e j θ+ 4πk N for k = 0, 1, . . . , N 2 −1. Thus, by raising each even node to the power of N 2 and taking average we get c = e jθN 2 where j 2 = −1. Hence, and the claim of the theorem follows.

Remark 3.2.
The last matrix in the factorization (8) has been split into three sparse matrices in (1) to reduce the multiplication counts and hence for efficient hardware implementation.
The following self-contained factorization for the Vandermonde matrices is proposed in connection to the phase rotation associated with delay τ and frequency ω t = 2π f .
Proof: The proof follows similar lines as that of Theorem 3.1, Remark 3.5. Theorem 3.4 has proposed a self-contained factorization, as opposed to a scaled DFT matrix. If one chooses to scale DFT matrices to factor V N , it results in the computation of small complex numbers and leads to zero matrices [9]. The proposed factorization for V N in (10) overcomes this barrier.
Remark 3.7. When θ = 0 and r = 1, the proposed factorization for the Vandermode matrices given in Theorem 3.4, reduces to the well known self-contained DFT matrix factorization [3,19,22,24]. Thus, we can use this property to define a delay Vandermonde matrix to solve the beam squint problem as well as allow high-speed analog realizations for future high bandwidth applications where the slowing down of Moore's law prevents the adoption of digital parallel processing architectures.

Self-recursive Algorithms for Vandermonde Matrices
In the following, we will state self-recursive radix-2 algorithms for Vandermonde matrices with the help of Before stating algorithms, let us use the following notation to denote sparse matrices which will be used

Analog GDB-Complexity
The number of additions and multiplications required to carry out a computation is called the arithmetic complexity in a digital computing system. Here, because our intention is to realize these algorithms as high-

GDB Counts of Analog Fast Algorithms for Vandermonde Matrices
Here we analyze the analog GDB counts of the radix-2 algorithms for Vandermonde matrices presented in Section 3.1. Let us denote the number of complex/real additions (say #aC/#aR respectively) and complex/real multiplications (say #mC/#mR respectively) required to compute y = V N z and y =Ṽ N z having z ∈ C N or R N .
We do not count multiplication by ±1 and permutation.
Let us first analyze the complex GDB counts of the radix-2 algorithms for Vandermonde matrices by a complex input vector. We recall that the GDBs implement a complex multiplication defined in the frequency domain ω t which requires a time-domain delay to implement on the DVM signal flow graphs. We recall that the independent frequency variable of the DVM is ω x and that ω t is the temporal frequency parameter associated with the matrix elements α. This is why the complex multiplication operations, which contain e − jω t τ terms, must in practice be realized in the time domain using time-delays.
Proof: Referring to the algorithm vancc(z, N), we get By following the structures ofD N ,Î N and C N , Thus by using the above, we could state (14) as the first order difference equation with respect to t ≥ 2 Solving the above difference equation using the initial condition #aC(VanCC, 2) = 2, we can obtain #aC(VanCC, 2 t ) = N t. vancc(z, N) and (15), we could obtain another first order difference equation with

Now by using the algorithm
Solving the above difference equation using the initial condition #mC(VanCC, 2) = 1, we can obtain #mC(VanCC, 2 t ) = N t − N + 1. algorithm with z ∈ C N is given by Proof: The multiplication of the diagonal matrixD N with a complex input counts no addition and N 2 − 1 multiplications. Thus by using vanccr(z, N) algorithm and GDB counts in (13), the complex GDB counts can be obtained as in (13). algorithm with z ∈ C N is given by Proof: The multiplication of the diagonal matrixD N with a complex input counts no addition and N 2 − 1 multiplications. Thus by using vancr(z, N) algorithm and GDB counts in (17), the complex GDB counts can be obtained as in (18).
Let us analyze the real GDB counts of the radix-2 algorithms for Vandermonde matrices by a real input vector. Here we count the multiplication of two complex numbers with 2 real additions and 4 real multiplications.
Theorem 4.5. Let N = 2 t (≥ 2) and θ be given. The real GDB counts of the proposed vancc(z, N) algorithm with z ∈ R N is given by Proof: Referring to the algorithm vancc(z, N), we get By following the structures ofD N ,Î N and C N , Thus by using the above, we could state (20) as the first order difference equation with respect to t ≥ 2 #mR(VanCC, 2 t ) − 2 · #mR VanCC, 2 t−1 = 2 · 2 t − 2.
Solving the above difference equation using the initial condition #mR(VanCC, 2) = 1, we can obtain Now by using the algorithm vancc(z, N) and (15), we could obtain another first order difference equation with Solving the above difference equation using the initial condition #aR(VanCC, 2) = 2, we can obtain #aR(VanCC, 2 t ) = N t.
Corollary 4.6. Let N = 2 t (≥ 2), r and θ be given. The real GDB counts of the proposed vanccr(z, N) algorithm with z ∈ R N is given by Proof:D N is a diagonal matrix with real entries so the number of additions will remain the same as in (19) while the number of multiplications will be increased by N − 1 in (19).
Proof:D N is a diagonal matrix with real entries so the number of additions will remain the same as in (23) while the number of multiplications will be increased by N − 1 in (23).
Following Tables 1, 2, and 3, the proposed radix-2 algorithms for the Vandermonde matrices have shown significant arithmetic complexity reduction as opposed to the DVM algorithms presented in [1,16,17]. At the same time, we should recall that the DVM algorithms proposed in [1,16,17] have no restriction for nodes or delays as in this paper. Moreover, the proposed radix-2 algorithms for Vandermonde matrices have reduced GDB counts extensively opposed to the direct computation of Vandermonde matrices by a vector. More importantly, we have achieved the lowest GDB counts of radix-2 algorithms on computing Vandermonde matrices by a vector in the literature while covering radix-2 DFT algorithms as a subclass of the proposed radix-2 algorithms.

Theoretical Analysis
Error bounds and numerical stability when computing the radix-2 Vandermonde algorithms associated with true time delays are the main concern in this section. To derive analytic results for error bound, we will use the perturbation of the product of matrices (stated in [9]). Following the proposed radix-2 algorithms vancc(z, N) and vanc(z, N), we have to compute weights e ±k( 2π j N ) = ω k ± (sa y), where ω ± = e ± 2π j N for k = 0, 1, . . . , N 2 −1. The way Table 2: Real GDB counts of the proposed radix-2 algorithms (i. e. vanc(z, N) and vancc(z, N) we compute weights affects the accuracy of the algorithms. Thus, we will assume that the computed weights ω k ± are used and satisfy for all k = 0, 1, . . . , where µ + := c 1 u andµ − := c 1 u u is the unit roundoff, and c 1 and c 2 are constants that depend on the method [22].
Proof: Using the algorithm vancc(z, N) and the computed matrices B(k) (in terms of computed weights ω k + ) for k = 0, 1, · · · , t − 2: we have has only two non-zeros per row and recalling that we are using complex arithmetic, we get: Using the fact that B(k) are computed using the computed weights ω k + , we get: Each block diagonal matrix C(k) is formed by 2 k number of C N 2 k 's in block diagonal positions. Using the fact that each C N 2 k has only one non-zeros per row and recalling that we are using complex arithmetic, we get: V(t − 1) is a block diagonal matrix and formed by 2 t−1 number of V 2 's in diagonal positions. Hence Thus overall, Hence |V(t − 1)||B(t − 2)||C(t − 2)| · · · |B(1)||C(1)| |B(0)||C(0)||z|.
Since each C(k) is an unitary matrix, and each B(k) and V(t − 1) are unitary matrices up to scaling, we get C(k) 2 = 1 and where ν + = η + γ 3 + η + + γ 3 . Now following V N V H N = N · I N , we get y 2 = n z 2 , and hence the result.
Proof: Theorem 5.1 immediately follows that the proposed radix-2 algorithm for Vandermonde matrices i.e. vancc(z, N) can be computed with tiny forward error provided that the weights i.e. ω k + are computed stably. On the other hand, y = y + ∆y = V N z + ∆y. Thus, we get y = V N (z + ∆z) and ∆z 2 z 2 = ∆y 2 y 2 . If we compute y = V N z using the brute force computation, we get Since V N is unitary w. r. t. scaling, this immediately reduces to As µ + = O (u), the error (26) of the proposed radix-2 algorithm is much more smaller than that in (27). Thus, the proposed algorithm is backward stable. Hence, the proposed algorithm is numerically stable.

Numerical Results
We will now state numerical results in connection to the error bounds of the proposed radix-2 algorithms for Vandermonde matrices and compare the results with the error bound of the radix-2 FFT algorithm analyzed in [9]. With the help of the radix-2 factorization of the DFT matrices in [22], it was proved in [9] that the error bound on computing radix-2 FFT algorithm is given by; where y = f l(F N x), F N is the DFT matrix, N = 2 t , η = µ + γ 4 (1 + µ), and µ depends on the methods for computing the weights as specified in [22]. We compare the error bounds of the proposed radix-2 algorithms for Vandermonde matrices shown in (26) and (28) with the radix-2 FFT algorithm (29) using MATLAB(R2014a version). In these calculations, we have chosen µ = µ + = µ − = 10 −15 and γ N = N u 1−N u where N = 2 t and u is the machine precision. Since µ = O (u), we have chosen u = 10 −15 . Table 4 shows the error bounds of the proposed radix-2 algorithms for Vandermonde matrices and radix-2 FFT algorithm in [9]. Based on the numerical results shown in Table 4, the proposed radix-2 algorithms for Vandermode matrices and radix-2 FFT algorithm have the same error orders except for N = 16, and 256. Even with these two N values, error orders of the proposed algorithms and FFT vary only by 10 −1 and relatively very low. To sum up, Table 4 shows that the proposed radix-2 algorithms for Vandermonde matrices provide tiny forward errors.  . vancc(z, N) and vanc(z, N)) vs radix-2 FFT algorithm [9] N Error Bound Error Bound vancc(z, N)/vanc(z, N) FFT

Signal Flow Graphs for Radix-2 Vandermonde Algorithms
In this section, we use signal flow graphs to illustrate the connection between algebraic operations used in sparse and orthogonal factorization of Vandermonde matrices with the fundamental signal flow graphs (SFG) building blocks (i.e. adders and multipliers). We provide two signal flow graphs to show the simplicity of the proposed radix-2 algorithms for Vandermonde matrices. Being pivotal for efficient physical implementation in hardware, SFGs should represent a numerical algorithm in its fully factorized form in such a way that more sparse matrices are resulted and, as a consequence, less arithmetic operations demanded. Thus, Fig. 1 displays the SFG for the proposed vanc(z, N) algorithm for the case N = 8. The recursive nature is evident as we express the 8-point SFG in terms of the 4-and 2-point SFGs. Notice that, the SFG of the vancc(z, N) algorithm is not presented because the delays have been replaced with time advances that are not realizable in real-time circuits. But for the software implementation purposes, we have proposed vancc(z, N) algorithm in Section 3.2 to effectively compute Vandermonde matrices.

Conclusion
We have proposed novel self-recursive radix-2 algorithms for Vandermonde matrices. These algorithms have sparse and orthogonal factors. We have shown that the well known radix-2 DFT algorithm is a subclass of the proposed algorithms for the Vandermonde matrices. The proposed algorithms attain the lowest gain-delayblock counts for Vandermonde matrices by a vector, in the literature. Theoretical error bounds on computing the radix-2 algorithms and stability of the proposed algorithms are established. Numerical results of the forward error bounds of the proposed radix-2 algorithms are compared with the radix-2 FFT algorithm. The proposed radix-2 algorithms have shown tiny forward and backward errors when weights are computed stably. Signal flow graphs were presented to show the simplicity of the proposed algorithm and to realize highfrequency analog circuits. Using the radix-2 algorithms for Vandermonde matrices associated with true time delay based delay-sum filterbanks, we have reduced the circuit complexity of multi-beam analog beamforming systems significantly.