Fast Radix-32 Approximate DFTs for 1024-Beam Digital RF Beamforming

The discrete Fourier transform (DFT) is widely employed for multi-beam digital beamforming. The DFT can be efficiently implemented through the use of fast Fourier transform (FFT) algorithms, thus reducing chip area, power consumption, processing time, and consumption of other hardware resources. This paper proposes three new hybrid DFT 1024-point DFT approximations and their respective fast algorithms. These approximate DFT (ADFT) algorithms have significantly reduced circuit complexity and power consumption compared to traditional FFT approaches while trading off a subtle loss in computational precision which is acceptable for digital beamforming applications in RF antenna implementations. ADFT algorithms have not been introduced for beamforming beyond $N = 32$, but this paper anticipates the need for massively large adaptive arrays for future 5G and 6G systems. Digital CMOS circuit designs for the ADFTs show the resulting improvements in both circuit complexity and power consumption metrics. Simulation results show similar or lower critical path delay with up to 48.5% lower chip area compared to a standard Cooley-Tukey FFT. The time-area and dynamic power metrics are reduced up to 66.0%. The 1024-point ADFT beamformers produce signal-to-noise ratio (SNR) gains between 29.2--30.1 dB, which is a loss of $\le$ 0.9 dB SNR gain compared to exact 1024-point DFT beamformers (worst case) realizable at using an FFT.

The computational complexity of computing the N-point DFT can be reduced via fast Fourier transforms (FFTs), which are fast algorithms for realizing DFTs that reduce the computational complexity to O (N log 2 N) [10]. Thus, multiple DFT beams for both wireless communications applications (e.g., JSDM) and multi-beam radar/imaging systems are often generated by applying an N-point spatial FFT to each temporal sample acquired by the ULA [13,18].
The search for particular N-point FFT methods that minimize the multiplicative complexity is a separate field of research in signal processing, computer science, and applied mathematics, with a multitude of algorithms and implementations available [7,10,22,25,49]. In [19], the theoretical lower bound for the DFT multiplicative complexity was established as a function of N. All FFT algorithms use sparse factorizations of the DFT matrix to provide accurate implementations of the DFT at an arithmetic complexity that approaches this lower bound. However, such high accuracy is of limited practical relevance in digital multi-beam RF beamforming applications, such as radar signal processing, where the accuracy of the results is limited by other system parameters or environmental conditions (e.g., thermal noise in a receiver, or the practical implementation of an antenna radiation pattern compared to ideal, harmonic distortion in a microwave mixer or amplifier). In such applications, relentless pursuit of high accuracy in the exact computation of the DFT is not relevant in terms of overall performance, and smart system designers can exploit this fact for power and cost optimization.
High-precision VLSI implementation of FFT algorithms may result in unnecessarily large circuits, exaggerated critical path delays, and wasted power. All of those factors contribute to higher-cost circuits, reduced frequency of operation, and higher operation costs. This is because digital multipliers demand a large amount of circuit resources when compared to simple adders. This makes the reduction of the number of multipliers in a given system crucial when chip area and power must be conserved and high-speed operation is desirable. In particular, the adoption of approximate DFT (ADFT) computations opens up new possibilities for fast algorithms which do not compute the DFT in the strictest mathematical sense, but nevertheless can be good enough for digital multi-beam RF beamforming applications, particularly at mmwave frequencies and above, where reproducibility of antenna patterns become more problematic. Because ADFT applications are able to realize much greater efficiencies than the theoretical lower bound N for an N-point DFT proposed in [19], ADFT computations allow greater reductions in computational complexity than traditional FFTs, albeit at the cost of a deterministic loss in performance, namely a small increase in worst-case side-lobe level [6].
The ever increasing data rate demands of wireless communications led to the exploration of millimeter-wave (mmW)/sub-THz/THz frequencies in 5G cellular networks [34,35], where larger antenna array sizes (e.g. N = 64, 128, 256) for beamforming and massive MIMO have become a general requirement [9]. For example, IoT and robotics applications in emerging fifth-generation (5G) and beyond mobile wireless networks will require 6D positioning, which involves both spatial position and device orientation (role, pitch, yaw) which require new algorithms that can benefit from large number of closely packed low-complexity digital beams [28,41,46]. A similar need occurs in the design of systems for intelligent surfaces which provide means of communication without line-of-sight [42]. In fact mmW-based 5G MIMO cellular systems are already being deployed [1]. Moreover, ongoing research in the sub-THz range [8,11,20,23,26,31,36,38,41,51] suggests that the W and G bands will be commercially available within the next 5-10 years. Such sub-THz carrier frequencies require large amounts of beamforming gain to mitigate free-space path loss in the first meter of propagation from the antenna [35,41,47,48]. Thus, communication systems at these frequencies would require much larger numbers of antenna elements in the transceiver arrays; array sizes of the order of N = 1000 elements would not be unrealistic for future sixth generation (6G) cellular systems. Nevertheless, to the best of our knowledge, DFT approximations in the literature are limited to N ≤ 32.
In this paper, we address this important beamforming challenge by introducing three new approximations to the very large N = 1024 (1024-point) DFT. Fast algorithms that allow low-complexity implementations of these approximations are also developed and shown to provide remarkable accuracy with significant cost and power reduction compared to DFT and FFT approaches. The proposed 1024-point ADFTs are based on a recently proposed 32-point DFT approximation and multiplierless fast algorithm [27,40] that furnish a "reasonable" approximation of the 32-point DFT albeit without using multiplications (i.e., using an adder-only signal flow graph). The 1024-point exact DFT can be expressed in terms of 32-point DFT. We use this fact to derive an approximation for the 1024-point DFT matrix by means for our earlier 32-point ADFT. In particular, we propose three different 1024-point transforms with different trade-offs in computational complexity and computational accuracy compared to the baseline exact DFT. These three transforms differ from each other based on the use of 32-point ADFT in the derivation and they can be used to replace the FFT while generating N = 1024 beams from a 1024-element ULA as shown in Fig. 1.
The paper is organized as follows. Section 1 reviews the DFT and selected popular FFT algorithms. In Section 2, we discuss the mathematical background for the 32-point DFT approximation introduced in [40] and describe its associated fast algorithm in matrix form. In Section 3, we present 1024-point DFT approximations and discuss three different algorithms to implement them. Section 4 explores the digital VLSI realization of the proposed 1024-point DFT approximations.
In Section 5, we summarize our conclusions.

Review of the DFT and FFT
In order to understand the method used to create accurate ADFT algorithms, we will discuss the mathematical background related to the DFT definition and FFT algorithms.

Mathematical Definition of the DFT
⊤ represent a signal with N samples. The DFT maps the input signal x into an output signal X = X [0] X [1] · · · X [N − 1] ⊤ according to the following relationship: where ω N = e − j 2π N is the Nth root of unity and j −1. On the other hand, the inverse DFT (IDFT) is given as The DFT of x can be expressed through a matrix-vector multiplication X = F N · x, where is the N-point DFT matrix [33].

FFT Algorithms
DFT was originally the cornerstone of primitive DSP, until the FFT was found to be vastly more efficient. Here we extend FFTs to become ADFTs. The computational complexity associated with performing the N-point DFT operation in direct form is O (N 2 ). This complexity is prohibitive for most engineering applications since a high number of operations accounts for (i) higher energy consumption; (ii) higher latency; (iii) higher number of gates; and, in consequence, (iv) higher chance of system failure. To address these issues, FFT factorizations furnish a product of sparse (mostly zeros) matrices that reduces the DFT computational complexity to O (N log N). Different FFT algorithms can be identified in the literature [14,16,43,44].
Here we consider three popular algorithms, namely i) the Cooley-Tukey FFT [10], ii) the split-radix FFT [16], and iii) the Winograd FFT [45]; each of these is briefly described below.

Cooley-Tukey Algorithm
A very popular form of the classical Cooley-Tukey algorithm is the radix-2 decimation-in-time FFT, which splits the Npoint DFT computation into two N/2-point DFT computations resulting in an overall reduced complexity [14]. Recursive use of this algorithm reduces the number of multiplications of the DFT from O (N 2 ) down to O (N log 2 N).

Split-radix Algorithm
This is a variant of the Cooley-Tukey FFT algorithm which uses a blend of radix-2 and radix-4 by recursively expressing the N-point DFT in terms of one N/2-point DFT and two N/4-point DFT instantiations [16]. The split-radix algorithm can reduce the overall number of additions required to compute DFTs of sizes that are powers of two without increasing the number of multiplications [21].

Winograd Algorithm
The Winograd algorithm implements an efficient FFT and exploits the multiplicative structure on the data indexing of DFT and converts it into a cyclic convolution computation [43,44]. In several particular cases, the Winograd algorithm achieves the theoretical minimum multiplicative complexity [19] as shown in [43] making it more efficient over the Cooley-Tukey and radix. For large DFT block lengths that can be decomposed as a product of small primes, the Winograd algorithm achieves nearly-linear complexity [10].

Matrix Representation of the N 2 -point DFT in terms of the N-point DFT
Now we will use the matrix definition in sub section 1.1 to derive a matrix representation for the computation of the N 2point DFT in terms of the N-point DFT via a radix-N FFT approach. The goal of this is to derive a 1024-point DFT in terms of 32-point DFT. Generally speaking, the N 2 -point DFT computation corresponds to a vector-matrix multiplication with a N 2 × N 2 matrix transformation: The expression in (4)  The 1D to 2D mapping can be accomplished by means of the inverse vectorization operator invvec(·) [17] (Cf. [29,30]) which obeys the following mapping: Based on the 1D to 2D mapping in Eqn. (5) we can show that the N 2 -point DFT given in (4) can be represented in the following matrix expression based on the Cooley-Tukey algorithm: where vec(·) is the matrix vectorization operator [37, p. 239], • is the Hadamard element-wise multiplication [37, p. 251], the superscript ⊤ denotes simple transposition (non Hermitian), and Ω N is the twiddle-factor matrix given by Ω N = (ω m·n N 2 ) m,n=0,1,...,N . Noting that Ω ⊤ N = Ω N , (6) can be further simplified. In particular, for N = 1024 = 32 2 , we have The inner DFT call corresponds to row-wise transformation of invvec(x), whereas the outer DFT performs column-wise transformations on the resulting intermediate computation. The formulation shown in (7) is the fundamental expression on which the proposed approximations (eg. ADFTs) in this work are based.

Multiplierless 32-point ADFT
In this section, the adopted multiplierless 32-point ADFT, first introduced in [27,40], is presented, and its complexity and error analysis are discussed . This is critical for understanding as the 1024-point ADFT is realized using 32-point ADFT as the main building block.

Matrix Representation
The considered 32-point ADFT matrix denoted byF 32 can be computed through a product of sparse matrices whose real and imaginary parts of its coefficients contains only ±1 entries. Such simple arithmetic leads to hardware designs that can be realized with adders only.
To present the factorization ofF 32 , we need the auxiliary structures in eq. (8)- (17), since the auxiliary factors are key for matrix factorization. Let B t be a t × t real matrix given by where I k andĪ k being the identity and counter-identity matrix of order k, respectively. Let also Z 1 , Z 2 , and Z 3 be the following matrices (for clarity, only the non-zero elements are shown): and The 32-point ADFT matrix is factorized into eight sparse matrices W k , for k = 0,1,... ,7, according tô where and W 7 is given in (17).

Arithmetic Complexity
In this section, we study the arithmetic complexity of the proposed ADFTs by assuming execution is fully sequential. That is, we consider all algorithms execute on a sequential processor by utilizing a central processing unit (CPU) that furnishes arithmetic operations dictated by the particular algorithm. The execution time is proportional to the number of arithmetic operations, and in general, multiplication being more computationally intensive compared to addition, takes longer to execute. Therefore, the number of multiplications is the primary metric for quantification of arithmetic complexity.
The discussed 32-point ADFT has a null complexity of multiplications and no bit-shifting operations are required. The only source of arithmetic complexity is the number of additions in the factorization in (12). Considering complex inputs, the matrices W 0 , W 1 , and W 4 require 60 real additions each, while the matrices W 2 , W 3 , and W 5 require 28 real additions each.
Similarly, the matrix W 6 requires 24 real additions, while the only complex matrix in the factorization, W 7 , requires 60 real additions. In total, the transformF 32 thus requires 348 real additions and no bit-shifting. By comparison, the Cooley-Tukey radix-2 algorithm requires 88 real multiplications and 408 real additions [10,16]. In contrast, the approach to represent a 32-point DFT using (4) and (7)

Error Analysis
The rows of a linear transform matrix can be understood as a finite impulse response (FIR) filter bank [32]. Savings of computation and exploitation of sparsity gives rise to slightly inaccurate representations of the frequency response of the filter bank. Thus we can assess how close the filter bank implied by the proposed (in eq. (12)) ADFT approximations are

Approximation Methodology
As section 2.2 illustrated the ADFT for N = 32, here we exploit the square of N = 32 to create a family of ADFTs for N = 1024. Motivated by the promising results achieved for 32-point ADFT, we will extend the approximation to 1024point case using the mathematics described in section 1.3. Here, we propose three ADFT algorithms which have small deviations of their filter bank responses when compared to the DFT. We assume that the applications at hand will be tolerant of the given deviations of frequency response, and that such deviations will be a small price to pay in exchange for the significantly smaller circuit realizations and power consumption over traditional fixed-point FFTs. It should be noted that the implementation of such approximate methods is not constrained by the minimum theoretical bounds of multiplicative complexity [19], that apply to the exact DFT. Indeed the proposed algorithms are not in fact calculating the DFT, but furnishing approximations that are deemed reasonable for most high-speed digital-RF applications.
Based on (7), we propose the replacement of the exact 32-point DFT F 32 by the 32-point ADFT proposed in [40].
Therefore, a suite of approximations for the DFT computation emerges. We propose three different algorithms based on the position of ADFT matrix in the derivation: LetX i for i = 1,2,3 denote approximations for X given by Algorithm 1, Algorithm 2, and Algorithm 3, respectively. Thus we have mathematically:X The above combinations of ADFT and DFT yield low-complexity approximations for the 1024-point DFT, which-due to its relatively large block length-is a computationally intractable task via usual direct numerical search methods. Algorithms 1, 2, and 3 have considerably different computational complexities and performance trade-offs, as discussed in subsection 3.2.

Twiddle-factor Matrix
In the three proposed algorithms, only the DFT computation F 32 is subject to an approximation; the twiddle-factor ma- trix Ω 32 is left unaltered in its exact form (cf. (7)). Therefore, a minimum number of multiplications remains due to Ω 32 .
Considering only the nontrivial multiplications, the twiddle-factor matrix requires 961 complex multiplications, which translate into 2883 real multiplications and 2883 real additions. The arithmetic complexity assumes sequential operation in a CPU. This parameter will be used in the arithmetic complexity calculations for each of the three algorithms.

Algorithm 1
Here the only source of multiplicative complexity are the twiddle factors in between the row-and column-wise 32-point

Algorithm 2
Here multiplicative costs stem from the twiddle factors and the column-

Algorithm 3
Here the operation count follows the same rationale as for Algorithm 2, with the difference that the roles of the row and column-wise transforms are swapped. Therefore, Algorithms 2 and 3 have the same arithmetic costs. The arithmetic complexity of the proposed methods is summarized in Table 1.

Performance of the Proposed Approximations
Considering the frequency response error expressed in log-magnitude units, Fig. 3 shows (i) the upper and lower envelopes and (ii) the first, second, and third quartiles of the error resulting from the proposed approximate filter banks [24,39]. For ease of visual inspection, we show only the normalized frequencies on the interval [−π/4,π/4]. The error of the frequency response for the remaining parts of the interval [−π,π] are just a repetition of the plots in Fig. 3.
Note that the three approximations resulting from Algorithm 1, Algorithm 2, and Algorithm 3 have distinct frequency responses. Fig. 3 indicates that the Algorithm 1 is the one presenting the largest deviation for the main lobe from the exact DFT. This is expected given that the transform resulting from Algorithm 1 is obtained through the substitution of both the row-and column-wise DFT block by the discussed approximate 32-point DFT. This qualitative analysis is confirmed once we calculate the errors in the frequency responses of the rows of the three proposed approximations. Table 2 displays the minimum (nonzero), mean, and maximum for the squared magnitude of these errors. Notice in Fig. 3 that the transform resulting from Algorithm 1 has the highest deviations from the expected frequency response for its rows with range of 5 dB compared to the filter bank response of the exact DFT matrix. In Table 2, we also show the worst-case side lobe in dB for each of the transforms. All transforms considered here possess a low worst-case side lobe on the order of −12 dB.
Noise rejection of the proposed ADFTs can be evaluated by means of its SNR improvement per frequency bin. The noise present from the antenna array can be modeled as additive white Gaussian noise (AWGN) with zero mean and variance σ 2 .
The AWGN present in each frequency bin is σ 2 /N. For narrowband (monochromatic) plane wave received by the array, the input signals to both the DFT and the three ADFT algorithms follows exp( j2πnk/N) for n, k = 0,1,... , N − 1, where k represents the DFT/ADFT bin number (corresponding to specific spatial frequencies related to the direction of propagation of each wave) and n is the antenna number in the ULA. The monochromatic signal having frequency exp( j2πk/N) for bin k has its SNR improved by 10 log 10 (N), which is 30.1 dB for the 1024 point DFT. This is the best case SNR improvement per bin for the DFTs. The adoption of various ADFTs in place of the DFT causes a loss of SNR performance observed as a hit in the SNR per bin. Let the reduction in SNR for bin k be denoted ∆γ x where x ∈ {1,2,3} are for the three proposed approximation algorithms.
The worst-case SNR degradation for the ADFTs obtained through simulations with 10 5 replicates for finding the ensemble average for each bin of the ADFTs are shown in      antenna gain at the transmit side. Fig. 4 shows the SNR plot for each of the beams of the DFT and the three proposed approximations. Notice that no approximation has an SNR lower than 29.2 dB in any of the bins, demonstrating that the SNR degradation is ≤ 0.9 dB compared to the DFT where the SNR improves by 30.1 dB for every bin.

Digital VLSI Realization
Next, we explore digital VLSI realizations of the three ADFT approaches outlined in (18), (19) and (19) using a timemultiplexed approach. Traditionally, arithmetic complexity amounts of counts of both multiplication operations and addition operations. However, for semi-parallelized hardware implementations on VLSI platforms, the existence of parallel sub-systems offers a trade-off between circuit complexity and algorithm execution speed as described by Amdahl's Law [3].
The proposed algorithms are based on radix-32 SFGs, which imply the sequential nature is limited to 1024-point algorithm completion every 32 clock cycles. The radix-32 SFG allows re-use of ADFT and DFT cores, and twiddle-factor cores, using time-multiplexing up to 32 levels. The use of time-multiplexed operations leads to the generalization of the multiplier structures that do not distinguish trivial multiplications by 0,1,−1. Therefore, the number of multiplications for the twid- which realizes the matrix transposition operation in digital VLSI hardware, while operating in-step with the system clock.
One complete matrix transpose operation is achieved every 32 clock cycles. The transpose buffer feeds the second timemultiplexed ADFT32 after suitable twiddle factors have been applied, which in turn, furnishes the desired 1024-point ADFT values. In order to minimize the chances of overflow, the second time-multiplexed ADFT32 block in Fig. 5 uses a larger wordlength by one bit than the first time-multiplexed ADFT32 block. Use of a larger word length accommodates for the arithmetic operations that are carried on the first time-multiplexed ADFT32 and the twiddle factors.

Transpose Buffer and Twiddle Factors
The transpose buffer shown in Fig. 6 consists of a mesh of 1024 delays and 32 parallel multiplexers, each of them possessing 32 inputs. The transpose buffer block generates the transpose of the first set of frequency bins. The transposition allows the column-wise DFT computation required in eq. (18), (19) and (20).
Twiddle-factor multiplication count consists of 961 non-trivial complex multiplications spread over 32 clock cycles.
However, these are implemented using 32 parallel complex multipliers, which each consume 3 real multipliers and 5 adders (Gauss Algorithm for complex multiplication). The twiddle factor block therefore furnishes the only multiplications present in ADFT1024_1, which results from Algorithm 1 in a radix-32 hardware realization. Each of the column bins (after the transpose buffer) undergoes a multiplication by ω m·n 1024 , where 0 ≤ m ≤ 31 and 0 ≤ n ≤ 31. Therefore, the precision of the twiddle-factor multipliers plays a critical role in the final area A, area-time AT, and area-time-squared (AT 2 ) metrics. In this paper, we have set the twiddle-factor precision level to be equal to the system word size of the inputs to the ADFT1024_1 core, which is a design parameter and the choice of lower precision levels in the twiddle factors would result in improvements in the VLSI metrics for all three proposed algorithms. In a sense, hardware designed with such conservative parameters can be thought of as worst-case benchmark, with more coarsely quantified twiddle factors leading to even better improvements in area, area-time, and area-time-squared metrics.

Circuit Complexity
All circuits operate for 32 clock cycles to produce one 1024-point transform. Complex multiplication is realized using 3 real multiplier circuits and 5 real adder circuits following the Gauss multiplication algorithm. The twiddle-factor matrix based on Gauss multiplication Ω 32 in (18) therefore requires 96 real multiplier circuits and 160 adders circuits. This block is common to all four 1024-point algorithms.
As shown in Fig. 5, the proposed radix-32 time-multiplexed architecture for Algorithm 1 uses two ADFT32 cores.

ADFT1024_2 core
In ADFT1024_2, the row-wise DFT block is substituted by the ADFT32 block. The DFT32 requires a total 78 real multiplier circuits and 398 adder circuits. Because the ADFT32 requires 348 adder circuits but no multipliers, we have an overall circuit complexity of 398 + 160 + 348 = 906 adder circuits and 96 + 78 = 174 multipliers ADFT1024_2.

ADFT1024_3 core
The circuit complexity for the Algorithm 3 is the same as for ADFT1024_2. The only change is in the placement of the elements in the architectural level.
The 1024-point DFT (denoted DFT1024) obtained by using two DFT32 cores for row-and column-wise FFTs would require 78 * 2 + 96 = 252 real multiplier circuits and 398 * 2 + 160 = 796 + 160 = 956 adder circuits. This is our reference radix-32 FFT circuit for baselining the circuit complexities of the proposed ADFT1024 algorithms.  The circuit complexities for the proposed designs as well as DFT1024 are presented in Table 3.

ASIC Synthesis and Place-Route Results: 45nm CMOS
The  Table 4. In Table 5, we list the hardware implementation metrics for ADFT1024_1, ADFT1024_2, and ADFT1024_3. Metrics for the DFT1024 core were included as reference values.

Analysis of the Results
The results in Table 4 shows that the 32-point ADFT core demands considerably less hardware resources than the 32point exact DFT core. On the other hand, the implementation of the transpose buffer with twiddle factor multiplication adds a fixed hardware complexity to the system for both the DFT and the approximate architectures. As a result, the transpose buffer causes the highest area consumption and a relatively high power consumption in comparison to that of 32-point ADFT cores. Thus, it becomes the dominant factor in hardware complexity for the designs of the three 1024-point approximate transforms, as shown in Table 5.
The core ADFT1024_1 gives the best hardware utilization, whereas ADFT1024_2 gives the worst as can be seen in Table 5. Algorithm 3 gives the best error performance, i.e., provides the most accurate approximation. Moreover, the hardware resource consumption of its physical realization ADFT1024_3 is also close to that of ADFT1024_2. The error performance of Algorithm 2 does not differ much from that of Algorithm 1, which also provides a hardware realiza-

ADFT-based 1024-Beam Digital Beamformers
In the proposed system, each ADFT bin corresponds to a unique direction in space. Ideally these bins should be identical to the spatial DFT bins, but their magnitude could deviate because of the approximation. The four worst bins for each of the three algorithms are shown in Fig. 7. The resulting errors are small enough to be acceptable for the low SNR scenarios seen in practical wireless systems. Note that practical realization of 1024-element ULAs for generating narrow ADFT-based beams in currently-licensed frequency bands (upto the V band) may be challenging due to the large sizes of the resulting apertures. However, due to ongoing research in the sub-THz range [8,11,20,23,26,31,36,38,51], the W and G bands will soon be commercially available for both licensed and unlicensed use. At a carrier frequency of 300 GHz, λ/2 = 0.5 mm and thus the size of a Nyquist-spaced 1024-element ULA would decrease to a reasonable value of 51.2 cm.

Conclusions
FFTs are used for reducing the computational costs of evaluating the DFT. Generally, they decrease complexity from O (N 2 ) down to O (N log N). In this paper, we show that further savings can be accomplished by means of approximate methods.
The resulting 1024-point DFT approximations present a trade-off between performance and hardware complexity without significant loss in terms of worst-side lobe and SNR.
Our work shows that larger block-length DFT approximations can be obtained from the smaller-size approximations derived using previously-described numerical optimization methods. Our methodology can be directly applied to any DFT for which the block length is a perfect square.     The choice of algorithm depends on the application and its tolerance for computational error in the DFT block. Highly error tolerant applications can greatly benefit from Algorithm 1 which has the lowest complexity. Algorithm 2 or 3 maybe selected when Algorithm 1 does not furnish sufficient performance.