Skip to Main Content
In this paper we extend a custom FFT vector architecture by adding multiple lane capabilities and study its hardware implementation. We use the six step algorithm to segment a long transform of size N = Z × L into L smaller transforms of size Z. We split the data into pairs of vector registers (for the real and imaginary part), containing Z elements. A vector register pair with its corresponding functional unit form a single lane replicated L times. While smaller transforms proceed iteratively all of them are computed in parallel. The shorter FFT transforms along the X dimension are computed using previously proposed vector permutations while the transforms along the Y dimension are performed using a simple butterfly network that handles inter-lane communication. All data patterns required by the FFT computation are generated implicitly in hardware by a simple control unit. No data transposition is required and the twiddle factors are stored locally inside the functional units. We validated our design through simulation and ASIC synthesis targeting 90nm technology. We compare three possible configurations for computing a 256 point FFT, all running at 217 MHz with Z × L equal to: a) 32 × 8, b) 16 × 16 and c) 8 × 32. Configuration a) is the smallest and the slowest; configuration b) requires 1.43 times fewer cycles and requires 1.64 more area while configuration c) requires 1.76 times fewer cycles and is 3.16 times larger. Unlike other high performance FFT implementations, our design offers the possibility to trade-off execution speed for two types of resources: vector register size and number of lanes. Another important contribution is the possibility to execute 2D FFT without any modifications or adaptations.