By Topic

Implementation Study of FFT on Multi-lane Vector Processors

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

2 Author(s)
Spinean, B. ; Comput. Eng. Lab., Tech. Univ. Delft, Delft, Netherlands ; Gaydadjiev, G.

In this paper we extend a custom FFT vector architecture by adding multiple lane capabilities and study its hardware implementation. We use the six step algorithm to segment a long transform of size N = Z × L into L smaller transforms of size Z. We split the data into pairs of vector registers (for the real and imaginary part), containing Z elements. A vector register pair with its corresponding functional unit form a single lane replicated L times. While smaller transforms proceed iteratively all of them are computed in parallel. The shorter FFT transforms along the X dimension are computed using previously proposed vector permutations while the transforms along the Y dimension are performed using a simple butterfly network that handles inter-lane communication. All data patterns required by the FFT computation are generated implicitly in hardware by a simple control unit. No data transposition is required and the twiddle factors are stored locally inside the functional units. We validated our design through simulation and ASIC synthesis targeting 90nm technology. We compare three possible configurations for computing a 256 point FFT, all running at 217 MHz with Z × L equal to: a) 32 × 8, b) 16 × 16 and c) 8 × 32. Configuration a) is the smallest and the slowest; configuration b) requires 1.43 times fewer cycles and requires 1.64 more area while configuration c) requires 1.76 times fewer cycles and is 3.16 times larger. Unlike other high performance FFT implementations, our design offers the possibility to trade-off execution speed for two types of resources: vector register size and number of lanes. Another important contribution is the possibility to execute 2D FFT without any modifications or adaptations.

Published in:

Digital System Design (DSD), 2012 15th Euromicro Conference on

Date of Conference:

5-8 Sept. 2012