Introduction
The Fourier transform [1] is a key component in 5G communication systems [2]. This mathematical operation transforms a signal from the time domain into the frequency domain. The discrete version of the Fourier transform is called discrete Fourier transform (DFT). To calculate the DFT, the fast Fourier transform (FFT) algorithm proposed by Cooley and Tukey [3] reduces the operation complexity from
In 5G communications, the size of the FFTs is obtained as a product of powers of 2, 3, and 5, as is detailed in its physical layer description [2]. This motivates the need for designing non-power-of-two (NP2) FFTs. During the 20th century, several algorithms were proposed to make NP2 FFTs more efficient, such as those by Rader [4] and Winograd [5]. Furthermore, other NP2 algorithms such as the prime factor algorithm [6], [7], [8] have been proposed.
When the FFT algorithm is implemented in hardware, pipelined architectures allow for high performance [9], [10], [11], [12], [13], [14], [15]. In fact, the field of pipelined FFT hardware architectures has been deeply developed during the last decades [16]. These designs have reached a high degree of optimization for power-of-two (P2) sizes [17]. Compared to them, the architectures that consider NP2 sizes have been barely explored, due to the higher complexity that algorithms for NP2 sizes [4], [5], [6], [7], [8] involve. The consequence of this fact for communication systems is that NP2 FFT architectures are barely used. Even when the most suitable FFT size in the system were a non-power-of-two FFT, it is common to use a higher size that is a power of two instead.
Nowadays, pipelined FFT hardware architectures for NP2 sizes mostly consider single-path delay feedback (SDF) architectures [18], [19], [20], [21], [22], [23], [24], [25], [26], with the exception of [27]. However, for NP2 sizes, SDF architectures are not as efficient as could be expected: Although SDF architectures process data in series at a rate of one sample per clock cycle, the butterflies that they use operate data in parallel. This means that the butterflies are only working for a fraction of the time, whereas the rest of the time they wait for new data. This leads to a utilization of the butterflies in SDF FFTs of
In this paper, new efficient serial pipelined butterflies for radices 2, 3, 4, and 5 are proposed. These butterflies reach a high utilization that allows for achieving low area and high performance simultaneously. The proposed designs focus on minimizing hardware-consuming components such as adders and multipliers. The strategy that has been followed is to divide the complex-valued calculations of the butterflies into operations with real-valued data. Then, these operations are distributed along a pipelined circuit. This, combined with a carefully designed data management, leads to serial butterflies with a high degree of optimization. The preliminary version of the serial butterflies proposed in this paper was developed in the author’s Bachelor Thesis [29]. This paper provides the scientific publication of that work and completes it with new implementations, experimental results, and comparison. The proposed butterflies are suitable for future non-power-of-two serial commutator (SC) FFT architectures [10], where the processing elements operate on data that arrive in series in consecutive clock cycles.
The novelty and contribution of the paper can be observed at various levels. First, this paper is the first work that deals in a rigorous way with the design of serial butterflies. Second, the paper is the first one that highlights and faces one of the key problems in NP2 FFTs, which is the low utilization of butterflies. Third, the paper presents efficient solutions to tackle this problem. Fourth, the challenge of designing the butterflies in the paper required a thorough analysis of the data flow in order to obtain an order of operations that reduces the hardware components. Finally, we have pursued that the paper is complete, providing any information related to the proposed butterflies that may be relevant for the reader. The reason is that the design of optimized butterflies is fundamental for the design of efficient NP2 FFTs. Without them, future NP2 FFT will not be feasible in communication systems, because they would still require a large amount of hardware, as they do nowadays. Thus, the final goal that this work pursues is to develop NP2 FFT architectures that are as efficient as power-of-two ones. With this goal, communication systems will be able to implement NP2 FFTs instead of being forced to resort to P2 FFT sizes. This ambitious goal of deriving efficient NP2 FFT architectures will take place in several steps. In this paper, we set the first stone to build NP2 FFT architecture by developing efficient butterflies for NP2. In future works, we will present new efficient algorithms for NP2 FFTs, shuffling circuits to calculate the permutations in NP2 FFTs, and, finally, the desired efficient NP2 FFT architectures.
The paper is organized as follows: In Section II, the state-of-the-art is reviewed. In Section III, the proposed serial butterflies are presented and analyzed in detail. In Section IV, the proposed designs are compared to previous ones. In Section V, implementation results on FPGA and ASIC are reported and compared with parallel butterflies. Finally, in Section VI, the main conclusions of the paper are provided.
Background
A. The FFT
An \begin{equation*} X[k] = \sum _{n=0}^{N-1}x[n]\cdot W_{N}^{nk}, \;\;\;\;\;\;\; k = 0, 1, \ldots, N-1, \tag{1}\end{equation*}
B. Butterflies
A radix-\begin{align*} X[{0}] &= x[{0}] + x[{1}], \tag{2a}\\ X[{1}] &= x[{0}] - x[{1}]. \tag{2b}\end{align*}
It can be observed that these operations correspond to the calculation of the DFT in (1) for
Fig. 2 shows the signal flow graph of a radix-3 butterfly based on Rader’s algorithm [4]. According to (1), it carries out rotations by 0°, 120° and −120°. The operations that are extracted from the flow graph are \begin{align*} X[{0}] &= x[{0}] + x[{1}] + x[{2}], \tag{3a}\\ X[{1}] &= x[{0}] - \frac {1}{2}(x[{1}]+x[{2}]) -j\frac {\sqrt {3}}{2}(x[{1}]-x[{2}]), \tag{3b}\\ X[{2}] &= x[{0}] - \frac {1}{2}(x[{1}]+x[{2}]) +j\frac {\sqrt {3}}{2}(x[{1}]-x[{2}]). \tag{3c}\end{align*}
Fig. 3 represents the flow graph of the radix-4 butterfly.
The operations that are carried out are \begin{align*} X[{0}] &= (x[{0}] + x[{2}]) + (x[{1}] + x[{3}]), \tag{4a}\\ X[{1}] &= (x[{0}] - x[{2}]) - j(x[{1}] - x[{3}]), \tag{4b}\\ X[{2}] &= (x[{0}] + x[{2}]) - (x[{1}] + x[{3}]), \tag{4c}\\ X[{3}] &= (x[{0}] - x[{2}]) + j(x[{1}] - x[{3}]). \tag{4d}\end{align*}
Fig. 4 shows a signal flow graph of the radix-5 butterfly. It is based on the Winograd’s algorithm [5]. However, in Fig. 4 we have reordered the operations of the third stage so that the first multiplication is by
C. SDF FFT Architectures
SDF FFT architectures are the most common pipelined architectures used to process NP2 FFTs [19], [21], [22], [23], [24], [25], [26], [27]. Fig. 5 shows an SDF stage that uses a radix-2 butterfly. Input data arrive in series during consecutive clock cycles. The first half of the inputs is streamed to the buffer. While the buffer is being filled, the butterfly is not used. When the buffer is full, the output of the buffer is streamed to the upper input of the butterfly to operate these samples with the new input data. When the butterfly starts to work, half of the processed data is streamed to the rotator, while the other half is stored in the buffer. Finally, the data stored in the buffer is streamed to the output. This process repeats periodically as new data arrive at the circuit.
In the general case of radix-
Proposed Serial Butterflies
A. Theoretical Limits
Radix-\begin{align*} \text {Real adders}_{\text {min}} &= \left \lceil{ \frac {\text {Real additions in SFG}}{r}}\right \rceil, \tag{5a}\\ \text {Real multipliers}_{\text {min}} &= \left \lceil{ \frac {\text {Real multiplications in SFG}}{r} }\right \rceil, \tag{5b}\end{align*}
B. Proposed Radix-2 Serial Butterfly
In the radix-2 butterfly in Fig. 1, data are complex-valued. Thus, it calculates \begin{align*} X_{r,0} &= x_{r,0} + x_{r,1}, \tag{6a}\\ X_{r,1} &= x_{r,0} - x_{r,1}, \tag{6b}\\ X_{i,0} &= x_{i,0} + x_{i,1}, \tag{6c}\\ X_{i,1} &= x_{i,0} - x_{i,1}, \tag{6d}\end{align*}
Table III shows the timing diagram of the proposed radix-2 serial butterfly. Each row of the timing diagram corresponds to a signal of the circuit shown in Fig. 6. Note that letters are added to Fig. 6 to identify these signals. The first two signals in Table III represent the values of the control signals of the multiplexers. The next two rows represent the real and imaginary parts of the input data, which arrive at the same clock cycle. Signals A to H represent intermediate nodes of the circuit. Finally, the last two rows represent the real and imaginary parts of the output data, which are provided at the same clock cycle. Pairs of data to be processed in the butterfly arrive in consecutive clock cycles. Thus, the serial-parallel permutation circuit permutes data so that the real part of the second sample and the imaginary part of the first sample are exchanged at C and D. This permutation makes it possible to operate the real parts of the data first and the imaginary parts during the next clock cycle, according to the set of equations (6). Then, the butterfly provides the real and imaginary parts to the output at the same clock cycle by using the second serial-parallel permutation circuit. As each serial-parallel permutation circuit has a latency of one clock cycle, the butterfly has a total latency of two clock cycles.
C. Proposed Radix-3 Serial Butterfly
Fig. 7 shows the implementation of the proposed radix-3 serial butterfly. As in its flow graph in Fig. 2, the proposed hardware implementation distributes the required operations along three stages. The dashed lines placed after the adders and multipliers represent pipeline registers used during the implementation to improve the maximum clock frequency. The number in the upper side of some dash lines indicates the number of pipeline registers connected in series and dash lines with no number represent a single pipeline register. The proposed circuit reaches the minimum number of real multipliers according to Table II: The multiplication by \begin{equation*} \frac {\sqrt {3}}{2} \approx \frac {887}{1024} = \frac {((8-1)\cdot 16 - 1)\cdot 8 -1}{1024} = 0.8662. \tag{7}\end{equation*}
Table IV shows the timing diagram of the proposed circuit. Note that input data arrive in natural order as
D. Proposed Radix-4 Serial Butterfly
The operations required to process a 4-point FFT are described in the flow graph of Fig. 3. There are two clearly distinguished stages, which consist of four complex additions each. Based on it, Fig. 9 shows the proposed radix-4 butterfly. As in its flow graph, the proposed hardware implementation distributes the required operations along two stages. These stages include four real adders and zero multipliers, which correspond to the minimum values according to Table II.
Table VI shows the timing diagram of the circuit and the operations are detailed in Table VII. Input data arrive in natural order as
E. Proposed Radix-5 Serial Butterfly
Fig. 10 shows the proposed radix-5 serial butterfly. The aim of this implementation is to use the minimum possible number of real multipliers, as well as a number of real adders in line with the number of stages in the flow graph of Fig. 4. As in the flow graph, the proposed hardware implementation distributes the required operations along five stages. These stages include 10 real adders and two real multipliers, which means that the minimum number of real multipliers according to Table II is achieved. Table VIII shows the operations that are calculated at each stage. The multiplier constants of the radix-5 serial butterfly in Fig. 10, which also appear in Table VIII, are the ones listed in the first and second columns of Table IX. Note that the magnitude of these constants is the same as the magnitude reported in Table I for the radix-5 parallel butterfly. However, their phase is different in some cases. Additionally, both real multipliers have been implemented with shift-and-add operations as reconfigurable multiple constant multipliers (RMCM) [31]. The upper multiplier, M0, is shown in Fig. 11 and the lower multiplier, M1, is shown in Fig. 12. The multiplier M1 has been designed with the heuristics in [31]. The control signals S9 and S10 that appear in these circuits are the same control signals that appear in Fig. 10. The third column of Table IX shows the approximated values used in the shift-and-add circuits. Both shift-and-add circuits use 3 real adders, which are shared for every constant with the help of additional multiplexers. As a result, the proposed serial radix-5 butterfly uses 10 real adders and two real multipliers implemented with three real adders each.
Table X shows the timing diagram of the circuit. As in all the proposed serial butterflies, inputs arrive in natural order and a circuit that reorders the inputs is needed. This circuit consists of serial-serial permutation circuits linked to a serial-parallel permutation circuit. The serial-serial permutation circuits exchange data from natural order to
Comparison
Table XI shows the comparison between the proposed serial butterflies and previous approaches. Previous works include radix-3 serial butterflies [26], [28], a 2-parallel radix-3 butterfly [27] and a radix-5 serial butterfly [28].
The table compares the works in terms of real multipliers, real adders, real multiplexers, registers, throughput in samples per clock cycle, and latency in clock cycles (cyc.). For the number of real multipliers it is assumed that a complex multiplication uses four real multipliers and two real adders, and a complex multiplication by either a pure complex or a pure real constant requires two real multipliers.
The proposed radix-2 serial butterfly only requires 2 real adders, 4 real multiplexers, 4 registers, and no real multiplier. It processes one sample per clock cycle and has a latency of two clock cycles.
The proposed radix-3 serial butterfly processes one sample per clock cycle with a latency of four clock cycles. It requires 9 real adders, 12 real multiplexers, 8 registers, and no real multiplier. Compared to [26], it halves the number of adders from 18 to 9, which is a significant improvement, and also reduces the number of multiplexers by two. This improvement comes at the cost of a slight increase in registers and latency, which is not significant compared to the large reduction in adders. Compared to [27], the proposed approach is more hardware-efficient when processing serial data, as it halves the number of adders and reduces the number of multiplexers and registers by 53% and 33%, respectively. For 2-parallel data, two instances of the proposed butterfly could be used, which would require approximately the same amount of components as [27]. Compared to [28], the proposed radix-3 butterfly uses 5 more real adders and two more registers. However, it saves two real multipliers and two multiplexers. Multipliers are the most hardware-consuming components, being the area of a multiplier similar to the area of a number of adders equal to the data word length. For 16 bits, the two multipliers would require approximately 32 adders, i.e., much more than the adders used in the proposed design.
The proposed radix-4 serial butterfly only requires 4 real adders, 14 real multiplexers, 12 registers, and no real multiplier. It processes one sample per clock cycle and has a latency of six clock cycles.
The proposed radix-5 serial butterfly requires 16 real adders, 45 real multiplexers, and 31 registers, and processes one sample per clock cycle with a latency of 13 clock cycles. Compared to the radix-5 butterfly in [28], the proposed implementation uses 6 more real adders, a similar number of multiplexers, 19 more registers, and takes 7 additional clock cycles to process the inputs. However, it removes the four real multipliers in [28], which are the most hardware-consuming components. For 16-bit data, these multipliers would require around 64 adders, leading to much more hardware cost than in the proposed design. Furthermore, contrary to [28], the proposed approach has the advantage that data are processed in pipeline without feedback loops in the data path. This guarantees that any number of pipeline registers can be added in order to increase the clock frequency.
Compared to the parallel butterflies in Table II, the proposed serial butterflies in Table XI reduce the number of real adders and real multipliers. Regarding the proposed radix-2 and radix-4 butterflies, the number of real adders in the proposed implementations has been reduced by a factor
Experimental Results
The proposed serial butterflies have been implemented on a Virtex Ultraescale+ HBM XCVU37P-FSVH2892-2L-E. They have been designed with parameterizable word length (WL). The quantization noise in the butterflies has been studied and characterized in Table XII. This table shows the signal-to-quantization-noise ratio (SQNR) [32] in dB as a function of the word length of each real and imaginary part of the data, and assuming that the word length is the same along the circuit. The SQNR is calculated as \begin{equation*} \text {SQNR (dB)} = 10\cdot \log _{10}\left ({\frac {E\{ | X_{ID} |^{2}\}}{E\{ | X_{Q}-X_{ID} |^{2}\}}}\right), \tag{8}\end{equation*}
The experimental results in Table XII show that the SQNR grows at a rate of 6 dB per bit for radix-2 and radix-4 butterflies. For radix-3 and radix-5, the 6 dB increase occurs for small word lengths. However, from
Table XIII shows the post-implementation results of the proposed serial butterflies (Prop.) and the parallel butterflies (Par.) in the Virtex Ultrascale+ FPGA. The parallel butterflies have been designed as the direct implementation of their flow graphs shown in Section II. For a fair comparison, both serial and parallel architectures are compared under the same conditions: Every multiplier is implemented with shift-and-add operations, inputs, and outputs are registered, and pipeline registers have been added in order to achieve higher clock frequency. This entails higher latency in terms of clock cycles than the values reported in Table XI. The figures of merit included in Table XIII are LUTs, registers, CARRY8s, CLBs, clock frequency, SQNR, latency, and power consumption. The power consumption results consider a clock frequency of 650 MHz.
In Table XIII, it can be observed that the proposed radix-2 serial butterfly uses a larger number of LUTs, a similar number of registers, 3 more CLBs, and half the number of CARRY8s. Compared to the parallel radix-2 butterfly, the proposed one has higher latency than the parallel one and similar power consumption. Considering all these figures of merit, both architectures can be considered similar in terms of hardware resources and performance. However, it is worth noting that the proposed radix-2 butterfly processes data in series, whereas the parallel butterflies process two parallel branches. Therefore, these architectures will be preferable in different scenarios, depending on how data arrive at the butterfly.
Regarding radix-3 butterflies, the proposed serial implementation saves 55 LUTs, 188 registers, and 12 CLBs, and halves the number of CARRY8 compared to the radix-3 parallel butterfly. Likewise, it reduces power consumption by 11%. These improvements in area and power consumption come at the cost of an increase in latency. This increase in latency is an expected result, as the serial butterfly has only one data path to calculate the same operations that a parallel butterfly calculates in parallel, i.e., in the parallel butterflies the operations are distributed among the parallel paths, whereas in the serial butterfly, these operations are distributed in time.
Regarding radix-4 butterflies, the proposed serial butterfly saves 66 LUTs, 65 registers, and 20 CLBs compared to the radix-4 parallel butterfly, which corresponds to savings of 25%, 16%, and 32%, respectively. The reduction of real adders by a factor
The proposed radix-5 butterfly has a similar number of LUTs compared to the radix-5 parallel butterfly. By contrast, the registers and CLBs are reduced by 27% and 11%, respectively. This reduction in registers is caused by the pipeline registers that are needed in the parallel butterfly so that it can reach a frequency of 650 MHz. i.e., in the proposed serial butterfly a lower amount of registers is needed to reach a frequency of 650 MHz. The proposed radix-5 serial butterfly reduces significantly the number of real adders regarding Table XI, which results in a reduction of CARRY8 by 67%. As expected, the latency of the proposed radix-5 serial butterfly increases and the power consumption decreases with respect to the radix-5 parallel butterfly, leading to savings of 20% in power consumption.
Finally, the proposed serial implementations have the same SQNR as the parallel ones, due to the fact that all the mathematical calculations are the same, including the shift-and-add calculation of the multiplications.
In order to deeply explore the capabilities of the proposed serial designs, ASIC results have been extracted. Table XIV shows the post-synthesis ASIC results for the proposed serial butterflies (Prop.) and the parallel butterflies (Par.) with the same conditions as in Table XIII. The figures of merit included in Table XIV are technology, operating voltage, combinational cells, sequential cells, cell area, SQNR, latency, and power consumption. The power consumption results consider a clock frequency of 800 MHz. The technology used is TSMC of 40 nm. The operational voltage is 1.1 V. The values of SQNR and latency are the same as the ones reported in Table XIII, due to the fact that the logic circuit remains equal in both FPGA and ASIC implementations.
In Table XIV, it can be observed that the proposed radix-2 serial butterfly uses more combinational cells and a similar number of sequential cells. However, the cell area of the proposed radix-2 serial butterfly is slightly smaller. Both implementations have similar power consumption. The experimental results for both radix-2 ASIC implementations are in line with the FPGA results. Regarding the power consumption reported in the radix-2 butterflies FPGA implementations, the radix-2 serial and parallel ASIC implementations reduce the power by 94%.
Regarding radix-3 butterflies, the proposed serial ASIC implementation saves 81 combinational cells and 222 sequential cells, which means a 32% reduction of sequential cells. The area and power consumption are reduced by 29% and 28%, respectively. Regarding the power consumption reported in the radix-3 butterflies FPGA implementations, the radix-3 serial and parallel ASIC implementations reduce the power by 92% and 94%, respectively.
Regarding radix-4 butterflies, the proposed serial ASIC implementation saves 63 combinational cells and 63 sequential cells. The cell area and power consumption are reduced by 31% and 23%, respectively. Regarding the power consumption reported in the radix-4 butterflies FPGA implementations, the radix-4 serial and parallel ASIC implementations reduce the power by 94% and 93%, respectively.
Finally, the proposed radix-5 serial ASIC implementation saves 719 combinational cells and 547 sequential cells, which means a reduction of 19% and 32%, respectively. The total cell area and power consumption are reduced by 33% and 31%, respectively. Regarding the power consumption reported in the radix-4 butterflies FPGA implementations, the radix-4 serial and parallel ASIC implementations reduce the power by 92% and 93%, respectively.
As a result, with the exception of the proposed radix-2 serial butterfly, it can be observed that the proposed serial butterflies reduce area and power by around 30% in the ASIC implementations with respect to the parallel ASIC implementations. The ASIC results reported are in line with the FPGA results, supporting the improvement of the proposed serial designs.
Conclusion
This work has presented new serial butterflies for NP2 FFTs in communication systems for 5G and beyond. Contrary to butterflies in SDF architectures, the serial butterflies proposed in this paper have only one input and one output, which improves their utilization when data are processed in series. Furthermore, the proposed designs distribute efficiently the operations along a pipeline circuit, which reduces the number of hardware components. For radix-2 and radix-4, the proposed serial butterflies achieve the minimum number of real adders and real multipliers. For radix-3 and radix-5, they achieve the minimum number of real multipliers. Additionally, the multipliers have been implemented using shift-and-add operations, which provides further optimization of the circuits.
The proposed circuits have been implemented on an FPGA and ASIC. Their SQNR has been analyzed as a function of the word length. Experimental results show that the proposed serial butterflies achieve a high clock frequency and reduce the area and power consumption with respect to parallel butterflies at the cost of an increase in latency.
The proposed butterflies are suitable for future non-power-of-two serial SC FFT architectures, where the processing elements operate on data that arrive in series in consecutive clock cycles.
ACKNOWLEDGMENT
The authors would like to thank Prof. Martin Kumm for providing the adder graphs of the RMCM multipliers used in the proposed designs.