A Low-Complexity Shifting-Based Conflict-Free Memory-Addressing Architecture for Higher-Radix FFT

Conflict-free addressing (CFA) techniques are necessary for Fast Fourier Transform (FFT) hardware. For low radices, well-proven XOR-based addressing architectures are available in the literature. Applications such as wireless communication use higher FFT radices, more processing elements and a larger number of memory sets for a continuous flow of data. In the existing CFA techniques, the complexity increases with increasing radices or memory sets. In the proposed technique, the higher number of memory sets and radix are leveraged to an advantage. A novel scheme is suggested to reduce the complexity using a progressive shifting technique. The mathematical basis of the scheme is derived here and illustrated with an example of 512-point radix-16 FFT. The proposed and existing CFA architectures are designed using Verilog and implemented in the Semiconductor Laboratory, Chandigarh, 180 nm SCL library. The synthesis results show that the proposed scheme achieves a 33% area reduction compared to the best existing schemes. The area reduction factor further improves with a higher number of FFT points.

H IGH-radix Fast Fourier Transform FFT processors are in great demand today for handling high-speed systems such as Orthogonal Frequency Division Modulation (OFDM) based communication standards including Wireless Personal Area Network (WPAN) (IEEE 802.15c) and Wireless High Definition (HD) (IEEE 802.11ad, IEEE 802.11ay), vehicle and military radars, and medical imaging such as Magnetic Resonance Imaging (MRI). These processors are based on different radices of the FFT algorithm, which are radix 2, 4, 8, 16, etc. Higher radices (greater than 4) are needed for higher throughput, especially for FFT sizes of 512 points and above. For example, realizing architectures for 512 point FFT, with higher radices (8 or 16) would only require two or three stages compared to nine stages when using radix 2. The existing XOR or modulo-addition schemes for accessing multiple banks of memory in a conflict-free manner work well for lower radices, but for higher radices, the complexity and area consumed increase significantly. Continuous-flow (CF) high-performance FFTs need at least two memories. Additional memories are required to feed more processing elements (PEs) to boost performance. Therefore, in high-performance FFTs, the number of memories is usually two or three or sometimes even more. A Conflict-Free Access (CFA) [1] scheme has to be designed for each memory, leading to a sizeable increase in area. In this paper, the higher number of memories is used to an advantage. When higher radices are used, the number of stages is also limited to approximately 3 to 4. When the number of stages and memories are in the same range, each memory will be tied up with only one or two stages. In such a case, the XOR or modulo addition logic can be replaced with our proposed "progressive shift" (PS) technique, which is much simpler. The progressive shifting technique is not easy to apply in architectures where one memory is used to supply data for more than two stages of the system. In order to demonstrate the advantages of proposed technique, it is compared with existing techniques and the major differences are outlined in Table 1.
The motivation and main contributions of this paper are highlighted below: • The progressive shift technique is proposed for FFT architectures that have multiple memories and fewer stages serviced by a PE. Such type of FFT architectures usually find application in high throughput FFT designs.  The main idea is to reduce the complexity of the CFA circuits which tend to occupy larger areas, as number of memories increase to improve performance. • A theoretical framework for the technique is also formulated. • Progressive shifting and modulo addition/XOR switching are embedded in a single mathematical framework. • Simulation studies are carried out to show the superiority of the proposed technique in terms of the usage of hardware resources and plotted in the form of a graph. • To validate the proposed architecture, a chip is fabricated and its layout is presented in Fig. 8. The proposed architecture finds applications in OFDM based MIMO communication systems which demand high throughput and low area. The use of proposed multi-bank memory architecture can be extended to implement 5G communication systems and Internet of Things (IoT) architectures in future.

II. CONFLICT FREE ADDRESSING
An FFT processor generally uses some form of the Cooley-Tukey algorithm and a butterfly structure for processing data using a PE. The PE needs r inputs and generates r outputs in every clock cycle, where r is the radix of the FFT algorithm. As per the usual requirement of FFT algorithms, data have to be reordered before giving at the input of PE. This is commonly done by first storing the data in a memory and then retrieving using an address different from the one used during storage. Since one memory can only give one output, to fulfill the need for r simultaneous outputs, r banks are required in a memory. The size of each bank is taken as N/r so that the total storage in the memory is N , where N is the number of FFT points. The main challenge is the distribution of data to the memory banks such that data required simultaneously by the butterfly are stored in different banks. This problem is called bank generation for CFA and uses switching to reorder the data. The switching circuit is called forward commutator, which consists of a circuit (BAGU) to generate the necessary bank number and address and a multiplexer array, which directs the incoming data and address to that bank. Additionally, after the memory, a realignment commutator is needed that realigns the data with the inputs of the PE. The commutators use a large number of multiplexers to do this and hence consume a significant area. Each memory will require its own switching circuitry, as shown in Fig.  1 continuous flow of data, the area for conflict-free switching increases to a significant level. Hence, the development of a CFA scheme with a lower area is a challenging task that needs to be addressed. There have been two distinct approaches for CFA in the literature. The first is based on finding the bits in the data index, which would help in identifying a bank for it [2], [3], [9]. However, all these schemes explored the radix-2 architecture. To achieve higher throughputs, in [4], a high radix scheme was proposed using parallel butterflies. On similar lines, [10] derived schedules for CFA. The scheme was elaborated for the case where multiple butterflies operated in parallel in the same stage. It also took pipelining delays into account to take care of folded or pipelined butterflies and hence, different read and write times for a set of data. In [11], the work of [4] was extended, and CFA schemes for the cases of both prime factor FFT and common factor FFT were presented. Various constraint sets were derived and based on them, banks and addresses were generated using XOR or modulo-r addition.
In all these cases, for a particular data index, the bank and row address are generated using XOR or modulo addition logic, and it is difficult to decipher any definite pattern. This necessitates movement of not only data but also "row address" from some input port of a switch to some other output port, resulting in a complex switching mechanism. The other addressing scheme was designed by Reisis [7], where the concepts of forward permutation and reverse permutation were proposed. However, rules or guidelines for a hardware efficient permutation were not discussed anywhere. Additionally, it assumed an independent memory for each and every stage. It dealt with the particular case of one memory per stage, and no solution was provided for the case when a memory fed more stages. Hence, this permutation technique has limited applications. Hsiao [12] focused on modulo addition and suggested a form of permutation but did not go into any study on its applications or theory. Chen [13] used a similar technique for the output stage for the simple case of power-of-two radix implementation, but they did not deal with other stages. For real FFT realization, Ma in [14] suggested a procedure to fill up the entire memory bank sequentially rather than filling up multiple banks simultaneously. Memory bandwidth usage was inefficient, as other banks were unused during the input or output phases. This inefficiency was later taken care of in [15] at the cost of extra exchange stages to attain a bit-order as required for CF. The CFA scheme from the classic work of Johnson [3] was used. Hence, this works did not truly add anything new to CFA, although they came up with a more efficient architecture for real valued FFTs. In [16], an XOR based scheme was derived based on the postulates and the basic principles as suggested by Johnson with some modification. The design in [8] proposed a CFA for real FFT. The scheme was explained for 32 point FFT. The derivation of the CFA scheme was not provided. The implementation used counters to store data depending on the count, but in effect, the switching complexities remained similar. In [17], mixed radix algorithms were discussed. The design considered only a single memory without banks. Address generation was discussed to obtain the operands. However, the architecture did not use parallelism and was a low-throughput structure. Long [18] reduced the complexity of the CFA circuits developed by [19], which in turn was an improvement of [3]. The advantage of this design was the lower complexity of implementation compared to [3]. However, the design was limited to radix 2 and radix 4; additionally, the complexity increased when the number of memories or PE were increased. The work in [20] was on a flexible number of FFT points. They used a single PE with 16-way parallelism. The CFA scheme was challenging, as the PE had to cater to a large variety of points. The number of memories used for CF was three. The presented scheme was directly based on [12] and used a sophisticated switching mechanism for data and addresses. In [6], a storage pattern that initially looked similar to progressive shifting was suggested. However, it was inefficient because it used two multibank memories to realize CFA, whereas two memories could be used to realize both CFA and CF simultaneously [21]. In [22], Wang followed a new trend of using single port memories. The CFA challenge increased tremendously in this design, which was solved using modulo addition techniques. Although the design saved a significant area in memory by replacing the dual port with a single port, the actual advantage was offset due to the use of more complicated CFA circuits. In order to overcome the above mentioned limitations, in this paper, we propose a progressive shifting permutation, which is applicable for cases where one or two stages are fed by one memory set. The proposed technique avoids the complicated XOR based switching mechanism presented earlier. Moreover, as the shifting mechanism in progressive shifting is regular, shifting of addresses is eliminated, thus making the switch simpler. Hence, the proposed scheme is more efficient. It may be extended by combining with the recently proposed single port memories of [22] to further improve chip efficiency.

III. PROPOSED BANK GENERATION SCHEME FOR CFA USING PROGRESSIVE SHIFTS
Hsiao [12] outlined a procedure for bank generation based on modulo addition of the characteristic variable of each stage. Following a similar approach, a method is proposed to determine bank generation with additional help from Lemmas 1 and 2. For the purpose of ease of reading, a table of important symbols is provided in Table 2.
For a N point FFT, let the factors of N be as where each factor defines a stage and S is the total number of stages. We define the characteristic variable n i of each stage i = 1, 2, . . . S: The input equation of an FFT algorithm is expressed as where the constants X, Y , Z, etc. are defined as The last constant would simply be 1. This is the method of the common factor algorithm, while for the prime factor algorithm, it is difficult to apply the progressive shifting method, as the flow of data is complicated owing to modulo operations [23], [12]). Lemma 1: The input indices to a butterfly have a one-digit difference among them. It is this digit that is used to decide the banks. Here, the digits are n 1 , n 2 ...n S . Proof 1: A stage of FFT is defined by the n ′ i s. Hence, for a particular stage, all other n ′ i s will remain constant except for the stage in question. Thus, the fundamental definition of stage ensures that for each stage, the different digit is the one that controls the stage. Furthermore, it seems logical to use this digit to differentiate banks. Lemma 2: If a memory is feeding several stages, all the corresponding stage variables must be taken into account S Agarwal et al.: Low complexity conflict free scheme for FFT while deciding the bank. If any variable is left out, the corresponding stage cannot be fed by the memory in a conflict-free manner. Proof 2: Let only a single-stage variable be used to find the bank. Then, bank=n 1 ensures that for the first stage, the data are in conflict-free banks. For any other stage, there will be at least one situation where all other n ′ i s are the same, and only the stage variable is different. However, since this variable is not taken into account, the bank for those inputs is guaranteed to be the same, leading to conflict. Hence, all variables have to be taken into account while determining the banks for all the stages, as in (4).
Theorem 1 (Progressive Shifting): If only one of the n ′ i s is changed in the bank generation function, it will lead to the CFA scheme of progressive shifting. This shall happen when the function is addition, and it is conducted modulo B, where B is the number of banks, as shown in (5). It should be noted that usually B ≥ r, where r is the radix of the FFT algorithm. In the case of B > r, instead of continuous progressive shifts, shifts may occur every other cycle or even after more cycles.
When only one-or two-stage variables are changed, the flow in which data have to be stored follows a simple pattern leading to simplifications in switching commutators. In (4), the digits n 1 , n 2 , etc. are added together to find the banks. For a stage, only one digit is supposed to be different; this ensures that the required inputs go to different banks. In higher-radix algorithms, the banks are large in number, while the number of stages comes down drastically. Hence, the interaction among the stages does not severely affect the flow of data. Additionally, in cases where each stage has its own memory [7] for demanding throughputs or in input or output memories [13], there is virtually no interaction among the stages. Under such circumstances, the switching complexity may be reduced considerably by using progressive shifts, as derived below. The derivation of the technique is based on (5). If a memory has to feed only one stage, only one stage variable will increase by 1. Thus, the bank for data changes only by 1 unit. This means that the data are stored in the banks in a progressively shifting manner. The data required simultaneously for a PE can be shown in a row of a table as in Table 3. Thus, progressive shift represents increasing the shift amount by one for every consecutive or alternate row. The modulo operation ensures that the data are shifted circularly in a progressive fashion so that the required data are directly stored in different columns, as shown in Table 3. The corresponding hardware is proposed in section III-A.

A. HARDWARE FOR PROGRESSIVE SHIFTING
In this scheme, the banks to which the data are to be shifted are known a priori. Hence, they can be stored in a series  Fig. 3. This is simpler than the other technique in which the bank values have to be made available at any of the multiplexers. The two main components of the proposed architecture are the progressive circular right shift unit (PCRSU) and progressive circular left shift unit (PCLSU) as shown in Fig. 2. PCRSU is placed before the memory bank. Its purpose is to increase the input data shift right by 1 whenever a shift is required. A shift may be required after every clock cycle or alternate cycles or even after a few more cycles. The maximum shift amount is r, assuming a radix r butterfly. To achieve circular progressive switching, multiplexer banks and a shift register array (SRA) of data-width D = log 2 r and length r are used. The multiplexer bank consists of r multiplexers, each with r data inputs of width d (d is the width of the data) and a select input of size D. The r data values from the previous stage are connected to each of the r inputs of all the data multiplexers, as shown in Fig. 3. The select logic is designed using SRA. It is preloaded with values 0, 1, 2, . . . , r−1. The outputs of the r registers of SRA are connected to the r select inputs of the data multiplexers. On every clock cycle or as required, the data in SRA are shifted by one register. At the start, the selected inputs of data multiplexers are 0, 1, 2, . . . , r − 1. This leads to data output without any shifts. Another clock edge now leads to shifting of addresses in the register array to the right in a circular fashion. Hence the selected inputs now become r − 1, 0, 1, 2, . . . , r − 2. Now, the data output of the PCRSU is as if shifted right by one unit. The multiple clock cycle logic for activating the shifting in the SRA can be realized using a counter of an appropriate size. This is demonstrated by considering a 3-bit address as an example. The shift in the output of the multiplexers with respect to the change in the addresses of the SRA is shown in after the memory bank to align the right-shifted data in line with the inputs of the PE. The design of this unit is similar to that of the PCRSU. Only the direction of the shift is changed to be left circular.

IV. BANK GENERATION USING XOR LOGIC
In [3], it is suggested that the banks for radix 2 FFT can be differentiated on the basis of parity of the data indices and a derivation is given based on simultaneous solutions of the conditions for in-place memory access. Here, this is derived with the help of simple equations formulated for progressive shifting so as to show the relation between the two techniques. For radix 2 decomposition, the digits n i 's of (5) can be  0  8  16  24  4  12  20  28  1  9  17  25  5  13  21  29  2  10  18  26  6  14  22  30  3  11  19  27  7 15 23 3 replaced by bits b i 's.
Single-bit addition is equivalent to an XOR operation. The result of XOR defines the parity of the data index. It is thus shown that the two methods are related by a common mathematical framework. The differentiating factor is the constraints imposed on the number of b i 's that are allowed to change simultaneously.

V. A CASE STUDY: CFA FOR 512 POINT RADIX-16 FFT
A 512 point FFT chip was designed in Huang [5] using radix 16 algorithm. The architecture used two radix-16 stages and a radix-2 stage. In our design, to double the throughput, an additional PE is introduced in the pipeline for stage 2 processing. The new architecture is shown in Fig. 4. The design is thus modified so that the input memory feeds only P E 1 , and there is a second memory to feed P E 2 . An output memory is used finally to obtain the outputs in natural order. This circuit is aptly suited for the progressive shifting technique of CFA. The hardware for this design consists of several subunits, as described in III-A. These are detailed in Table 7. For XOR-based addressing in higher and mixed radix FFTs, the results of [4] can be used. The equation derived by Huang is reproduced in (7) for reference: The bank generation logic defined in (7) shows that it is not possible to identify the bank simply by inspection. In Huang, the circuit was implemented using XOR logic, but details were not presented. We redesign and implement the circuit using Verilog and analyze it for the purpose of comparison with PS technique. The detailed design of XOR  based forward commutator is shown in Fig. 5. The first step is to generate the index of the data using q-bit counters with offset adders, where q = log 2 N . This data index is then fed to a bank and address generator unit (BAGU). A BAGU implements (7) using XOR gates. In Fig. 5, Bank k is calculated by BAGU 0 . This means that Data 0 is to be stored at Bank k . Hence, the multiplexer at k th port must have the selection input as Bank 0 . For this transfer, a unique switching unit called decoder-encoder switch is required, which transfers data from the input port to the output port in such a way that if the input is Bank k at port 0 , then the output at port k is Bank 0 . Additionally, the address generated by BAGU 0 is for Bank k and it has to be transported to that bank. Thus, this design requires switching of both data and address, hence requiring extra multiplexers for address switching.
To read data from the banks, a reverse commutator as shown in Fig. 6 is used. The read order is determined by the FFT algorithm. BAGUs determine the banks and address at which the requisitioned data is available. The decoderencoder switch transfers this information to the select inputs of a address-select multiplexer and the address is directed to the proper bank. The read data have to be realigned with the inputs of the PE. To achieve this, a second array of multiplexer bank is used where the select input is the bank generated by the associated BAGU. Thus, CFA is achieved with the help of both the forward and reverse commutator, the former placed before the memory to write data and the latter placed after the memory to read data. The decoder-encoder switch is made with an array of decoders and encoders, as shown in Fig. 7.

VI. SYNTHESIS RESULTS AND COMPARISON
The design implementations of both the XOR-based technique and the proposed PS-based technique are carried out using Verilog. The correctness of the designs is verified through simulations carried out using Synopsys VCS. For synthesis, an SCL Chandigarh foundry node of 180 nm [24] is used, and synthesis is carried out using Synopsys DC.
Additionally, various parts of the addressing logic are synthesized separately to evaluate their relative complexity, and the corresponding results are presented in Table 7. From the table, it is seen that the proposed scheme uses 65% fewer counters, 26% fewer multiplexers, 52% less selection logic and achieves an overall 33% area efficiency compared to the XOR-based scheme. The area, power and execution time are obtained from the synthesis result and are shown in Table 8. In terms of power, the proposed scheme is 23% better and the timing performance is 39% faster. Complexity analysis is performed to study the dependence of the area of logic elements on radix R, number of FFT points N and wordlength D, and the results are tabulated in columns 3 and 7 of Table  7. To determine the scalability of the proposed technique, simulations are carried out for different numbers of FFT points with variable radix sizes of 8, 16 and 32. The results of the simulations are tabulated in Table 9 and plotted in the graph shown in Fig. 9. From the table, it is clear that as the number of points increases, the savings in hardware by using the proposed PS over conventional XOR-based design improve. For both schemes, as radix increases, complexity grows in an exponential manner, but the PS scheme remains more or less 30-40% efficient for all radices. To estimate overall savings on silicon area, it must be observed that one CFA circuit is used per memory. Hence, for an FFT with continuous flow using two memories, there are two CFA circuits. To compare the performance, the 512-point CF FFT chip was designed using both PS and XOR based techniques. The results showed that an area of 3 mm 2 is needed for PS and 3.35 mm 2 for XOR based design. It is also observed that the CFA logic portion of the proposed architecture occupies 0.65 mm 2 area while that of the XOR based design occupies 1 mm 2 area. Hence, it can be concluded that the area savings of the CFA logic and of the complete chip of the proposed architecture is 35% and 10% respectively compared with XOR based architecture. The FFT chip with PS CFA circuit was fabricated in 180 nm SCL Chandigarh foundry and its micrograph is shown in Fig. 8.
From the above observations, it is inferred that due to the following features, the proposed scheme outperforms the XOR-based bank generation scheme:  Total Area 486800 324900 33 • Since the address for a bank is generated by its linked address module, multiplexers for address shifting are not required. • The decoder-encoder switch is replaced with an SRA, which is less complex.

VII. CONCLUSION
CFA circuits have to be repeated as many times as the number of memories in the FFT architecture. Hence, they do contribute to the overall area and complexity of the design. The progressive shifting technique shown in this work is an efficient technique; in a typical case, it is seen to save approximately 33% of the area in commutation circuits. It is best suited for FFTs with a high radix and a lower number of stages or for architectures with multiple memories. Additionally, designs with a memory dedicated to a particular stage of PE can take advantage of this technique. On the other hand, this scheme is not easy to use for designs in which one memory feeds more than two stages, as in lowradix architectures. More investigation is necessary to adapt it for such cases. Another advantage of this design is that it does not change much from one architecture to another, facilitating its reusability once designed. Comparatively, the other addressing techniques are mathematically rigorous and require the design of individual addressing logic for different FFT architectures and need to start from scratch each time the architecture changes. In conclusion, for high-throughput FFTs, the progressive shifter is expected to result in substantial benefits in terms of area and design effort.