Configurable Mixed-Radix Number Theoretic Transform Architecture for Lattice-Based Cryptography

Lattice-based cryptography continues to dominate in the second-round finalists of the National Institute of Standards and Technology post-quantum cryptography standardization process. Computational efficiency is primarily considered to evaluate promising candidates for final round selection. In lattice-based cryptosystems, polynomial multiplication is the most expensive computation and critical to improve the performance. This paper proposes an efficient number theoretic transform (NTT) architecture to accelerate the polynomial multiplication. The proposed design applies mixed-radix multi-path delay feedback architecture and flexibly adopts various polynomial sizes. Configurable NTT design is realized to perform forward and inverse NTT computations on a unified hardware, which is then used to develop an effective polynomial multiplier. The proposed architectures were successfully accelerated on several Xilinx FPGA platforms to directly compare with state-of-the-art works. The implementation results show that the proposed NTT architectures have comparable area-time product and demonstrate 1.7~17× performance improvement, and the proposed polynomial multipliers achieve higher performance compared with previous works. Experimental results confirmed the proposed design’s applicability for high-speed large-scale data cryptoprocessors.


I. INTRODUCTION
I N the explosive era of internet-of-things (IoT) and rapid development of next generation networks, post-quantum cryptography (PQC) has attracted increasing interest for securing data privacy against potential quantum attacks. The National Institute of Standards and Technology (NIST) second-round report [1] showed that lattice-based cryptography (LBC) schemes have become promising candidates for future standardization. These schemes include public-key encryption (PKE) protocols, key-establishment mechanisms (KEM), and digital signatures with strong theoretical security guarantees, such as CRYSTALS-Kyber (shortly Kyber) [2], Saber [3], CRYSTALS-Dilithium [4], and NewHope [5]. Polynomial multiplication is the critical bottleneck in LBC systems. Naïve polynomial multiplication in n-point cyclotomic rings is expensive with time complexity of O(n 2 ) op-erations. Fortunately, the number theoretic transform (NTT) is a strong tool with efficient memory utilization that performs the polynomial multiplication in O(n log n) operations. Thus, designing an efficient NTT architecture has top priority to accelerate cryptoprocessors, particularly for multiconnected cryptosystems and large-scale data encrypting applications, e.g., IoT [6], biometrics [7], and video-based facial images [8].
There are many approaches to design NTT architectures based on the trade-off between throughput rate and hardware complexity. Conventional methods construct a mutual butterfly circuit and iteratively execute NTT stages based on a memory unit. In each stage, groups of coefficients are fetched from the memory unit, transformed, and written back to the memory in appropriate orders. However, the memory-based NTT architecture is constrained by the cost of memory access VOLUME XX, 2021 operations. The memory access scheme changes from stage to stage and the memory read/write operations are complex due to data-dependency between adjacent NTT stages.
Several previous studies have investigated strategies to reduce the access pattern complexity of memory-based approaches. Xing and Li proposed an NTT architecture with four multipurpose butterfly units (BUs) and memory pingpong strategy to settle the non in-place writing issue between adjacent stages [9]. The ping-pong technique helped simplify the memory access operations but doubling the memory size. Zhang et al. presented a low-complexity compact NTT INTT architecture for highly efficient NewHope-NIST on an FPGA device [10]. Their NTT architecture's outstanding strength came from the address conflict-free multi-bank memory access scheme, which was efficient with a small number of parallel radix-2 BUs. Yaman et al. subsequently proposed three different hardware architectures (lightweight, balanced, high-performance) for polynomial multiplication in Kyber [11]. Their unified BU architecture could support butterfly operations and point-wise multiplication (PWM). The NTT, INTT, and polynomial multiplication could be done using a unified hardware design with different numbers of execution clock cycles (CCs). Recently, Bisheh-Niasar et al. have presented the state-of-the-art FPGA-based implementation of high-speed NTT architecture for Kyber [12]. Four BUs have been grouped in a 2×2 BU array together with several optimization strategies to speed-up the NTT computation. Bisheh-Niasar et al. have then proposed a reconfigurable resource-efficient NTT architecture supporting the Kyber [13]. In which, multiple BUs have been paralleled based on the memory ping-pong strategy to adopt various Kyber configurations. However, when increasing the number of parallel BUs for high speed, the memory banks will double and the routing becomes more complex. In pipeline executing aspect, the aforementioned memory-based NTT architectures might cause some bubble cycles between computation rounds due to the data-dependency of adjacent stages. Furthermore, the final NTT results are only produced at the last computational stage after a delay of log n stages, which causes a big gap between two consecutive polynomials. These shortcomings are the obstacle to improve performance of the NTT design.
To speed up the NTT computation, some previous studies deployed multiple BUs in parallel. Feng et al. implemented d lanes of BUs (d is set up to 16) to target at high-speed polynomial multiplier [14]. Mert et al. also proposed a flexible NTT design for various parameter sets and increased the number of processing elements up to 32 at the expense of more hardware resources [15]. The deployment of multiple BUs in parallel aims to reduce the number of execution CCs. Nevertheless, this strategy often requires high memory bandwidth making the on-chip memory access pattern complicated, which are challenging to improve the clock frequency for a high-speed NTT design.
A different approach would be to develop highly pipelined NTT architectures based on systolic array technique.
Rentería-Mejía and Velasco-Medina proposed a high-radix multi-path delay commutator NTT architecture to accelerate the ring learning with error based cryptoprocessors [16]. Tan and Lee subsequently proposed an efficient multi-path delay feedback (MDF) NTT architecture specifically for short-term security parameters [17]. Recently, Duong and Lee have implemented the MDF NTT architectures of k-parallel data paths for 1024-point polynomial [18], which confirmed the reasonable choice of k can achieve high efficiency. However, these works have not fully investigated the flexibility and reconfiguration of NTT INTT architecture for different latticebased cryptography schemes. This paper focuses on implementing an efficient NTT architecture specifically for high-speed computing environments. The proposed approach deploys all computational stages in fully pipelined and parallel manner. The NTT design transforms polynomials sequentially, in which polynomial coefficients are generated every CC with multiple parallel data paths. In addition, advanced cryptography protocols can support various security levels and perform NTT or INTT on-demand out-of-turn between parties. Thus, expanding previous designs from specific to more generic and configurable settings would be significant for advanced cryptosystems. In NIST's current view, Kyber is the most promising LBC candidate for PKE KEM standardization at the end of the third round [1]. However, other LBC schemes and their variants are still worthy in different research areas based on their novel ideas, different security standards and potential for further improvement. In this work, we illustrate the reconfigurable capability of the proposed NTT architecture by employing parameter sets (n, q) of polynomial degree n and modulo prime q such as (1024, 12289) and (512, 12289) in NewHope [5], (256, 3329) in Kyber [2] of the second and the third round NIST PQC submissions, respectively. The proposed NTT engine operates as a major accelerator and flexibly switches between parameter sets. These parameter sets satisfy various security strength categories required in NIST PQC Call for Proposals [19].
We summarize the main contributions of present paper as follows: 1) Building on the prior NTT architecture [18], we select a rational radix value (i.e., k = 4) and propose a flexible mixed-radix MDF NTT architecture supporting various parameter sets of (n, q). The proposed NTT architecture has four parallel data paths and is fully pipelined for high throughput implementation. Output polynomials are produced sequentially after execution time of n 4 CCs. Additional multiplexers select the configuration corresponding to the given parameter set.
2) A configurable NTT INTT architecture is realized to perform NTT and INTT computations on a unified hardware. The configurable design completes the NTT and INTT computations in the same number of execution CCs and provides a versatile tool for implementing cryptographic algorithms and reducing the hardware costs for high performance cryptosystems.
Algorithm 1 Iterative radix-2 NTT algorithm [20] Input: a(x) ∈ Z q [x] (x n + 1), Ψ lists n powers of ψ in bitreversed order Output: A = N T T n (a) 1: A ← a 2: for (m = 1; m < n; m = 2 * m) do 3: for (i = 0; i < m; i = i + 1) do 4: for (j = i * n m ; j < (2i+1)n 2m ; j = j + 1) do 6: temp ← M odM ult(W, A[j + n m ], q) 7: end for 10: end for 11: end for 12: return A 3) We develop a polynomial multiplier using the configurable NTT INTT architecture. The proper scheduling of configurable modules performs the polynomial multiplication effectively. Experiment verified that the proposed architecture achieved higher performance and better efficiency compared with previous works. The remainder of this paper is organized as follows. Section II gives the background of NTT. Section III proposes a flexible mixed-radix MDF NTT architecture and a configurable NTT INTT based polynomial multiplier. Section IV presents implementation results and discussion. Finally, Section V summarizes and concludes the paper.

II. NUMBER THEORETIC TRANSFORM
NTT is a form of fast Fourier transform (FFT) in a finite field with integers and provides an effective tool to perform the polynomial multiplication in LBC schemes. Algorithm 1 describes a fast iterative radix-2 NTT computation in ring R q = Z q (x n +1), where n is a power of two and q is a prime number [20]. This approach applies the negative wrapped convolution method to avoid zero-padding and eliminate modulo (x n + 1) in polynomial multiplication. ψ ∈ Z q is defined as the square root of ω (i.e., the primitive n-th root of unity), where where j is the bit reversal of i. We apply the Cooley-Tukey (CT) and Gentleman-Sande (GS) algorithms to perform NTT and INTT respectively, which help to avoid expensive reordering steps [21]. The function M odM ult() performs the modular multiplication by q. For polynomial a = (a[0], ..., a[n−1]) in R q as an example, the NTT and INTT forms are respectively defined as follow, Algorithm 2 Proposed mixed-radix 2 k1 2 k2 NTT algorithm When merging the powers of ψ i and ψ −i into ω ij and ω −ij respectively, the polynomial multiplication of a and b can be computed as follow, However, implementation of this algorithm on hardware platforms is challenging to achieve desired throughput. Therefore, we propose a mixed-radix NTT algorithm to adopt the proposed method for high-speed hardware accelerators.

III. PROPOSED MIXED-RADIX NTT ARCHITECTURE
A. PROPOSED MIXED-RADIX 2 K 1 2 K 2 NTT ALGORITHM Algorithm 2 performs the NTT computation of n-point polynomial using the mixed-radix method, assuming that n is decomposed into two components: radix-2 k1 and radix-2 k2 . Hence, 2 k1 -point NTT is performed 2 k2 times to accomplish the first transformation. A subsequent reordering function reorders the coefficients for the next stage. The second transformation performs 2 k2 -point NTT 2 k1 times. And the last reordering function generates the final output in bit-reversed order. With the pre-computed twiddle factor (TF) constants listed in bit-reversed order, 2 k2 radix-2 k1 NTT computations use the same first 2 k1 TF constants, and 2 k1 radix-2 k2 NTTs use remaining 2 k1 groups of 2 k2 -1 TF constants respectively. The proposed algorithm has ability to parallel 2 k2 radix-2 k1 NTT computations efficiently. Radix-2 k2 NTT operation can transforms received coefficients with parallel BUs in each stage and generates result in bit-reversed order. Fig. 1 illustrates the data-flow of mixed-radix NTT approach, which clearly shows butterfly operations and data dependency between adjacent stages. Algorithm 1 is used to perform partial NTT computations for radix-2 k1 and radix-2 k2 . Each partial NTT respectively requests TFs ψ in Ψ at VOLUME XX, 2021 j-th step as follows, where TFs in each stage are divided into m groups and each group with corresponding order i contains multiple BUs. The number of groups doubles when the stage index increases by one, and the number of BUs in each group halves. Therefore, each radix-2 k1 NTT uses the same group of TF constants for 2 k2 computational instances, whereas each radix-2 k2 NTT uses different groups of TF constants for k 2 computational stages run by iter indices respectively.
For example, Fig. 2 illustrates the order of TF and inverse TF assigned to NTT and INTT computational stages when n is 16. Assuming k 1 = k 2 = 2, four partial NTT 2 k 1 computations use the same group of TF constants with the consecutive orders through two computational stages (e.g., Ψ [1] for Stage 1, Ψ [2] and Ψ [3] for Stage 2). Meanwhile, four partial NTT 2 k 2 computations require different groups of TF constants for each computation (e.g., Ψ [4], Ψ [8] and Ψ [9] for two respective stages of first NTT 2 k 2 ). iter points to the order in Ψ corresponding to each stage of each NTT 2 k 2 computation. The same order scheme of inverse TF constants is used for INTT computation except that the scaling factor n −1 is early multiplied by inverse TF constants for Stage 1 INTT computation.  Fig. 3 illustrates the proposed flexible mixed-radix MDF NTT architecture based on the aforementioned algorithms intended for speed-optimized LBC schemes. Considering the evaluation results of prior study [18], we select the optimal radix value (k 2 = 2) and design a flexible NTT architecture for various parameter sets (n, q). The proposed NTT architecture includes two modules with respective BU1 and BU2. Module 1 adopts the MDF architecture to performs k 1 computational stages with four parallel data paths. This module continuously receives input vectors in normal order and generates four coefficients per CC. Module 2 with two parallel BU2s in each stage directly transforms four coefficients received from Module 1. The design methodology naturally eliminates the reordering steps in Algorithm 2. After k 2 computational stages, Module 2 generates the final results in bitreversed order. Four additional multiplexers select the data direction corresponding to the given parameter set. Specially, the NTT operation in Kyber [3] is slightly different by using n-th roots of unity instead of 2n-th roots. However, the NTT computation on a 256-point polynomial can be performed on two separate 128-point classic ones according to the parity of index. The proposed NTT architecture concurrently performs two 128-point NTTs in 7 stages by bypassing the last stage (i.e., Stage 10).   of specific primes was effectively executed in [22]. We realize and modify MR implementation for q = 3329 and 12289 as shown in Fig. 5 (a) and (b), respectively. The compact MR architectures are fully pipelined with the same latency (i.e., 5 CCs) using only bit-shift and addition to avoid expensive integer multiplications.

B. OVERALL FLEXIBLE NTT ARCHITECTURE DESIGN
The idea behind the mixed-radix MDF NTT architecture is to implement a fully pipelined design without reordering buffers. The continuous transformation between modules completely removes redundant cycles and eliminates additional memory for intermediate results. Taking n = 1024 for example, Fig. 6 describes the timing diagram of the proposed NTT architecture. For the very first input polynomial, the number of execution CCs through ten NTT stages is VOLUME XX, 2021 calculated by the sum of FIFO delay (255 CCs), pipelined BU stages (1 CC for multiplier, 5 CCs for MR, and 1 CC for modular addition subtraction), and pipeline registers between stages (9 CCs), i.e., 255 + 70 + 9 = 334 CCs. Because the pipeline is fulfilled, the NTT design accepts the next input polynomial after n 4 CCs. Hence, the NTT design generates four coefficients every CC and transforms input vectors every n 4 CCs in sequential manner. Moreover, the proposed NTT architecture is flexible to support various parameter sets with additional multiplexers. The reasonable choice of k 2 provides an effective trade-off between throughput and resource utilization for high performance LBC systems.  Fig. 4. Scaling factor n −1 in the INTT computation is priormerged into the first two addresses of Ψ −1 as illustrated in Fig. 2 to eliminate redundant cycles. Hence, the configurable architecture performs the NTT and INTT computations in the same number of n 4 execution CCs. In terms of hardware utilization, the configurable architecture consumes more logic circuits to implement additional multiplexers compared with the flexible design. The configurable NTT INTT architecture is significant to develop an efficient polynomial multiplier as presented in the following paragraph. Fig. 8 shows a polynomial multiplier architecture using

C. CONFIGURABLE NTT INTT ARCHITECTURE AND EFFICIENT POLYNOMIAL MULTIPLIER
where ζ = 17 is the first primitive 256-th root of unity and br 7 (i) is the bit reversal of the unsigned 7-bit integer. Experiment showed that the Four-path ModMult unit is fully pipelined and performs the basecase PWM in 17 CCs whereas the classic PWM of remaining parameter sets is completed in 6 CCs with four parallel modular multiplications. Every clock cycle, four coefficients are generated, concatenated and sequentially stored in the Data Memory block. The memory writing and reading are independent and simply scheduled by the Top Control Unit.

IV. IMPLEMENTATION RESULTS AND DISCUSSION
The proposed mixed-radix MDF NTT architectures were modeled using the Verilog hardware description language and synthesized using the Xilinx Vivado © 2020.1. To directly compare with related works on similar FPGA platforms, the implementation results were placed-and-routed on four 28nm FPGA devices: (1) a Xilinx Zynq-7000 (xc7z020clg484) that has 53K look-up table (LUT) elements, 106K flip-flops (FFs), 220 digital signal processing (DSP) slices, and 140 Block RAMs (BRAMs); (2) a Xilinx Virtex-7 (xc7vx485tffg1761) that has 303K LUTs, 607K FFs, 2800 DSPs, and 1030 BRAMs; (3) a Xilinx Artix-7 (xc7a200tfbg676) that has 135K LUTs, 269K FFs, 740 DSPs, and 365 BRAMs; and (4) a Xilinx Spartan-7 (xc7s100fgga676) that has 64K LUTs, 128K FFs, 160 DSPs, and 120 BRAMs. Resource consumption and achievable clock frequency were obtained with default place-androute settings. We introduced area-time product (ATP) and hardware efficiency metrics to enable fair comparison with previous works due to various hardware resource types. Table 1 shows key implementation results of the proposed flexible NTT architectures compared with previous studies for various parameter sets (n, q). The second and third columns of this table show the numbers of CCs and achievable clock frequencies. The fourth column shows the execution time of NTT designs that their latency is calculated as Latency (µs) = CCs Freq. (MHz). For n = 1024, the proposed NTT architecture operates approximately 9.6×, 12×, 8×, and 2× faster than that of [9], [10], [12], and [15] respectively. Xing and Li [9] proposed a ping-pong NTT architecture that used four BUs and required a large number of CCs (i.e., 1280). Zhang et al. [10] used only two parallel BUs for the iterative computation, which utilized hardware resources effectively but consumed many CCs (i.e., 2569). Although Mert et al. [15] significantly reduced the number of CCs (i.e., 200) by paralleling 32 processing elements, their NTT architecture operated at lower clock frequency and required more hardware resources. Bisheh-Niasar et al. [12] grouped two NTT stages into each computational round by constructing the 2×2 BU array. Their approach proposed to reduce the access pattern complexity but still required a large number of CCs (i.e., 1591). For n = 512, the proposed NTT architecture runs approximately 12× and 2× faster than that of [10] and [15], respectively. For n = 256, Yaman et al. [11] deployed 16 BUs in high-performance unified hardware architecture and significantly reduced the CC number of the NTT operation (i.e., 69). Bisheh-Niasar et al. [12] deployed 2×2 BU array and improved the access pattern to reduce the computational cycle (i.e., 324 CCs). Meanwhile, Bisheh-Niasar et al. [13] employed two configurable BUs in parallel, which required a larger number of CCs (i.e., 474) and performed the NTT computation at low clock frequency. However, our fully pipelined NTT design has smallest CC number and outperforms that of [11], [12], and [13] approximately 1.7×, 7×, and 19.7× acceleration, respectively. Thus, the proposed NTT architecture achieves superior performance compared to previous approaches.
To compare efficacy among NTT architectures, we evaluated ATP metric of the trade-off between area requirement and latency. The fifth through eleventh columns report the VOLUME XX, 2021  Table 1, the state-of-the-art NTT architecture for n = 1024 in [12] has smallest overall ATP value (see [13]). For n = 512, the NTT architecture in [10] has smallest ATP values measured by LUTs, FFs, and DSPs, which benefiting from the conflict-free memory access operation specifically for two parallel radix-2 BUs. Our NTT architectures have comparable ATPs with slightly higher values but do not consume any BRAM compared with six and two that in [10] and [12] for n = 1024, and five that in [10] for n = 512, respectively. For n = 256, the proposed NTT architecture has slightly better ATP value of LUTs, comparable ATP values of FFs, DSPs and without requiring BRAM compared with previous studies. Next, we evaluate and compare the speed of various NTT designs through achievable throughput in the following paragraph. We introduce the throughput metric to measure the amounts of bits passing through the NTT accelerator for a second as follows,
The twelfth column of Table 1 compares the throughput of various NTT designs. The proposed NTT architectures achieve highest throughput among the NTT designs of various parameter sets. Specifically compared with state-of-theart studies, our NTT designs can deliver significant throughput approximately 12× and 8× that of [10] and [12] for n = 1024, 12× that of [10] for n = 512, and 1.7×, 7×, and 19.7× that of [11], [12], and [13] for n = 256, respectively.  Table 2 compares the configurable NTT INTT based polynomial multipliers with previous works. Hardware efficiency is introduced as a fair comparison metric due to different modulo prime sizes among the polynomial multipliers. The hardware efficiency is used to evaluate the throughput that one FPGA hardware unit can deliver and defined as follows, The efficiency values of utilized LUTs, FFs, DSPs, and BRAMs are calculated by the equation (5), normalized, and denoted as Eff_LUT, Eff_FF, Eff_DSP, and Eff_BRAM, respectively. For n = 1024, Wang et al. [24] proposed a hardware accelerator for shared polynomial multiplication, which traded parameterization and used one configurable BU to reduce the hardware complexity. However, their polynomial multiplier required a large number of CCs (i.e., 11455). The proposed polynomial multiplier outperforms [24] approximately 27× speed-up with higher efficiency values. For n = 512, Feng et al. [14] implemented a high-speed polynomial multiplier on the Spartan-6 FPGA platform. The fifth column of this table shows that the proposed polynomial multiplier consumes 5.7K slices (18857 FFs and 13421 LUTs) with higher efficiency values than that of [14]. Differ from Spartan-6, Spartan-7 has some extended features of the 7 series family such as in DSP and BRAM. However, our register-transfer logic design only used basic logic elements except the 36Kb BRAM, which operated in simple dual-port mode and was better suited to the Data Memory structure than the 18Kb BRAM included in Spartan-6. For n = 256, the unified hardware architecture in [11] required 256 CCs to complete the polynomial multiplication. Our polynomial multiplier runs faster, has better LUT and BRAM, but worse FF and DSP efficiency metrics compared with [11]. However, the key generation, encryption, and decryption processes in Kyber [2] show that the NTT, PWM, and INTT are performed on-demand. In which, only one or even no NTT computation are required for the two input polynomials right before the PWM. It means that the proposed approach can perform the polynomial multiplication in [2] more efficiently in a highly pipelined manner. Regarding memory usage, the proposed approach allocates one 36Kb BRAM unit in simple dualport mode for the Data Memory unit (512 addresses of 72bit). The Data Memory only consumes 64, 128, and 256 addresses of 36Kb BRAM unit for 256, 512, and 1024-point polynomials, respectively. Table 3 shows the utilized FPGA resource breakdown of the proposed polynomial multipliers. Except for BRAM, the NTT modules occupy most of the hardware resources, and the NTT 2 consumes more LUTs for configurable function than the NTT 1 . For n = 256, the Four-path ModMult unit consumes more LUTs, FFs, and DSPs for the specific PWM than the classic PWM in cases of n = 1024 and 512. Percentage values indicate the utilized resource proportion of respective modules in polynomial multipliers. Additionally, we report the hardware consumption of submodules such as configurable BUs in Fig. 4 and MR units in Fig. 5. Please notice that the BU1 implementation result is reported with one FIFO register. The MR operation of two modulo primes can share some of bit-shift operations.
Thus, the proposed NTT architecture can achieve superior throughput with comparable efficiency compared to previous approaches. Although different parameter sets were implemented and compared, the proposed polynomial multiplier is primarily directed towards supporting potential LBC schemes for the third round NIST finalists.

V. CONCLUSION
The paper proposed an efficient mixed-radix MDF NTT architecture preferable for high-performance large-scale data cryptoprocessors. Flexible design adapted the NTT architecture to various parameter sets (n, q) and the reasonable choice of radix values helped achieve high performance. The proposed configurable NTT INTT architecture offers a versatile tool to effectively perform expensive polynomial multiplication in LBC schemes.
For future works, the proposed NTT architectures could be improved with various levels of parallelism in stages and customized for high degree large modulus polynomial constructors. Coming study is applying the proposed configurable NTT INTT architecture to accelerate the NIST lattice-based PQC finalists, particularly on co-designed software and hardware platforms.