A Flexible NTT-Based Multiplier for Post-Quantum Cryptography

In this work an NTT-based (Number Theoretic Transform) multiplier for code-based Post-Quantum Cryptography (PQC) is presented, supporting Quasi Cyclic Low/Moderate-Density Parity-Check (QC LDPC/MDPC) codes. The cyclic matrix product, which is the fundamental operation required in this application, is treated as a polynomial product and adapted to the specific case of QC-MDPC codes proposed for Round 3 and 4 in the National Institute of Standards and Technology (NIST) competition for PQC. The multiplier is a fundamental component in both encryption and decryption, and the proposed solution leads to a flexible NTT-based multiplier, which can efficiently handle all types of required products, where the vectors have a length ≈104 and can be moderately sparse. The proposed architecture is implemented using both Field Programmable Gate Array (FPGA) and Application Specific Integrated Circuit (ASIC) technologies and, when compared with the best published results, it features a 10 times reduction of the encryption times with the area increased by 3 times. The proposed multiplier, incorporated in the encryption and decryption stages of a code-based PQC cryptosystem, leads to an improvement over the best published results between 3 to 10 times in terms of $LC$ product (LUT times latency).

to find a new standard for Public Key Cryptography. The best proposals rely on functions that are proven to be hard to reverse, NP-hard problems, and admit efficient hardware and software implementations.
The first cryptosystem based on error correcting codes was proposed by McEliece, and uses Goppa codes [5], while recent advancements employ Quasi-Cyclic Low/Moderate-Density Parity-Check (QC LDPC/MDPC) codes, in order to reduce the complexity and the memory requirements of the whole system [6], [7]. The NIST competition has now reached round 4, and three code-based cryptosystems are still competing [8]: Classic McEliece [9], BIKE [10] and HQC [11].
The polynomial product is one of the operations intensively used in the encryption and decryption primitives of these cryptosystems, including one of the winners of the selection process in Public Key Encryption [12], [13]. Important aspects that affect both computational complexity and latency in polynomial multiplication are: the length of polynomials, the number of non zero elements, and the type of coefficients. In McEliece cryptosystems, based on LDPC/MDPC codes, the encryption involves a dense variable, while the decryption works on a sparse variable.
The main methods known in the literature for polynomial product calculation are the Schoolbook, the Karatsuba [14] and the Schönhage-Strassen [15] algorithms, which represent the state of the art in this domain: • The Schoolbook algorithm is widely adopted in systems based on QC-LDPC/MDPC codes [16], [17], [18]. The implementation efficiency of this approach strongly depends on the density of the involved matrices: while the efficiency is very good for a low density matrix, it becomes rather poor when the density of the matrix is large (as at the encoding side).
• The Karatsuba algorithm has a reduced time complexity, if used to compute the polynomial products required in the encryption [19], but its efficiency is drastically reduced when applied in the decryption.
• The Schönhage-Strassen algorithm [20] can be implemented to efficiently handle the sparse and medium density vectors as well as the dense vectors. Therefore, it is preferable as a solution applied to both encryption and decryption.
This work presents a novel architecture supporting the polynomial products involved both in the encryption and in the decryption. The proposed multiplier is based on the Number Theoretic Transform (NTT) and it has been optimized for binary polynomial multiplications, where input operands are binary and results can be either binary or integer.
The same datapath architecture operates with both sparse and dense matrices. The proposed design has been synthesized for both FPGA and ASIC technologies. Moreover, the achieved results have been compared with state-of-the-art architectures that were previously published for code-based PQC and NTT multipliers able to support similar data sizes.
The work is organized as follows: in Section II, code-based Cryptosystems are presented, while in Sections III and IV the cyclic matrix product and the NTT-based multiplier are described. The details of the proposed architecture are given in Section V, and finally the results and the comparisons against the state of the art are provided in Section VI.

II. CODE-BASED CRYPTOSYSTEMS
Code-based cryptosystems adopt error correcting codes to encrypt a secret message. In communication systems, error correcting codes are commonly employed to remove the errors caused by a noisy channel; on the contrary, in codebased cryptography, errors are intentionally inserted at the encoding (encryption) side and then corrected at the decoding (decryption) side.
The security is mainly guaranteed by the complexity of the decoding of a codeword: with t errors and without knowing the decoding matrix, the problem is proven to be NP-hard [21].
In this approach, the encryption PK and decryption SK keys are the encoding and decoding matrices of an LDPC code: G, the generator matrix and H, the parity check matrix.
The structure of the H matrix is H = [H 0 , H 1 , . . . , H n 0 −1 ], where each block (0 to n 0 − 1) is a binary square cyclic matrix with dimension N and d v is the weight of each column; therefore, the matrix H has size (N , N · n 0 ).
where I N is the identity matrix of size N , l = n 0 − 1, G l is a quasi cyclic matrix with density at most d v × d v , and it is the result of H −1 n 0 −1 · [H 1 , . . . , H n 0 −1 ]. We also use g l to refer to the first row of G. The encoding adds redundancy to the input vector m, of length N · (n 0 − 1), such that the encoded vector c (the codeword), with length N · n 0 , satisfies the set of parity equations, defined by the matrix H.
Since PK and SK are large, cyclic matrices are adopted to reduce the memory requirement. The structure of a cyclic matrix is: a n−1 a n−2 a n−3 . . . a 0 The first column of A, denoted as a (the first row is a T ), is sufficient to describe the whole matrix. Moreover, since the codes are sparse/moderate sparse, only few elements of the row are asserted. Hence, only the positions of the ones in a, instead of the complete column, can be stored to reduce the memory footprint. Therefore, to describe matrices H and G, we only store the asserted positions of the h and g l vectors.

A. ENCRYPTION
The encryption starts from the secret message, m, encoded with G into c. The error e is a binary vector with length N ·(n 0 ) VOLUME 11, 2023 and t positions asserted. Thus, the encryption can be written as: It is worth noting that the ⊕ operations is the sum in the Galois field GF(2), thus, the error flips t bits of the codeword, c = m · G.
The codeword c satisfies the parity equations defined by the code, and the errors introduced in c make some parity equations unsatisfied.

B. DECRYPTION
The decryption removes the errors in x, in order to retrieve the original message m. In LEDAcrypt and BIKE, the iterative decoding is based on a modified Bit Flipping (BF) algorithm [22]; in both schemes it has been designed such that it is suitable for PQC. The decoder receives x and, by means of H, it evaluates the Syndrome (s) and the Unsatisfied Parity Checks (upc), which are respectively the response to the parity equations with the input x and the number of wrong (unsatisfied) parity check equations, where each bit of x is involved.
The bits with a value of upc over a specific threshold (b) are flipped. The decoder iteratively evaluates s, upc and flips specific bits until s becomes equal to 0 (vector with all elements set to 0).
The LEDAcrypt and BIKE bit flipping based decoder is reported in Algorithm 1, which incorporates two alternative forms. Both LEDAcrypt and BIKE algorithms use the function f (·) to evaluate the threshold b It (line 4). Alternative options to implement this function have been proposed in [10] and [23].

Algorithm 1 Bit Flipping Based Decoder
As said, the algorithm can be written in two variants: in the first one, the processing tasks at lines (8) and (9)  . This alternative form is shown in Algorithm 1 using the italic font and angle brackets. At the end of the algorithm, if the syndrome is zero or the maximum number of iterations is reached, then the algorithm stops.
As it can be observed, both encryption and decryption need multiplications: in particular, c at the encryption side, and s and upc in the decryption procedure are obtained as the results of products involving a QC matrix. Given the cyclic structure showed in (1), this product operation is referred to as cyclic matrix product.
In order to obtain an efficient bit flipping decoding, an effective implementation of the cyclic matrix product is required. The proposed multiplier assumes the parameters listed in Table 1, which are derived from the LEDAcrypt [24] and BIKE [10] proposals.

III. CYCLIC MATRIX PRODUCT
The generic cyclic matrix multiplication is intended as the product The result of the multiplication is computed as

A. POLYNOMIAL PRODUCT AND INTEGER MULTIPLICATION
The vectors a and v can be seen as polynomials with coefficients a i and v i . The same description holds for the product A · v, and the result in Equation (3) is equivalent to r = a T · V. Thus, the techniques applied in polynomial and integer multiplication can be applied to the cyclic matrix product in LEDAcrypt and BIKE. The choice among the best known polynomial product algorithms mentioned in Section I can be made based on time complexity and implementation complexity. The metric one can use to evaluate the time complexity is the number of element-wise multiplications, which depends on the length of the polynomials (N ). By using the big O notation, the time complexity of the Schoolbook, Karatsuba and Schönhage-Strassen algorithms can be expressed as O(N 2 ), O(N log 2 (3) ) [14], and O(N ·log (N )·log (log (N ))) [25] respectively ( Figure 1).
LEDAcrypt and BIKE algorithms include cyclic products with a binary cyclic matrix, where N ≈ 10 4 . Therefore, both the Karatsuba and Schönhage-Strassen algorithms are preferable in terms of time complexity. Between these two options, we choose the Schönhage-Strassen algorithm because of its lower time complexity, especially with sparse matrices [19], [20], [26].

IV. NTT BASED CIRCULANT MATRIX PRODUCT
The Schönhage-Strassen algorithm [25] is based on the convolution theorem. Given a vector y and a cyclic matrix X, the cyclic matrix product y · X is equivalent to the cyclic convolution (conv) between x and y, where x is the first column of X: By defining X , Y and Z as the frequency-domain representations of signals x, y and z respectively, (4) can be replaced with the element-wise multiplication: Thus, if one applies known algorithms to calculate the direct Fourier Tranform (DFT (·)) and the inverse Transform (IDFT (·)), then (4) can be rewritten as: The direct Fourier Transform is conveniently replaced with the Number Theoretic Transform (NTT) [27], as it is specifically suited to transform and process integer vectors without using floating point numbers and exploiting modular arithmetic. Generally, an NTT-based multiplier involves the transform of the input values, x and y, into X and Y , their element-wise multiplication, and the inverse transform to obtain z. The NTT for the integer vector x is defined as where N is the length of x; α and P are referred to as radix and modulus, respectively. The values of α and P depend on N and have to be selected such that: • The overflow condition does not occur during the computation of the convolution; thus, P must be larger than M 2 N , with M being the largest integer in the input signals (it is '1' for binary vectors).
• P must be a number for which we can define a primitive root of unity r. 1 • α is a function of r and P; if P is in the form kN + 1, we have α = r k mod P. Any prime value of P in the form kN + 1, for which a primitive root of unity can be found, is valid; then α is simply equal to r k mod P. The condition on P restricts the possible values that can be employed, but, in general, it is not mandatory for P to be prime and a low value of P has the advantage of reducing the size of the operators in the NTT.
In this work, the NTT-based multiplier is conceived for the length and type of vectors employed in LEDAcrypt and BIKE cryptosystems, which use binary vectors. In the case of the decoder, the involved operands are the ciphertext, x = [x 0 , x 1 ], and s, plus the first row of the H matrix, SK in the Decoder, h = [h 0 , h 1 ]. The QC-MDPC codes adopted here are block codes, with n 0 block, then the variable x and H have a block structure.
As for the encoder, the operands are m and g l . It is worth noting that the length of the vectors, as it is clear from Table 1, is not a power of two; instead, these lengths are prime numbers. In order to meet the NTT requirement, the vector length is extended by means of different padding approaches described in the following. Moreover, the sparsity of h is exploited by the NTT-sparse algorithm, in order to reduce the overall execution time.
The algorithm flow of the convolution-based multiplier is shown in Figure 2. Since h is sparse, it is firstly padded with a specific algorithm and then transformed by the NTTsparse algorithm, whereas for x a trivial padding is applied and then it is transformed by the usual NTT algorithm; the two transforms are then element-wise multiplied and finally 1 A primitive root of unity is any number such that r x mod P = 1 and r y mod P = 1 for any y < x [27].   Code-based PQC is not the only field in which the NTT algorithm is required. It appears in a variety of applications and the parameters of the involved polynomials strongly depend on the domain. In the cryptography domain, we can distinguish between two main applications: Fully Homomorfic Encryption (FHE) and PQC. The difference in the parameters is presented in Figure 3 where, despite Lattice Based and Code-based cryptosystems are PQC primitives, their values for N , M , and P are quite different.
In the following, the NTT and NTT-sparse algorithms are described, as key elements in the implementation of the multiplication in (4), where x is the sparse vector, y the dense vector, and z the result. At the decoding side, the two algorithms are applied to the evaluation of s and upc. At the encoding side, only the NTT algorithm is required, as both vectors are dense.

A. ALGORITHMS FOR NTT COMPUTATION
The well-known radix-2, Cooley-Tuckey, decimation-infrequency FFT structure, composed by several butterfly units (see Figure 4) is used to efficiently implement the NTT. The equation (7) can be rearranged as  Figure 5 gives the structure of the butterfly graph to compute the NTT of a vector x. The N = 8 example shows the sequence of multiplications and additions/subtractions computed in each stage of the NTT computation, and all of them are implemented with the basic block in Figure 4. The input vector x (listed in bit reverse order) is renamed as v j (i), where the index j = 0, 1, 2 is referred to the computation stage and i = 0, . . . , 7.
In a straightforward approach, the whole graph is computed uniformly, with the same processing extended to all stages. However, the computation can be drastically simplified if a sparse input vector is received, where most of the elements are equal to 0. It is convenient, for sparse vectors, to process the asserted positions as separate values in the NTT-graph, by propagating them independently of the others. In the basic unit of Figure 4, no computation is needed if the inputs are 0, and reduced complexity calculations are possible when a single input is asserted: indicates the twiddle factor at stage l and position j, corresponding to values α (2j)k and α (2j+1)k in Equation (7)).
To exploit these simplifications, in our approach, the NTT stages are split in two ranges: a sparse NTT computation is applied from the input up to a limit stage (l lim ); then, the asserted values are merged and the normal, dense NTT computation is applied for the remaining stages.
In the example of Figure 5, most of the inputs are 0, except v 0 (3) and v 0 (4) (asserted position Pos x = [3,4]). Since stages l = 0 and l = 1 work on vectors having several elements equal to 0s, in this case, l lim = 2, and the sparse computation is limited to stages l = 0 and l = 1. As it can be observed, the asserted element v 0 (3) occupies a single position at stage l = 0. When this input propagates to stage l = 1, two values need computation and four positions must be computed at stage l = 2.
The computation required at each propagation of an asserted value is based on the binary representation of the input positions of the asserted inputs. For example, for the asserted input v 0 (3), the binary representation of position 3 is [0, 1, 1]. Every bit in this representation refers to a butterfly stage, with the least significant bit corresponding to the first stage. A bit equal to 0 implies that the corresponding value only needs to be copied to the two butterfly outputs; on the 3342 VOLUME 11, 2023 contrary, a bit equal to 1 implies that both outputs of the butterfly must be calculated in the simplified form. Thus, for the asserted input v 0 (3) of Figure 5 with binary position [0, 1, 1], we have a 1 at both stages l = 0 and l = 1. At stage l = 0 one multiplication and one subtraction are required to compute the two output values v 1 . The same kind of processing is needed at stage l = 1, where two multiplications are performed and four values are computed as v 2 (0), v 1 (0), P−v 2 (0) and P−v 1 (0). A different computation is done on the asserted element v 0 (4), having binary position [1,0,0]. In this case, we have bits equal to 0 for both stages l = 0 and l = 1; therefore, just replica of the asserted inputs are required at the butterfly outputs. Indeed, at stage 2, the same value, v 2 (4), appears in positions from 4 to 7.
Because of the linearity of the NTT, the simplified computations required for each asserted input can run independently of the others, as a separate transform contribution. When the l lim stage is reached, the separate transforms are merged and the computation proceeds with the normal, dense NTT. In same cases, the merge operation is as simple as the plain concatenation of values at stage l lim : this is the case for the example in Figure 5, where single values are available at the output of stage 1 in every position; therefore, they are concatenated to form the 8-input vector for the last stage. In other cases, multiple values are available for a position at the boundary between sparse and dense computation, because of the separate contributions derived with the described procedure. Under this case, the merge operation needs a sum of the multiple values at a given position, before concatenation. An example of this second case is shown in Figure 6, where the asserted inputs v 0 (2) and v 0 (3) are independently processed, leading to multiple contributions, which are added at the input of the last stage. for k = 1 : n op do 7: if PosBin(l) = 1 then 8: 13: end if 14: end for 15: end for 16: end for In a more formal way, the proposed approach can be described with the pseudo-code in Algorithm 2, where the received inputs are: the sparse stage limit l lim , the number of non-zero elements nze, and their positions in vector x, named Pos x . The binary representation of these positions is reported as PosBin, and the single bit in the binary pattern is indicated as PosBin(l) for stage l. Based on the value of PosBin(l), two possible operations are performed: 1) If PosBin(l) = 1, a multiplication and a subtraction are executed to update the butterfly outputs; 2) If PosBin(l) = 0, the first butterfly input is simply copied to the two outputs. VOLUME 11, 2023 At every stage in the range 0 to l lim , the computed transform contributions are in a matrix SX(i v , k), this is updated at every stage (l) and its rows, i v = 0 . . . nze correspond to the asserted positions; the columns, k, are at most 2 l lim .
Later, in Section V, the multiplication by the twiddle factor (the blu circle in Figure 4) is indicated as MPY, while ADD and SUB correspond to the red and yellow boxes, respectively.

B. MODULAR ARITHMETIC AND NTT COMPUTATION
Modular arithmetic is fundamental in the computation of the NTT. The result of every addition, or subtraction must be in the range [0, P − 1], thus, if the value is outside [0, P − 1], supplemental computation is needed to adapt the result. Modular multiplication is similar, but it is more complex, as the result is the reminder of the division between the product and P. To have an efficient implementation, the division must be avoided, and this can be achieved by adopting the Montgomery reduction method [28], which exploits a support modulo R, to be conveniently selected.
Let us consider the product a · b = c, where a and b, the operands, and c, the result, are in the range [0, P − 1]. To calculate the multiplication, the operands must be converted to the Montgomery form [29]. The modulus is evaluated with respect to a number R selected such that R > P and GCD(R, P) = 1. The conversion is performed by means of a multiplication by an integer y R ∈ [0, R − 1], selected s.t. P · y R = −1 mod R.; thus, the equivalent of a in the Montgomery form is a m = (a · R) mod P. Similarly, b m = (b · R) mod P.
Then, the Montgomery multiplication takes place according to Algorithm 3.

Algorithm 3 Montgomery Multiplication
Input: integer factors of the multiplication, a m , b m ; Input: integer y R ; Output: product in Montgomery format, c m ; Notice that, if R is conveniently selected as a power of 2, then only multiplications by constants, bit shifting and bit masking operations are required in lines 2 to 4 of Algorithm 3. The result must be converted back from Montgomery to integer domain, which requires additional complexity. If this method is applied to compute a single operation, the input and output conversions introduce a relevant computational overhead, which limits the advantage of the overall approach. However, since the Montgomery method preserves the distributive property, it is not required to apply the conversion to each single operation; thus, only NTT inputs and INTT outputs are converted, instead.

C. PADDING
If the size of the inputs is not a power of 2, then a proper padding is required. Alternatively, in the case of an input vector length that is a prime number, the Bluestein algorithm [30] could be used. This method efficiently computes the NTT of a prime length vector with a time complexity of N log N ; however, it does not support sparse vectors.
Two padding methods have been selected in this work: the PrimePadding applied to a and the ZeroPadding applied to x. It is convenient to apply the PrimePadding to the sparse vector, such that its sparsity remains unchanged, while the dense vector is padded with zeros to reduce the complexity of the computation.

1) PRIMEPADDING
The PrimePadding vector extension [30], is based on the idea of extending a, the first row of A, to a such that A , the new matrix, is still circulant and contains A. The extension takes a vector of length N and provides a vector of length N , with N > N and N a power of 2.
An example of the PrimePadding extension is showed in (9) for the extension from N to N , with d N = N − 2N , applied to the first column in matrix A given in (1).
The extension of the base vector consists in concatenating d N zeros and then the mirrored N − 1 elements of a on the right part of the vector. The PrimePadding is applied to vector x, the sparse vector in Equation (4). In the complete cryptosystems, this corresponds to the first row in matrices H and H T and to the error vectors e It .

2) ZEROPADDING
The ZeroPadding is applied to vector x, in Equation (4). The dense vector, in the cryptosystem, are the message m and the encoding matrix G.
The whole convolution operation requires the first N components of the output vector. The idea is shown in (10), where the multiplication is the Syndrome evaluation (s = H · x) in the Decoder,  In (10), the original circulant matrix is highlighted with a red square, while the dashed orange box represents the first row of the matrix, properly padded with the PrimePadding technique. Similarly, the input vector (blue box) has been zero-padded to N . For the result we need to consider only the first N components, (11) which are highlighted in green in (11).
The advantage of this padding technique is that the matrix is still sparse and N − N elements of the input vector are set to zero, which allows skipping part of the computation.

V. ARCHITECTURE
The NTT-based multiplier architecture handles four operations to compute the convolution: the sparse NTT computation, the dense NTT, the element wise multiplication and the INTT. These four elements of the computation involve multipliers and adders, as arithmetic units.
The whole architecture is divided in Data Path, Control Unit and Memory. To save complexity, part of the Data Path and part of the Memory are shared among the four mentioned operations.
The NTT and INTT coefficients are in the form α ik mod P; they are pre-computed and stored in two dedicated ROMs, referred to as Direct Constant Memory and Inverse Constant Memory.
The intermediate values, involved in the computation of the product, are stored in only two dedicated RAMs, named Complete RAM 1 and 2, that serve the two NTT, the INTT and the element wise product. The sparse NTT needs two additional memories to store the expanded positions: these are the Starting ROM and the Sparse RAM.
The core element in the Data Path is the butterfly unit of Figure 4, which includes one multiplier, one adder and a subtracter. The parallelism of the Data Path is n u = 8; thus, 8 computations are handled at the same time. The basic elements of the Data Path are showed in Figure 7, and will be detailed in the following sub-sections.
The Control Unit (CU) is quite complex due to the presence of shared arithmetic units and memories. A hierarchical organization has been adopted in order to simplify the control (Figure 8).
The Master control unit handles three separate controllers, which manage the three main operations of the multiplier, the Sparse NTT, the Dense NTT and the INTT (plus the element wise product). Two additional control

A. SPARSE-NTT
The Sparse-NTT controller has the most complex behaviour when compared to the conventional NTT and INTT. The computation is split in two phases. Initially, as described in Algorithm 2, the asserted inputs are individually processed and stored; then, after the merging, data are processed as dense vectors.
According to the architecture in Figure 7, the sparse execution works as follows: the asserted position in the sparse vector, Pos x (i v ), is used to address the Starting ROM memory, which contains the pre-computed transform values for stage l p = log 2 (n u ) = 3. Let us assume that a single input is asserted within each block: in this case, a single '1' is received, located at any position from 0 to n u − 1. For each input, the resulting values up to stage l p are stored in the memory with n u rows, corresponding to the location of the input '1'. As an example, Figure 9 shows the content of the Starting ROM for the case of asserted v 0 (3) (Figure 5). The Sparse RAM contains nze rows with the asserted positions expanded up to stage l lim ; the expansion has length 2 l lim . The next step merges the values from Sparse RAM to Complete RAM 1. The memory contains the transform of the sparse input vector x, from Equation (4) up to stage l lim expressed as a vector. This memory content is then used by the Dense-NTT controller, which starts at stage l lim + 1. The vector length is N : it is initially reset to 0 and then each expanded Pos x is assigned to an interval of the complete N vector; when two positions point to the same interval, they are summed together.

B. DENSE-NTT
The Dense-NTT controller evaluates the transform of the sparse vector from stage l lim and the complete transform of the dense vector. As shown in Figure 7, the complete transform of the dense vector has Complete RAM 1 as input and output of the computation. The Dense-NTT takes in input the vector of length N and evaluates its transform at stage l lim + 1, the vector is updated ''n u = 8 elements at time'', until all N elements of the transform are computed.
The Dense-NTT controller provides the addresses for memory Complete RAM 1 that contains the inputs of the n u MPY; the twiddle factors are provided by the Direct Constant Management given the stage and the n u section of the complete N that is processed. The updated values are obtained after the ADD/SUB is evaluated and stored in Complete RAM 1.
The transform of the dense vector is computed from stage l = 0 up to the end via the Dense-NTT, and the result is stored in Complete RAM 2. The processing is the same as the one described for stages l > l lim in sparse vector.
The transform of the sparse vector is in Complete RAM 1 and the transform of the dense vector is in Complete RAM 2.

C. ELEMENT WISE MULTIPLICATION
The element wise multiplication evaluates the product between the transforms stored in Complete RAM 1 and 3346 VOLUME 11, 2023 in Complete RAM 2. The computation involves the MPY units, with one input from Complete RAM 1 and the second input from Complete RAM 2 (see Figure 7). Since the data stored in Complete RAM 2 are not needed in the following steps, they are overwritten with element wise multiplication results.

E. TWIDDLE FACTOR MANAGEMENT
The twiddle factors are pre-computed and stored in dedicated memories. However, in the sparse processing, they are accessed according to a different procedure than in the dense case. The input twiddle factors for the MPY (the constants) in the Sparse and Dense NTT and INTT are provided by the the Direct Constant management for the NTT and the Inverse Constant management for the INTT, respectively. These units are connected to the memories, Inverse Constants Memory which store the twiddle factors.
As shown in Figure 10, the memories are organized in 4 blocks, referred to as Constants %4 = i (with i = 0, 1, 2, 3). In order to reduce the latency, n u constants are read in parallel from the memories and forwarded to the n u multipliers MPY. The Direct Constant Management controller is in charge of driving the multiplexers and both the Register File (RF) and Buffer units in order to effectively provide constants for both the Sparse and Dense-NTT; the execution is performed such that, while the constants loaded in Buffer feed the MPY, the Direct Constant Management controller loads the next set of constants in RF.

F. USE OF MAIN HARDWARE RESOURCES
The units in the Data Path and memory elements have been widely reused during the NTT-based multiplication between sparse and dense vectors.
The evolution of the resource use is shown in Figure 11 that shows which unit is active in each stage of the computation. The usage of the arithmetic units is highlighted in orange. As for the RAM components, the use is in light-blue when only read operations are involved (Direct Constant Memory and Inverse Constant Memory), and in light-blue/red color when the same memory is used for both read and write operations.

VI. IMPLEMENTATION RESULTS
The NTT-based multiplier has been implemented as an ASIC component, using Synopsys Design Compiler and the UMC 65 nm technology; the same architecture has also been implemented for an Artix-7 200 FPGA target, using Vivado.

A. ASIC RESULTS
The ASIC synthesis results are reported in Table 2 and include the occupied area A, when the clock period is set to t ck = 5 ns (column 3), the shortest achievable clock period t ck,min before obtaining a negative slack (column 4), and the occupied area A f when the shortest clock period is set as a constraint (column 5). The critical path, with the longest combinatorial delay, is found within the element wise multiplier and it depends on the choice of the modulus P, since P affects the size of the arithmetic units.
The Table also provides the percentage area and delay differences with respect to the synthesis case with clock period set to 5 ns: for a large N , the feasible increment of the clock frequency is higher than the corresponding area penalty.
It is worth mentioning that the area occupation is referred to a flexible kind of polynomial multiplier that performs the VOLUME 11, 2023   product of binary vectors and computes the result either in GF (2) or in the integer field.

B. FPGA RESULTS
The FPGA synthesis results are reported in Table 3 for several choices of the input vector length (N ), while the density of the vector is set d v = 10. The critical path has been set to 10 ns in order to enable a fair comparison against published implementations, where the same constraint was imposed. It has been verified that the effect of the density on the number of required resources is negligible. In the Table, N is the length of the input vector extended with PrimePadding, as defined in Section IV-C1. Columns 2-6 provide the required number of hardware resources, namely Look-Up Tables (LUT), Flip-Flops (FF), Block RAMs (BRAM), and arithmetic processing units (DSP). It is worth noting that the number of BRAMs increases almost linearly with N , whereas the amount of allocated logic elements (LUTs and FFs) is sub-linear with N , leading to hardware utilization improvement.

C. LATENCY
The multiplier latency is reported in Table 4 in terms of number of clock cycles, for a few choices of vector length and sparsity. The last two columns also show the latency value in ms, assuming for each case the corresponding highest clock frequency in the ASIC and FPGA synthesis. From Table 4, it can be noticed that the latency scales linearly with N , while it scales logarithmically with the matrix density.

VII. COMPARISON
The latency values provided in Table 4 include three components deriving from the three fundamental processing  parts: i) the transform of the two input operands to the frequency domain, ii) the element-wise multiplication, and iii) the final inverse-transform. However, in LEDAcrypt and BIKE decoders for PQC, the result of a cyclic product frequently becomes the input of another cyclic product. In such cases, it is convenient to keep the results in the frequency domain and avoid unnecessary transforms and inverse-transforms. Therefore, a comparison among complete systems that incorporate the polynomial product as one of the key operations is more meaningful than the comparison between standalone multiplier components.
In this paper, we consider the use of the NTT-based multiplier in the context of a PQC application, encompassing both encryption and decryption. We assume the adoption of the QC-MDPC McEliece crytopsystem with the parameters given in Table 1.
The encoding part requires one cyclic multiplication with two dense vectors, as in (2), and the sum with the error vector. For these operations, Table 5 reports both the latency (expressed as the total number of required clock cycles) and the complexity (limited for simplicity to the number of allocated LUTs) when using the proposed NTT-based architecture and three alternative multipliers from the recent literature.
For the implementation in [17], the reported numbers of cycles are derived from the available formulas. The number of LUTs in [18] is referred to the whole system for the parallelism set to n u = 8. In [19], the area and execution time are related to a multiplier suitable for binary polynomial multiplication, as necessary in the LEDAcrypt and BIKE encoders. An additional implementation proposed in [16] has not been included in Table 5 as it is referred to an encoder with a sparse multiplication.
The NTT-based products in the decoder can be arranged by exploiting the properties of the convolution, with the purpose of reducing the overall complexity. The decoding Algorithm 1 includes two multiplications in sequence, at steps 1 (Syndrome Evaluation) and 3 (UPC Evaluation). Therefore, in the first iteration, one transform can be saved by computing the syndrome transform S It as: where h is the first row of matrix H. The evaluation of upc It is then where h T is the first row of matrix H T . Thus, one inverse transform can be saved in the first iteration.
The upc It vector is processed in the time domain, since the error position over the threshold has to be derived. The threshold is computed from the Syndrome sum, in the frequency domain, which corresponds to the value of S It (0), associated to α i·0 from Equation (7).
The error vector, e It , in the time domain is transformed in the frequency domain to E It with the Sparse-NTT Algoritm.
The syndrome vector is then updated by exploiting the linearity of the transform, the updated syndrome is: Then, as in the decoder Algorithm 1, after the check on the syndrome weight a new iteration of the procedure starts.
The execution time of the first iteration is strongly reduced if some NTT/INTT unnecessary transforms are skipped. Moreover, the execution time for the remaining iterations is further reduced because the transforms of h T and h, have been already computed during the first iteration and one Sparse-NTT transform is enough.
The resulting decoding execution time, with mixed frequency and time domain computation, is reported in the third column of Table 6, in terms of number of cycles. The remaining columns give the same information for three previous implementations, while the last row shows the required number of LUTs. In Table 6, the total numbers of clock cycles reported in [17] and [18] have been scaled to have N = 12, 323 and d v = 100, using the formulas provided in the two works. When comparing the different decoder implementations, one can see that the proposed solution achieves a similar latency as [16] and [17] (n u = 32), while it is much slower than [18]. In terms of LUTs, the proposed decoder is better than [16], very similar to [18] and more expensive than [17].
We now consider the implementation of the cyclic multiplier in the context of a complete system, able to perform both encoding and decoding, and supporting both binary and integer formats. To compare the alternative implementations, we introduce the LC figure of merit, defined as a latencycomplexity product, where the complexity C is given by the global number of LUTs allocated in the synthesis of both encoder and decoder, while the latency L is the execution time (in s) evaluated for encoder and decoder, assuming for the case N = 12, 323, d v = 100, a clock frequency of 100 MHz, and It max = 6 (when possible the numbers are scaled to match these requirements).
Although the proposed NTT-based multiplier is not the best option when comparing standalone arithmetic units, the results in Table 7 show that it is more efficient than the other implementations proposed in the literature, when the unit is used in a complete application, where the same component must be exploited for multiple products in the encoding and decoding stages, which are characterized by different data formats and computational structures. As seen from Table 7, the advantage in terms of LC product over the alternative approaches grows quickly with the size of the problem: when N moves from 12,323 to 37,813, LC is 5 times the initial one. Considering the same N our proposal is almost 7 times more efficient when compared to [17] and 3 times more efficient than [18]. Moreover, the increase of N takes advantage of the Schonhage Stressen algorithm and makes out proposal even more efficient. The reported LC figure for [16] is better than all the other ones in Table 7; however, this comparison is not fair, because the encoding in [16] involves a product between a sparse vector and a dense vector, which drastically simplifies the computation with respect to the considered applications.

VIII. CONCLUSION
The NTT-based multiplier offers a series of advantages when applied to code-based cryptosystems. One relevant advantage is the logarithmic increase of the latency with the increase of the density of the variables. This property is important for the implementation of a McEliece Cryptosystem, where both sparse and dense cyclic products are required. This advantage is more relevant if the complete execution time is considered for both encoder and decoder, which involve operands with different density values.
Moreover, in the decoding procedure, the computation can be arranged to skip or reduce several operations, leading to a shorted execution time, in the computation of the multiplication.
A second advantage is in its adaptability of the NTT-based approach to different data types. This is particularly important for the implementation of code-based cryptosystems, where variables with different densities and coefficients are used. The flexibility of the proposed solution is not limited to LEDAcrypt and BIKE cryptosystems, but it can be extended to other primitives in the PQC domain, where the polynomial product plays a key role is several primitives.
The present work proves that a NTT-based multiplier, with proper design choices, is efficient for code-based Post-Quantum Cryptography, making the proposed unit a valid accelerator for different applications. Open Access funding provided by 'Politecnico di Torino' within the CRUI CARE Agreement VOLUME 11, 2023