Efficient Low-Latency Multiplication Architecture for NIST Trinomials With RISC-V Integration

Binary extension field arithmetic is widely used in several important applications such as error-correcting codes, cryptography and digital signal processing. Multiplication is usually considered the most important finite field arithmetic operation. Therefore efficient hardware architectures for multiplication are highly desired. In this brief, a new architecture for multiplication over finite fields generated by irreducible trinomials <inline-formula> <tex-math notation="LaTeX">$f(x) = x^{m}+x^{t}+1$ </tex-math></inline-formula> is presented. The architecture here proposed is based on the use of a polynomial multiplier and a cyclic shift register that can perform the multiplication in <inline-formula> <tex-math notation="LaTeX">$t-1$ </tex-math></inline-formula> clock cycles. The general architecture is applied to the trinomials recommended by NIST (National Institute of Standards and Technology). Furthermore, a RISC-V instruction set for the proposed multiplier is implemented and validated using VeeR-EL2 on a Nexys A7 FPGA. To the best knowledge of the authors, this is the first work that integrates the multiplication based on NIST trinomials into a RISC-V SoC. Results show an improvement of several orders of magnitude in terms of latency at a cost of less than 50% more of area.

Efficient Low-Latency Multiplication Architecture for NIST Trinomials With RISC-V Integration José L. Imaña , Luis Piñuel, Yao-Ming Kuo , Member, IEEE, Oscar Ruano , and Francisco García-Herrero Abstract-Binary extension field arithmetic is widely used in several important applications such as error-correcting codes, cryptography and digital signal processing.Multiplication is usually considered the most important finite field arithmetic operation.Therefore efficient hardware architectures for multiplication are highly desired.In this brief, a new architecture for multiplication over finite fields generated by irreducible trinomials f (x) = x m + x t + 1 is presented.The architecture here proposed is based on the use of a polynomial multiplier and a cyclic shift register that can perform the multiplication in t − 1 clock cycles.The general architecture is applied to the trinomials recommended by NIST (National Institute of Standards and Technology).Furthermore, a RISC-V instruction set for the proposed multiplier is implemented and validated using VeeR-EL2 on a Nexys A7 FPGA.To the best knowledge of the authors, this is the first work that integrates the multiplication based on NIST trinomials into a RISC-V SoC.Results show an improvement of several orders of magnitude in terms of latency at a cost of less than 50% more of area.

I. INTRODUCTION
B INARY extension field GF(2 m ) arithmetic is widely used in several important applications such as error-correcting codes, cryptography and digital signal processing [1], [2], [3], [4], [5].These applications often require efficient VLSI implementations of arithmetic operations, especially for multiplication, which is usually considered the most important finite field arithmetic operation.The complexity of the multiplier depends on the irreducible polynomial f (x) selected for the field.For hardware implementations, low Hamming weight irreducible polynomials, such as trinomials and pentanomials, are normally used [6], [7], [8].Irreducible trinomials f (x) = x m + x n + 1 are very important because they are abundant and they exhibit the lowest Hamming weight [9].Specific irreducible trinomials have been recommended by NIST (National Institute of Standards and Technology) for their use in digital signatures [10].Furthermore, some of the public-key encryption and key-establisment algorithms (such as the code-based Classic McEliece) submitted to the ongoing (Round 4) NIST Post-Quantum Cryptography (PQC) standardization process also use irreducible trinomials in their specifications [4].
Efficient methods and architectures for finite field multiplication have been proposed in the literature [6], [8], [9], [11].Two-step classic GF(2 m ) multiplication requires a multiplication of polynomials followed by a reduction modulo an irreducible polynomial [12].An efficient multiplication method was proposed by Mastrovito in which a product matrix was introduced to combine the above steps together [13].Other methods use a divide-and-conquer approach (such as Karatsuba algorithm) for polynomial multiplication [14].The use of suitable polynomials, such as irreducible trinomials, means that modular polynomial reduction can be efficiently implemented [7], [9], [13], [15].
In this brief, a new architecture for multiplication over finite fields (using the two-step classic method) generated by irreducible trinomials f (x) = x m + x t + 1 is presented.The architecture here proposed is based on the use of a polynomial multiplier and a cyclic shift register that can perform the multiplication in t − 1 clock cycles.The general architecture is applied to the irreducible trinomials f (x) = x 409 + x 87 + 1 and f (x) = x 233 + x 74 + 1 recommended by NIST and to the irreducible trinomials f (x) = x 193 + x 15 + 1 and f (x) = x 113 + x 9 + 1 recommended by SECG (Standards for Efficient Cryptography Group) [16].Furthermore, a RISC-V instruction set [17] for the proposed multiplier is implemented and validated using VeeR-EL2 on a Nexys A7 FPGA.To the best knowledge of the authors, this is the first work that integrates the multiplication based on NIST trinomials into a RISC-V SoC.
This brief is organized as follows.Section II provides notation and mathematical background.The new multiplier for general irreducible trinomials is introduced in Section III, where an example of multiplication for f (x) = x 6 + x 4 + 1 and the description of the new hardware architecture are also given.Section IV presents RISC-V instruction set for the proposed multiplier and experimental results of implemention using VeeR-EL2 on a Nexys A7 FPGA.Finally, conclusions are given in Section V.

II. BACKGROUND
Any element A in the finite field GF(2 m ) can be represented as A = m−1 i=0 a i x i , with a i ∈ GF(2) = {0, 1} and x being a root of an irreducible polynomial f (y) = m i=0 f i y i over GF (2).Arithmetic operations in GF (2 m ) are performed modulo f (x).Addition of polynomials is carried out under modulo 2 arithmetic, so the addition of two elements becomes the bitwise XOR of their binary representations.
1549-7747 c 2024 IEEE.Personal use is permitted, but republication/redistribution requires IEEE permission.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Multiplication in GF(2 m ) of two elements C = A • B mod f (x) is usually considered the most important and complex operation.The rest of the operations in GF (2 m ) such as inversion or exponentiation are derived from multiplication.Two-step classic multiplication in GF(2 m ) [12] requires a polynomial multiplication followed by a reduction modulo the irreducible polynomial f (x).Polynomial multiplication D = A • B is a polynomial with maximum degree 2m − 2 where its coefficients are determined by the expressions [12]: It can be observed in (1) that d 0 , . . ., d m−1 are the coefficients of x 0 , . . ., x m−1 of the polynomial D, respectively, that must not be reduced.However, the powers x m , . . ., x 2m−2 of D (with coefficients d m , . . ., d 2m−2 , respectively) must be reduced modulo f (x).
After the polynomial multiplication D = A • B given in (1), a reduction modulo the irreducible polynomial f (x) must be performed.In modular reduction C = D mod f (x), the degree 2m − 2 polynomial D is reduced by the degree m irreducible polynomial f (x), resulting in a polynomial C with maximum degree m − 1.The product C can be represented in matrix notation as matrix that can be decomposed in a (m × m) identity matrix I and a (m × m − 1) reduction matrix R, only dependent on the irreducible polynomial f (x).Therefore the reduction modulo f (x) can be given as follows [12]: The coefficients r i j ∈ GF (2) given in (2) can be computed as follows [12]: where , where the coefficients of the reduction matrix R are given in (3).
As shown in Section II, the product are null.Furthermore, for a given column i, the expressions for the computation of R in (3) are: These expressions are similar to those obtained for the computation of the product It can be observed the similarity among equations ( 5) and ( 4).The operation given in (5) can be implemented using a cyclic shift register as shown in Fig. 1(a), where ⊕ refers to an XOR, stands for an AND gate and refers to a 1-bit register.The registers are initially loaded with the coordinates of A, and the coefficients f i , i = 1 . . .m − 1, of the irreducible polynomial are connected to the AND gates together with the output of the last 1-bit register of the cyclic shift register.It can be observed that after one clock cycle, the registers contents will be the coefficients (p 0 , p 1 , . . ., p m−1 ) of the product The use of the cyclic shift register given in Fig. 1(a) for the implementation of As given in Section II, the terms x 0 , . . ., x m−1 of D (with coefficients d 0 , . . ., d m−1 , respectively) do not need to be reduced.However, the powers x m , . . ., x 2m−2 of D (with coefficients d m , . . ., d 2m−2 , respectively) must be reduced modulo f (x).The polynomial D = d 2m−2 x 2m−2 +. ..+d m x m + d m−1 x m−1 + . . .+ d 0 can be rewritten as follows: For irreducible trinomials f (x) = x m +x t +1, with t = 1 . . .m− 1, we have that x m = x t + 1 mod f (x), therefore using again Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
modulo properties we have . Finally, we obtain the following expression (8) for the computation of the product C = A • B mod f (x): where the addition D H + D L is given as follows Therefore, the implementation of D H + D L can be done with the bitwise XOR of the coefficients as given in equation ( 9).The cyclic shift register given in Fig. 1(a) for the implementation of P = A • x mod f (x) could be used to compute D H x t mod f (x).It can be observed that the product D H x t (corresponding with the irreducible trinomial f (x) = x m + x t + 1) can be written as the t products (. . .(((D H x)x) . . .x), so using modulo properties we have: does not need to be reduced modulo f (x).The lowest trinomial we could use would be f (x) = x m + x + 1 (t = 1), so for this trinomial no reduction would be needed and, therefore, no shifts (i.e., 0 clock cycles) would be needed in the structure given in Fig. 1(a) to get the reduction.In this way, the computation of D H x t mod f (x), would need t − 1 clock cycles to be completed.
Fig. 1(a) could be modified to implement the above behaviour as follows: the m − 1 coefficients of D H are stored in the m − 1 most-significant registers of the shift register, while for the least-significant register p 0 , the AND gate with f 1 and the XOR gate feeding register p 1 are removed.The output of the most-significant register p m−1 is adjusted to feed 1-bit register p 1 .Fig. 1(b) shows the final architecture needed to compute D H x t mod f (x) for f (x) = x m + x t + 1 in t − 1 clock cycles, where the 1-bit registers p m−1 , p m−2 , . . ., p 2 , p 1 are initially loaded with the coordinates of D H = (d 2m−2 , . . ., d m+1 , d m ), respectively.
It can be observed that for t = 1, the product does not need to be reduced modulo f (x) and the result (stored in registers p m−1 , . . ., p 2 , p 1 , respectively) does not include a coefficient for x 0 , as explained before in this section.
For t = 2, the product and therefore the result (also stored in registers p m−1 , . . ., p 2 , p 1 , respectively) does not include a coefficient for x 1 .In general, the product D H x t mod f (x) given in the 1-bit registers does not include a coefficient for the term x t−1 .Furthermore, it can be proven that the product D H x t mod f (x) computed by the architecture shown in Figure 1(b) is given by: where p i , with i = 1 . . .m − 1, represent the coefficients stored in the corresponding registers after t − 1 clock cycles.

A. Example: Multiplication for f
For the specific case f (x) = x 6 +x 4 +1, the product C = A•B mod f (x) is computed using expression (8), in such a way that The coefficients of the polynomial multiplication D = (d 10 , . . ., d 0 ) are computed using the expression (1), where D H = (d 10 , . . ., d 6 ) and D H = (d 5 , . . ., d 0 ).The addition of D H + D L is determined using expression (9) as follows: The computation of D H x 4 mod f (x) is done in 3 clock cycles using the cyclic shift register given in Fig. 1(b).Table I shows the contents of the 1-bit registers in the 3 clock cycles, where Init.represents the initial contents of the registers.Following (11), the polynomial D H x 4 mod f (x) is given as: is computed by the addition of ( 12) and ( 13) as follows: where in x 4 , the addition d 10 + d 10 = 0.It can be proven that the result given in ( 14) matches the one obtained by direct application of equations ( 2) and (3).

B. Hardware Architecture of the Multiplier
Based on the above considerations, the hardware architecture of the new multiplier proposed for the computation of the product The polynomial multiplier module in the upper part of Fig. 2 computes the product of polynomials D = A • B as given in equation (1).It must be noted that this polynomial multiplier can be implemented using the commonly integrated multiplier found in most coprocessors.The lower part in Fig. 2   register at the left in Fig. 2. The final addition is performed by the m − 1 XOR gates at the bottom of Fig. 2 as given in (11).It can be observed that the general architecture given in the lower part (specific for trinomials) in Fig. 2 has a delay T X + max{T X , T A }, where T X and T A stand for the delay of 2-input XOR and AND gates, respectively.

IV. IMPLEMENTATION AND HARDWARE RESULTS
To address the challenge of resource sharing and computational speed, the polynomial multiplier in the upper part of Fig. 2 is implemented using the existing general-purpose multiplier integrated in most coprocessors.Our attention is centered on NIST-recommended trinomials, so to ensure the architecture's generality, multiplexors are added to provide support for any of these trinomials.The entire design incorporates pipeline registers at the input and output of each module and divides the polynomial multiplier into 14 pipeline stages to maintain a lower critical path compared to the RISC-V CPU.This enables operations to be performed at maximum speed.Additionally, this approach achieves significant area savings as the bits stored in the shift register also act as pipeline registers, eliminating the need for duplication.This architecture not only facilitates resource sharing but also provides a substantial speedup compared to a CPU execution using optimized software code.Other fully combinational architectures and designs with higher degrees of parallelism were explored; however, their integration with the RISC-V CPU proved challenging due to the substantial increase in hardware resources required for working with large-order polynomials (e.g., trinomials of order 409).

A. Control and Final Architecture
To integrate these operations, new instructions are added to the compiler.Given that the operands can be up to 409-bit words, three different instructions have been introduced for each operand (A and B).ffloadas (ffloadbs) begins loading the operand from the general-purpose registers, while ffloada (ffloadb) loads the information into the internal pipeline register for inputs A and B (Fig. 2).Finally, ffloadae (ffloadbe) indicates the completion of operand loading.Note that different trinomial orders will require a varying number of ffloada (ffloadb) instructions.However, the control mechanism is simplified and completely generic, thanks to the start and end instructions.Fig. 3 provides a summary of these instructions, following the standard format for RISC-V processors.Furthermore, the integration of the multiplication module follows a similar approach to the division module of the VeeR-EL2 architecture [18], with the distinction that hardware resources are shared with the already integrated multiplier and the Galois Field extension given in [19].

B. Hardware Results and Comparisons
Table II presents a comparison between the standard RISC-V-based SoC VeeR-EL2 and our proposed solution, which incorporates to the VeeR-EL2 the multiplication previously described for the trinomials x 409 +x 87 +1 (NIST, SEGC), x 233 + x 74 +1 (NIST, SEGC), x 193 +x 15 +1 (SECG), and x 113 +x 9 +1 (SECG).The modified EL2 SoC can be found in the repository from [20], which allows readers to replicate all the experiments on the Nexys A7-100T board, which features an Artix7 FPGA and the C codes including all the compilation details.The architecture has been described using System Verilog and functionally verified using Maple as a golden model.As shown in Table II, there is approximately a 45% increase in LUTs and a 28% increase in registers.However, when comparing with Table III, the number of clock cycles is reduced by 2246 to 389 times, with the best case corresponding to the highestorder trinomial.In other words, the latency is reduced by 99.74% to 99.96% compared to the same SoC executing the same operations with optimized baremetal code.
To provide a fairer comparison, a figure of merit considering area and timing can be computed.For the worst-case scenario with the trinomial of order 113, an improvement of 389/1.45= 268 is achieved, while for the best-case scenario with the trinomial of order 409, an improvement of 2246/1.45= 1549 is obtained.In other words, with less than 50% of the area, the latency for computing multiplications with these trinomials is reduced from 3.3ms to 8.64µs in the worst case and from 0.79s to 14.12µs in the best case, demonstrating the efficiency of the proposed solution.
Compared to previous works from the authors on finite field arithmetic [19] and to the recently proposed RISC-V cryptography extensions [21] the improvement in terms of speedup is kept as the simplifications introduced in Section III reduce timing complexity from m to t thanks to the special properties of the trinomials exploited in this brief and not considered in [19] or [21].
Comparing in Table IV the multiplier that has been integrated into the proposal with the solutions available in the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.state-of-the-art it can be concluded that thorough examination of critical parameters, such as the critical path of each design, maximum latency in terms of clock cycles, and the total logical gates for the most restrictive polynomial (order 409), only two multipliers in this comparison reduce the number of clock cycles when compared to our proposal.Additionally, one multiplier fails to meet the critical path of the RISC-V core [23].Furthermore, another multiplier, while achieving a performance improvement, comes at a cost of 620k gates, which is equivalent to 21 times the size of our accelerator [22].Compared to the efficient solution from [15], our proposal has three times less latency in terms of clock cycles and shorter critical path, and the overhead in gates, compared to a processor that integrates the CLMUL module is only 1.9 times larger, as it only requires 5.1 additional Kgates.Also, the proposal described in our paper is 100% modular, so it can be applied to polynomials of different orders without adding extra area, while other solutions require a particular implementation for each polynomial.
V. CONCLUSION To the best knowledge of the authors, this is the first work that integrates the multiplication based on NIST and SEGC trinomials into a RISC-V SoC.All the previous works have been based on the implementation of full-custom accelerators and co-processors which are isolated from the pipeline of the processor.This brief introduces a new approach to implement this multiplication taking into account some mathematical properties that allow the hardware resource sharing with other functional units from the processor, the reduction of latency with a moderate increase of area and the generalization for future cases due to the structure of the designed architecture and the definition of the ISA.
efficiently computes the product based on irreducible trinomials, completing the computation in t − 1 clock cycles.The 2m − 1 outputs of the multiplier are stored in 2m − 1 1-bit registers as follows: D L = (d m−1 , . . ., d 0 ) is stored in m − 1 registers (at the right in Fig. 2) and D H = (d 2m−2 , . . ., d m ) is stored in the m − 2 registers of the cyclic shift register given in Fig. 1(b).The addition of D H + D L is performed in Fig. 2 by the m − 1 XOR gates (within a dotted box) below the 1-bit registers storing D L .The computation of D H x t mod f (x) is performed, after t − 1 clock cycles, by the cyclic shift

TABLE II AREA
FOR THE VEER-EL2 AND THE MODIFIED VERSION

TABLE III TIMING
RESULTS FOR THE STANDARD VEER-EL2 AND THE MODIFIED VERSION.BOTH DESIGNS HAVE THE SAME CRITICAL PATH, AS THE OPERATOR IS DEFINED FOR NOT BEING THE LIMITING MODULE TABLE IV ASIC SYNTHESIS RESULTS FOR A MULTIPLIER WITH A POLYNOMIAL OF ORDER 409