KINA: Karatsuba Initiated Novel Accelerator for Ring-Binary-LWE (RBLWE)-Based Post-Quantum Cryptography

Along with the National Institute of Standards and Technology (NIST) post-quantum cryptography (PQC) standardization process, lightweight PQC-related research, and development have also gained substantial attention from the research community. Ring-binary-learning-with-errors (RBLWE), a ring variant of binary-LWE (BLWE), has been used to build a promising lightweight PQC scheme for emerging Internet-of-Things (IoT) and edge computing applications, namely the RBLWE-based encryption scheme (RBLWE-ENC). The parameter settings of RBLWE-ENC, however, are not in favor of deploying typical fast algorithms like number theoretic transform (NTT). Following this direction, in this work, we propose a Karatsuba initiated novel accelerator (KINA) for efficient implementation of RBLWE-ENC. Overall, we have made several coherent interdependent stages of efforts to carry out the proposed work: 1) we have innovatively used the Karatsuba algorithm (KA) to derive the major arithmetic operation of RBLWE-ENC into a new form for high-performance operation; 2) we have then effectively mapped the proposed algorithm into an efficient hardware accelerator with the help of a number of optimization techniques; and 3) we have also provided detailed complexity analysis and implementation comparison to demonstrate the superior performance of the proposed KINA, e.g., the proposed design with $u=2$ involves 64.71% higher throughput and 15.37% less area-delay product (ADP) than the state-of-the-art design for $n=512$ (Virtex-7). The proposed KINA offers flexible processing speed and is suitable for high-performance applications like IoT servers. This work is expected to be useful for lightweight PQC development.

Abstract-Along with the National Institute of Standards and Technology (NIST) post-quantum cryptography (PQC) standardization process, lightweight PQC-related research, and development have also gained substantial attention from the research community.Ring-binary-learning-with-errors (RBLWE), a ring variant of binary-LWE (BLWE), has been used to build a promising lightweight PQC scheme for emerging Internet-of-Things (IoT) and edge computing applications, namely the RBLWEbased encryption scheme (RBLWE-ENC).The parameter settings of RBLWE-ENC, however, are not in favor of deploying typical fast algorithms like number theoretic transform (NTT).Following this direction, in this work, we propose a Karatsuba initiated novel accelerator (KINA) for efficient implementation of RBLWE-ENC.Overall, we have made several coherent interdependent stages of efforts to carry out the proposed work: 1) we have innovatively used the Karatsuba algorithm (KA) to derive the major arithmetic operation of RBLWE-ENC into a new form for high-performance operation; 2) we have then effectively mapped the proposed algorithm into an efficient hardware accelerator with the help of a number of optimization techniques; and 3) we have also provided detailed complexity analysis and implementation comparison to demonstrate the superior performance of the proposed KINA, e.g., the proposed design with u = 2 involves 64.71% higher throughput and 15.37% less area-delay product (ADP) than the state-of-theart design for n = 512 (Virtex-7).The proposed KINA offers flexible processing speed and is suitable for high-performance applications like IoT servers.This work is expected to be useful for lightweight PQC development.

B, W
Binary polynomial (algorithm derivation and hardware design).D, T , Z Integer polynomial (algorithm derivation and hardware design).

I. INTRODUCTION
I T HAS been proven that the current public-key cryptosys- tems such as Rivest Shamir Adleman (RSA) and elliptic curve cryptography (ECC) can be broken by Shor's algorithm [1], [2] operated on a large-scale quantum computer [1].As it is predicted that the well-established quantum computer will be available in the not far future, the research community has already started designing next-generation cryptosystems [3], [4], [5], i.e., post-quantum cryptography (PQC).Indeed, the National Institute of Standards and Technology (NIST) already initiated the PQC standardization process [5], and the latticebased PQC has been regarded as one of the most important categories of PQC schemes [5], [6].
Many of the lattice-based PQC are based on the learningwith-errors (LWE) problem [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17].While the ongoing NIST PQC standardization process targets general-purpose applications [5], [6], there is also a need to develop lightweight PQC algorithms.This is also confirmed by the very recent National Science Foundation (NSF) Secure and Trustworthy Cyberspace Principal Investigators' Meeting 2022 (SaTC PI Meeting '22) that one of the future research directions is the "lightweight PQC" [18].Fortunately, some preliminary works on lightweight lattice-based PQC have already been carried out.In an earlier article, it is shown that for the LWE problem based on binary errors [19], [20], [22], [23], [24], i.e., binary-LWE (BLWE), the hardness of the lattice problem still remains [19] and can be used to build lightweight PQC.Following this proof, the ring variant of BLWE, RBLWE, is introduced to obtain a smaller computational complexity than the regular Ring-LWE-based PQC [24].RBLWE-based encryption scheme (RBLWE-ENC) is based on the average-case hardness of the RBLWE problem, and detailed security analysis has shown that it is secure enough for lightweight applications [24].

A. Existing Works
After the initial introduction, Buchmann et al. [24] reported the software implementation results.While on the hardware platform: 1) the first hardware implementation of the RBLWE-ENC was released in [25]; 2) the second hardware design for RBLWE-ENC was reported in [26]; 3) a high-speed hardware structure (but incomplete) was then reported in [27]; 4) a compact RBLWE-ENC hardware architecture was presented in [28]; 5) a lookup table (LUT)-like method-based hardware architecture was reported in [29]; 6) another compact structure for RBLWE-ENC was presented in [30]; 7) a new high-speed hardware RBLWE-ENC was introduced in [31]; 8) a pair of low-speed and high-speed RBLWE-ENC hardware accelerators were presented in [32]; and 9) efficient hardware RBLWE-ENC architectures were also recently reported in [33], [34], and [35], respectively.Meanwhile, there also exist other types of implementations like the fault detection scheme of [36] (based on the high-speed structure in [26]).These reports represent the major works in the field.

B. Challenges
The main operation of RBLWE-ENC is a particular polynomial multiplication over the ring Z q /(x n + 1), where one polynomial involves merely binary values and another polynomial consists of integer coefficients.This particular setup is not desirable for the direct deployment of a fast algorithm (e.g., Karatsuba) as the addition-related iterative operations will increase the small-size coefficient involved in processing bit-width, which might offset the gain from deploying a fast algorithm.Meanwhile, the parameter settings of the RBLWE-ENC are not in favor of employing another widely used fast algorithm, i.e., number theoretic transform (NTT) [37].In fact, the existing implementations of RBLWE-ENC are all based on the schoolbook polynomial multiplication (complexity of O(n 2 ), e.g., [25], [26], [27]).For resource-constrained applications, the schoolbook-based method may still be a good choice as it allows the basic point-wise operations to obtain compact implementation [32], [35].While for high-performance applications like the Internet-of-Things (IoT) servers that contain enough resources (e.g., field-programmable gate array (FPGA) devices), we prefer to accelerate RBLWE-ENC based on a hardware implementation strategy as it not only offers high-speed operation but also provides opportunities to be further developed into specific integrated circuits.In this case, the schoolbook-based design strategy can be further improved to obtain better performance, i.e., better area-time complexities.

C. Major Contributions
Based on the aforementioned considerations, in this article, we propose to introduce a Karatsuba initiated novel accelerator (KINA) for efficient implementation of RBLWE-ENC.We have carried out three steps of efforts to finalize the proposed work (main contributions) as follows.1) We have used the Karatsuba algorithm (KA) to derive the polynomial multiplication of RBLWE-ENC into a new algorithm for high-speed processing.2) We have then mapped the proposed polynomial multiplication algorithm into a new RBLWE-ENC hardware accelerator (KINA) with the help of a number of optimization techniques.3) We have conducted thorough complexity analysis and comparison to confirm the efficiency of the proposed RBLWE-ENC accelerator (KINA).Note that though KA-based polynomial multiplication is a standard technique for the LWE-based scheme, how to efficiently employ this technique to obtain high-speed processing of RBLWE-ENC has not been explored in the literature.To the authors' best knowledge, the proposed KINA is the first report about the KA-based RBLWE-ENC accelerator with flexible processing speed for different high-performance applications.
The rest of the article is organized as follows.Section II gives brief preliminaries.The proposed algorithm is presented in Section III.The hardware accelerator is introduced in Section IV.The complexity analysis and comparison are presented in Section V. Related works and future research are described in Section VI.The conclusion is given in Section VII.
1) Key Generation: The key generation is based on p = r 1 − a • r 2 , where p is the public key that will be sent to Bob (r 1 will then be discarded).In this phase, the secret and public keys have n and nlog 2 q bits, respectively.2) Encryption: The message binary polynomial m ) is firstly encoded into m based on (1).Then, three binary polynomials (errors) e 1 , e 2 , and e 3 will be used to produce the ciphertext c 1 and c 2 for Alice (the length of the ciphertext is 2nlog 2 q bits) 3) Decryption: In this phase, Alice recovers the encoded message (using secret key r 2 ) from the original m.
Of course, a threshold decoder function [24] will be employed to generate the final output: the output will be "1" if the coefficient of the obtained value lies in the range of (q/4, 3q/4), otherwise the outcome will be "0".The recent report of [26] proposed an inverted RBLWEbased scheme, i.e., the coefficients of the polynomials are represented in the inverted range of (−⌊(q/2)⌋, ⌊(q/2)⌋ − 1) such that all the modular operations can be performed naturally under the two's complement form.The three phases of Fig. 1 under this strategy remain the same, except (m 0 , . . ., m n−1 ) → n−1 i=0 m i (−(q/2))x i , and the final decode function (opposite of the original one).In this article, we also adopt this strategy.
Security Level and Parameter Sets: BLWE with a restricted number of samples retains the worst case hardness of the LWE problem [19], while RBLWE-ENC is based on the average-case hardness of RBLWE.A relatively recent security analysis has estimated that RBLWE-ENC achieves 73/84 and 140/190 quantum/classic security bits for the parameter settings of (n, q) = (256, 256) and (n, q) = (512, 256), respectively [21], [22].In this article, we follow the existing reports [22], [24] to use these parameter sets for possible lightweight applications.

B. KA: Karatsuba Algorithm (Binary Field)
The typical two-term KA-like method over binary field is as follows [38], [39], [40], [41] (where Then, define C ′ as the product of A ′ and B ′ such that where is the field polynomial).Equation (3) can be iteratively applied to the polynomial multiplication to obtain subquadratic complexity.

A. Major Challenges
The iterative deployment of KA on the polynomial multiplication involves parallel computation and thus is not ideal for hardware implementation (the resource usage will be too large).Meanwhile, as mentioned in the Challenges of Section I, the small-size coefficient-involved processing bitwidth will bring extra overhead even if we choose only a very small number of iterative deployments.

B. Overall Principle
We thus decided to use only the two-term decomposition of (3) so that the small-size coefficient-related computation will not cause large overhead.Still, the direct mapping of the two-term KA into hardware will incur large resource usage.Therefore, we propose to: 1) process all the input-output in a serial-in and serial-out format (practical for deploying in actual applications) and 2) compute the major arithmetic procedure in an accumulation format to save the resource usage (hardware implementation friendly).The steps below have strictly followed this principle.

C. Extension of KA to RBLWE-ENC
It is obvious that the main operation of the RBLWE-ENC is the polynomial multiplication, which can be defined as where , and T = n−1 i=0 t i x i for d i and t i are integers in Z q and b i ∈ {0, 1}.Then, we have [follow (3)] where Then, we can have where the original n-size polynomial multiplication becomes the addition of three n/2-size subpolynomial multiplications.

D. Proposed Algorithmic Derivation
Define where Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
which can be rewritten as which can be expressed as an equivalent matrix-vector product which can be seen as a (n −1) × (n/2) circulant matrix-vector product (the blank parts are actually "0"s).For instance, the second column of [B L ] (from left, (n/2) nonzero elements) is the circularly shifted version of the first column (circularly downward by one position), so are the rest of the columns.

2) Computation Strategy for [T L ]:
The direct implementation of (10), however, will involve too much resource usage, and hence we propose to compute the matrix-vector product of (10) into column-wise accumulations, i.e., the elements of one column of [B L ] are multiplied with one corresponding element of vector-matrix [D L ] and then accumulated with the next column-based similar operation (so on and so forth).As the nonzero elements in each column of [B L ] are identical, we can hence share these coefficients during the accumulation process while the elements of [D L ] can be fed in a serial format.Note that we can also process multiple columns of [B L ] at the same time (with related elements of [D L ]) for higher speed applications.
There are in total (n/2) columns in the matrix [B L ], and we can thus define the first column (from left) of . We also define the elements of vector-matrix where (10) becomes the form of column-wise accumulations.Define (n/2) = uv (u and v are integers).Then, ( 11) becomes which has v groups of accumulation that each group has u items of [B L ] ju+i [D L ] 1, ju+i .These u computations can be executed at the same time to speed up the overall processing.
3) Computation for [T H ] and [T M ]: Similarly, we have Similarly, one can easily follow the same strategy for 4) Final Recombination: As seen from ( 7), each term of the KA-deployed polynomial multiplication has factors such as (1−x), x, and (x 2 −x), which involves position-shifting-based additions and hence is not hardware implementation friendly (if we calculate all related coefficients at the same time).To save resource usage, we can arrange the coefficients of the final output of T to be delivered out in a serial format.
Specifically, we can define , and T m = n−2 i=0 t M,i x 2i .Connecting with (7), we can have which can be substituted with x n ≡ −1 to have where it is found that the final output coefficients t i are just the addition of the corresponding values of t L ,i and/or t H,i and/or t M,i .The actual computation process can be seen in the corresponding hardware component section in Section IV. 5) Final Algorithm Formulation: Based on the above mathematical derivation of ( 4)-( 15), we can have the proposed KA-based polynomial multiplication algorithm as follows.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Fig. 2. Proposed RBLWE-ENC accelerator (KINA, u = 1), where Z and W denote the additional two polynomials (Z is an integer polynomial and W is a binary polynomial.CSR: circular shift-register.A constant error (Const.err.), delivered from a counter, is also needed when executing the addition with the integer polynomial Z , following the suggestion of [34].

Algorithm 1 Algorithm of KA-Based Polynomial Multiplication for RBLWE-ENC
Where T L , T H , and T M are processed in parallel to obtain high-performance computation.Note that for RBLWE-ENC, we also need to consider the operations related to other polynomials like the additions with an integer polynomial as well as the decoder function in the decryption phase.The following section will cover details of the hardware accelerator.

IV. KINA: PROPOSED RBLWE-BASED PQC ACCELERATOR
Following Algorithm 1 of Section III, we can have the proposed RBLWE-ENC accelerator (KINA) as described below.As shown in Fig. 1, the major arithmetic operation involved within RBLWE-ENC includes a polynomial multiplication followed by the addition with two polynomials (e.g., ciphertext c 2 in the encryption phase), which can be extended to the operations in other phases (the subtraction in the key generation can be easily realized by the hardware implementation).Thus, this major arithmetic operation, one polynomial multiplication along with the additions with two other polynomials, is used to construct the proposed accelerator.Meanwhile, we have also presented a higher speed version of KINA (when u > 1).Finally, we have also proposed several optimization techniques to further maximize the design efficiency.
A. KINA: Proposed RBLWE-ENC Accelerator 1) Architectural Overview: As shown in Fig. 2, the proposed RBLWE-ENC accelerator consists of three major components, namely the input processing component, the multiplication component, and the linear combination component.During the actual execution process, e.g., encryption phase or decryption phase, inputs B and D are firstly decomposed based on (5), and then loaded into the corresponding shift-register in the input processing component, i.e., D L /D H and B L /B H . Besides, two adders are needed to produce the corresponding B M and D M .After that, three computational units in the multiplication component, T L , T M , and T H , take the processed coefficients as input and execute point-wise multiplications of the related coefficients as well as the accumulation of the matrix-vector products.When this multiplication step is executed, the three results are then delivered to the linear combination component to produce the final results according to Line-11 of Algorithm 1 in Section III (see (15) as well).The final result will be delivered out in a serial format until the whole computation process is finished.It is noted that the output for the encryption phase is 8 bit, while the output for decryption is 1 bit.The detailed internal structures and related functions of these components in Fig. 2 are described below.
2) Input Processing Component: The input processing component is responsible for loading and delivering the decomposed coefficients of B L , B H , D L , and D H , as well as the producing and delivering of correct coefficients of B M and D M to the following multiplication component.Overall, the input processing component consists of three circular shiftregisters [CSRs, length of (n − 1)] and two adders (different 3) Multiplication Component: The multiplication component consists of three parallel processing units for T H , T L , and T M , respectively.The internal structures of the processing units for T H /T L and T M are shown in Fig. 4(a) and (b), respectively.As the computation processes of T L and T H are very similar [see (10) and ( 13)], we just use one set of the internal structure to illustrate the detailed design.Basically, the (n − 1) bits from related CSRs are fed to (n − 1) sets of a point-wise multiplier followed by an accumulator, where the point-wise multiplier is executed by an 8-bit AND cell [Fig.4(a)] and the accumulator contains an adder followed by a register (the output of the register is also used as another input of the adder to form the loop).Thus, after (n/2) cycles' accumulations, the output of the register becomes t L ,i or t H,i (0 ≤ i ≤ n −2).Note that we have also inserted a 2-to-1 MUX in the middle of the accumulator such that these final output values of t L ,i or t H,i (0 ≤ i ≤ n − 2) can be circularly shifted, which facilitates the delivering of the final output in a serial format (see the linear combination component).The internal structure of the processing unit of T M is almost the same as T L and T H , except that one input to the point-wise multiplier has now become 2 bit, which can be realized by a MUX-based design as shown in Fig. 4(c).Three precalculated values (based on the input d M,i , where d M,i represents the corresponding coefficients of D M ), i.e., "0," "d M,i ," and "2d M,i ," are attached to the MUX, and the result will be determined according to the 2-bit b M,i (b M,i is the corresponding coefficient of B M ).After being accumulated, the output of the T M unit is delivered to the Linear Combination Unit together with the outcomes of the other two units for the calculation of the final result T .Note that the connected (n − 1) registers in the accumulators, through the insertion of MUXes, and functions as a shift-register to provide correct input signals for the following linear combination component to produce the correct output in a serial format.
Case Example: For a clear demonstration, we have also used a case example of n = 8.Connecting with (10), we have Meanwhile, the output of the CSR for T L is "b 0 , b 2 , b 4 , b 6 , 0, 0, 0, 0," which becomes "0, b 0 , b 2 , b 4 , b 6 , 0, 0, 0" in the next cycle, "0, 0, b 0 , b 2 , b 4 , b 6 , 0, 0" in the third cycle, and finally "0, 0, 0, b 0 , b 2 , b 4 , b 6 " in the last cycle.This process exactly matches the column-based accumulations of (10).Since each column of the matrix in ( 16) takes one cycle, the entire multiplication needs four cycles.Based on this case example, we can conclude that a total of (n/2) cycles are needed for the multiplication component.

4) Linear Combination Component:
The linear combination component is responsible for the calculation of the final result by combining the outputs of the three units in the multiplication component using linear operations (addition/subtraction) according to (14) and (15).As shown in Fig. 5, a set of sign inverters (SIs) and adders are used in this component to obtain the correct output.During the linear combination process at each cycle, corresponding values stored in the accumulators from the multiplication component are fed into the SIs and adders, respectively, according to the setup of Fig. 5   Example of the computation inside the linear combination component: As seen from ( 15), e.g., the first term associated with x 0 = 1 is (t L ,0 − t L ,(n/2) − t H,(n/2)−1 ), which can be obtained through the SI and adders attached to t L ,0 , t L ,(n/2) , and t H,(n/2)−1 as well as related MUXes to deliver the desired coefficient, added with corresponding coefficients of (Z + Cons.Err.) and W , to produce the final output t 0 .Similarly, the other output values of T can be obtained through the coordination of SIs, adders, and MUXes.
5) Control Unit: Finally, a control unit is also required to coordinate the proper operation of the RBLWE-ENC accelerator.Specifically, this control unit is based on a finite state machine (FSM), where it involves five operational states, Fig. 5. Linear combination component, where SI denotes the sign inverter.Cons.Err.: constant error.The green and blue signals denote that they are connected with the corresponding registers in Fig. 4; while the red signals are control signals, e.g., "odd_sel" denotes that the MUX works in the lower channel when the odd order of output coefficient is selected to be produced when "odd_sel" = "1" (similar to the "cnt" control signals attached to the MUXes, e.g., "cnt = N − 1" means that the MUX works in the lower channel when this counting signal is "1").
namely "clear/reset," "load," "compute," "output," and "done."During the "clear/reset" state, the signals and registers in the accelerator are cleared up for preparing the execution of the new task.Then, during the "load" state, all the signals are loaded into respective CSRs for the following "compute" state.While the following "compute" state executes the point-wise multiplication-based accumulation to obtain the desired result for T L , T H , and T M .The next state is the "output" period, which executes the linear combination operation to obtain the desired output in a serial format following (15).After all the output coefficients t i (0 ≤ i ≤ n − 1) are delivered, the accelerator will release the final "done" signal.
6) Overall Operation: After the proposed KINA loads the values of B L and B H into the CSRs ((n/2) cycles), the related three units for T L , T M , and T H need (n/2) cycles to produce the correct output values (or decryption output) to be sent to the linear combination component for final serial outputting (n cycles).Overall, the computation time of the multiplication component can be viewed as the major latency of KINA.

B. KINA: Higher Speed Version
The RBLWE-ENC accelerator of Fig. 2 processes one column (u = 1) per cycle.For higher speed applications, however, we can increase the number of column-based accumulation, i.e., parallel processing based on larger u, to obtain a higher speed version of KINA.
The overall data flow of the proposed higher speed RBLWEbased accelerator is very similar to the basic version shown in Fig. 2. The coefficients of B and D are grouped into three groups, respectively, and are then fed into different multiplication units, the products go through the accumulation and permutation before they are sent to the linear combination component for the final calculation.The output size is also  The modification made in the input processing component is that shift-registers for D are changed to u-parallel output instead of the original serial-out.This is because we need u of the coefficients delivered out at the same time for the parallel calculation.Also, the shift-registers do not shift one position per cycle-they shift u numbers every cycle, which is implemented by connecting the register to the uth one in front its position a circular format the internal is shown in  7).Each layer of point-wise multipliers takes (n−1) coefficients of B and one coefficient of D as its inputs and calculates the products.A single adder tree takes u products as its inputs and calculates their summation as its output.After the multiplication and permutation are finished, the results will be delivered to the accumulators for the final linear combination.One unit of the multiplication component consists of u × (n − 1) point-wise multipliers, u adder trees, and (n + u − 1) accumulators.Meanwhile, each adder tree contains (n − 2) adders.
Overall Operation of the Proposed Higher Speed KINA: The higher speed KINA, after all the values of B L and B H are loaded into the CSRs [(n/2) cycles], the accelerator needs only n/(2u) cycles to compute the T L , T M , and T H at the cost of increased hardware usage in the multiplication component.The final output still needs n cycles (for the linear combination component).Thus, the major computation time of the higher speed KINA is n/(2u) cycles.

A. Complexity Analysis
The area-time complexities of the proposed KINA (Fig. 2) are listed as follows.
1) The input processing component requires three CSRs The area-time complexities of the proposed designs (both the basic and higher speed versions), in terms of the number of AND gates, XOR gates, adders, MUXes, and latency cycles, are listed in Table I along with those of the existing high-speed designs of [25], [26], [31], [32], [33], and [34].Note that we have listed the major computation time (clock cycles) as the latency for all the designs, following what we have discussed in Section IV (which was also reported in these existing designs).
Note that we do not include the following designs since: 1) the design of [29] is a special design based on LUT-like method; 2) the structures of [28] and [30] belong to compact designs (similar to [43], which is a compact implementation of an approximate Ring-LWE based scheme); and 3) the designs in [31] and Architecture-II of [33] did not consider the input-output processing resources in structural design.
One can see that from Table I, the proposed designs overall have relatively larger area complexities than those of the existing ones because of the proposed Karatsuba-based design strategy that three processing components are processed in parallel.But as we consider the overall area-time complexities, it is very obvious that all the existing designs have a complexity of O(n 2 ), while the proposed one can achieve O((3n 2 )/4).Finally, one has to mention that the listed complexities are based on theoretical estimation, and the corresponding implementation can reflect a more precise result.I is more on the theoretical side, there is a need for a more detailed comparison.Thus, we have implemented the proposed accelerator on the FPGA platform and the experiment is setup as follows: 1) we have coded the proposed RBLWE-ENC accelerator (KINA) with VHDL and have verified its functionality through ModelSim; 2) we have followed the existing strategies [25], [26], [31], [32], [33] to synthesize and implement the coded design on the Xilinx FPGAs (after place and route), i.e., Virtex-7 XC7V2000t and Kintex-7 XC7K325t devices, respectively, through Vivado 2020.2; 3) we have chosen the same parameter settings according to the existing designs of [25], [26], [31], [32], and [33], i.e., (n, q) = (256, 256) and (n, q) = (512, 256) ( q = 8), which correspond to the quantum/classic security of 73/84 bits and 140/190 bits, respectively [22]; 4) the proposed accelerator also includes the third and fourth polynomials Z and W for operations of both encryption and decryption phases as well as related resources; 5) for a more general demonstration, we do not use the other available resources on the FPGA devices such as the block RAM (BRAM), etc.; 6) we have chosen u = 1, u = 2, u = 4, u = 8, and u = 16 for the proposed KINA, respectively, to showcase the high-speed operational performance under different processing setups; and 7) the obtained area-time complexities, in terms of area usage (the number of (LUTs, registers (FFs), and slices), maximum frequency (Fmax, MHz), latency cycles, delay (critical-path × latency cycles), area-delay product (ADP), and throughput are all listed in Table II.

B. FPGA Implementation Results and Related Comparison 1) Experimental Setup: While the comparison of area-time complexities listed in Table
2) Discussion: From Table II, we can clearly see that the area consumption of the proposed KINA increases as u becomes bigger.This is because the number of adders and point-wise multipliers in the KINA is positively proportional to u, i.e., the number of involved adders and point-wise multipliers increases to execute the parallel calculation and permutation in the multiplication component.However, the latency drops significantly as u increases as more columns of and [D M ]) can be involved in the calculation at the same time, and thus the number of cycles required for the multiplication process decreases rapidly.Finally, one can also find that the higher speed version of the proposed KINA has relatively better performance than the basic one.Meanwhile, in terms of the overall area-time complexities, the proposed design with u = 2 obtains the best ADP among all the cases.
3) Comparison With the Existing RBLWE-ENC Implementations: The comparison with those of the state-of-the-art ones (i.e., [25], [26], [31], [32], [33] is seen in Table III).We have carefully considered the comparison setup, as described below.First of all, as the designs of [29] and [32] are reported on the Intel Straix-V device, we thus also obtained the performance of the proposed KINA (u = 2) on the same Stratix-V device, as listed in Table IV.Secondly, we want to mention that some of the existing designs do not include the input processing component in the implementation, and hence we need careful For instance, we notice that the designs of [31] do not include the input processing component (reported area is smaller than the actual value, similar to Arch.-II of [33]), and we also consider that the designs of [33] have shown their efficiency over the ones in [31], we thus just list the result of [31] for discussion.Note Arch.-I of [33] is listed for actual comparison due to its complete setup in this aspect (see Fig. 4 of [33]).
4) Comparison Discussion: When comparing with the existing designs, although the area consumption of KINA is relatively high, the latency of the proposed design with different choices of u is much lower than the existing designs, e.g., it (u = 2) involves at least 53.92% less delay than the best of the existing designs on the Virtex-7 device for n = 256.Also, the ADP of the proposed design is at least 15.37% less than the existing design of [34] for n = 512 on the Virtex-7 device.Another noticeable advantage of the proposed design is the throughput, which refers to the performance of calculation over a unit period time.From Table III, we can see that the throughput of the proposed design (u = 2) is 2.17× to 5.24× than the existing designs on the Virtex-7 device.This similar situation happens on the Intel Stratix-V device, as shown in Table IV, where the proposed accelerator (u = 2) has significantly better ADP than [29] and [32] (we followed the existing designs to calculate the ADP).This indicates that the proposed design is extremely suitable for high-speed calculation, such as IoT servers.

5) Comparison With Software Implementation (CPU):
To demonstrate the efficiency of the proposed accelerator, we have also measured the performance of the software implemented (coded in C language) RBLWE-ENC deploying the proposed KA strategy.The hardware setup is as follows: 1) we have used the microbenchmark support library from Google [47] as the benchmark library; 2) we have used the single core of AMD Ryzen Threadripper 3960× processor running at 3.8 GHz; 3) the testing was carried out on the Ubuntu 20.04 LTS OS on a virtual machine; and 4) we have used g++ 9.4.0 to compile the code and disabled the optimization flag.The software implementation of KA-deployed RBLWE-ENC (decryption) takes 145 762 ns (number of testing iterations is 4798) and 545 954 ns (number of testing iterations is 1282) for n = 256 and n = 512, respectively.From the obtained data, one can see that the proposed hardware KINA accelerator is much faster than the software implementation one and hence is preferred for practical deployment.Finally, we want to mention that the FPGA-based implementation can also be extended further as specific integrated circuits for potential applications, which the CPU implementation does not offer.
6) Discussion About the Performance With Other PQC: We have also listed other lattice-based schemes in Table V for a more comprehensive discussion.In particular, we have selected the available implementations of NIST PQC schemes for discussion: NewHope [44], Kyber [45], and Saber [46].Note the existing designs mostly did not report the slice number, and we thus use the number of LUTs to calculate the ADP.
It is seen that the proposed KINA has significantly better area-time complexities than the existing designs.Besides that, we want to point out that the designs of [44] and [45] have used extra numbers of DSPs and BRAMs, and hence their actual area-time complexities are larger than the calculated ADP.Meanwhile, when comparing with the public-key scheme Saber of [46] (KA-based implementation as well), the proposed KINA not only has smaller area usage but also Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.involves much faster processing time.One may conclude that the proposed KINA is more suitable for high-performance lightweight applications where the processing speed is more desirable with relatively small resource usage.
Lastly, we also want to mention that both Kyber and Saber explore module rings in constructing the security keys and ciphertext (e.g., Kyber is built on the Module-LWE [49]).Meanwhile, Kyber has utilized a built-in NTT ciphertext structure to improve practical implementation performance [49], while Saber explores the small secret-key size and rounding techniques [50].Still, it is encouraging to see that comparing the implementation result of KINA with Kyber (and even Saber) also highlights the potential efficiency of Ring-LWE, especially when the parameter (e.g., modulus q) is selected as small.Due to the module ring setup, Kyber may obtain better efficiency on area usage; Ring-LWE-based scheme, however, can achieve faster computation (which leads to better overall area-delay efficiency).For lightweight applications, where the security requirement is not that high, one may choose to use the Ring-LWE-based setting with small parameters (i.e., RBLWE in this article) to obtain better feasibility and efficiency.
7) Unique Features: Overall, the proposed KINA possesses two unique features: 1) this is the first RBLWE-based PQC accelerator based on KA and 2) the proposed accelerator provides flexible processing throughput, depending on the choices of u, for potential high-speed environments.These two unique features facilitate the deploying of the proposed KINA in various high-performance applications.
The proposed KINA has constant time operation and hence is resistant to regular timing attacks [53].While this article focuses mostly on the developing KA-initiated computation strategy for the major arithmetic operation of RBLWE-ENC (polynomial as well as the overall hardware accelerator, the research on designing and implementing a key-encapsulation mechanism (KEM) of RBLWE-ENC and related works (such as side-channel attacks) is out of the scope of this article.Nevertheless, we want to mention that developing an efficient KEM version of RBLWE-ENC can be seen as one of our future research directions.Meanwhile, side-channel analysis and related countermeasures can also be extended further on the proposed accelerator.
Finally, we also want to emphasize that the research and development for lightweight PQC is still an under-explored area (as pointed out in the NSF SaTC PI Meeting'22 [18]), though the NIST PQC standardization has recently selected algorithms like Kyber for general-purpose usage [6].Therefore, we hope the proposed work in this article can stimulate many follow-up investigations from the research community, e.g., scheme development, parameter selection, security analysis, implementation techniques, etc.Besides that, we also hope the proposed KINA design strategy can be extended for polynomial multiplication used in NIST-selected schemes like Falcon [51] and even Dilithium [52], where they are not bound with fast algorithms [6] (Kyber is built-in with NTT already [49]).As KINA provides a flexible and extensible way for accelerating large integer polynomial multiplications, it is natural to think about applying similar structures to these NIST PQC schemes.While emphasizing the plausible high-throughput and flexible processing, a couple of related works need to be done for such endeavoring, including modular reduction, point-wise multiplier, sampler design, etc.

VII. CONCLUSION
In this article, we propose an efficient RBLWE-ENC accelerator, KINA, on the hardware platform.The key contributions of this work include: 1) usage of KA to derive an efficient computation of the polynomial multiplication over ring, the major arithmetic operation of the RBLWE-ENC; 2) efficiently mapped the proposed algorithm into a new RBLWE-ENC accelerator, KINA (including the higher speed version); and 3) conducted analysis and comparison to show the efficiency of the proposed accelerator.It turns out to be: a) the proposed KINA is the first Karatsuba-based RBLWE-ENC accelerator that achieves a complexity of O((3n 2 )/4) and b) the proposed accelerator provides flexible processing capabilities.The proposed design strategy and implementation results are expected to help the further development of the RBLWE-based lightweight PQC scheme.
D H , and T M = B M D M .1) Detailed Steps to Derive T L : Let us consider T L first,

Fig. 3 .
Fig. 3. Details of the CSRs for B H , B L , and B M (from up), respectively.
to obtain the desired output.Note the selection signals to the MUX are generated from the control unit.The linear combination component can be seen as the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Fig. 4 .
Fig. 4. (a) Processing unit for T H or T L .(b) Processing unit for T M .(c) Point-wise multiplier for T M .

Fig. 6 .
Fig. 6.Internal structure of the shift-register for B for the proposed higher speed KINA ("ot i " denotes the output of the individual register).

Fig. 7 .
Fig. 7.Internal structure of the computation unit in the multiplication component, where each unit contains the adder trees (here we have used T L and the input of B L is 1-bit).Note that for T M , we need to set the input (B M ) as 2-bit.Meanwhile, we have also demonstrated the internal structure for the adder tree for u = 4, which can be extended to different values of u.
Fig. Another modification made in the multiplication Unlike the one in the version of Fig. which has only one layer of point-wise multiplierbased accumulator-sets, there are u layers of the multiplier accumulator-sets, where each set has (n − 1) point-wise multipliers and the products are sent to a special component, adder tree, for the addition (see Fig. (two CSRs, respectively, for B L and B H , contain (n/2) 1-bit registers; one CSR for B M needs (n/2) 2-bit registers), one 1-bit adder, and one 8-bit adder.2) The computation unit for T H (and T L ) has (n − 1) AND gates, (n − 1) 8-bit adders, and (n − 1) 8-bit registers, while the computation unit for T M has (n − 1) 3-to-1 MUXes, (n − 1) 8-bit adders, and (n − 1) 8-bit registers.3) The linear combination component has 6 8-bit adders, 5 SIs, 8 8-bit MUXes, and an XOR gate.The proposed KINA takes (n/2) cycles to execute related accumulations (multiplication component) and another n cycles to output the final results in a serial format.For the higher speed design of KINA, the input processing component area consumption remains the same as the basic version.In the multiplication component, a total number of two × u × (n − 1) AND gates and u × (n − 1) MUXes are needed for the point-wise multiplications and 3 × u × (n −1) 8-bit adders, as well as 3 × (n −1) + log 2 u × (n −2) 8-bit registers, will be used for permutation and accumulation.The area usage of the linear combination component remains the same.Finally, the time required for the computation (multiplication component) decreases as u increases, i.e., n/(2u) cycles.

TABLE I MAJOR
AREA-TIME COMPLEXITIES FOR THE PROPOSED RBLWE-BASED ACCELERATOR AND THE STATE-OF-THE-ART STRUCTURES

TABLE II FPGA
IMPLEMENTATION PERFORMANCE OF THE PROPOSED ACCELERATOR ON AMD-XILINX DEVICES * consideration when selecting the proper competing designs.

TABLE III COMPARISON
OF FPGA IMPLEMENTATION PERFORMANCE (AMD-XILINX DEVICES) *

TABLE V DISCUSSION
ABOUT THE PERFORMANCE WITH OTHER PQC IMPLEMENTATIONS about the regular Ring-LWE-based PQC