Efficient Hardware Implementation of Large Field-Size Elliptic Curve Cryptographic Processor

Due to rapid development in secured technological devices, the efficient implementation of a large field-size elliptic curve cryptosystem (ECC) is becoming demanding in many critical applications. Therefore, this paper presents a new Montgomery point multiplication (PM) algorithm to optimize and balance the signal flow and resource utilization efficiency. Thereafter, we have presented an efficient ECC processor architecture over $GF(2^{m})$ with m = 409 and 571 for the proposed Montgomery PM algorithm. Finally, we have given a detailed comparison and performance analysis (in terms of area-delay product) to show that the proposed cryptographic processor has superior performance as compared to the competing designs. The implementation results after place & route on Xilinx Virtex 7 and Kintex Ultrascale+ are provided. The achieved results reveal that the proposed large field-size ECC processor (and the proposed design strategy) can be extended and applied in many security-demanding applications.


I. INTRODUCTION
Elliptic-curve cryptography [1] was first introduced in the mid-1980s. Compared with traditional RSA (representing Rivest, Shamir, and Adleman), cryptosystems based on elliptic curves have relatively shorter operand lengths while maintaining the same security level (which revolutionized public-key cryptography). Due to the rapid development of security technology, ECC with small field-size has gradually become obsolete. Larger field-size ECCs have thus attracted widespread attention from the research community in recent years.
The most critical operation in ECC is the point multiplication (PM), and generally, a complete PM consists of point addition (PA) and point doubling (PD) operations. The PM thus involves several finite field arithmetic components such as addition, squaring, inversion, and multiplication (multiplication is regarded as the most costly operation, and the inversion can be realized by the multiplicative operation) [2][3][4][5][6][7][8][9]. Besides that, the implementation complexity of the PM The associate editor coordinating the review of this manuscript and approving it for publication was Mohamad Afendee Mohamed .
is also very much determined by the efficiency of related PM algorithms.

A. EXISTING WORKS
In order to obtain an efficient hardware ECC over GF (2 m ) (binary field ECC is desirable for hardware implementation), many efforts have been carried out. Overall, these works can be categorized into two levels: (i) System-level, mainly refers to the ECC system implementation (mostly based on the efficient computation of the PM, regarded as the main efforts in the field) [6][7][8][9][10][11][12]. Among many algorithms proposed for the realization of PM, the Montgomery algorithm can be seen as the most frequently used one due to its strong attack resistance [3]. It is also noted that many existing reports are mainly focusing on the small field-size ECC, though the National Institute of Standards and Technology (NIST) has recommended five polynomials for ECC application [3] (GF (2 409 ) and GF (2 571 ) are considered as large field-sizes). We have listed several important ECC hardware implementation works below.
In 2008, Chelton et al. [2] designed a high-speed ECC processor. In this paper, a new combined algorithm was developed to perform PA and PD in an efficient format. Besides that, a subpipelined bit-parallel multiplier was deployed to reduce the latency of PM.
In 2013, Mahdizadeh et al. [4] introduced a highly efficient architecture for ECC PM based on a reorganized and reordered Lopez-Dahab critical-path to achieve maximum architectural and timing improvements. The proposed structures are implemented in parallel, and the critical-path operations are diverted to noncritical paths. The proposed design obtains better efficiency than the previous design.
In 2017, Khan et al. [7] proposed two hardware ECC architectures based on a pipelined full-precision multiplier. To reduce the latency, the authors have proposed a modified Lopez-Dahab Montgomery PM algorithm to avoid data dependency and reduce the required clock cycle numbers. The first single-multiplier-based ECC processor achieves low resource usage, while the second three-multiplierbased ECC processor design obtains the fastest computation time.
In 2018, Li et al. [8] proposed a highly efficient architecture for right-to-left PM algorithm on Koblitz curves to allow parallel computation of Frobenius maps and PAs (to achieve short latency and high frequency). The proposed architecture can perform a point multiplication in 2.50 µs over the five NIST Koblitz curves K-163 at 292 MHz and consume 3,670 slices when implemented on the Virtex-7 device.
In the same year, Imran et al. [9] presented an ECC hardware architecture over GF(2 m ) based on BHC (Binary Huff Curves). In this work, a unified and flexible PM hardware architecture based on both ECC and BHC was proposed with the consideration of providing flexibility in the domain of security/reliability. This unified cryptoprocessor obtains better performance than previous ones.
More recently, Imran et al. [10] presented a hardware accelerator for ECC. The authors have proposed an efficient two-stage pipelining architecture with rescheduled PA and PD instructions. It achieves faster computation than the competing ones.
In 2006, Kumar et al. [13] introduced different architectural enhancements in the least Significant Digit (LSD) multiplier and their hardware implementations. They proposed the Double Accumulator Multiplier (DAM) architecture and the N-Accumulator Multiplier (NAM) architecture to obtain hardware implementation efficiency.
In 2017, Namin et al. [20] presented two digit-level finite field multipliers in GF(2 m ) based on a specific feature of redundant representation in a class of finite fields to minimize hardware resource usage. Theoretical results were then verified by the hardware implementations.
In 2020, José Imana [22] proposed a bit-serial polynomial basis (PB) multipliers architecture over GF (2 m ) binary field generated by irreducible trinomials and based on the LFSR (Linear-Feedback Shift Register) technique. It can perform the multiplication in m clock cycles and offers a performance/area trade-off that is very useful in resourceconstrained applications.
Besides that, fast algorithms such as Toeplitz Matrix Vector Product (TMVP) have been used to obtain low-complexity implementations, including one of the latest [23], which is superior to other current designs.
From the above discussions, it is desirable that: (i) the proposed highly efficient multipliers over GF(2 m ) can be seamlessly integrated with a novel PM algorithm to obtain a perfect ECC implementation; (ii) the employed multipliers and ECC design strategy are ideally suitable for large field-size hardware implementation as it is becoming more demanding recently.

B. MAIN CONTRIBUTIONS
With this point of view, in this paper, we propose an efficient implementation for the large field-size ECC (hardware platform). We derive the proposed ECC through a combination of three coherent interdependent efforts: (i) An optimized Montgomery PM algorithm is proposed to lay a solid foundation for efficient ECC implementation.
(ii) A new ECC processor is then constructed based on the proposed PM algorithm with the help of several algorithmarchitecture co-implementation techniques (i.e., thorough algorithm-to-architecture design details).
(iii) A series of performance comparison and complexity analyses have been carried out to confirm the superior efficiency of the proposed design over the competing ones.
The rest of the paper is organized as follows: The background knowledge is introduced in Section II. Section III derives the proposed PM algorithm. The desired ECC processor is presented in Section IV. The comparison & complexity are provided in Section V. The conclusion is given in Section VI.

II. BACKGROUND INFORMATION A. ELLIPTIC CURVE CRYPTOGRAPHY
An elliptic curve is the set of points (x, y) over a field K ( [1]) as: which can be simplified into the form of: y 2 + xy = x 3 + ax 2 + b with a, b ∈ K over GF(2 m ). Let P = (x 1 , y 1 ) = O (O is the point at infinity) be a point, the inverse of P is −P = (x 1 , x 1 + y 1 ). Let Q = (x 2 , y 2 ) = O be a second point with Q = −P, the P + Q = (x 3 , y 3 ) is +λ+x 1 +x 2 +a, y 3 = λ(x 1 +x 3 )+x 3 +y 1 , (2) Where λ = y 1 +y 2 x 1 +x 2 (P = Q). Or Where λ = x 1 + y 1 x 1 (P = Q). VOLUME 10, 2022 As seen from (2) and (3), PA and PD both require 1 inversions and 2 multiplications, there is thus a need to find another point representation that can replace the field inversion with field multiplication.
Take a point E = (x, y) to transform (x, y) from affine to projective coordinates (X , Y , Z ) with Z = 0. Thus, for x = X /Z α and y = Y /Z β , the elliptic curve equation becomes Where α and β should be well chosen that the scalar multiplication requires only multiplication and addition over GF (2 m ). Note that here we use the popular Lopez-Dahab (LD) coordinates (α = 1 and β = 2) [5], and (4) can be written as The choices of appropriate elliptic curves and system coordinates are the first step in the implementation process. Then, the algorithm for the scalar multiplication should be well chosen. Overall, Montgomery's algorithm is resistant to the side-channel attack because the PA and PD are indistinguishable [24], and we also consider it in this paper.

B. DIGIT-SERIAL FINITE FIELD MULTIPLIER
First of all, let us consider the fast algorithm of TMVP [25]. Let V = (V 0 , V 1 ) be an n × 1 column vector and the matrixvector (T 0 , T 1 , T 2 ) be used to define an n × n Toeplitx matrix T , where V 0 and V 1 are two n 2 × 1 column vectors, and T 0 , T 1 , and T 2 are three n 2 × n 2 Toeplitz matrices. A TMVP of C = TV in this case is which can be expressed as Based on (7), we can recursively generate four components (component matrix point (CMP), component vector point (CVP), point-wise multiply (PWM), and reconstruction (R)) of reduced-size matrices as Figure 1 shows the subquadratic complexity TMVP-based architecture for (6), which has three stages: the evaluation point generation (EPG) stage, the PWM stage, and the R stage.
The EPG stage performs two block functions of CMP(T ) and CVP(V ), the PWM stage computes P = PWM (CMP(T ), CVP (V )) = (P 0 , P 1 , P 2 ), and the R stage performs the operation C = R(P) = (P 0 + P 1 , P 1 + P 2 ). Let symbols S and D be ''space'' and ''delay'', respectively, FIGURE 1. Overall structure of the subquadratic TMVP multiplier [25]. and S ⊗ (n) and S ⊕ (n) in the case of n = 2 i (i > 1) denote the number of bit-multiplications and the number of bit-additions required for n × n TMVP multiplication. Meanwhile, let D ⊗ (n) and D ⊕ (n) denote the number of AND gate delay and the number of XOR gate delay required for TMVP multiplication. In [25], Fan and Hasan have shown that: for 2-way TMVP decomposition, CMP involves ( 3n 2 −1) XOR gates and T ⊕ delay (or T X , the delay time of an XOR gate); CVP has n 2 XOR gates and T ⊕ delay; PWM contains 3S ⊗ ( n 2 )+3S ⊕ ( n 2 ) space complexity and D ⊗ ( n 2 )+D ⊕ ( n 2 ) delay; and R unit consists of n XOR gates (delay of T ⊕ ). Accordingly, we have obtained the following recurrences on complexities: To solve the recurrence equations in (8). The time and space complexities of 2-way TMVP decomposition can be expressed as follows (T ⊗ and T ⊕ are the delay time of AND gate and XOR gate, respectively): and d is a power of 2, and n = m d . The product of A and B can be written as (note that we follow the existing notation definition to present the multiplier in [23], which is applicable only in this subsection) Where . Following (9), we can firstly use the partial product AB i to obtain the TMVP formula, and then obtain the digital-serial multiplier with sub-quadratic space complexity.
First, assuming that k is the number of processing elements (PE), the product in (9) can be rewritten as Where n x nd according to (10). According to the TMVP-based decomposition, the product C i is directly transferred as Where (0 ≤ k ≤ n − 1) and Where we find that CMP(A (i) ) appears in all partial product R(W j ). In order to reduce the space and time complexities, each partial product C i is then split into a two-step computation process: Step 1: Computing P B j = CVP(B kdi+j ) for 0 ≤ j ≤ k − 1 and P A = CMP(A (i) ).
Step 2: Based on the two-step calculation process, the digital-serial polynomial multiplication over GF(2 m ) is summarized as Algorithm 1. Figure 2 shows the systolic digital-serial multiplication architecture according to Algorithm 1.

III. PROPOSED LD MONTGOMERY ALGORITHM
Let two points define as: P 1 , P 2 ∈ E[GF(2 m )], which are presented in projective coordinates. Meanwhile, define P 3 , P 4 to have: P 3 = P 1 + P 2 (PA) and P 4 = 2 × P 1 (PD). To calculate the mentioned PA and PD, six finite field multiplications, five finite field squaring operations, and four finite field additions are needed [3].
Proposed Algorithmic Strategy & Details. We firstly consider that the LD Montgomery algorithm's computational latency is equivalent to the six field multiplications' computation time (the field additions and field squaring operations can be simultaneously operated with the multiplications). Besides, we also consider that the performance of ECC is determined by the number of employed multipliers and the digit-size (assume digit-serial multipliers are used to implement the ECC), e.g., the frequency of the processor decreases as the digit-size increases (we can add the number of pipeline stages to improve the maximum operating Algorithm 1 Existing digit-serial systolic multiplication [23]. . end for 2.8. return C frequency). Following this strategy, we propose to combine PA and PD to speed up the main computation process, i.e., we propose to only employ two multipliers to achieve low latency implementation, as presented in the proposed Algorithm 2.
LD Montgomery Algorithm Against Side Channel Attacks. The scalar multiplication is the most computationally expensive operation in ECC, which is the primary target of side-channel attacks [26]- [28]. Overall, two aspects of countermeasures are needed to resist side-channel attacks. The first strategy unifies the calculation procedures of elliptic curve PA and PD to make them indistinguishable [29] (from the adversary's attack). The second technique is adapting the scalar multiplication to make the elliptic curve PA and PD independent of the security bits. The latter aspect includes the double-and-add always method [30] and the Montgomery ladder approach [31]. As clearly stated in the proposed algorithm (Algorithm 2), the Montgomery PM algorithm is highly regular, i.e., there are always two multiplications in each step. Algorithm 2 Proposed LD Montgomery PM Algorithm (Mul and Sqr denote the multiplication and squaring, respectively) Input: k=(k m−1 , · · · , k 1 , k 0 ) with k m−1 = 1 P=(x,y)∈ E(F ( 2 m )) Output: kp=(x 3 , y 3 ) Initial Step: If k i+1 = '1' then If k i+1 = '0' then PA: P(X 1 ,Z 1 ) = P(X 1 ,Z 1 ) + Q(X 2 ,Z 2 ) ; PD: Q(X 2 ,Z 2 ) = 2Q(X 2 ,Z 2 ); S-1-0: Z 1 = Mult(X 2 ,Z 1 ); S-1: Z 2 = Mult(X 1 ,Z 2 ); X 1 = Mult(X 1 ,Z 2 ); X 2 = Mult(X 2 ,Z 1 ); R 1 = Sqr(Z 2 ); R 2 = X 2 ; S-1-1: Z 2 = Sqr(Z 2 ); R 1 = Sqr(R 1 ); R 2 = Sqr(X 2 ) ; Hence, both PM and PA are performed in every iteration. Therefore, Algorithm 2 is secure in resistant timing attacks and simple power analysis attacks due to the independence between the operation and the value of scalar k. Meanwhile, we optimize the modular inversion and modular multiplication algorithms to make the operation time constant to resist timing attacks [32]. Overall, Algorithm 2 makes the proposed ECC processor resistant against simple side-channel attacks (while other types of side-channel attacks are beyond the scope of this paper). Figure 3 shows the related data flow diagram (based on Algorithm 2). Generally, in projected coordinates, Montgomery PA and PD require six field multiplications, five field squares, and four field addition operations (the delay of the main calculation unit is equal to the delay of six field multiplications). Based on the proposed Algorithm 2, the entire calculation can be decomposed into three steps with only two multipliers at one step (thus reducing the calculation delay and implementation complexity). Please note that we will use the newly released [23] finite field multiplier, which can help us further reduce the involved complexity.

IV. PROPOSED ECC PROCESSOR
The proposed ECC processor based on Algorithms 1, 2 and 3 is shown in Figure 4, which consists of the following units: Main Computation Unit. This unit contains two 7-to-1 MUX, seven registers, one 1-to-7 DeMUX to carry out the necessary operations along with other units in the cryptoprocessor. In particular, these registers and related MUXes/DeMUX coordinate together with the control unit and the arithmetic logic unit for the operations presented in Algorithms 1, 2 and 3.
Arithmetic Logic Unit. This unit focuses on the processing of PA and PD involved in the proposed PM algorithm (through modular multiplication & XORing operations), which constitutes the main data path components in the processor. Based on Algorithm 2, only two multipliers are needed in the proposed design.
Control Unit. This unit generates control signals for all the other units, including the control signals for the data flow in the processor and the movement of data between the Proj vs Aff unit, the main computation unit, and the Aff vs Proj unit. As presented later, we have used a finite state machine (FSM) to generate all the necessary control signals for coordinating all the system-level operations.
Projective to Affine Coordinates Conversion (Proj vs Aff) Unit. After completing all operations related to scalar multiplication, this unit converts from projective coordinate to affine coordinate. The related result will then be sent to the bus interface unit for further processing. The entire conversion is realized by two multipliers and two inversions.
Affine to Projective Coordinates Conversion (Aff vs Proj) Unit. This unit converts the coordinates of the point P(x, y) to the point P (X , Y , Z ). It uses a multiplier and a register to perform the entire conversion, which is achieved by reusing a multiplier and storing the result in each step.
Bus Interface Unit. This unit is responsible for reading input data from the main computing unit and writing output data (adding it to the proposed processor to communicate with the external environment effectively). It is controlled by ''start'', ''busy'', and ''clock'' signals, as shown in Figure 4. When ''start'' signal is set, Affine X , Affine Y and the key ''K '' will be received eight-bits by eight-bits (inserting ''start'' signal after the bus is set that the device is ready to receive data). Once the calculation operation and the conversion step from projective to affine are completed, the result from the main calculation unit will be delivered out eight-bits by eight-bits again.
Structural Details. The structural details of the proposed processor are introduced as follows.
1) PA and PD. Based on Algorithm 2 (and Figure 3), the calculation of PA and PD depends on the next key bit k i+1 . When k i+1 = '1', the step output will be (X 1 , are prepared at the same time, then two multipliers will be passed). When k i+1 = '0', we start with the multiplication between X 2 and Z 1 and X 1 , Z 2 . In step S-1, regardless of k i+1 , the square output (Z 2 ) is stored in the local register R 1 , and the square output (X 2 ) is stored in the local register R 2 . In step S-2, two multiplications are performed to calculate X 2 and Z 2 . Moreover, the squaring operation of R 3 is executed to obtain Z 1 . Before the squaring operation in step S-2, we add X 1 to Z 1 in step S-1 to obtain Z 1 . In step S-3, the multiplication between the base point x and the value in R 3 , and the multiplication between R 1 and R 3 are calculated. After that, an addition operation is performed to get a new X 1 .

2) Affine to Projective Coordinates Conversion (Aff vs
Proj). The Aff vs Proj unit is controlled by the scalar k, affine coordinates (Affine X and Affine Y ), irreducible polynomial, and signals of activation (''CLK'', ''Start'', and ''Reset''). Thus, the Aff vs Proj block converts the affine coordinates to projective ones using two multiplications and one XORing. After conversion, it sends a signal (''Aff_Done'') to the control unit. 3) Control Unit. The control unit is the main component of the proposed ECC processor, which is responsible for all the communications between all components. This unit uses an FSM that the controller synchronizes with the other ECC units, and its details (signal setups) are shown in Figure 5. In the proposed ECC processor, squaring is carried out by simply interleaving '0' bits between the original bits [33], as shown in figure 6. Meanwhile, the hardware architecture to execute Algorithm 3 is shown in Figure 7, where the bit-serial inverter uses AND-XOR cells and five m-bit MUX to update the five registers S, Y , R, B, and D. The proposed EEA-based inverter has a short critical-path delay and a smaller area as Algorithm 3 has no modular operations. 6) Polynomial Multiplier. In projective coordinates-based ECC implementation, the cryptoprocessor's overall performance depends on the polynomial multipliers' performance. We have thus employed the TMVP-based polynomial multiplier of [23] in the proposed ECC processor architecture to obtain low complexity. 7) kP Time (Whole Latency). The multiplier of [23] has a latency of (2v+2) cycles, where v = √ n & n = m/d .  While in the proposed design, we have packed the CMP and CVP units of Figure 1 into the product computation unit and the RP cell, thus shortening the design into 2v cycles. Moreover, according to the data dependence in Figures 2 and 3, the total clock cycles for the proposed ECC processor is: Where the scalar multiplication takes (m − 1)v cycles to execute the LD Montgomery PM and 2m cycles to apply inversion (the coordinate conversion consists of two multipliers).

V. COMPLEXITY AND COMPARISON A. IMPLEMENTATION & COMPLEXITY
We have then implemented the proposed ECC processor on the field-programmable gate array (FPGA) platform based on large field-sizes of GF (2 409 ) and GF (2 571 ). The proposed ECC processor (Figure 4) is coded with VHDL and verified by Modelsim (by using the test vectors provided by the NIST standard [41]). The design is then implemented through Xilinx Vivado 2019.2 (after place & route) on the devices of Virtex 7 (XC7V2000T) and Kintex Ultrascale+ (XCKU15P). The obtained results, namely the maximum frequency (Fmax, MHz), area usage (slices/CLB), kP time (µs), and area-delay product (ADP = # slices×kP) are listed in Table 1. Note that as the static power of the FPGA device takes large portion of the whole power consumption, we thus do not report the power here (this also follows the existing styles in [6], [7], [11], [12], [39], [40]).

B. FPGA COMPARISON
To further evaluate the actual performance of the proposed design, we have also compared the corresponding FPGA implementation results with those of the available reports in the literature, particularly those ones have reported the large field-size [6], [7], [11], [12], [39], [40], as listed in Table 2.
As far as only the latency is concerned, the proposed design over GF (2 409 ) and GF(2 571 ) is 32.9%, and 32.4% faster than [6], respectively. Besides that, the work presented in [7] over GF (2 571 ) utilizes 65% more FPGA slices than this work. As the report based on GF(2 409 ) is very limited, we thus only compare the results with [6], [12], [39] and find that the proposed ECC processor has significantly less ADP than [6], [12], [39] (even when the results have been adjusted for Virtex-7 and Kintex devices). Considering the maximum field-size recommended by NIST [3], that is, GF(2 571 ), the proposed design involves much less ADP than the existing ones, especially the latest [11], [39] (the adjusted ADP on the Virtex-7 device is reduced by 59.23% and 58.9%, respectively, according to the results shown in [11], [39]). At the same time, compared with the competing designs of [6], [7], [40] on the Virtex-7 device, the ADP of the proposed processor has 6.02%, 61.32%, and 11.09% smaller ADP, respectively (please note that the overall performance of the design in [7] is actually better than [6], as shown in [7]).
The superior performance of the proposed ECC processor benefited from: (i) proper arrangement of the signal flow & resource usage brought by the proposed LD Montgomery algorithm; (ii) proposed algorithm-architecture co-implementation techniques for the ECC processor. Further efforts can be made to optimize the finite field multiplier to obtain higher efficiency.

VI. CONCLUSION
In this paper, we propose three new efforts to obtain an effective hardware implementation of a large field-size ECC processor: (i) We firstly propose a novel LD Montgomery PM algorithm to arrange proper signal flow for ECC; (ii) Then, we construct a new ECC processor based on a series of algorithm-architecture co-implementation techniques; (iii) Lastly, we carried out detailed implementation-based complexity analysis and comparison to prove the effectiveness of the proposed ECC processor. The proposed ECC processor is highly efficient, and it can be extended for deploying in many critical applications. Future work may focus on the developing of more efficient finite field multipliers and related cryptographic processors. He is currently an Assistant Professor with the Department of Electrical & Computer Engineering, Villanova University, Villanova, PA, USA. His research interests include cryptographic engineering, hardware security, post-quantum cryptography, and VLSI implementation of neural network systems. Dr. Xie has served as a technical committee member for many reputed conferences, such as HOST, ICCAD, and DAC. He is also currently serving as an Associate Editor for Microelectronics Journal and IEEE ACCESS. He was serving as an Associate Editor for IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-II: EXPRESS BRIEFS. He received the IEEE Access Outstanding Associate Editor for the year of 2019. He also received the Best Paper Award from HOST'19.