• Abstract

# Flexible GF(2m) Divider Design for Cryptographic Applications

In cryptographic applications, private key algorithms usually aim at high-throughput data communication, while public key algorithms require much lower throughput for private key exchange and authentication. To increase hardware utilization and reduce area overhead, this paper presents a flexible divider design in GF(2m), which can be configured to operate in either SIMD or SISD mode. When applied to SIMD applications, the divider can perform multiple divisions in parallel and output results per cycle; thus, it is suitable for AES cryptosystems demanding high throughput. In SISD applications, the divider is scalable and can handle different sizes of operand such as those specified in ECC standards. A scalable design can also relax the potential problem of high fan-out control signals. Complexity analysis shows the proposed divider, operated in SIMD mode, has lower area complexity and higher throughput in comparison with related work.

SECTION I

## INTRODUCTION

The explosive growth in data communications and internet services has made cryptography an important research topic to provide the need of confidentiality, authentication, data integrity, and/or non-repudiation. Generally, private key algorithms [1] have the benefit of high throughput rate; thus, are suitable for data communications. In contrast, public key algorithms with much lower throughput rate are required for private key change and authentication [2]. This implies that a flexible cryptographic processor should be capable of dealing with both the private key and public key algorithms [3].

This work is focused on the applications involving Advanced Encryption Standard (AES) [1] and Elliptic Curve Cryptography (ECC) [2], [4]. The former is now widely used for secured data communications. The latter offers the most security per bit compared with other public key cryptosystems such as RSA cryptosystem [2], [4], making it suitable for applications with constrained resources. Moreover, both AES and ECC use the arithmetic in Galois field GF(2m), which is more suitable for fast and compact hardware realization than a prime field GF(p) because no carry propagation exits in GF(2m). Of the basic operations in GF(2m), division is the most complicated one and is conventionally replaced by a series of multiplications. A high-speed divider design would be very helpful if it can operate faster than the equivalent multiplication sequence.

This paper proposes a flexible divider design in GF(2m), which can be configured to operate in either SIMD (single instruction multiple data) or SISD (single instruction single data) mode. Starting from the high-speed divider developed in our previous work [5], this work shows how to increase its hardware utilization and to reduce its hardware overhead by maximizing the resource sharing between AES and ECC. Thus, the proposed divider not only operates in very highspeed but also possesses the area-efficient feature. When applied to SIMD applications, the divider can perform multiple divisions in parallel and output results per cycle and can be applied to AES cryptosystems aiming at high throughput. In SISD applications, the divider is scalable and can handle different sizes of operand such as those specified in ECC standards. As a result, the proposed divider is suitable for wireless applications like PDA and smart cards, which demand a secured data communication, but the devices have limited resources to offer such an option.

The difficulty of hardware sharing between AES and ECC comes from the considerable discrepancy of the public and private key algorithms. For example, when analyzing AES algorithms, one can observe that the 128-bit input data of AES is considered as a 4 × 4 array of 8-bit bytes [1], which can be seen as a kind of vector architecture. Thus, the SIMD design can be applied to speed up its computation. In contrast, the operands of public key cryptosystem must be large enough for security consideration, and the operand sizes may be different as specified in existing standards [2], [4]. Thus, a scalable design, which can handle division of arbitrarily sized operands if there is enough memory capacity [6], is usually preferred in such applications.

Lim [7] proposed SIMD/SISD ALU architectures over GF(2m). The ALU consists of P Q-bit subword dividers; thus, it can perform P GF(2n) arithmetic computations (nQ) in parallel or one GF(2n) arithmetic operation (Q < nPQ). This implies that their development is not scalable and has large area cost because P must be large enough for ECC application with a large key size (up to 571-bit), assuming that Q = 8 for AES algorithms. Moreover, the potential problem is the large wire delay of high fan-out control signals because the design is based on the semi-systolic array architecture. This paper proposes a novel SIMD/SISD divider based on the digit-serial systolic array architecture. The complexity comparisons with related work [7] show the proposed divider has a smaller area and higher throughput in SIMD applications. Finally, the technique developed in this work can be easily extended to design SIMD/SISD multipliers.

This paper is organized as follows. Section II briefly reviews the background and notation used in this work. Section III describes the proposed SIMD/SISD divider architecture. It is followed by the complexity analysis and implementation results in Section IV. Finally, we give our conclusion in Section V.

SECTION II

## BACKGROUND AND NOTATION

Let G(x) be an irreducible polynomial in GF(2m) expressed as G(x) = gm xm+ … + g1 x + g0, where g0 = gm = 1 and gj∊ {0, 1} for j = 1,…, m−1, or an equivalent vector form G = (gm,…, g1, g0). Any element A(x) ∊ GF(2m) can then be uniquely represented as A(x) = am−1 xm−1+ … + a1 x + a0 where aj∊ {0, 1} for j = 0, …, m−1, or A = (am−1, …, a1, a0). We next review our previous work [5] to obtain V(x) = (A(x)/B(x))G, where (C(x))G denotes the operation C(x) mod G(x).

Algorithm DA: The Division Algorithm [5]

Initialization: (R(x), S(x), U(x), V(x)) ← (B(x), G(x), A(x), 0), (d, f) ← (2, 1)

Result:    V(x) ≡ (A(x)/B(x))G

Algorithm: for i = 1: 2m−1

if r0 = 1

if f = 1

(R(x), S(x), U(x), V(x))

← (R(x) + S(x), R(x), U(x) + V(x), U(x))

f ← 0

else

(R(x), U(x)) ← (R(x) + S(x), U(x) + V(x))

(R(x), U(x)) ← (R(x)/x, (U(x)/x)G)

if f = 0 and d0 = 0 dd/2

else           (d, f) ← (d · 2, 1)

In Algorithm DA, two auxiliary variables, U(x) and V(x), are accompanied with the dividend S(x) and divisor R(x) to determine the value of (S/R)G. The one-hot encoding counter, (d, f), represents the difference of upper bounds on the degree of R(x) and S(x), deg(R) and deg(S), where deg(C) denotes the degree of C(x). f ∊ {0, 1}is the sign and d ∊ {1, 2, 22, …, 2m} is (m + 1)-bit vector, which presents the magnitude. The index i is used to count the number of iterations, which implies that the division algorithm converges in 2 m−1 iterations. Note that the correct result can still be obtained if i > 2m−1, which is useful for the proposed divider design in the next section.

SECTION III

## PROPOSED SIMD/SISD DIVIDER ARCHITECTURE

To develop our SIMD/SISD divider, we first construct the dependence graph (DG) and base cells of Algorithm DA and decide the digit size according to the size of subword (the operand size of SIMD division). Then, we develop a digit-serial systolic subword divider, which can output one digit (subword) per cycle, and determine the number of subword dividers to be cascaded to perform SISD division. Hereafter, we use “iteration” to denote the iteration of division algorithm and “cycle” to present the clock cycle.

### A. Subword Divider

Fig. 1(a) depicts the two-dimensional DG of the proposed algorithm. General speaking, the array consists of 2 m −1 rows which correspond to the number of iterations. Each row has m +1 cells including one A-cell, m B-cells. The iterative division operations are performed row by row, i.e., the i-th row for the i-th iteration, and the division result V(x) will be available at the bottom row after 2 m −1 iterations. For clarity, we define three controlling signals Ctrl2 ≡ r0i, Ctrl3 ≡ u0i + r0i · v0i, Ctrl4 = r0i · fi and Ctrl5 = fi + 1: TeX Source \eqalignno{&R^{i+1}(x)\leftarrow (R^i(x) + Ctrl2 \cdot S^i(x))/x,&\hbox{(1)}\cr&U^{i+1}(x)\leftarrow (U^i (x) + Ctrl2\cdot V^i (x) + Ctrl3\cdot G(x))/x,&\hbox{(2)}\cr&S^{i+1}(x)\leftarrow \overline{Ctrl4}\cdot S^i(x) + Ctrl4 \cdot R^i (x),&\hbox{(3)}\cr&V^{i+1}(x)\leftarrow \overline{Ctrl4}\cdot V^i (x) + Ctrl4 \cdot U^i(x),&\hbox{(4)}\cr&d^{i+1} \leftarrow \overline{Ctrl5}\cdot (d^i/2) + Ctrl5\cdot (d^i \cdot 2).&\hbox{(5)}}Note that the addition is a bit-wise XOR operation, R(x)/x ≡ (0, rm−1, …, r1) is a right shift operation that decreases the degree of polynomial R(x) by one, and the operation (Ui (x)/x)G ≡ (Ui (x) + u0i · G (x))/x is to perform ujiu2j + 1+ u0i · gj + 1 for 0 ≤ jm−1, where uim = 0. For the scalability consideration, the (m+1) bits of d is uniformly distributed into the (m+1) cells in a row of the DG, as done in [8].

Fig. 1. (a) Dependence graph with a cut set (w = 4) (b) B-cell. (c) Control circuit in A-cell.

In summary, the B-cell in Fig. 1(b) is employed to implement the operations defined in (1)–(5). The A-cell in Fig. 1(a) works as the right boundary cell for updating the LSBs, and generates the three controlling signals Ctrl2, Ctrl3, Ctrl4 and Ctrl5. Thus, the A-cell consists of a B-cell (datapath) and required control circuit, as shown in Fig. 1(c).

To develop a subword divider, the DG is mapped to a digit-serial systolic array using Guo and Wang's method [9] with the digit size (w) equal to the operand size plus one. Fig. 1(a) shows an example with w = 4, where the dotted lines denote the cut set. Figs. 2(a), (b), and (c) depict the subword divider and processing elements, PE2 and PE1; each row of base cells in PEs is allocated to accomplish an iteration of Algorithm DA. The PEs can be further pipelined along the dotted lines to increase its operating frequency and throughput. For simplicity, we use the notation “w × t PE” to represent a PE with a digit size of w bits, which can carry out t iterations concurrently. This implies that a w × t PE is composed of (w−1) × t B* cells and t A* cells. Since subword division requires 2w−3 iterations, the PE1 and PE2 are w × w PE and w × (w−3) PE, which perform the first w and the following (w−3) iterations, respectively. In this way, the subword divider can output one subword per cycle; therefore, its throughput is (w −1) bits per cycle and latency is (2 w −1) cycles. Latency is defined as the time to complete one division. A*-cell contains two parts (B*-cell and control circuit in Fig. 2(d)); B*-cell consists of a B-cell in Fig. 1(b) and the flip-flops of pipelined stages (dotted lines in Fig. 2(b) and (c)) for rj − 1, sj, uj−1, vj, gj and dj.

Fig. 2. (a) Subword divider. (b) PE2 (w = 4, t = 1). (c) PE1 (w = 4, t = 4). (d) Control circuit in A* cell.

The sequence of the controlling signals as depicted in Fig. 2(d) is as follows: (1) the Ctrl1 signal consists of a zero followed by m ones with the leading zero used to initialize the divider and to sample the values of r0, u0+ r0 v0, and f at the beginning of a division operation; (2) initially, we have f = 1 and d = (00 … .010); (3) Ctrl2–Ctrl5 are signals defined in the base cell design in Fig. 1. Note that it takes (t+1) cycles for a w-bit digit to go through one w × t PE.

### B. SIMD/SISD Divider

In SIMD applications like AES, multiple subword dividers can be employed to improve throughput. Assuming that N subword dividers are employed, the throughput of SIMD divider is N × (w −1) bits. Fig. 3 shows an example of N = 4. For SISD applications like ECC, we adopt the concept in [6] to reuse subword dividers to perform division for a variety of large size operands. This implies that a single subword divider with enough restoring buffers for storing temporary values can carry out division of arbitrarily sized operands. Fig. 4 illustrates the dataflow of R(x) when a subword divider with restoring buffers is employed to perform GF(235) division. The column under in Fig. 3 shows input data and output data of the jth row of PE1/PE2 in Fig. 2(c)/(b). Temporary data are idle in restoring buffers; therefore, N subword dividers can be cascaded to carry out SISD division to increase performance, as depicted in Fig. 3. As a result, N(2w−3) iterations of Algorithm DA can be performed in corresponding pipeline stages concurrently. The following equation defines the minimum number of subword dividers that can perform GF(2m) division without idled temporary values in restoring buffers, TeX Source $$\left\lceil{m+1\over w}\right\rceil \le (2w-1)\times N\eqno{\hbox{(6)}}$$As shown in Fig. 4, the operand is loaded digit by digit, implying that the first stage of PE1 of the first subword divider is available after cycles. The right-hand side of (6) denotes the required cycles to output the first temporary value.

Fig. 3. Proposed SIMD/SISD divider (N = 4).
Fig. 4. Data flow of R(x) in a subword divider (N = 1, w = 4, m = 35).

When the constraint in (6) holds true, no restoring buffer is needed because the temporary value outputted from the last one of the serially connected subword dividers can be processed immediately. Under this circumstance, the latency of our SISD divider (LSISD) is: TeX Source $$L_{\rm SISD} = \left\lceil{2m-1\over N\cdot (2w-3)}\right\rceil \times N\cdot (2w-1) +\left\lceil{m+1\over w}\right\rceil-1\eqno{\hbox{(7)}}$$Algorithm DA requires at least 2m−1 iterations; thus, the data stream must go through the cascaded subword dividers times as each subword divider completes (2 w −3) iterations and each SISD divider has N serially connected subword dividers. It takes N (2 w −1) cycles for each digit to pass through the serially connected subword dividers because the latency of each subword divider is N (2 w −1) cycles. Therefore, the least significant digit is outputted after cycles, and it takes cycles to output the remaining digits. The throughput of SISD divider is m/LSISD.

SECTION IV

## COMPLEXITY ANALYSIS AND IMPLEMENTATION RESULTS

Table I lists the estimated performance and area requirement of our SIMD/SISD divider design and the related work [7] based on the TSMC 0.18 μm technology, where CPD, T, and L denote critical path delay, throughput, and latency, respectively. The unit of area (gates) is the area of a 2-input NAND gate. For fair comparison, w is 9 in our design since GF(28) division is required to implement AES. To increase performance, N is 4 as the maximum field size specified in [4] is 571. For [7], we selected Q = 8 for AES and P = 72 for ECC since PQ must be larger than or equal to 571. The latency and throughput of our work operated in SISD mode is the average of different field size (m) ranging from 113 to 571 in the ECC standard [4]; the detailed time information is illustrated in Fig. 5. Note that the latency and throughput in Fig. 5 are CPD × LSISD and m/(CPD × LSISD) with w = 9 and N = 4, respectively.

Fig. 5. Latency and throughput of SISD division for different field sizes.
TABLE I Comparisons of Area and Time Complexity

Experimental results show that our development reveals smaller area and higher throughput for SIMD applications. In SISD applications, ours exhibits slightly larger latency in complexity analysis. However, since our design is based on the digit-serial systolic array architecture, the length of global control signals is the same as the digit size. This implies that the proposed divider has smaller wire delay because the length of global control signals based on the semi-systolic array in Lim's work [7] is 576. The length of global control signals is defined as the maximum number of base cells driven by the same global control signal. The smaller wire delay can compensate the timing overhead in SISD applications. Moreover, the scalable design can be easily extended to handle large field sizes for better security.

The developed SIMD/SISD divider design were coded in Verilog hardware description language and then synthesized using Synopsys tools based on the TSMC 0.18 μm library. In our experiment, the SIMD/SISD divider can operate at 649 MHz (1.54 ns). The reported area is about 43 k gates. The throughput of SISD applications is 20.7 Gbps. The required time to complete one division increases with respect to the field size m. For example, it takes about 0.47 μs and 2.2 μs to complete one GF(2131) and GF(2571) division, respectively.

SECTION V

## Conclusion

This paper presented a SIMD/SISD divider based on the digit-serial systolic array architecture. The proposed divider can obtain high throughput when operated in SIMD mode. When applied to SISD applications, the proposed divider is scalable; thus, can deal with any size of operands if there is enough memory space to store temporary values. Complexity analysis shows that the proposed divider exhibits smaller area and higher throughput in SIMD applications, and can relax the large wire delay resulting from the global control signals when operated in scalable SISD mode. Finally, the mapping technique presented in this work can be easily extended to handle different kinds of arithmetic operations such as multiplication to develop a SIMD/SISD multiplier.

## Footnotes

Wen-Ching Lin and Ming-Der Shieh are with Department of Electrical Engineering National Cheng Kung University No.1, Ta-Hsueh Road, Tainan 70101, Taiwan shiehm@mail.ncku.edu.tw

Chien-Ming Wu is with the Chip Implementation Center (CIC) National Applied Research Laboratories Hsinchu 300, Taiwan, ROC

## References

1. Federal Information Processing Standards Publication

Advanced Encryption Standard (AES) 2001

2. Standard Specifications for Public Key Cryptography

IEEE Std-1363-2000 2000-01

3. An area-efficient universal cryptography processor for smart cards

Y. Eslami, A. Sheikholeslami, P. G. Gulak, S. Masui, K. Mukaida

IEEE Trans. Very Large Scale Integration (VLSI) Systems, vol. 14, p. 43–56, 2006-01

4. Certicom Corporation

The Basics of ECC 2006, http://www.certicom.com/index.php?action=res,_eccfaq

5. High-speed, low-complexity systolic designs of novel iterative division algorithms in GF(2m)

C. H. Wu, C. M. Wu, M. D. Shieh, Y. T. Hwang

IEEE Trans. Computers, vol. 53, p. 375–380, 2004-03

6. A scalable architecture for modular multiplication based on Montgomery's algorithm

A. F. Tenca, C. K. Koc

IEEE Trans. Computers, vol. 52, issue (9), p. 1215–1221, 2003-09

7. Design space exploration of a hardware-software co-designed GF(2m) Galois field processor for forward error correction and cryptography

W. M. Lim, M. Benaissa

Proc. 1st IEEE/ACM/IFIP Int'l Conf. Hardware/Software Codesign & System Synthesis, 2003, 53–58

8. New systolic architectures for inversion and division in GF(2m)

Z. Yan, D. V. Sarwate

IEEE Trans. Computers, 2003-11, vol. 52, 1514–1519

9. Novel digit-serial systolic array implementation of Euclid's algorithm for division in GF(2m)

J. H. Guo, C. L. Wang

Proc. IEEE Int'l Symp. Circuits and Systems, 1998, 478–481

## Cited By

No Citations Available

## Keywords

### INSPEC: Non-Controlled Indexing

No Keywords Available

### Authors Keywords

No Keywords Available

### More Keywords

No Keywords Available

No Corrections

## Media

No Content Available
This paper appears in:
International Symposium on Circuits and Systems
Issue Date:
2009
On page(s):
25 - 28
ISBN:
N/A
Print ISBN:
978-1-4244-3827-3
INSPEC Accession Number:
10760297
Digital Object Identifier:
10.1109/ISCAS.2009.5117676
Date of Current Version:
26 Jun, 2009

### Articles of Influence

Hopes, T.

#### A Distributed Supervisor Synthesis Approach Based on Weak Bisimulation

© Copyright 2011 IEEE – All Rights Reserved