The explosive growth in data communications and internet services has made cryptography an important research topic to provide the need of confidentiality, authentication, data integrity, and/or non-repudiation. Generally, private key algorithms [1] have the benefit of high throughput rate; thus, are suitable for data communications. In contrast, public key algorithms with much lower throughput rate are required for private key change and authentication [2]. This implies that a flexible cryptographic processor should be capable of dealing with both the private key and public key algorithms [3].

This work is focused on the applications involving Advanced Encryption Standard (AES) [1] and Elliptic Curve Cryptography (ECC) [2], [4]. The former is now widely used for secured data communications. The latter offers the most security per bit compared with other public key cryptosystems such as RSA cryptosystem [2], [4], making it suitable for applications with constrained resources. Moreover, both AES and ECC use the arithmetic in Galois field GF(2^{m}), which is more suitable for fast and compact hardware realization than a prime field GF(p) because no carry propagation exits in GF(2^{m}). Of the basic operations in GF(2^{m}), division is the most complicated one and is conventionally replaced by a series of multiplications. A high-speed divider design would be very helpful if it can operate faster than the equivalent multiplication sequence.

This paper proposes a flexible divider design in GF(2^{m}), which can be configured to operate in either SIMD (single instruction multiple data) or SISD (single instruction single data) mode. Starting from the high-speed divider developed in our previous work [5], this work shows how to increase its hardware utilization and to reduce its hardware overhead by maximizing the resource sharing between AES and ECC. Thus, the proposed divider not only operates in very highspeed but also possesses the area-efficient feature. When applied to SIMD applications, the divider can perform multiple divisions in parallel and output results per cycle and can be applied to AES cryptosystems aiming at high throughput. In SISD applications, the divider is scalable and can handle different sizes of operand such as those specified in ECC standards. As a result, the proposed divider is suitable for wireless applications like PDA and smart cards, which demand a secured data communication, but the devices have limited resources to offer such an option.

The difficulty of hardware sharing between AES and ECC comes from the considerable discrepancy of the public and private key algorithms. For example, when analyzing AES algorithms, one can observe that the 128-bit input data of AES is considered as a 4 × 4 array of 8-bit bytes [1], which can be seen as a kind of vector architecture. Thus, the SIMD design can be applied to speed up its computation. In contrast, the operands of public key cryptosystem must be large enough for security consideration, and the operand sizes may be different as specified in existing standards [2], [4]. Thus, a scalable design, which can handle division of arbitrarily sized operands if there is enough memory capacity [6], is usually preferred in such applications.

Lim [7] proposed SIMD/SISD ALU architectures over GF(2^{m}). The ALU consists of *P* *Q*-bit subword dividers; thus, it can perform *P* GF(2^{n}) arithmetic computations (*n* ≤ *Q*) in parallel or one GF(2^{n}) arithmetic operation (*Q* < *n* ≤ *PQ*). This implies that their development is not scalable and has large area cost because *P* must be large enough for ECC application with a large key size (up to 571-bit), assuming that *Q* = 8 for AES algorithms. Moreover, the potential problem is the large wire delay of high fan-out control signals because the design is based on the semi-systolic array architecture. This paper proposes a novel SIMD/SISD divider based on the digit-serial systolic array architecture. The complexity comparisons with related work [7] show the proposed divider has a smaller area and higher throughput in SIMD applications. Finally, the technique developed in this work can be easily extended to design SIMD/SISD multipliers.

This paper is organized as follows. Section II briefly reviews the background and notation used in this work. Section III describes the proposed SIMD/SISD divider architecture. It is followed by the complexity analysis and implementation results in Section IV. Finally, we give our conclusion in Section V.

SECTION II

## BACKGROUND AND NOTATION

Let *G*(*x*) be an irreducible polynomial in GF(2^{m}) expressed as *G*(*x*) = *g*_{m} *x*^{m}+ … + *g*_{1} *x* + *g*_{0}, where *g*_{0} = *g*_{m} = 1 and *g*_{j}∊ {0, 1} for *j* = 1,…, *m*−1, or an equivalent vector form *G* = (*g*_{m},…, *g*_{1}, *g*_{0}). Any element *A*(*x*) ∊ GF(2^{m}) can then be uniquely represented as *A*(*x*) = *a*_{m−1} *x*^{m−1}+ … + *a*_{1} *x* + *a*_{0} where *a*_{j}∊ {0, 1} for *j* = 0, …, *m*−1, or *A* = (*a*_{m−1}, …, *a*_{1}, *a*_{0}). We next review our previous work [5] to obtain *V*(*x*) = (*A*(*x*)/*B*(*x*))_{G}, where (*C*(*x*))_{G} denotes the operation *C*(*x*) mod *G*(*x*).

**Algorithm** *DA:* The Division Algorithm [5]

**Initialization:** (*R*(*x*), *S*(*x*), *U*(*x*), *V*(*x*)) ← (*B*(*x*), *G*(*x*), *A*(*x*), 0), (*d*, *f*) ← (2, 1)

**Result:** *V*(*x*) ≡ (*A*(*x*)/*B*(*x*))_{G}

**Algorithm:** for *i* = 1: 2*m*−1

if *r*_{0} = 1

if *f* = 1

*(R*(*x*), *S*(*x*), *U*(*x*), *V*(*x*))

← (*R*(*x*) + *S*(*x*), *R*(*x*), *U*(*x*) + *V*(*x*), *U*(*x*))

*f* ← 0

else

*(R*(*x*), *U*(*x*)) ← (*R*(*x*) + *S*(*x*), *U*(*x*) + *V*(*x*))

(*R*(*x*), *U*(*x*)) ← (*R*(*x*)/*x*, (*U*(*x*)/*x*)_{G})

if *f* = 0 and *d*_{0} = 0 *d* ← *d*/2

else (*d*, *f*) ← (*d* · 2, 1)

In Algorithm *DA*, two auxiliary variables, *U*(*x*) and *V*(*x*), are accompanied with the dividend *S*(*x*) and divisor *R*(*x*) to determine the value of (*S*/*R*)_{G}. The one-hot encoding counter, (*d*, *f*), represents the difference of upper bounds on the degree of *R*(*x*) and *S*(*x*), deg(*R*) and deg(*S*), where deg(*C*) denotes the degree of *C*(*x*). *f* ∊ {0, 1}is the sign and *d* ∊ {1, 2, 2^{2}, …, 2^{m}} is (*m* + 1)-bit vector, which presents the magnitude. The index *i* is used to count the number of iterations, which implies that the division algorithm converges in 2 *m*−1 iterations. Note that the correct result can still be obtained if *i* > 2*m*−1, which is useful for the proposed divider design in the next section.

SECTION III

## PROPOSED SIMD/SISD DIVIDER ARCHITECTURE

To develop our SIMD/SISD divider, we first construct the dependence graph (DG) and base cells of Algorithm *DA* and decide the digit size according to the size of subword (the operand size of SIMD division). Then, we develop a digit-serial systolic subword divider, which can output one digit (subword) per cycle, and determine the number of subword dividers to be cascaded to perform SISD division. Hereafter, we use “*iteration*” to denote the iteration of division algorithm and “*cycle*” to present the clock cycle.

### A. Subword Divider

Fig. 1(a) depicts the two-dimensional DG of the proposed algorithm. General speaking, the array consists of 2 *m* −1 rows which correspond to the number of iterations. Each row has *m* +1 cells including one A-cell, *m* B-cells. The iterative division operations are performed row by row, i.e., the *i*-th row for the *i*-th iteration, and the division result *V*(*x*) will be available at the bottom row after 2 *m* −1 iterations. For clarity, we define three controlling signals *Ctrl*2 ≡ *r*_{0}^{i}, *Ctrl*3 ≡ *u*_{0}^{i} + *r*_{0}^{i} · *v*_{0}^{i}, *Ctrl*4 = *r*_{0}^{i} · *f*^{i} and *Ctrl*5 = *f*^{i + 1}:
TeX Source
$$\eqalignno{&R^{i+1}(x)\leftarrow (R^i(x) + Ctrl2 \cdot S^i(x))/x,&\hbox{(1)}\cr&U^{i+1}(x)\leftarrow (U^i (x) + Ctrl2\cdot V^i (x) + Ctrl3\cdot G(x))/x,&\hbox{(2)}\cr&S^{i+1}(x)\leftarrow \overline{Ctrl4}\cdot S^i(x) + Ctrl4 \cdot R^i (x),&\hbox{(3)}\cr&V^{i+1}(x)\leftarrow \overline{Ctrl4}\cdot V^i (x) + Ctrl4 \cdot U^i(x),&\hbox{(4)}\cr&d^{i+1} \leftarrow \overline{Ctrl5}\cdot (d^i/2) + Ctrl5\cdot (d^i \cdot 2).&\hbox{(5)}}$$Note that the addition is a bit-wise XOR operation, *R*(*x*)/*x* ≡ (0, *r*_{m−1, …,} *r*_{1}) is a right shift operation that decreases the degree of polynomial *R*(*x*) by one, and the operation (*U*^{i} (*x*)/*x*)_{G} ≡ (*U*^{i} (*x*) + *u*_{0}^{i} · *G* (*x*))/*x* is to perform *u*_{j}^{i} ← *u*^{2}_{j + 1}+ *u*_{0}^{i} · *g*_{j + 1} for 0 ≤ *j* ≤ *m*−1, where *u*^{i}_{m} = 0. For the scalability consideration, the (*m*+1) bits of *d* is uniformly distributed into the (*m*+1) cells in a row of the DG, as done in [8].

In summary, the B-cell in Fig. 1(b) is employed to implement the operations defined in (1)–(5). The A-cell in Fig. 1(a) works as the right boundary cell for updating the LSBs, and generates the three controlling signals *Ctrl*2, *Ctrl*3, *Ctrl*4 and *Ctrl*5. Thus, the A-cell consists of a B-cell (datapath) and required control circuit, as shown in Fig. 1(c).

To develop a subword divider, the DG is mapped to a digit-serial systolic array using Guo and Wang's method [9] with the digit size (*w*) equal to the operand size plus one. Fig. 1(a) shows an example with *w* = 4, where the dotted lines denote the cut set. Figs. 2(a), (b), and (c) depict the subword divider and processing elements, PE_{2} and PE_{1}; each row of base cells in PEs is allocated to accomplish an iteration of Algorithm *DA*. The PEs can be further pipelined along the dotted lines to increase its operating frequency and throughput. For simplicity, we use the notation “*w* × *t* PE” to represent a PE with a digit size of *w* bits, which can carry out *t* iterations concurrently. This implies that a *w* × *t* PE is composed of (*w*−1) × *t* B* cells and *t* A* cells. Since subword division requires 2*w*−3 iterations, the PE_{1} and PE_{2} are *w* × *w* PE and *w* × (*w*−3) PE, which perform the first *w* and the following (*w*−3) iterations, respectively. In this way, the subword divider can output one subword per cycle; therefore, its throughput is (*w* −1) bits per cycle and latency is (2 *w* −1) cycles. Latency is defined as the time to complete one division. A*-cell contains two parts (B*-cell and control circuit in Fig. 2(d)); B*-cell consists of a B-cell in Fig. 1(b) and the flip-flops of pipelined stages (dotted lines in Fig. 2(b) and (c)) for *r*_{j − 1}, *s*_{j}, *u*_{j−1}, *v*_{j}, *g*_{j} and *d*_{j}.

The sequence of the controlling signals as depicted in Fig. 2(d) is as follows: (1) the *Ctrl*1 signal consists of a zero followed by *m* ones with the leading zero used to initialize the divider and to sample the values of *r*_{0}, *u*_{0}+ *r*_{0} *v*_{0}, and *f* at the beginning of a division operation; (2) initially, we have *f* = 1 and *d* = (00 … .010); (3) *Ctrl*2–*Ctrl*5 are signals defined in the base cell design in Fig. 1. Note that it takes (*t*+1) cycles for a *w*-bit digit to go through one *w* × *t* PE.

### B. SIMD/SISD Divider

In SIMD applications like AES, multiple subword dividers can be employed to improve throughput. Assuming that *N* subword dividers are employed, the throughput of SIMD divider is *N* × (*w* −1) bits. Fig. 3 shows an example of *N* = 4. For SISD applications like ECC, we adopt the concept in [6] to reuse subword dividers to perform division for a variety of large size operands. This implies that a single subword divider with enough restoring buffers for storing temporary values can carry out division of arbitrarily sized operands. Fig. 4 illustrates the dataflow of *R*(*x*) when a subword divider with restoring buffers is employed to perform GF(2^{35}) division. The column under in Fig. 3 shows input data and output data of the *j*^{th} row of PE_{1}/PE_{2} in Fig. 2(c)/(b). Temporary data are idle in restoring buffers; therefore, *N* subword dividers can be cascaded to carry out SISD division to increase performance, as depicted in Fig. 3. As a result, *N*(2*w*−3) iterations of Algorithm *DA* can be performed in corresponding pipeline stages concurrently. The following equation defines the minimum number of subword dividers that can perform GF(2^{m}) division without idled temporary values in restoring buffers,
TeX Source
$$\left\lceil{m+1\over w}\right\rceil \le (2w-1)\times N\eqno{\hbox{(6)}}$$As shown in Fig. 4, the operand is loaded digit by digit, implying that the first stage of PE_{1} of the first subword divider is available after cycles. The right-hand side of (6) denotes the required cycles to output the first temporary value.

When the constraint in (6) holds true, no restoring buffer is needed because the temporary value outputted from the last one of the serially connected subword dividers can be processed immediately. Under this circumstance, the latency of our SISD divider (*L*_{SISD}) is:
TeX Source
$$L_{\rm SISD} = \left\lceil{2m-1\over N\cdot (2w-3)}\right\rceil \times N\cdot (2w-1) +\left\lceil{m+1\over w}\right\rceil-1\eqno{\hbox{(7)}}$$Algorithm *DA* requires at least 2*m*−1 iterations; thus, the data stream must go through the cascaded subword dividers times as each subword divider completes (2 *w* −3) iterations and each SISD divider has *N* serially connected subword dividers. It takes *N* (2 *w* −1) cycles for each digit to pass through the serially connected subword dividers because the latency of each subword divider is *N* (2 *w* −1) cycles. Therefore, the least significant digit is outputted after cycles, and it takes cycles to output the remaining digits. The throughput of SISD divider is *m*/*L*_{SISD}.

SECTION IV

## COMPLEXITY ANALYSIS AND IMPLEMENTATION RESULTS

Table I lists the estimated performance and area requirement of our SIMD/SISD divider design and the related work [7] based on the TSMC 0.18 μm technology, where *CPD, T*, and *L* denote critical path delay, throughput, and latency, respectively. The unit of area (gates) is the area of a 2-input NAND gate. For fair comparison, *w* is 9 in our design since GF(2^{8}) division is required to implement AES. To increase performance, *N* is 4 as the maximum field size specified in [4] is 571. For [7], we selected *Q* = 8 for AES and *P* = 72 for ECC since *PQ* must be larger than or equal to 571. The latency and throughput of our work operated in SISD mode is the average of different field size (*m*) ranging from 113 to 571 in the ECC standard [4]; the detailed time information is illustrated in Fig. 5. Note that the latency and throughput in Fig. 5 are *CPD* × *L*_{SISD} and *m*/(*CPD* × *L*_{SISD}) with *w* = 9 and *N* = 4, respectively.

Experimental results show that our development reveals smaller area and higher throughput for SIMD applications. In SISD applications, ours exhibits slightly larger latency in complexity analysis. However, since our design is based on the digit-serial systolic array architecture, the length of global control signals is the same as the digit size. This implies that the proposed divider has smaller wire delay because the length of global control signals based on the semi-systolic array in Lim's work [7] is 576. The length of global control signals is defined as the maximum number of base cells driven by the same global control signal. The smaller wire delay can compensate the timing overhead in SISD applications. Moreover, the scalable design can be easily extended to handle large field sizes for better security.

The developed SIMD/SISD divider design were coded in Verilog hardware description language and then synthesized using Synopsys tools based on the TSMC 0.18 μm library. In our experiment, the SIMD/SISD divider can operate at 649 MHz (1.54 ns). The reported area is about 43 k gates. The throughput of SISD applications is 20.7 Gbps. The required time to complete one division increases with respect to the field size *m*. For example, it takes about 0.47 μs and 2.2 μs to complete one GF(2^{131}) and GF(2^{571}) division, respectively.

This paper presented a SIMD/SISD divider based on the digit-serial systolic array architecture. The proposed divider can obtain high throughput when operated in SIMD mode. When applied to SISD applications, the proposed divider is scalable; thus, can deal with any size of operands if there is enough memory space to store temporary values. Complexity analysis shows that the proposed divider exhibits smaller area and higher throughput in SIMD applications, and can relax the large wire delay resulting from the global control signals when operated in scalable SISD mode. Finally, the mapping technique presented in this work can be easily extended to handle different kinds of arithmetic operations such as multiplication to develop a SIMD/SISD multiplier.