Efficient Bit-Parallel Multiplier for All Trinomials Based on n-Term Karatsuba Algorithm

Recently, hybrid multiplication schemes over the binary extension field <inline-formula> <tex-math notation="LaTeX">$GF(2^{m})$ </tex-math></inline-formula> based on <inline-formula> <tex-math notation="LaTeX">$n$ </tex-math></inline-formula>-term Karatsuba algorithm (KA) have been proposed for irreducible trinomials. Their complexities depend on a decomposition of <inline-formula> <tex-math notation="LaTeX">$m$ </tex-math></inline-formula> and the choice of a generation polynomial. However, these multipliers have some limitations on a decomposition of <inline-formula> <tex-math notation="LaTeX">$m$ </tex-math></inline-formula> or generation polynomial <inline-formula> <tex-math notation="LaTeX">$x^{m}+x^{k}+1$ </tex-math></inline-formula> such that <inline-formula> <tex-math notation="LaTeX">$m\geq 2k$ </tex-math></inline-formula>. In this paper, we loosen such limited conditions. We present a new hybrid bit-parallel multiplier based on <inline-formula> <tex-math notation="LaTeX">$n$ </tex-math></inline-formula>-term KA for any irreducible trinomial <inline-formula> <tex-math notation="LaTeX">$x^{m}+x^{k}+1\,\,(0< k< m)$ </tex-math></inline-formula>, where <inline-formula> <tex-math notation="LaTeX">$m$ </tex-math></inline-formula> is decomposed as <inline-formula> <tex-math notation="LaTeX">$m=nm_{0}+r$ </tex-math></inline-formula> with <inline-formula> <tex-math notation="LaTeX">$0< r< m_{0}$ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$1< n$ </tex-math></inline-formula>. (Here, various values for <inline-formula> <tex-math notation="LaTeX">$n,m_{0}$ </tex-math></inline-formula> and <inline-formula> <tex-math notation="LaTeX">$r$ </tex-math></inline-formula> may be chosen.) To this end, we generalize the previously proposed multiplication scheme for <inline-formula> <tex-math notation="LaTeX">$x^{nm_{0}+1}+x^{k}+1$ </tex-math></inline-formula> into <inline-formula> <tex-math notation="LaTeX">$x^{nm_{0}+r}+x^{k}+1$ </tex-math></inline-formula>. We evaluate the explicit complexity of the proposed multiplier. Specific comparisons show that the proposed multiplier achieves the lowest space complexity with the same or lower time complexity among hybrid multipliers. Compared to the fastest multipliers, the time complexity of the proposed multiplier costs only <inline-formula> <tex-math notation="LaTeX">$T_{X}$ </tex-math></inline-formula> higher while its space complexity is much lower (it has roughly 40% reduced space complexity), where <inline-formula> <tex-math notation="LaTeX">$T_{X}$ </tex-math></inline-formula> is the delay of one 2-input XOR gate.


I. INTRODUCTION
Efficient hardware implementations of the binary extension field GF(2 m ) arithmetic are desired for various areas such as elliptic curve cryptography, computer algebra, and error correcting code ([1]- [3]). For example, the main operation in elliptic curve cryptography is a point multiplication kP, where k is an integer and P is a point on elliptic curves. To efficiently implement such point multiplication on hardware, efficient finite field arithmetic are required. Particularly, field multiplication is considered as the most important building block. For this reason, a number of multiplication algorithms have been presented. The space and time complexities are two major factors to measure the efficiency of a multiplication algorithm. The space complexity is represented as the total number of XOR gates and AND gates used. The corresponding time complexity is defined as the total delay of the circuit The associate editor coordinating the review of this manuscript and approving it for publication was Ailong Wu .
implementing the multiplier, which is expressed in terms of T X (the delay of an XOR gate) and T A (the delay of an AND gate). Such complexities mainly depend on the choice of field basis and a generation polynomial. Multiplication designs over GF (2 m ) defined by an irreducible trinomial using the polynomial basis (PB) usually have lower complexities than other choices.
Multiplication architectures over GF(2 m ) using PB is often accomplished in two steps, which are polynomial multiplication and modular reduction. There has been a lot of efforts to efficiently perform these two steps. For instance, divide-and-conquer approaches such as Karatsuba algorithm (KA), Winograd short convolution algorithm, and Chinese remainder theorem are used to reduce the space complexity of the polynomial multiplication step ( [4]- [6]). However, such methods generally cause an increase in time complexity. For an efficient modular reduction step, a shifted polynomial basis (SPB), which is a variation of the polynomial basis, is adopted in [7], [8]. Mastrovito in [9] proposed Mastrovito VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ approach combining the above two steps to reduce time complexity.
The resulting bit-parallel multipliers may be classified into three categories on the basis of the space complexity: quadratic, subquadratic, and hybrid multipliers. Quadratic multipliers ordinarily implement the product of two polynomials straightly ( [8], [10], [11]). On the other hand, subquadratic multipliers usually implement a divide-and-conquer approach recursively ( [4], [5], [12]). Quadratic and subquadratic multipliers have the lowest time and space complexities, respectively, while having the largest space and time complexities, respectively. For practical applications, hybrid approaches have been proposed, which can provide a trade-off between the time and space complexities. These approaches perform a few subquadratic iterations and then quadratic step on small input operands.
In this paper, we consider a hybrid bit-parallel multiplier for GF (2 m ) defined by an irreducible trinomial x m + x k + 1 (0 < k < m). The fastest bit-parallel multiplication architectures for trinomials were proposed in [8], [11]. These multipliers are quadratic multipliers, which have the complexity: m 2 AND and m 2 − 1 XOR gates, and delay T A + (1 + log 2 m )T X (for good field, it is equal to T A + log 2 m T X ). There are many hybrid multipliers for trinomials ([6], [13]- [16]). They have at most about 25% reduced space complexities compared to the fastest bit-parallel multipliers, but cost several more XOR gates delay.
In recent years, Li et al. in [17] proposed a hybrid Karatsuba-based multiplier for a special class of a trinomial x m + x k + 1 with m = nk. The multiplier applies n-term KA once in the polynomial multiplication to reduce the space complexity. Also, it uses SPB and Mastrovito approach for the efficient time complexity. Its lowest asymptotic space complexity is O( 1 2 m 2 + m 3 2 ) circuit gates while its time complexity is only T X slower than the fastest multipliers. However, irreducible trinomials x nk +x k +1 are not abundant. Park et al. [18] extended the multiplication scheme in [17] into trinomials x m + x k + 1 such that m is decomposed as m = nm 0 or m = nm 0 + 1 with n, m 0 > 1. The complexity of this multiplier depends on the choices of values n, m 0 , and k. Although an arbitrary integer m can be decomposed into m = nm 0 or m = nm 0 + 1 with n, m 0 > 1, the options for values of n, m 0 are limited. Such limitation for the decomposition of m is loosened in [19]. Li. et al. in [19] utilize alternative approach to perform modular reduction instead of Mastrovito approach and presented n-term Karatsuba multiplier for a trinomial x m +x k +1, where m = nm 0 +r with 0 ≤ r < n, m 0 . However, there is the limitation m ≥ 2k for the choice of trinomial. The space complexity of the multiplier roughly matches that of [18], while its time complexity is slightly higher. However, flexible parameters n, m 0 , r may yield lower space complexities or time complexities compared to [18].
In this paper, we would like to further ease limitations on trinomials and a decomposition of m in [18], [19]. We present a hybrid multiplier for any irreducible trinomial x m + x k + 1 (0 < k < m) with a decomposition m = nm 0 + r with 0 < r < m 0 and 1 < n. (Since the case r = 0 is already dealt with in [18, Section III], we do not address the case.) To this end, we generalize the multiplication scheme for trinomial x nm 0 +1 + x k + 1 in [18] into x nm 0 +r + x k + 1, which combines n-term KA and Mastrovito approach. We evaluate the explicit space and time complexities of the proposed multiplier. The complexity of the proposed multiplier is comparable with those in [18] and [19]. However, more flexible options for trinomials and values for n, m 0 and r may yield better complexities. We show this by giving some specific comparisons for several fields containing ones recommended by NIST. The proposed multiplier achieves the lowest space complexity with the same or lower time complexity among hybrid multipliers. Compared to the fastest multipliers, the time complexity of the proposed multiplier costs T X higher while its space complexity is much lower (it has roughly 40% reduced space complexity).
The rest of this paper is organized as follows. In Section II, we introduce some notations and review the multiplication scheme for trinomial x nm 0 +1 + x k + 1 presented in [18]. In Section III, we generalize the multiplication scheme for x nm 0 +1 +x k +1 into any trinomial x nm 0 +r +x k +1. We evaluate the explicit complexity of the presented multiplier. Section IV gives a comparison of the proposed multiplier with the best-known similar multipliers for trinomials. Finally, some conclusions are drawn.

II. MULTIPLICATION SCHEME FOR TRINOMIAL
x nm 0 +1 + x k + 1 BASED ON n-TERM KA Let the binary extension field GF(2 m ) be generated by an irreducible trinomial x m + x k + 1 (0 < k < m). In order to represent field elements of GF(2 m ), a shifted polynomial basis is used, which is defined as follows.
Definition 1 [7]: Let v be an integer and the ordered set {1, x, · · · , x m−1 } be a polynomial basis of GF(2 m ) over GF (2). The ordered set } is chosen to represent field elements of GF(2 m ) for an efficient modular reduction ( [7]).
First, we introduce some notations which are used throughout this paper. For a matrix A, Now, we briefly describe the multiplication scheme for trinomial based on n-term KA presented in [18]. Let the degree m of GF(2 m ) be decomposed as m = nm 0 +1 with n, m 0 > 1.
(The values of n and m 0 can be flexibly chosen.) Two given Then, the polynomial multiplication AB is equal to The above product is expanded by using the following n-term KA.
Lemma 1 ( [12] and [17, or [18,Lemma 2]): . Then the multiplication of the two polynomials is equal to where e j := a j b j for 0 ≤ j < n and e u,v : By using the above lemma, the multiplication and the product AB is partitioned into three parts K (x) = x (n−1)m 0 + x (n−2)m 0 + · · · + 1, Therefore, the field multiplication C = AB mod x nm 0 +1 + x k + 1 is equal to The computation of S 1 mod x nm 0 +1 + x k + 1 in (1) is performed based on Mastrovito approach ( [9]). That is, S 1 mod x nm 0 +1 +x k +1 is represented as a matrix-vector product for a matrix M and vector b. (M is called Mastrovito matrix corresponding to S 1 mod x nm 0 +1 + x k + 1.) To this end, the polynomial multiplication A i B i is first implemented as the matrix-vector product where b i is the coefficient vector of B i , A i,L and A i,H are m 0 × m 0 triangular matrices given by Then, the polynomial Therefore, the polynomial can be expressed as big matrix-vector product Ab, where A is a 2nm 0 ×nm 0 matrix and b is an nm 0 ×1 vector defined in (3), as shown at the bottom of the next page. The ith row A i, * of A corresponds to the coefficient of x i−1−2k of S 1 . Finally, the Mastrovito matrix M related to Ab mod x nm 0 +1 + x k + 1 is obtained by reducing terms of degrees that are out of the The polynomial S 2 in (1) is split into some parts using the following lemma. Lemma 2 ( [18, Lemma 4] or [20,Proposition 1]): The polynomial 2n−3 i=1 ( u+v=i,0≤u<v<n E u,v )x im 0 can be partitioned as follows: From the above lemma, we can write S 2 as j ∈ GF(2) since λ = n/2 and the degrees of terms of G i are in the range if n is even, Next, S 2 mod is obtained by implementing the modular reductions for G 1 , · · · , G λ mod x nm 0 +1 + x k + 1 and adding all these results by a binary XOR tree.

III. GENERALIZED MULTIPLICATION SCHEME FOR TRINOMIAL x nm
In this section, we present a multiplication scheme for any trinomial based on n-term KA, which is a generalization of that for trinomial x nm 0 +1 + x k + 1 in the previous section. Let GF(2 m ) be defined by an irreducible trinomial x m−1−k } be used to represent a field element. We decompose m as m = nm 0 + r, where 1 ≤ r < m 0 and 1 < n. Here, a flexible decomposition of m is possible, that is, various values for n, m 0 , and r may be chosen. Throughout this paper, we define We first split two given elements We expand the term ( n−1 i=0 A i x im 0 )( n−1 i=0 B i x im 0 ) using nterm KA in Lemma 1, and partition the product AB into three parts Therefore, the field multiplication C = AB mod F(x) is equal to It is noted that the polynomials S 1 and S 2 have the same forms as those in (1), respectively, except for the reduction polynomial. Fig. 1 (a) depicts the architecture of the field multiplication C = AB mod F(x). Here, Addition block implements the additions of vectors in Table 3 for Fig. 1 (b) illuminates detailed structures of AND blocks and BTX blocks in Fig. 1 (a), which is the same as that in [21, Fig. 1 (b)]. Fig. 2 (given in Section III-B2) shows the architecture for Q 1 , · · · , Q w , where S 3 mod F(x) = Q 1 + · · · + Q w (see (15)). We expound the computation of (5) can be written as the matrix vector product Ab as in Section Now, we compute the Mastrovito matrix M corresponding to S 1 mod F(x) by reducing terms of S 1 of degrees that are out of the range [−k, m − k − 1]. We use the following reduction rules We express modular reductions as additions of some matrices.
To this end, we first define the m × (m − r) matrices R 1 , R 2 , and R 3 according to the value of k in Table 1 (0 denotes a zero row). The ith row of R 1 corresponds to the coefficient of x −k−1 (whose coefficients correspond to the first k rows of A) are reduced as by (6). The above equations show that terms of S 1 of degrees in [−2k, − k − 1] are reduced by adding R 2 and R 2 [ k] to R 1 . If k ≥ m − 2r, then there are no more terms to be reduced. Otherwise k < m − 2r, the terms of S 1 of degrees in [m − k, 2m − 2k − 2r − 2] are reduced as by (6) and such reductions are implemented by adding R 3 and equal to Consequently, we have that We give full descriptions of two vectors M 1 b and M 2 b. From definition of A in (3), two m×(m−r) matrices M 1 and M 1 are given as in (7), as shown at the bottom of the next page since m = nm 0 + r. Here, for the matrix M 1 , we use the following facts: We define the following vectors.
From now on, we consider the complexity for The vectors e 0 , · · · e n in (8) are computed by two blocks: AND blocks and BTX blocks in Fig. 1 (b). Such computations require the complexity according to [18, Section III-A]. After computing vectors e i 's, we perform the computation of two vectors M 1 b and M 2 b using sub-expression sharing technique in Lemma 3.
Lemma 3 ( [16], [17]): When two expressions a 1 + · · · + a l + a l+1 + · · · + a q and a 1 + · · · + a l + a l+1 + · · · + a r with l common terms are simultaneously computed by binary XOR trees, l − W (l) XOR gates can be saved by reusing common values, where W (l) is the Hamming weight of l In Table 3, we summarize the required number of XOR gates for the computations of two vectors M 1 b and M 2 b by using the above lemma. For instance, if the vector e 0 + · · · + e n−1 +e n [↑ r] in (v) is computed with m 0 n XOR gates, we can save m 0 (t − 1 − W (t − 1)) XOR gates for the computation of e 0 + e 1 + · · · + e t−2 . Therefore, e 0 + e 1 + · · · + e t−2 VOLUME 8, 2020   Table 3 are computed. For the vector M 2 b, we need to compute (iv), (vii), and (viii). In the case k > m−r, the vector (e 1 +· · ·+e n ) [k−nm 0 ] in M 2 b is obtained from the vector (e 1 +· · ·+e n ) [r] in (ii) since k −nm 0 < r. The last row in Table 3 gives the total number of XOR gates for the computation of two vectors M 1 b and M 2 b. Their critical path delay is log 2 (n + 1) T X . Finally, the sum of the two vectors is implemented with max{k, m−2r} XOR gates and T X delay. Consequently, the total complexity for S 1 using (9) and Table 3, where J is defined in Table 3.

Example 1:
We give a small example to describe the proposed multiplication scheme. We consider the finite field GF(2 18 ) generated by the irreducible trinomial F(x) = x 18 + x 7 + 1 with parameters n = 3, m 0 = 5, r = 3, and k = 7. Then, we have that t := k m 0 = 2. Field elements of GF(2 18 ) are represented using the SPB {x −7 , x −6 , · · · , x 10 }. Let two 18 ) be given. We split two elements A and B as According to (5), the field multiplication C of the two elements A and B is equal to Based on previous descriptions, S 1

mod F(x) is equal to the sum of two vectors
The computation of S 1 mod F(x) is implemented as follows (see the left part of Fig. 1 (a)). First, the four vectors e 0 , · · · , e 3 are computed by two blocks in Fig. 1 (b). We note that the vector (e 0 + e 1 ) [2] in M 2 b can be obtained from the calculation process of (e 0 + e 1 ) + (e 2 + e 3 [↑ 3]) without any costs. Therefore, the computation of two vectors M 1 b and M 2 b requires 43 XOR gates and 2T X delays (see Table 3). Finally, S 1 mod F(x) is obtained by adding two vectors M 1 b and M 2 b with 12 XOR gates and T X delay. In summary, the total complexity for S 1 mod F(x) is 75 AND gates, 56+43+12=111 XOR gates, and T A + 6T X delays (see (10)).

BTX blocks: Add all the entries of the same row in matrices
The computation of S 2 + S 3 mod F(x) is dealt with in Examples 2 and 3.

B. COMPUTATION OF S 2 + S 3 mod F (x)
In this section, we consider the computation of S 2 + S 3 mod F(x), which is implemented in parallel with the computation of S 1 mod F(x).

1) COMPUTATION OF S 2 mod F (x)
Analogous to Section II-B, we can write S 2 in (5) as where λ = n 2 and We first compute the polynomials G i 's (i.e., G i 's) with the complexity in (4). The architecture for G i 's is given in the middle part of Fig. 1 (a) (see Example 2 more details). Next, we perform the modular reductions G i mod F(x) for 1 ≤ i ≤ λ and then the summation λ i=1 (G i mod F(x)). The modular reduction formula for G i mod F(x) is different according to the parity of n. In fact, we can write Here, P 3 , and P 2 are obtained without any costs if the polynomials G i 's are given (see Table 4). Since such modular reductions G i mod F(x) are performed in the same way as [18] and no special technique is used, we only summarize their results in Table 4. As a result, we have The last column of Table 4 denotes the number of XOR gates for each modular reduction. The total number of XOR gates for the summations in (11) is reported in the last rows of Table 4 for each parity of n. By adding it to the space complexity for G i 's in (4), we obtain the space complexity for where Y 1 and Y 2 are given in Table 4. Example 2: Continuing from Example 1, we consider the computations of Since λ = n 2 = 1, we have that where G 1 = E 0,1 + E 0,2 x 5 + E 1,2 x 10 . According to Table 4, the modular reduction of S 2 = G 1 = 9 j=−9 u In order to compute S 2 mod F(x), we first compute the polynomial G 1 = G 1 x −9 = 9 j=−9 u (1) j x j as in [18, Section III-B] (see the middle part of Fig. 1 (a)). The computations of 6 polynomials is obtained by adding the entries of the same row in the following matrix Then, the coefficients of the polynomial G 1 = E 0,1 + E 0,2 x 5 + E 1,2 x 10 can be obtained by adding the entries of the same row in the matrix E G 1 using a binary XOR tree, where The computation of the matrix E G 1 costs 75 AND gates and T A delay. The additions of the entries of the same row in E G 1 require 56 XOR gates and 3T X delays. (These computations consist of AND blocks and BTX blocks similar to that of e 0 , · · · , e 3 in Fig. 1 (b).) Consequently, the total complexity for the polynomial G 1 (i.e., G 1 ) is 75 AND gates, 30+56=86 XOR gates, and T A + 4T X delays, as in (4). Next, the summation of polynomials P 3 is performed with 3 XOR gates.
As a result, the total space complexity for S 2 mod F(x) = P is 75 AND gates and 89 XOR gates (see (12)). Its time complexity is considered in Example 3. VOLUME 8, 2020 FIGURE 2. Architecture of Q 1 , · · · , Q w for S 3 mod F (x).

2) COMPUTATION OF S 3 mod F (x)
Here, we perform the computation of S 3 mod F(x) in (5). Its architecture is described in Fig. 2. The polynomial can be written as S 3 = 2m−2k−2 i=m−2k−r f i x i since the degrees of A i and B i (resp. A and B) are less than m 0 (resp. r). We first compute the coefficients f i 's and then implement the modular reduction 2m−2k−2 i=m−2k−r f i x i mod F(x). The computation of the coefficients f i 's is performed by computing two polynomials ( n−1 i=0 B i x im 0 ) A and ( n−1 i=0 A i x im 0 ) B + A Bx nm 0 in parallel followed by addition of these two polynomials. First, we consider the computation of ( n−1 i=0 A i x im 0 ) B + A Bx nm 0 . We use the following notations.
• r A:= the first r columns of the matrix A, • b := the coefficient vector of B with r components, • b := the vector [ b T , 0, · · · , 0] T with m 0 components. The polynomial A i B can be written as the matrix-vector product where m 0 ×m 0 matrices A i,L and A i,L are defined in (2). Then the polynomial ( n−1 i=0 A i x im 0 ) B is represented as the matrixvector product since the last (m 0 − r) components of b are all zero. (On the right side of the matrix, we denote the first exponent of indeterminate x for each line.) Also, we write the polynomial A B as the matrix-vector product a nm 0 +r−1 a nm 0 +r−2 · · · a nm 0 +0 are r × r triangular matrices. Since the last (m 0 − r) rows of r A n−1,H are all zero rows, we obtain that (The computations of matrices ( r A i,H + r A i+1,L ) and ( r A n−1,H ) [r] + A L are all free.) The computation of the above matrix-vector products consists of AND blocks and BTX blocks similar to that of e i 's (see Fig. 1 (b)). Since ( r A 0,L ) [r] and A H are r × r triangular matrices, the computations of the above matrix-vector products for ( n−1 In parallel with the computation of ( n−1 i=0 A i x im 0 ) B+ Ax nm 0 B, the polynomial ( n−1 i=0 B i x im 0 ) A is similarly computed with the complexity: rnm 0 AND gates, (r − 1)nm 0 − (r − 1) XOR gates, and T A + log 2 r T X delays. Next, the coefficients f i 's are obtained by adding two polynomials ( n−1 i=0 A i x im 0 ) B + Ax nm 0 B and ( n−1 i=0 B i x im 0 ) A with (nm 0 + r − 1) XOR gates and T X delay. Consequently, the computation of the coefficients f i 's requires Now, we perform the modular reduction 2m−2k−2 i=m−2k−r f i x i by F(x) using the reduction rules in (6). Such modular reduction depends on the value k. In fact, 2m−2k−2 i=m−2k−r f i x i mod F(x) is written as a summation of 3 terms (2 terms if r = 1): The above additions for D 1 or D 3 are free since they are the additions of non-overlapped parts. The summation Hence, the total space complexity for S 3 using (13).
Since S 3 mod F(x) is written as a summation of 3 terms (2 terms if r = 1) with the delay T A + (1 + log 2 r )T X by (13), we can write S 3 mod F(x) as a summation of w terms S 3 mod F(x) = Q 1 + · · · + Q w (15) within The space complexities for (S 2 mod F(x)) and (S 3 mod F(x)) are given in (12) and (14), respectively. Finally, the addition of (S 2 mod F(x)) + (S 3 mod F(x)) requires m(= nm 0 + r) XOR gates. Therefore, the total space complexity for S 2 + S 3 mod F(x) is equal to Now, we consider the time complexity for the computation of (S 2 mod F(x)) + (S 3 mod F(x)) from (11) and (15), where P 3 := 3 . According to Table 4, the computation of P 3 is free since it consists of non-overlapped parts. Each computation of P (i) 1 , P (i) 2 , P 3 , and P 2 is implemented within the delay T A + (1 + log 2 m 0 )T X since they are obtained without any costs from the polynomials G i 's (see (4)). Also, the delay for each Q j is T A + (1 + log 2 m 0 )T X by (15). Thus, after T A + (1 + log 2 m 0 )T X delays, (S 2 mod F(x)) + (S 3 mod F(x)) is written as a summation of at most    2 n 2 + w = (n + w) terms if n : even, 2 n − 1 2 + 1 + w = (n + w) terms if n : odd. VOLUME 8, 2020 (Here, we note that if n is odd, t is even, k ≤ tm 0 − r − 2, then the ranges of degrees of P ( n−1 2 ) 2 and P 2 are [m 0 − k, − 1] and [0, tm 0 − k − r − 2], respectively, from Table 4 and so their addition is free. Therefore, S 2 mod F(x) can be regarded as a summation of (2 n−1 2 +1) terms in this case, too.) As a result, the total delay for (S 2 mod F(x)) + (S 3 mod F(x)) is Example 3: Here, we consider the computation of S 3 mod F(x) and the time complexity for S 2 + S 3 mod F(x) in Examples 1 and 2.
We first compute the polynomial 15 can be represented as the matrix-vector product 0 a 5i+2 a 5i+1 a 5i a 5i+3 a 5i+2 a 5i+1 a 5i+4 a 5i+3 a 5i+2 The above matrix-vector product can be computed with 54 AND gates, 34 XOR gates, and T A +2T X delays. In parallel with the computation of ( 2 i=0 A i x 5i ) B + Ax 15 B, the polynomial ( 2 i=0 B i x 5i ) A is similarly computed with 45 AND gates, 28 XOR gates, and T A + 2T X delays. And then, we get the polynomial S 3 = 20 i=1 f i x i by adding two polynomials ( 2 i=0 B i x 5i ) A and ( 2 i=0 A i x 5i ) B+ A Bx 15 with 17 XOR gates and T X delay. Therefore, the polynomial S 3 = 20 i=1 f i x i is computed with the complexity: 54+45=99 AND gates, 34+28+17=79 XOR gates, and T A + 3T X delays (see (13)).
Next, the modular reduction The addition D 1 + D 2 + D 3 requires 12 XOR gates. Hence, the total space complexity for S 3 mod F(x) is 99 AND gates and 79+12=91 XOR gates (see (14)). Since m 0 2 < r, w = 3 and Q i = D i in (15) for 1 ≤ i ≤ 2. Now, we consider the total time complexity for the computation of S 2 + S 3 mod F(x). According to the above description and Example 2, we can write where P (1) i 's and Q i 's can be computed in parallel within the delay T A + 4T X . The above summation is implemented with 3T X delays. Thus, the total time complexity for S 2 + S 3 mod F(x) is T A + 7T X delays (see (18)).

C. OVERALL COMPLEXITY FOR FIELD MULTIPLICATION C
We implement the computations (S 1 mod F(x)) and (S 2 +S 2 mod F(x)) in parallel with T A + (1 + log 2 m 0 + log 2 (n + w) )T X . Then, the addition of these two polynomials is performed with m(= nm 0 + r) XOR gates and T X delay. As the final outcome, the overall complexity for the field multipli- using (10), (16), and (18). Since l i=1 W (i) can be roughly written as l 2 log 2 l for a nonzero integer l ( [22]), J defined in Table 3 is roughly estimated as O(m log 2 n). The above number of XOR gates for the computation of the field multiplication can be written as by omitting the linear parts. We note that (rm− 1 2 rm 0 − 1 2 r 2 ) < m 0 m and − 1 2 nk + tk − 1 2 t 2 m 0 ≤ − 1 2 nk + tk − 1 2 tk ≤ 1 2 k since 1 ≤ t ≤ n + 1.

IV. COMPARISON
In Table 5, we compare the proposed multiplier with the best-known bit-parallel multipliers for GF(2 m ) defined by a TABLE 5. Comparison of the best-known bit-parallel multipliers for trinomial x m + x k + 1. trinomial x m + x k + 1. The fastest bit-parallel multipliers for trinomials have been proposed in [8] and [11]. [18] proposed multipliers for trinomial x m + x k + 1 based on n-term KA, where m is decomposed as m = nm 0 or m = nm 0 + 1. VOLUME 8, 2020 In [19], such limitation for m in [18] is loosened, that is, m satisfies m = nm 0 + r with 0 ≤ r < n, m 0 , but there is the limitation m ≥ 2k for trinomial. The multiplier in [18] or [19] achieves the lowest space complexity among similar bit-parallel multipliers.
The proposed multiplier in this paper further eases limitations for trinomials and a decomposition of m in [18], [19]. In fact, the multiplier for trinomial x nm 0 +1 + x k + 1 in [18] can be considered as a special case of the proposed multiplier. The complexity of the proposed multiplier is comparable with those in [18] and [19]. It is difficult to immediately know which multiplier is better. However, more flexible options for trinomials and values for n, m 0 and r may yield better complexities. In Table 6, we prove this by giving specific comparisons for several fields which are addressed in [19].
Compared to [18] and [19], the proposed multiplier has lower space complexity while having the same time complexity as shown in Table 6. In some cases such as m = 439 or 447, the proposed multiplier has not only lower space complexity but also lower time complexity.
Let us compare the proposed multiplier with the fastest multiplier in [8]. The proposed multiplier has one more T X delay. However, its space complexity is much lower. For instance, when m = 439, the proposed multiplier has about 43% reduced space complexity according to Table 6. Our space complexity gain (about 43% reduced space complexity) is much greater than the time complexity loss (about 10% XOR gate delay increment).

V. CONCLUSION
We have proposed a new hybrid multiplier for trinomial x nm 0 +r + x k + 1 based n-term KA, which is a generalization of the multiplication scheme for x nm 0 +1 + x k + 1 in [18]. We have evaluated the explicit complexity of the proposed multiplier and compared it with the best-known bit-parallel multipliers for trinomials. The specific comparisons for several fields show that the proposed multiplier achieves the lowest space complexity with the same or lower time complexity among hybrid multipliers. Compared to the fastest multipliers, the time complexity of the proposed multiplier costs T X higher while its space complexity is much lower. It is noted that the space complexity gain of the proposed multiplier is much greater than its time complexity loss. Consequently, the proposed multiplier outperforms similar bit-parallel multipliers. CHANGHO SEO received the B.S. and M.S. degrees from Korea University, South Korea, in 1990 and 1992, respectively, and the Ph.D. degree from the Department of Mathematics, Korea University, in 1996. He is currently a Full Professor with the Department of Applied Mathematics, Kongju National University, South Korea. His research interests include cryptography, information security, and system security. VOLUME 8, 2020