VLSI Architecture of S-Box With High Area Efficiency Based on Composite Field Arithmetic

This work aims at optimizing the hardware implementation of the SubBytes and inverse SubBytes operations in the advanced encryption standard (AES). To this, the composite field arithmetic (CFA) is employed to optimize all building blocks in S-box (and inverse S-box) of SubBytes (and inverse SubBytes) transformation. A joint design of S-box and inverse S-box is also proposed to further enhance the area efficiency. Specifically, the area of multiplier in the Galois composite field, GF<inline-formula> <tex-math notation="LaTeX">$((2^{2})^{2})$ </tex-math></inline-formula>, is reduced. The squaring and multiplication with constant <inline-formula> <tex-math notation="LaTeX">$\lambda $ </tex-math></inline-formula> in GF<inline-formula> <tex-math notation="LaTeX">$((2^{2})^{2})$ </tex-math></inline-formula> are combined and optimized as well. Moreover, the multiplicative inversion in GF<inline-formula> <tex-math notation="LaTeX">$((2^{2})^{2})$ </tex-math></inline-formula> is manually optimized. Furthermore, the S-box and inverse S-box are combined and optimized using the pre_processing and post_processing modules. To increase the throughput, a balanced and pipelined architecture is derived. Using the proposed architecture, a throughput of 5.79 Gbps for the S-box can be achieved on Virtex-6 XC6VLX240T and 10% better than the conventional work. According to the ASIC implementation result, the proposed design can still achieve the highest area efficiency and approximately 30% better than conventional works using TSMC 90nm process.


I. INTRODUCTION
In 2001, National Institute of Standard and Technology (NIST) invited proposals for new algorithm of the advanced encryption standard (AES) to replace the old data encryption standard (DES). The Rijndael algorithm [1], designed by two Belgian cryptographers, Joan Daemen and Vincent Rijmen, was finally selected as the AES specification and became a FIPS standard [2]. Nowadays, AES algorithm is the most popular symmetric encryption algorithm. With the rapid development of transmission technology in communication networks, the data throughput has increased significantly. Therefore, implementing a low-cost but highthroughput AES engine has become an essential issue.
Compared to the software solution [3], the hardware implementation is more suitable for high-throughput data applications. Among hardware implementations, the nonlinear SubBytes transformation realized using look-up tables (LUTs) [4]- [6] requires a large area compared to those The associate editor coordinating the review of this manuscript and approving it for publication was Paolo Crippa . using the composite field arithmetic (CFA) [7]- [20]. The works [7]- [11] studied low-area implementations based on the fully combinational logic. The work [12] presented a S-box based on the multiplexer. The work [13] evaluated 5-, 6-, and 7-stages pipelined S-box based on the CFA. By contrast, the studies, [14] and [15], proposed a 4-stage pipelined S-box. The work [16] adopted the pre-computation technique with three block design of S-box. In [17], modified MUX based S-box was introduced in AES to reduce the area without affecting the throughput. The study [18] proposed a new compact S-box. In [19], a joint AES encryption/decryption with a 7-stage pipeline using the CFA was proposed. The study [20] proposed a parallel pipelined architecture to obtain high data throughput. In [21], a compact AES through optimizing the mix columns and inverse mix columns transformations was obtained. The work [22] shared the circuitry of AES and SHA-3. A 60 Gbps reconfigurable cryptographic processor, including AES and SHA algorithms, was proposed in [23]. The authenticated encryption circuits for the operation modes of AES, AES-CCM and AES-XEX, were considered in [24] and [25], respectively. Recently, the research direction of AES towards the countermeasures for the side-channel attack (SCA) [26]- [36]. The side-channel attack is based on the correlation power analysis through monitoring the chip supply current signature or electromagnetic emissions. The work [26] proposed a 16-bit serial AES-128 hardware accelerator with randomized byte-order shuffling through heterogeneous S-boxes. The works [27], [28] adopted a SCA-resistant methodology based on the machine learning. The work [29] proposed to use the random fast voltage dithering against the SCA, while the works, [30] and [31], used the asynchronous logic and memristor, respectively. A masked S-box was devised in [32] to cope with the SCA. The work [33] proposed a fast power leakage simulation method for hardware-implemented cryptographic ICs. The works, [34] and [35], proposed the collision fault information and incremental fault analysis for the attacks on AES, respectively. In [36], the SCA is evaluated for the operation mode of XTS-AES.
The Internet of Things (IoT) paves another way for the AES research. The works, [37]- [40], proposed lightweight or area-efficient AES cryptographic circuits. In [41], lightweight AES, PRESENT, and GIFT ciphers were evaluated on FPGA.
Since there is no network that is immune to attacks, an efficient network security system is essential to protecting client data. Given the rapid growth of networks and data centers, and high demand for the greatest possible bandwidth, a new AES circuit is designed in this study with the highest performances, in terms of throughput and area efficiency, for the core networks and data centers.
The SubBytes and inverse SubBytes operations usually occupy the largest area and exhibit the longest path delay in AES. Thus, this work aims to develop a new VLSI architecture for the joint S-box and inverse S-box with a high area efficiency based on the CFA. More specifically, this study provides several contributions outlined below. 1) The area of multiplier in the Galois composite field, GF((2 2 ) 2 ), is reduced, where GF(·) denotes the Galois field. 2) The squaring and multiplication with constant λ in GF((2 2 ) 2 ) are combined and optimized.
3) The multiplicative inversion in GF((2 2 ) 2 ) is manually optimized. 4) The S-box and inverse S-box are combined and optimized as well. A pre_processing module is proposed to share the resources in isomorphic mapping and inverse affine transformation in tandem with isomorphic mapping of the S-box and inverse S-box, respectively. Likewise, a post_processing is also proposed to combine the inverse isomorphic mapping in tandem with affine transformation and inverse isomorphic mapping of the S-box and inverse S-box, respectively.
The rest of this paper is organized as follows. Section II briefly introduces the AES algorithm. Section III presents the proposed SubBytes and inverse SubBytes operations. Section IV demonstrates the implementation results. Finally, Section V draws conclusions.

II. ADVANCED ENCRYPTION STANDARD
AES [1] is the most popular symmetric encryption algorithm that can process 128 bits at a time. The same key is used for both encryption and decryption. It divides the plaintext into a fixed block size of 128 bits. Several rounds of repeated encryption and decryption processes are performed on the plaintext and ciphertext, respectively.
The AES is an iterative algorithm and uses a round function repeatedly. In encryption, each round is composed of four different byte-oriented processing steps: substitute bytes (Sub-Bytes), shift rows (ShiftRows), mix columns (MixColumns), and add round key (AddRoundKey), while the last round does not contain the MixColumns step. In decryption, each round is also composed of four different byte-oriented processing steps: inverse substitute bytes (InvSubBytes), inverse shift rows (InvShiftRows), inverse mix columns (InvMix-Columns), and add round key (AddRoundKey), while the last round does not contain the InvMixColumns step.
SubBytes is designed using the multiplicative inverse of input element over GF (2 8 ) and then applying an affine transformation on the multiplicative inverse. This step is the only non-linear transformation in AES. For the decryption, on the contrary, InvSubBytes is designed applying an inverse affine transformation firstly. The multiplicative inverse over GF (2 8 ) of the output of the inverse affine transformation is the final output of the inverse S-box.
The proposed design can be integrated into AES, which has been included in many communication standards, such as Internet engineering task force (IETF) and requests for comments, international organization for standardization (ISO), third-generation partnership project (3GPP), and IEEE standards. Therefore, it is a popular and secure encryption algorithm for media access control (MAC) layer and higher layers, such as Internet protocol (IP). Actually, AES can be applied in any applications for encrypting the digital contents.

III. PROPOSED SubBytes AND INVERSE SubBytes OPERATIONS
To improve the area efficiency of AES implementation, The architecture of the SubBytes and inverse SubBytes operations is proposed in Fig. 1, where Fig. 1(a) and Fig. 1(b) are conventional [19] and proposed architectures, respectively. In Fig. 1(a), δ, AT , (·) −1 , (·) 2 , λ, ×, and ⊕ denote the isomorphic mapping, affine transformation, inversion, squaring, multiplication with λ, multiplication, and addition, respectively. Proposed sub-blocks are introduced below. Notably, the adder in Galois field corresponds to the XOR operation. Therefore, the adders in Fig. 1 can be simply described in the register-transfer level (RTL) codes using the Verilog XOR operator.
then k can be written as Let the product k = qw, where q and w are also elements in GF((2 2 ) 2 ). According to the irreducible polynomial, Comparing (1) and (2), one has According to the irreducible polynomial, x 2 + x + 1, in GF(2 2 ), one can further reduce (3) by Similarly, it can be shown that q L w L = (q 1 w 1 + q 0 w 1 + q 1 w 0 )x + q 1 w 1 + q 0 w 0 . (8) Since k H = k 3 x + k 2 and k L = k 1 x + k 0 , substituting (4), (5), (6), (7), (8) into (3) and extracting common factors in k 3 , k 2 , k 1 , k 0 , one finally has It must be emphasized here that + in Galois field corresponds to the bitwise XOR operation. As presented in (9), only (q 3 + q 2 ), (q 1 + q 0 ), (w 3 + w 2 ), (w 1 + w 0 ), q 3 w 3 , q 2 w 2 , q 1 w 1 , q 0 w 0 , and (q 2 w 0 + q 0 w 2 ) are needed to implement the multiplication in GF((2 2 ) 2 ). As displayed in Fig. 2, the proposed multiplication requires 18 XOR and 12 AND gates. Its critical path has 4 XOR and 1 AND gates. In comparison, the multiplier in GF((2 2 ) 2 ) of [19] requires an area of 21 XOR and 9 AND gates, and critical path of 4 XOR and 1 AND gates. In this section, two submodules, i.e., squaring and multiplication with constant λ = {1100} 2 in GF((2 2 ) 2 ), are combined and optimized to improve the area efficiency. Let h be an element in GF((2 2 ) 2 ), its squaring and multiplication with constant λ can be written as After mathematical derivations (shown in Appendix), one finally has where l i and h i , i = 0, 1, 2, 3, are bits of elements, l and h, in GF((2 2 ) 2 ), respectively. As displayed in Fig. 3, the joint squaring and multiplication with constant λ requires 4 XOR gates, and its critical path has 2 XOR gates. The circuit with the same function in [19] requires an area and critical path of 7 XOR and 4 XOR gates, respectively.

D. Pre_processing AND Post_processing DESIGNS
The post_processing can output the results of inverse isomorphic mapping in tandem with affine transformation for the S-box, and inverse isomorphic mapping for the inverse S-box. It can be observed that the input of the inverse isomorphic mapping and the inverse isomorphic mapping in tandem with affine transformation are the same, and the outputs of them require additions in GF((2 2 ) 2 ). By sharing the resources involved in the S-box and inverse S-box, the output, p = {p 7 p 6 p 5 p 4 p 3 p 2 p 1 p 0 } 2 , of the inverse isomorphic mapping and the output, r = {r 7 r 6 r 5 r 4 r 3 r 2 r 1 r 0 } 2 , of the inverse isomorphic mapping in tandem with affine transformation can be respectively rewritten as and where ''+'' is the addition in GF((2 2 ) 2 ), k = {k 7 k 6 k 5 k 4 k 3 k 2 k 1 k 0 } 2 denotes the input of post_processing, t 8 = t 4 + t 0 , t 7 = k 7 + k 0 , t 6 = k 7 + k 2 , t 5 = k 7 + k 6 , t 4 = k 6 + k 5 , t 3 = k 5 + k 4 , t 2 = k 4 + k 3 , t 1 = k 2 + k 1 , and t 0 = k 2 + k 0 . The proposed post_processing requires 25 XOR gates, and its critical path has 3 XOR gates. Compared to the separate circuits of inverse isomorphic mapping and inverse isomorphic mapping in tandem with affine transformation in [19], which requires 32 XOR gates with critical path of 4 XOR gates, the optimized circuit needs less area and shorter critical path.
The pre_processing can be similarly designed and omitted here. It has 22 XOR gates with critical path of 3 XOR gates. By contrast, the conventional design requires 32 XOR gates with critical path of 4 XOR gates.

E. DESIGN CHARACTERISTICS OF SUB-BLOCKS AND PIPELINED DESIGN
To obtain a balanced design in different pipelining stages without inserting too many pipelined registers, a 5-stage pipelined joint S-box and inverse S-box is devised, as that shown in Fig. 4. Dashed lines represent the positions registers are placed. Hence, there are 11 4-bit registers inserted in the pipelined architecture. In this design, the critical path with 4 XOR and 1 AND gates is in the second stage.

IV. IMPLEMENTATION RESULTS
First, the proposed architecture of S-box using various FPGA devices is evaluated, such as Virtex-6 XC6VLX240T, Virtex-5 XC5VLX20T, Virtex-4 XC4VF100, and Spartan-3 XC3S200, that were adopted in conventional pipelined designs in literature. The proposed design is described using the Verilog HDL. The development system is Xilinx ISE Design Suite 14.7.
The implementation results about the FPGA implementation are displayed in Table 3. As presented, the proposed architecture of S-box has the highest throughput and area efficiency. Using the proposed architecture on Virtex-6, a throughput of 5.79 Gbps for the S-box can be achieved. Compared to that of [14], the area efficiency of proposed implementation can increase 10% on Virtex-6 xc6vlx240t. Moreover, it should be emphasized here that conventional designs focus only on pure S-box. Therefore, only the S-box is assessed in Table 3. Beyond that, the proposed VLSI architecture can share the resources in S-box and inverse S-box, i.e., isomorphic mapping and inverse affine transformation in pre_processing, and inverse isomorphic mapping and affine transformation in post_processing.  Next, regarding the ASIC implementation in Table 4 based on the TSMC 90nm cell library and the synthesis tool of Synopsys Design Compiler, at the maximum achievable clock rate for each architectures, the proposed design can still achieve the highest area efficiency and approximately 30% better than conventional works using TSMC 90nm process. Notably, the works [7]- [11] proposed fully combinational circuits for the S-box (and inverse S-box). Therefore, they can achieve smaller areas than this work and [19]. To further compare this work and [19], they are also synthesized using TSMC 40nm in Table 4. As presented, in terms of the area efficiency, this work is still 25.5% better than [19] using TSMC 40nm process.
The comparative algorithms are described in brief here. The works [7], [8], and [10] use the normal basis to establish the tower field and then map the elements of the Galois field GF(2 8 ) to this field. By contrast, the proposed work, [9], [11], and [19] use the composite field representation of GF((2 4 ) 2 ) to construct the S-box. However, different optimization approaches are adopted to implement the circuits. The work [9] proposed logic-minimization algorithm and searching for optimum transformation matrices. By contrast, our approach is to combine and optimize each modules through mathematical reduction. The work [11] proposed to combine the S-box and the inverse S-box through multiplexers. However, this work proposes the pre_processing module to optimize the inverse affine transformation and isomorphic mapping, and the post_processing to integrate the affine transformation and inverse isomorphic mapping. The work [19] proposed an efficient S-box using pipelining. While this work further optimizes all building blocks of S-box and inverse S-box, such as multiplication, squaring and multiplication with constant λ, multiplicative inversion, and pre_processing and post_processing. Moreover, the places registers are inserted in this work are different from those of [19]. Therefore, the number of required registers of this work can be reduced. To analyze the proposed design and [19] with the highest throughput and area efficiency in literature, the detailed time complexity and the data in the table are given in new  Tables 5 and 6. As presented, the area and timing of the proposed design are better than those in [19].

V. CONCLUSION AND OUTLOOKS
A new pipelined VLSI architecture for the joint S-box and inverse S-box using the CFA in GF((2 2 ) 2 ) is proposed. First, excluding the adder, all building blocks in the SubBytes and inverse SubBytes, such as multiplier, multiplicative inversion, squaring and multiplication with constant λ, isomorphic function and inverse affine transformation, and inverse isomorphic function and affine transformation, are optimized from the algorithm aspect. Next, to focus on VLSI implementation, the schematics of the multiplier and squaring and multiplication with constant λ are plotted in Figs. 2 and 3, respectively. Besides, the Boolean equations of the multiplicative inversion in GF((2 2 ) 2 ) and post_processing are written as (13) and (14) (and (15)), respectively. Therefore, the proposed VLSI design can be simply implemented and replicated. Finally, the superiority of the proposed design is validated using both FPGA and ASIC implementations. Our future work will integrate the proposed S-box and inverse S-box into the whole AES circuit.