High Throughput/Gate AES Hardware Architectures Based on Datapath Compression

This article proposes highly efficient Advanced Encryption Standard (AES) hardware architectures that support encryption and both encryption and decryption. New operation-reordering and register-retiming techniques presented in this article allow us to unify the inversion circuits in SubBytes and InvSubBytes without any delay overhead. In addition, a new optimization technique for minimizing linear mappings, named multiplicative-offset, further enhances the hardware efficiency. We also present a shared key scheduling datapath that can work on-the-fly in the proposed architecture. To the best of our knowledge, the proposed architecture has the shortest critical path delay and is the most efficient in terms of throughput per area among conventional AES encryption/decryption and encryption architectures with tower-field S-boxes. The proposed round-based architecture can perform AES encryption where block-wise parallelism is unavailable (e.g., cipher block chaining (CBC) mode); thus, our techniques can be globally applied to any type of architecture including pipelined ones. We evaluated the performance of the proposed and some conventional datapaths by logic synthesis with the NanGate 45-nm open-cell library. As a result, we can confirm that our proposed architectures achieve approximately 51-64 percent higher efficiency (i.e., higher bps/GE) and lower power/energy consumption than the other conventional counterparts.


INTRODUCTION
C RYPTOGRAPHIC applications are essential for many systems that rely on secure communications, authentication, and digital signatures. In accordance with the rapid increase in Internet of Things (IoT) applications, numerous cryptographic algorithms are now required to be implemented in resource-constrained devices and embedded systems with high throughput and efficiency. Advanced Encryption Standard (AES) is an ISO/IEC 18033 standard symmetric cipher that is one of the most widely used ciphers around the world. Since the publication of AES at 2001, many hardware implementations for AES have been proposed and evaluated for CMOS logic technologies. Studies of AES design and implementation are important from both practical and academic perspectives since AES employs a substitution permutation network (SPN) structure and the major subfunctions, which are followed by many other security primitives.
AES encryption and decryption are frequently used in block-chaining modes of operation, such as cipher block chaining (CBC), cipher-based message authentication code (CMAC), and counter with CBC-MAC (CCM), which are used in, for example, IEEE802.11 wireless LAN and IEEE802.15.4 wireless sensor networks. Therefore, AES architectures that efficiently perform both encryption and decryption in block-chaining modes are in high demand. Here, we cannot exploit the tradeoff of pipelining between throughput and area for encryption in block-chaining modes, although many conventional architectures employ pipelining techniques [1], [2], [3]. This raises the importance of datapath optimization for a higher throughput and efficiency. In addition, on-the-fly key scheduling should be implemented in resource-constrained devices because offline key scheduling implementation requires additional memory to store expanded round keys. Moreover, on-thefly key scheduling is sometimes more important when implementing block cipher with a tweak [4] (e.g., used in authenticated encryption), in some of which temporal key is generated using a master key and a tweak unique for each block [5], [6]. Thus, it would be valuable to develop efficient AES architectures with on-the-fly key scheduling without block-wise pipelining techniques.
In this paper, we present new round-based AES architectures for encryption only and both encryption and decryption with on-the-fly key scheduling. The proposed architectures achieve the lowest critical path delay (the least number of serially connected gates in the critical path) with less area overhead compared to conventional architectures with tower-field S-boxes. Our architectures employ new operation-reordering and register-retiming techniques to unify the inversion circuits without any selectors. These techniques also make it possible to unify the affine transformation and linear mappings (i.e., the isomorphism and constant multiplications) to reduce the total number of logic gates. The proposed and conventional AES datapaths were synthesized and evaluated with an open-cell library. The evaluation results show that our two architectures can perform encryption and both encryption and decryption more area-time efficiently. In particular, the throughput per gates of the proposed architectures are 51-64 percent larger than those of the corresponding conventional best architectures.
While the basic concepts and preliminary evaluations of our AES encryption/decryption architecture were resented in previous work [7], this paper newly presents more efficient AES architectures with threefold novel contributions. First, we propose a new optimization technique for minimizing linear operations named multiplicative-offset. The multiplicative-offset provides a larger variety of matrix constructions for linear mappings without any additional block nor overhead, which can lead to more compact and/or lowerlatency implementation. The proposed multiplicative-offset is given as an extension of a previous method [8] to round-based architectures on the basis of operation-reordering and register-retiming. Thanks to the new method, the proposed architecture achieves a further 7-9 percent higher efficiency than that in the previous version [7]. Second, we newly present a high throughput/gate AES encryption hardware design based on the proposed concept/technique (i.e., tower-field arithmetic, register-retiming, unification of linear operations, and multiplicative-offset). In particular, we show that the combination of unification and multiplicative-offset techniques can significantly reduce the logic depth (i.e., critical delay) of the encryption datapath. As a result, the proposed AES encryption architecture achieves 58-64 percent higher throughput/gate efficiency than other conventional ones. Third, while the previous study did not include any reports on power consumption, we describe a power consumption estimations based on a Monte-Carlo gate-level timing simulations with back-annotation, in which the effects of glitches were considered. The results clearly show the advantage of the proposed architectures in terms of power/energy consumption.
The rest of this paper is organized as follows. Section 2 introduces related works on AES hardware architectures. Section 3 presents a new AES encryption/decryption hardware architecture based on our operation-reordering, register-retiming, unification of linear mappings, and multiplicative-offset. Section 3 also presents an evaluation of the proposed datapath by the logic synthesis and gatelevel timing simulations in comparison with conventional round-based datapaths. Section 4 proposes an AES encryption hardware architecture based on the techniques presented in Section 3, and we evaluate the proposed architecture in the same manner as Section 3. Section 5 discusses variations of the proposed architectures. Finally, Section 6 contains our conclusion.

RELATED WORKS
In this paper, we briefly describe the related works. See the previous version [7] for more detail.

Unified AES Datapath for Encryption and Decryption
Architectures that perform one round of encryption or decryption per clock cycle without pipelining are the most typical for AES design and are called round-based architectures in this paper. Round-based architectures can be implemented more efficiently in terms of throughput per area than other architectures by utilizing the inherent parallelism of symmetric key ciphers.
To design such round-based encryption/decryption architectures in an efficient manner, we consider how to unify the resource-consuming components such as the inversion circuits in SubBytes/InvSubBytes for the encryption and decryption datapaths. There are two conventional approaches for designing such unified datapaths. The first approach is to place two distinct datapaths for encryption and decryption and select one of the datapaths with multiplexers as in [1]. In the architecture, the intermediate value is stored in a register after InvMixColumns instead of AddRoundKey. Such register-retiming is suitable for pipelined architectures. The second approach is to unify the circuits of the functions Sub-Bytes, ShiftRows, and MixColumns with their inverse functions, respectively (e.g., [2], [9]). The order of the decryption operations was changed to be the same as that of the encryption operations. The main drawbacks of conventional architectures are the false critical path delay and the required area and delay overheads caused by several multiplexers. This false critical path reduces the maximum operation frequency owing to logic synthesis due to the false longest logic chain. The overhead caused by the multiplexers is also nonnegligible for common standard-cell-based designs. In addition, the second approach sometimes requires an additional InvMixColumns required for the above reordering [9], which is also considered as an overhead.

Inversion Circuit Design and Tower-Field Arithmetic
The design of the inversion circuit used in (Inv)SubBytes has a significant impact on the performance of AES implementations. There are two major approaches for its design, namely, direct mapping and tower-field arithmetic. Inversion circuits based on direct mapping such as table-lookup, binary decision diagram (BDD), and positive-polarity Reed-Muller (PPRM) transforms [1], [10], [11] are faster but larger than those based on tower field. On the other hand, towerfield arithmetic enables us to design more compact and area-time efficient inversion circuits in comparison with direct mapping [2], [8], [9], [12], [13], [14], [15], [16], [17], [18]. Therefore, we focus on inversion circuits based on tower-field arithmetic in this paper.
To embed such a tower-field-based inversion circuit in AES hardware, isomorphic mapping between the AES field and the tower field is required because the inversion and Mix-Columns are performed over the AES field (i.e., PB-based GF ð2 8 Þ with an irreducible polynomial x 8 þ x 4 þ x 3 þ x þ 1). Typically, the input into the inversion circuit (in the AES field) is initially mapped to the tower field by the isomorphic mapping. After the inversion operation over the tower field, inverse isomorphic mapping prior to affine transformation is applied [9]. On the other hand, some architectures perform all of the AES subfunctions (i.e., SubBytes as well as ShiftRows, MixColumns, and AddRoundKey) over the tower field, where isomorphic mapping and its inverse mappings are performed at the timings of the data (i.e., plaintext and ciphertext) input and output, respectively [2], [19]. In other words, the cost of field conversion is suppressed when the conversion is performed only once during encryption or decryption. However, the cost of constant multiplications in MixColumns over a tower field is worse than that over the AES field while inversion is efficiently performed over the tower field. More precisely, in tower-field architectures, such linear mappings including constant multiplications usually require a 3T XOR delay, where T XOR indicates the delay of an XOR gate [20]. The XOR gate count used in (Inv)MixColumns over a tower field is also worse than that over the AES field.

PROPOSED ENCRYPTION/DECRYPTION ARCHITECTURE
This section presents a new round-based AES architecture that unifies the encryption and decryption paths in an efficient manner. The key ideas for reducing the critical path delay are summarized as follows: (a) to merge linear mappings such as MixColumns and isomorphic mappings as much as possible by reordering subfunctions, (b) to minimize the number of selectors to unify the encryption and decryption paths by the above merging and a registerretiming, and (c) to perform isomorphic mapping and its inverse mappings only once in the pre-and post-round datapaths. We can reduce the number of linear mappings to at most one for each round operation as the effect of (a). Moreover, we can reduce the number of selectors to only one (4-to-1 multiplexer) in the unified datapath as the effect of (b) while the inversion circuit is shared by the encryption and decryption paths. From the idea of (c), we can remove the isomorphic mapping and its inverse mappings from the critical path. Fig. 1 shows the overall architecture that consists of the round function and key scheduling parts. Our architecture performs all of the subfunctions over a tower field for both the round function and key scheduling parts and therefore applies isomorphic mappings between the AES and tower fields in the datapaths of the pre-and postround operations, which are represented as the blocks "Preround datapath" and "Post-round datapath" in Fig. 1. "Round datapath" performs one round operation for either encryption or decryption.

Round Function Part
The proposed architecture employs a datapath for encryption and decryption where inversion is unified and applies new operation-reordering and register-retiming techniques to address the conventional issues of a false critical path and additional multiplexers. By using our operation-reordering technique and then merging linear mappings, we can reduce the number of linear mappings on the critical path of the round datapath to at most one. Our reordering technique also allows for the unification of the linear mappings and affine transformation in a round. The unification of these mappings can drastically reduce the critical path delay and the XOR-gate count of linear mappings in a tower-field architecture.
The new operation reordering is derived as follows. First, the operational round operation of AES encryption is represented by the following equation: (1) where m ðrÞ i;j and k ðrÞ i;j are the ith row and jth column intermediate value at the rth round input in encryption and the rth round key, respectively, except for the final round. Note that the subscripts of each variable are a member of Z=4Z. The function S indicates the 8-bit S-box, and u 0 , u 1 , u 2 , and u 3 are the coefficients of the matrix of MixColumns, which are respectively given by b, b þ 1, 1, and 1, where b is the We can rewrite Eq. (1) by decomposing S into inversion and affine transformation as follows: where A is the linear mapping of the affine transformation, and c ð¼ b 6  (3) where D is the isomorphic mapping from the AES field to a tower field, and D 0 is the inverse isomorphic mapping. The linear mappings, which include an isomorphism and constant multiplications over the GF, are performed by the constant multiplication of the corresponding matrix over GF ð2Þ. Therefore, we can merge such mappings to reduce the critical path delay and the number of XOR gates. To unify all linear mappings on one round as at most one mapping, we first apply D to both sides of Eq. (3) as follows: where U eÀi ðxÞ ¼ Dðu eÀi ðAðD 0 ðxÞÞÞÞ. Note that U 2 ðxÞ ¼ U 3 ðxÞ is equal to the affine transformation over the tower field denoted as CðxÞ ¼ DðAðD 0 ðxÞÞÞ. Thus, the linear mappings of a round in Eq. (6) can be merged into at most one, even with a tower-field S-box, while the linear mappings in Eq. (3) cannot be. On the other hand, the equations for AES decryption corresponding to Eqs.
where m 0ðrÞ i;j denotes the ith row and jth column intermediate value at the rth round input at decryption, and A 0 indicates the linear mapping of the inverse affine transformation. The coefficients v 0 , v 1 , v 2 , and v 3 are respectively given by b 3 þ b 2 þ b, b 3 þ b þ 1, b 3 þ b 2 þ 1, and b 3 þ 1, and c 0 ð¼ b 2 þ 1Þ is a constant. Here, the linear mappings cannot be merged into one because they are performed both before and after the inversion operation. In addition, if we construct an encryption/decryption datapath based on Eqs. (6) and (8), the inversion circuit cannot be shared by encryption and decryption without a selector because the timings of the inversion operations are different from each other. Therefore, we consider a register retiming to store the intermediate value s where V eÀi ðxÞ ¼ DðA 0 ðv eÀi ðD 0 ðxÞÞÞÞ. Our round datapath is constructed with a minimal critical path delay according to Eqs. (6) and (9). Fig. 2 shows the proposed reordering technique. In the proposed flow, we perform isomorphic mapping from/to GF ð2 8 Þ to/from GF ðð2 4 Þ 2 Þ at the data input/output. We first decompose SubBytes into the inversion and (Inv)Affine. In the encryption, Affine, MixColumns, and AddRoundKey can be merged by exchanging Affine and ShiftRows. In the decryption, the inversion circuit is located at the beginning of the round by exchanging the inversion and InvShiftRows. Note that all the operations excluding isomorphic mappings are performed over the tower field as mentioned before. Thus, additional selectors for sharing the inversion circuit are not required thanks to the operation-reordering and registerretiming techniques. This is because both inversion operations are performed at the beginning of the round, which means that the data register output can be directly connected to the inversion circuit. Fig. 3 illustrates the proposed round function datapath with the unification of linear mappings. Our architecture employs only one 128-bit 4-in-1 multiplexer, whereas conventional ones employ several 128-bit multiplexers. For example, the datapath in [21] employs seven 128-bit multiplexers. 1 Fewer selectors can reduce the critical path delay and circuit area and solve the false critical path problem. Unified affine and Unified affine À1 in Fig. 3 perform the unified linear mappings (i.e., U 0 ; . . . ; U 3 and V 0 ; . . . ; V 3 ) and constant addition. The number of linear mappings on the critical path is at most one in our architecture, whereas that of the conventional architectures is not. We can also suppress the overhead of constant multiplication over the tower field by the unification. Adder arrays in Fig. 3 consist of four 4-input 8-bit adders in MixColumns or InvMixColumns. In the encryption, the factoring technique for MixColumns and AddRoundKey [20] is available for Unified affine as described in Section 4.1 in detail, which makes the circuit area smaller without a delay overhead. As a result, the data width between Unified affine and Adder array in Encryption path is reduced from 512 to 256 bits because the calculations of U 1 and U 3 are not performed in the Encryption path. In addition, Adder array and AddRoundKey are unified in the Encryption path because both of them are composed of 8-bit adders. 2 On the other hand, since there is no factoring technique for InvMixColumns 1. The selectors in SubBytes/InvSubBytes are included in the seven multiplexers.
2. Some architectures such as [9], [21] unify AddInitialKey and AddRoundKeys. We did not unify them to avoid increasing the number of selectors. without delay overheads, the data width from Unified affine À1 to Adder array in the Decryption path is 512 bits. Finally, an inactive path can be disabled by using a demultiplexer since our datapath is fully parallel after the inversion circuit. Thanks to the disabling, a multiplexer and AddRoundKey are unified as Bit-parallel XOR. (The addition of DðcÞ in Unified affine should be active only during encryption.) In addition, the demultiplexer can suppress power consumption due to a dynamic hazard. Although tower-field inversion circuits are known to be power-consuming owing to dynamic hazards [11], these hazards can be terminated at the input of the inactive path.
Our datapath employs the inversion circuit presented in [8], [15] because it has the highest area-time efficiency among inversion circuits applicable to the tower-field architectures and/or both encryption and decryption. We can merge the isomorphic mappings in order to reduce the linear function on the round datapath to only one, even if the inversion circuit has different GF representations at the input and output. Since the output is given by an RRB, the data width from Inversion to Unified affine (or Unified affine À1 ) is given by 160 bits. However, AddRoundKey in the decryption path and Bit-parallel XOR in the post-round datapath are implemented respectively by only 128 XOR gates because the NB used as the input is equal to the reduced version of the RRB. Note here that, while the path through the Encryption path and Post-round datapath contains two linear mappings (i.e., Unified affine and GF ðð2 4 Þ 2 Þ to GF ð2 8 Þ), this path would not be a critical path because the Adder array (including AddRoundKey) and 4-in-1 multiplexer basically has a longer delay than GF ðð2 4 Þ 2 Þ to GF ð2 8 Þ. In addition, a 1:2 DeMUX is implemented with NOR gates thanks to the redundancy, whereas nonredundant representations require AND gates.

Key Scheduling Part
The on-the-fly key scheduling part is shared by the encryption and decryption processes. For the encryption, the key scheduling part first stores the initial key in the Initial key register in Fig. 1 and then generates the round keys during the following clock cycles. For the decryption, the final round key should be calculated from the initial key and stored in Initial key register in advance. The key scheduling part then generates the round keys in the reverse order by Round key generator in Fig. 1. However, conventional key scheduling datapaths such those as in [9], [21] are not applicable to our architecture because they have a loop with a false path and/or a longer true critical path than our optimized round datapath.
To address the above issue, we introduce a new architecture for the key scheduling datapath. For on-the-fly implementation,  the subkeys are calculated for each of the four subkeys (i.e., 128 bits) in a clock cycle. Therefore, the on-the-fly key scheduling for the encryption over a tower field is expressed as where k ðrÞ 0 , k ðrÞ 1 , k ðrÞ 2 , and k ðrÞ 3 are the 32-bit subkeys at the rth round over the tower field, and KeyEx is the key expansion function that consists of a round-constant addition, Rot-Word, and SubWord over the tower field. The inverse key scheduling for the decryption over the tower field is represented by Fig. 4 shows the proposed key scheduling datapath architecture, where the KeyEx components (i.e., RotWord, Sub-Word, and Add constants) are unified for encryption and decryption. Here, the input key is initially mapped to the tower field, and all of the computations (including AddRoundKey) are performed over the tower field. The upper 2-in-1 multiplexer selects an initial key or a final round key as the input to the Initial key register, the middle 2-in-1 multiplexer selects a key stored in Initial key register or a round key as the input to the Round key generator, and the lower 2-in-1 multiplexers select an encryption or a decryption path. Importantly, most of the adders (i.e., XOR gates) for computing k , and k ðrþ1Þ 3 should be nonintegrated in order to make the critical path shorter than that of the round function part. In addition, the ENC/DEC signal controls the input to RotWord and SubWord by using a 32-bit AND gate. Such an usage of AND gate is useful for shortening critical path of key scheduling datapath compared to conventional ones which employ only multiplexers. Moreover, the round constant addition is performed separately from RotWord and SubWord to reduce the critical path delay. As a result, the critical path delay of the key scheduling part becomes shorter than that of our optimized round function part.

Optimization of Linear Mappings based on Multiplicative-Offset
Linear mappings (i.e., isomorphic mapping, the linear part of unified affine, and constant multiplications in MixColumns/ InvMixColumns) are realized as XOR matrix operations, whose construction are determined by the defining polynomials of the tower field in the case of tower-field implementation. The construction of XOR matrices (especially, Hamming weights of matrix) has an impact on the performance of AES hardware. In this subsection, we newly present a method named multiplicative-offset for increasing the variety of constructions of linear mappings, in order to find conversion matrices with less Hamming weights. Although the basic idea is similar to the method for optimizing tower-field S-box implementation proposed in [8], which uses a fixed multiplicative mask for the inversion, we extend and generalize it in order to optimize the whole tower-field AES encryption/ decryption architecture on the basis of proposed registerretiming and operation-reordering.  Fig. 2, all operations excluding the first and final operations (i.e., "Multiply g/GF ð2 8 Þ to GF ðð2 4 Þ 2 Þ" and "GF ð2 8 Þ to GF ðð2 4 Þ 2 Þ/Multiply g À1 ," respectively) should be performed over the tower field. The basic idea of multiplicative-offset is to initially multiply all bytes of the input data by a constant value g as an offset, which is a non-zero element of the PB-based GF ð2 8 Þ, and then, to multiply the intermediate bytes at each round by g 2 to correct the offset, and finally to multiply g À1 before the data output to remove the offset. Note that, in decryption, the offset value should be given by g À1 (and g À2 should be multiplied in a round) in order to share pre-and post-round datapaths. Since the multiplication over a GF is a type of linear mapping over the GF, this multiplication can be merged to isomorphic mapping or unified affine with further operation-reordering and register-retiming. Thus, we can increase the variety of conversion matrices by 255 times without any overhead because g can take a value from 255 candidates.
Importantly, each isomorphic mapping with multiplicativeoffset at Round 0 and 10 and round key is identical in the cases of encryption and decryption, which indicates that we can still unify the pre-and post-round datapaths and key scheduling part even with the multiplicative-offset. In addition, since the number of merged linear operations are same as that without multiplicative-offset in Fig. 2, we can confirm that the multiplicative-offset requires no overhead. Note that, while the conventional optimization method in [8] requires to multiply g to both input and output, the proposed multiplicative-offset modifies the timing of multiplications with register-retiming such that all the linear operations per round can be unified.
More precisely, we first multiply g to m where V eÀi ðxÞ ¼ Dðgðu eÀi ðAðgðD 0 ðxÞÞÞÞÞÞ, which denotes the unified linear transformation with the offset correction in the encryption. Note that gc is a constant value for a fixed g, and it should be embedded in the datapath. In the preround datapath, plaintext is not only mapped to the tower field, but is also offset by g at D g ¼ D g, which denotes the merged linear mapping for D and the constant multiplication of g. In the post-round datapath, the ciphertext is mapped to the AES field, and the offset is removed at D 0 g À1 ¼ g À1 D 0 , which denotes the merged linear mapping for D 0 and the constant multiplication of g À1 .
On the other hand, in decryption, we multiply g À1 to s ðrÞ i;j and the result of inversion in Eq. (8), adjust c 0 , and multiply g À1 to both sides of Eq. (8). As well as the encryption, by using the register-retiming and operation-reordering techniques, we then derive the multiplicative-offset version of Eq. (9) as follows: where L eÀi ðxÞ ¼ Dðg À1 ðA 0 ðv eÀi ðg À1 ðD 0 ðxÞÞÞÞÞÞ and t ðrÞ i;j ¼ Dðg À1 ðD 0 ðs ðrÞ i;j ÞÞÞ, which denote the unified linear transformation with offset correction and the intermediate value with the proposed register-retiming and multiplicative-offset in the decryption, respectively. Note here that the offset value is given by g À1 instead of g because of the register-retiming. In other words, while the ciphertext bytes are initially offset by g at D g as same as encryption, the first intermediate bytes are computed as t i;j ¼ Dðg À1 ðA 0 ðg À1 ðD 0 ðgz i;j ÞÞÞÞÞ, where z i;j denotes the ith row and jth column byte of ciphertext (i.e., input data at decryption). Moreover, we remove the offset at D 0 g À1 in the post-round datapath as in the case of encryption. Whereas the offset value is g À1 at the register, the offset is correctly removed by multiply g À1 at D 0 g because the offset value is inverted at the inversion as gðt i;j Þ À1 . Since the offset of a round key is given by g in both cases of encryption and decryption, we can still use the unified key scheduling part. Let l ðrÞ i;j be the ith row and jth column of the rth round key with the offset over the tower field (i.e., Dðgk   : Thus, we can perform AddRoundKey (and AddInitialKey) with the offset by replacing k ðrÞ i;j and KeyEx in Eqs. (14) and (15) with l ðrÞ i;j and KeyEx offset , respectively. The proposed encryption/decryption hardware with the multiplicative-offset can be realized by replacing the operations in Figs. 1, 3, and 4 with the corresponding ones for multiplicative-offset. Since we use common D g and D 0 g À1 for encryption and decryption as shown in Fig. 5 thanks to the offset correction at each round, the multiplicative-offset can be applied to our architecture in Figs. 1, 3, and 4 without any overhead. Thus, we can increase the variety of conversion matrices by 255 times because g can take a value from 255 candidates. The optimal conversion matrices can be searched in an exhaustive manner. Consequently, we found a set of conversion matrices with a Hamming weight of 4,016 in total while we found no conversion matrices set with Hamming weight of less than 4,416 without multiplicative-offset when we used the state-of-the-art tower-field inversion in [8]. Roughly speaking, the multiplicative-offset method reduces the circuit area for linear mappings by approximately 9 percent without any overhead. Note that a set of conversion matrices with a low Hamming weight is useful for reducing the fan-out of each XOR gate, which is related to the latency and power consumption.

Performance Evaluation
Tables 1 and 2 summarize the synthesis results of the proposed AES encryption/decryption architecture by the Synopsys Design Compiler (Version D2010-3) with the Nan-Gate 45 nm open-cell library [22] under the worst-case conditions, where Area indicates a two-way NAND equivalent gate size (i.e., gate equivalents (GEs)); Latency indicates the latency for one block encryption/decryption, which is estimated from the circuit path delay of the datapath under the worst-low condition; Max. freq. indicates the maximum operation frequency obtained from the critical path delay; Throughput indicates the throughput at the maximum operation frequency in which 10 cycles and 11 cycles indicate the throughput respectively, if and unless the plaintext and ciphertext blocks are input and output simultaneously in one clock cycle (see the next paragraph); Power@100 MHz indicates the power consumption estimated by a Monte-Carlo gate-level timing simulation with back-annotation, in which Enc. and Dec. indicates that for encryption and decryption, respectively; Efficiency indicates the throughput per area; and PL product indicates the power-latency product. To conduct a practical performance comparison, an area optimization (which maximizes the effort of minimizing the number of gates without flattening the description) was applied in Table 1, and an area-speed optimization (where an asymptotical search with a set of timing constraints was performed after the area optimization) was applied in Table 2. Fig. 6 also shows the latency and area of the synthesized proposed AES encryption/decryption architecture.
In these tables and figures, the conventional representative datapaths [1], [2], [9], [21] and that in the previous version [7] were also synthesized by using the same conditions. The source codes for these syntheses were described by the authors referring to [1], [2], [9], [21], 3 except for the source codes of Satoh's and Canright's S-boxes in [9], [12] that can be obtained from their websites [23], [24]. To fairly evaluate and compare the performance of datapaths without the  [2] were adjusted to the round-based nonpipelined architecture corresponding to the proposed datapath. Latency was calculated by assuming that the datapath of [1] requires 10 clock cycles to perform each encryption or decryption and the others require 11 clock cycles. This is because the initial key addition and first-round computation are performed with one clock cycle for [1]. On the other hand, Throughput of architectures except Lutz's [1]. is calculated assuming that they require 10 and 11 cycles to perform encryption/ decryption per block. These architecture can perform consecutive encryption/decryption with 10 clock cycles per block if plaintext and ciphertext are input and output simultaneously in one clock cycle; otherwise, they require 11 cycles. Note that, in Lutz's [1] architecture, plaintext and ciphertext cannot be input and output simultaneously in one clock cycle.. Area includes the initial key, round key, data registers, and control logic in addition to combinational circuits for round datapaths. Note also that the key scheduling parts of [1] and [2] were implemented with the ones presented in this paper because there was no description for the key scheduling parts. (For [1], the isomorphic mapping was removed for application to the round function part.) The results in Tables 1 and 2 show that our datapath achieves the lowest latency and highest efficiency compared with the conventional ones with tower-field inversion circuits. Although all operations are translated to the tower field in our architecture, the area and delay overheads of MixColumns and InvMixColumns are suppressed by the unification technique. More precisely, the critical path delay of the proposed datapath is given by only T INV þ T DeMUX þ 6T XOR þ T 4:1SEL , where T INV , T DeMUX , and T 4:1SEL denote the delay for the inversion circuit, demultiplexer, and 4-to-1 selector, respectively. The delay of 6T XOR includes the unified linear operation (3T XOR ), adder array (2T XOR ), and AddRoundKey (T XOR ) in Round datapath. Since the critical path of conventional architectures includes at least three linear operations, adder array, one AddRoundKey, one inversion, and three selectors, the critical path delay of conventional ones should be given by larger than T INV þ 12T XOR þ 3T 2:1SEL , where T 2:1SEL denotes the delay for a 2-in-1 selector. Thus, the lower logic depth of proposed datapath leads to the lower latency than the conventional ones, as indicated in Tables 1 and 2. Especially, even with a tower-field S-box, our architecture has an advantage with regard to the latency over Lutz's one with table-lookup-based inversion (i.e., very small T INV ). In addition, we can also confirm that the optimization of linear mappings based on multiplicative-offset clearly improves the area-time efficiency. As a result, our architecture is 51-63 percent more efficient in terms of the throughput per area than the conventional best. More precisely, the proposed datapath achieves approximately 63 percent higher efficiency than any conventional architecture and 9 percent more efficient than the previous version with area optimization. In addition, the proposed datapath has 51 and 7 percent higher throughput/gate efficiency than the conventional best and previous version when using area-speed optimization, respectively.
The results also suggest that the proposed architecture would perform an AES encryption or decryption with 38-46 and 44-51 percent smaller energy than the conventional best, respectively. More precisely, the proposed architecture reduces the power by 35-38 and 38-43 percent than the conventional best for encryption and decryption, respectively, because of the compressed datapath and the cutoff of an inactive path by a demultiplexer. Thanks to the lower latency and lower power consumption, the proposed architecture improves power-latency efficiency by 46 and 51 percent than conventional ones at encryption and decryption with area optimization, respectively. In the case of areaspeed optimization, the improvement of power-latency efficiency is given by 38 and 44 percent for encryption and decryption, respectively. These results indicate that the proposed architecture can perform AES encryption and decryption with the lowest energy consumption. In addition, the proposed architecture improves the PL product by 23 percent (9 percent) and 20 percent (7 percent) at encryption (decryption) than the previous version with area and area-speed optimization, respectively, because of the multiplicative-offset. We can also confirm the advantage of the proposed architecture over the previous version in terms of power/energy consumption as well as the efficiency.
The performance of the architecture in [2] was relatively lower for our experimental conditions because its critical path includes InvMixColumns for round key for operationreordering and therefore becomes longer than those of other designs. In addition, InvMixColumns over a tower-field is more area-consuming than that over the AES field. This suggests that the architecture in [2] is not suitable for an on-thefly key scheduling implementation. The architectures in [9], [21] have smaller areas than the proposed architecture; however, our architecture has a higher throughput. The increasing ratio of the throughput is larger than that of the circuit area because the architectures in [9], [21] use InvMix-Columns to compute InvMixColumns for round key and require several additional selectors, respectively. Moreover, the proposed architecture has a higher efficiency than the previous version, because the multiplicative-offset can reduce the implementation cost of linear mappings as described in Section 3.3.

Round Function Part
While unified AES encryption/decryption architecture is very important for many existing practical applications, AES hardware that supports only encryption is also highly in demand owing to the wider spread of the counter (CTR)mode and inverse-free authenticated encryptions working with AES such as Galois/Counter-Mode (GCM) [26] and Offset Two Round (OTR) [27].
In this section, we propose an efficient AES encryption hardware architecture based on the same philosophy as Section 3. Fig. 7 shows the overview of the proposed AES encryption hardware architecture, and Fig. 8 shows the proposed round function part for encryption only, which are based on Eq. (14) and are basically derived by omitting the Decryption path of the proposed unified architecture shown in Figs. 1, 3, and 4. Note that the multiplicative-offset is applied to the architecture in Fig. 7.
Here, we focus on the construction of "Linear operations," which performs the operations corresponding to Unified affine, Adder array, and AddRoundKey of the Encryption path in Fig. 3. Fig. 9 shows the block diagram of Linear operations for the jth column, where f i;j denotes the jth byte of input of Unified affine and F denotes the affine transformation with the multiplicative-offset over the tower field (i.e., Fðf i;j Þ ¼ DðgðAðgðD 0 ðf i;j ÞÞÞÞÞ þ DðgcÞ ¼ V 2 ¼ V 3 ), which is given by the same manner as the encryption/ decryption architecture in Section 3. As shown in [20], according to u 0 ¼ b and u 1 ¼ b þ 1 in the MixColumn function, we implement only either V 0 or V 1 (corresponding to Fig. 9a or Fig. 9b, respectively)    Linear mappings over GF ð2 8 Þ including V 1 and F usually require a delay of 3T XOR , where T XOR denotes the delay of a two-way XOR gates. This indicates that Linear operations of Fig. 9 requires a 6T XOR delay because the delay of F is usually given by 3T XOR . On the other hand, if we find an optimal conversion matrix for F with a 2T XOR delay, we can reduce the delay of Linear operations and can implement it with only a 5T XOR delay. Actually, while we found no such a matrix with a 2T XOR delay without multiplicative-offset, we successfully found a few optimal matrices in an exhaustive search of g. Thus, the proposed encryption architecture has a low latency, and we can confirm again the effectiveness of multiplicative-offset.
Thanks to the register-retiming, operation-reordering, and multiplicative-offset, our encryption architecture has the lowest logic depth among conventional architectures. Let T AND be the delay of a two-way AND gate. The architecture presented by Nekado et al. [20] has the delay of 4T AND þ 13T XOR , 4 which was lower than any other conventional tower-field architecture until now, to the best of our knowledge. The proposed architecture can achieve the delay of 3T AND þ 11T XOR þ T 2:1SEL , where T 2:1SEL denotes the delay of a two-in-one selector. Note that, as in the encryption/decryption hardware in Section 3, the critical path should be on the Round datapath part but not Postround datapath in Fig. 7b, because the adder array and multiplexer in Round datapath would be longer than GF ðð2 4 Þ 2 Þ to GF ð2 8 Þ in Post-round datapath. Thus, we can confirm the higher efficiency of the proposed architecture, given that a 2-in-1 multiplexer has a delay similar to a two-way logic gate such as AND and XOR gates [10]. Fig. 10 shows the proposed key scheduling part datapath for the encryption hardware. The encryption key scheduling part is basically derived by removing the decryption datapath from the encryption/decryption key scheduling part shown in Fig. 4 as well as the round function part, and it can be implemented according to Eq. (16). However, in contrast to the encryption/decryption key scheduling part, some XOR gates excluding the output of KeyEx offset should be factorized for achieving a smaller area, because the factorization can be performed without degrading the critical delay nor changing the functionality as shown in Fig. 10. Tables 3 and 4 show the performance of the proposed AES encryption hardware, and these data were derived with the same conditions as those in Tables 1 and 2 in Section 3.4. For comparison, we also show performance evaluation results for the conventional hardware. 5 As typical architectures, we evaluated two AES encryption hardware architectures denoted by SASEBO IPs [23], which are published as open-source intellectual property cores (IPs) for Sidechannel Attack Standard Evaluation BOards (SASEBOs). In the columns of SASEBO IPs [23], " Table" and "Tower field" indicate that the hardware architectures are based on a table-based and tower-field S-boxes, respectively. The source codes for these architectures were derived from [23]. In addition, we also evaluated state-of-the-art architectures presented by Gueron and Mathew in [28]. In the column of Gueron and Mathew [28], "Native" and "Mapped" indicate that the all round operations and only SubBytes were performed over a tower field, respectively. Since we found no public source codes for the architectures of [28], the source codes for these syntheses were described by the authors referring to [28], as well as Section 3.4. SASEBO IPs require 12 clock cycles for one block encryption, while the proposed architecture and the architecture by Gueron and Mathew can be implemented such that one block encryption is performed with 11 clock cycles, according to their description. In Tables 3 and 4, the rows denoted by "10 (11) cycles" and "11 (12) cycles" denote the throughput or efficiency in the case and not the case that the plaintext and ciphertext blocks are input and output simultaneously in one clock cycle, respectively. Fig. 11 also shows the latency and area obtained by the synthesis results.

Performance Evaluation
In Tables 3 and 4, while the SASEBO IP with the tablebased S-box has a shorter latency and higher throughput than our architecture, the table-based S-box requires an approximately two times larger circuit area than ours. The SASEBO IP with a tower-field S-box has a smaller circuit area, while our architecture has a shorter latency, which results in higher efficiency of our architectures. Here, it is interesting that the SASEBO IP with a table-based S-box has a lower power consumption than that with a tower-field 4. The delay does not include that of multiplexers because neither concrete implementation, datapath, nor architecture was shown in [20]; and therefore, a quantitative comparisons and evaluations with [20] are difficult in this paper.
5. Since the previous version [7] did not present encryption architecture, we do not compare the proposed encryption architecture with the previous version.
S-box in spite of their circuit area, rather than the towerfield S-boxes by hand at gate level. The tower-field S-box is known as a power-consuming circuit due to dynamic hazards [11], and Design Compiler would be good at synthesizing the table-based S-box with consideration of not only the low latency but also the low power consumption. Nevertheless, our architectures have a lower power consumption and lower (or at least comparable) power-latency product than SASEBO IP with a table-based S-box, which clearly shows the usefulness of the proposed unification techniques of linear mappings and multiplicative-offset. In addition, we can also confirm that our architectures have higher efficiency and lower power/energy consumption than state-ofthe-art architectures by Gueron and Mathew, thanks to the unification technique and multiplicative-offset. Thus, we can confirm that the proposed architecture achieves 64 and 58 percent higher efficiency in terms of bps/GE than conventional representative and state-of-the-art architectures with area and area-speed optimization, respectively. We can also confirm the advantages of the proposed architecture in terms of efficiency and power/energy consumption.
The critical path delay of the proposed datapath is given by T INV þ 5T XOR þ T 2:1SEL , where 5T XOR corresponds to the delay of Linear operation. The conventional architectures except for SASEBO IP with table-based S-box requires greater than T INV þ 8T XOR þ T 2:1SEL or T INV þ 11T XOR þ T 2:1SEL , because their critical paths include two or three linear operations. In contrast, SASEBO IP with table-based S-box has a delay of T INV þ 7T XOR þ T 2:1SEL with a very small T INV by the tablebased S-box. As aforementioned, SASEBO IP with table-based S-box has a lower latency than the proposed architecture. However, the latency of proposed architecture is still comparable to the SASEBO IP with table-based S-box as shown in Tables 3 and 4, whereas the proposed architecture has approximately 47 percent smaller circuit area with both area and area-speed optimization. This also indicates the area-latency efficiency of the proposed architecture.

Comparison With Other Conventional Architectures
The above comparative evaluation was done with the proposed and some conventional but representative datapaths. There are other previous works focusing on area-time efficiency by round-based architectures. However, such previous works do not provide for concrete implementation   and/or exhibit better performance than the abovementioned conventional datapaths. For example, a hardware AES description with a short critical path was presented in [20], which employed an redundant representation and unification techniques to reduce the critical path delay. However, we could not evaluate the efficiency by ourselves because of the lack of a detailed description about the implementation. The AES processor in [29] is also a typical high throughput encryption architecture, but we cannot evaluate it in this paper because of the lack of descriptions of its building blocks. Another AES encryption/decryption architecture with a high throughput was presented in [21]. However, the architecture had a lower throughput/area efficiency compared to the architecture in [9] according to that paper. Moreover, low latency AES encryption and decryption architectures based on twisted-BDD and hardware T-box in [10] were not evaluated in this paper because this would require a sophisticated back-end design (i.e., place and route).

Possibility of Pipelining
The proposed design employs a round-based architecture without block-wise parallelism such as pipelining. However, the block-wise parallelism are exploitable in the parallelizable modes and authenticated encryptions (e.g., CTR mode, GCM [26], OTR [27], and Offset CodeBook (OCB) [30]), by the trade-off between the area and the throughput by pipelining [28], [31]. A simple way to obtain a pipelined version of the proposed architecture is to unroll the rounds and insert pipeline registers between them. The datapath can be further pipelined by inserting registers into the round datapath. The proposed datapath can be efficiently pipelined by placing the pipeline register at the output of the inversion with a good delay balance between the inversion and the following circuit. For example, the synthesis results for the proposed datapath using the area-speed optimization with the NanGate 45-nm standard-cell library indicated that the inversion circuit had a delay of 0.60 ns, and the remainder had a delay of 0.66 ns. Accordingly, pipelining would achieve a throughput of 17.63 Gbps, which is nearly twice that without pipelining. Thus, the proposed datapath is also suitable for pipelined implementation.

Choice of S-box Implementation
We employed a tower-field S-box presented by Ueno et al. [8], [15]. Actually, Ueno's S-box is the most efficient S-box architecture applicable to our AES hardware architecture, which employ tower-field arithmetic, decomposition of (inv)S-box into inversion and (inv)affine transformation, and merge of linear operations including isomorphic mapping. There are several tower-field S-box implementation that have smaller area, lower latency, and/or higher efficiency than the above one. For example, Boyar et al. presented a very small S-box description based on a logic minimization [16]. In 2018, Reyhani-Masoleh et al. presented smaller and more efficient S-boxes than Boyar's and Ueno's S-boxes on the basis of architectural and gate-level optimizations. In 2019, Maximov and Ekdahl further improved the performance of AES S-box [18] by means of new logic minimization techniques. They presented three S-box designs, each of which are the fastest, smallest, or most efficient until now. However, these S-boxes cannot be applied to our architecture. These S-boxes unifies isomorphic mapping, affine transformation, and linear operations in inversion as a big XOR matrix operation at the input and/or output of S-box to achieve smaller area and lower latency. Therefore, these (inv)S-boxes cannot be decomposed into inversion and (inv)affine transformation, and cannot be applied to our AES hardware architecture that utilizes the S-box decomposition. An extension of our architecture for efficient application of such S-boxes remains in future work.
In contrast to tower-field S-boxes, direct-mapping-based (or table-based) S-box architecture including BDD-based ones can be implemented with very lower latency and relatively small power consumption at costs of circuit area and efficiency. However, we should implement table-based inversion and (inv)affine transformation separately in our architecture whereas the table can directly implement the functionality of S-box, which indicates that the decomposition of S-box can be latency and area overheads for the table-based implementation. In addition, our optimization techniques for unifying linear operations may not work well with the table-based implementation, because the table-based implementation does not require isomorphic mappings which are efficiently unified in our architecture.
Thus, the tower-field S-box which can be efficiently decomposed into inversion and (inv)affine transformation like Ueno's S-box [8], [15] is suitable to our architecture that is intended to achieve a higher throughput/gate efficiency (and S-boxes in, for example, [9], [12], [13] are possible candidates).

Application of Countermeasure Against Side-Channel Attacks
Another discussion point is how the proposed architecture can be resistant to side-channel attacks, especially against differential power analyses (DPAs) [32]. A masking countermeasure would be based on a masked tower-field inversion circuit [33]. The major features of the countermeasure are to replace the inversion with a masked inversion and to duplicate other linear operations. Such a countermeasure can also be applied to the proposed datapath. In addition, hiding countermeasures, such as wave dynamic differential logic (WDDL) [34], which replaces the logic gates with a complementary logic style, would also be applicable, and the hardware efficiency would be proportionally lower with respect to the results in Tables 1 and 2.
More sophisticated masking-based countermeasures such as threshold implementation (TI) and a consolidated masking scheme (CMS) [35], [36] would also be applicable to the proposed datapath in principle in the same manner as other conventional ones. On the other hand, such countermeasures, especially against higher-order DPAs, require a considerable area overhead and more random bits compared with the aforementioned countermeasures. When applying such countermeasures, the area overhead would be critical for some applications. In addition, TI-and CMS-based inversion circuits should be pipelined to reduce the resulting circuit area (i.e., the number of shares). To divide the circuit delay equally, it would be better to insert a pipeline register at the middle of the Encryption and Decryption path in Fig. 3. This paper presented a new efficient round-based AES architecture that supports encryption only and both encryption and decryption. The proposed datapath utilizes new operationreordering and register-retiming techniques to unify critical components with fewer additional selectors. Consequently, Our datapath has the lowest critical path delay compared to conventional ones with tower-field S-boxes. We also presented a new technique for optimizing matrices for linear operations named multiplicative-offset. The multiplicative-offset can improve the efficiency of AES hardware architecture by approximately 9 percent without any overhead. The proposed and conventional AES datapaths were implemented with compatible round-based architectures and evaluated by logic synthesis with the NanGate 45-nm open-cell library. The synthesis results suggested that the proposed architecture was approximately 51-64 percent more efficient than the best conventional architecture in terms of the throughput per area. In addition, as a result of gate-level timing simulations with back-annotation, we also confirmed that the proposed architecture can perform encryption/decryption with the lowest power/energy consumption.
Rei Ueno is an assistant professor in Research Institute of Electrical Communication, Tohoku University, and is currently working with the JST as a researcher for a PRESTO project. His research interests include arithmetic circuits, cryptographic implementations, formal verification, and hardware security. He received the Kenneth C. Smith Early Career Award in Microelectronics at ISMVL 2017. He is a member of the IEEE. Sumio Morioka received the BE, ME, and PhD degrees in computer science from Osaka University, Japan, in 1992, 1994, and 1997, respectively. For 1997-2016, he was a senior researcher in central research laboratories of NTT, IBM, Sony and NEC, and a visiting researcher of Imperial College London. In 2016, he joined Interstellar Technologies Inc., Japan, as a chief designer of avionics system for commercial space launch vehicles. His research interests include LSI architecture, EDA, formal methods and security systems. He received the Sony MVP 2004 Award for the development of a hardware security processor for PlayStation Portable and PLAYSTATION3. He is a member of the IEEE, IEICE, and a senior member of IPSJ.
Noriyuki Miura received the BS, MS, and PhD degrees in electrical engineering from the Keio University, Yokohama, Japan. He is currently an associate professor with Kobe University, Kobe, Japan, and concurrently a JST PRESTO researcher, working on hardware security and next-generation heterogeneous computing system. He is currently serving as a TPC Member for A-SSCC and Symposium on VLSI Circuits. He received the Top ISSCC Paper Contributors 2004-2013 and the IACR CHES Best Paper Award in 2014. He is a member of the IEEE.
Kohei Matsuda received the BS and MS degrees in computer science from Kobe University, Kobe, Japan, in 2015 and 2017, respectively, where he is currently working toward the PhD degree with the Graduate School of System Informatics. His current research interests include circuit-level countermeasure against physical attacks and design methodology for cryptographic processors. He is a member of the IEEE.
Makoto Nagata received the BS and MS degrees in physics from Gakushuin University, Tokyo, in 1991 and 1993, respectively, and the PhD degree in electronics engineering from Hiroshima University, Hiroshima, in 2001. He is a professor of the Graduate School of Science, Technology and Innovation, Kobe University, Kobe, Japan. He served as a technical program chair (2010-2011) and symposium chair (2012-2013) for Symposium on VLSI circuits. He is currently chairing Technology Directions subcommittee for International Solid-State Circuits Conference (ISSCC) and an associate editor for the IEEE Transactions on VLSI Systems. He is a senior member of IEEE and IEICE.
Shivam Bhasin received the bachelor's degree from UP Tech, India, in 2007, the master's degree from Mines Saint-Etienne, France, in 2008, and the PhD degree from Telecom Paristech, in 2011. He is a senior research scientist and principal investigator at Physical Analysis and Cryptographic Engineering Laboratory, Temasek labs, Nanyang Technical University Singapore, since 2015. His research interests include embedded security, trusted computing and secure designs. Before NTU, Shivam held position of Research Engineer in Institut Mines-Telecom, France. He was also a visiting researcher at UCL, Belgium (2011) and Kobe University, Japan (2013). He regularly publishes at top peer reviewed journals and conferences. Some of his research now also forms a part of ISO/ IEC 17825 Standard. He is a member of the IEEE.
Yves Mathieu is currently a full professor at Institut Mines-telecom/TELECOM ParisTech. He is the vice-chair for education of the Communication and Electronics Department. He undertakes research activities inside the "Safe and Secure Hardware" team with a focus on ASIC design. Jean-Luc Danger is currently a full professor at Institut Mines-telecom/TELECOM ParisTech. He is the head of the digital electronic system research team whose the main research topics are about security/safety of embedded systems and implementation of complex algorithms with physical constraints. He authored more than 200 scientific publications, 20 patents, and cofounded the company Secure-IC in 2010.
Naofumi Homma received the BE degree in information engineering, and the MS and PhD degrees in information sciences from Tohoku University, Sendai, Japan, in 1997, 1999, and 2001, respectively. He is currently a professor with the Research Institute of Electrical Communication at Tohoku University. For 2002-2006, he also joined the Japan Science and Technology Agency (JST) as a researcher for the PRESTO project. His research interests include computer arithmetic, EDA methodology, high performance/secure VLSI computing, and hardware security. He is a senior member of the IEEE.