ELM : A Low-Latency and Scalable Memory Encryption Scheme

. Memory encryption with an authentication tree has received signiﬁcant attentions due to the increasing threats of active attacks and the widespread use of non-volatile memories. It is also gradually deployed to real-world systems, as shown by SGX available in Intel processors. The topic of memory encryption has been recently extensively studied, most actively from the viewpoint of system architecture. In this paper, we study the topic from the viewpoint of provable secure symmetric-key designs, with a primal focus on latency which is an important criterion for memory. A progress in such a direction can be observed in the memory encryption scheme inside SGX (SGX integrity tree or SIT ). It uses dedicated, low-latency symmetric-key components, i . e ., a message authentication code (MAC) and an authenticated encryption (AE) scheme based on AES-GCM. SIT has an excellent latency, however, it has a scalability issue for its on-chip memory size. By carefully examining the required behavior of MAC and AE schemes and their interactions in the tree operations, we develop a new memory encryption scheme called ELM . It consists of fully-parallelizable, low-latency MAC and AE schemes and utilizes an incremental property of the MAC. Our AE scheme is similar to OCB, however it improves OCB in terms of decryption latency. To showcase the eﬀectiveness, we consider instantiations of ELM using the same cryptographic cores as SIT , and show that ELM has signiﬁcantly lower latency than SIT for large memories. We also conducted preliminary hardware implementations to show that the total implementation size is comparable to SIT .


Introduction
Cryptographic protection of memory, or more generally storage data, is widely deployed in modern systems. One typical method of protection is sector-wise encryption, such as XTS [Dwo10]. A sector-wise encryption scheme encrypts each memory sector in an independent and deterministic manner, keeping the secret key in a secure on-chip area. This prevents passive off-line attacks that try to extract the data from the storage devices, such as the Cold Boot Attack [HSH + 08]. However, it does not offer sufficient protection against active on-line attacks, as there is no way to detect forgeries. Most notably, replay cannot be detected. If we independently encrypt each sector using a nonce-based authenticated encryption (AE) and store all the nonces in the secure on-chip area, it would provide a strong security guarantee against such an adversary. However, this would also incur a linear increase of the on-chip area. This is usually impractical because the on-chip area is much more expensive than the main (off-chip) memory.
A well-known classical solution to this problem is to use an authentication tree, also known as a Merkle Hash Tree [Mer88]. By involving any unit memory data in the tree computation and storing the root hash value in the on-chip area, the authenticity against active attackers can be guaranteed. Instead of a cryptographic hash function, we can use a message authentication code (MAC) to build an authentication tree. The classical Merkle tree and its (possibly MAC-based) improvements, such as PAT [HJ06] and Bonsai Tree [RCPS07], provide an authenticity of the whole memory with a constant on-chip memory overhead, at the cost of a logarithmic computation overhead for read and write operations. Confidentiality of the memory can be achieved by an additional symmetric-key encryption mechanism, as was done by TEC-tree [ECL + 07]. Due when it is viewed as a mode of a tweakable block cipher (TBC) [LRW02], its latency is optimally small for both encryption and decryption. We call it Flat-OCB for its "flat" structure. As well as the original OCB, we proved that our AE is provable secure under the standard cryptographic assumption on AES, i.e., the strong pseudorandomness. Each technique itself is not ultimately novel. However, we show how to combine them in an optimal manner to reduce latency and computation (e.g., by shaving the redundant computations in the update of incremental MACs), which is, to the best of our knowledge, the first time this has been done in the field of tree-based memory encryption.
Our proposal is generic in principle, and the core idea can be instantiated by any block cipher or TBC. To showcase the effectiveness of our proposal, we specify concrete schemes, named ELM1 and ELM2, using the same components as SIT, namely AES-128 and a full 64-bit field multiplier 5 . We compare them with (a generalized variant of) SIT for various memory sizes and tree parameters under a certain practical implementation setting. Our results show that ELM1 and ELM2 have a smaller latency than SIT for most of the cases we see 6 . In particular, when the memory size gets larger, the difference becomes significant. We also conducted preliminary ASIC implementations, and show that the total implementation size is comparable to that of SIT. In addition, we discuss the optimization of hardware implementations for our proposal depending on the system constraints.

Notation
For a natural number n ∈ N, {0, 1} n denotes the set of n-bit strings. For binary strings A and B, A B or AB denotes the concatenation of A and B. The bit length of A is denoted by |A|, and |A| n := |A|/n .

Dividing a string A into blocks of n bits is denoted by A[1] · · · A[m]
n ← − A, where m = |A| n and |A[i]| = n, |A[m]| ≤ n for 1 ≤ i ≤ m − 1. For t ∈ N and t ≤ |A|, msb t (A) (lsb t ) denotes the first (last) t bits of A. A sequence of i zeros is written as 0 i . For sets E and E , we write E ∪ ← − E as shorthand for E ← E ∪ E . When the element K is uniformly and randomly chosen from the set K, it is denoted by K $ ← − K. For a function F : K × X → Y with the key space K, F (K, ·) may be written as F K (·).
Computation on Galois Field. Let GF(2 n ) be a finite field of size 2 n , where the characteristic is 2 and the extension degree is n ∈ N. We focus on the case where n = 128. Following [Rog04,IK03], we use the lexicographically first polynomial for defining the field and thus F 2 128 := F 2 [x]/(x 128 + x 7 + x 2 + x + 1) and we obtain GF(2 n ) = x . We regard an element of GF(2 n ) as a polynomial of x. For ∀a ∈ {0, 1} n , we also regard it as a coefficient vector of an element in GF(2 n ). Thus, the primitive root x is interpreted as 2 in the decimal representation. For a ∈ GF(2 n ), let 2a denote a multiplication by x and a, which is also called doubling [Rog04]. Similarly, let 3a denote 2a ⊕ a. In GF(2 n ), 2a := (a 1) if msb 1 (a) = 0 and 2a := (a 1) ⊕ (0 120 10 4 1 3 ) if msb 1 (a) = 1, where (a 1) is the left-shift of one bit. For c ∈ N, we can compute 2 c a by doubling a for c times.

(Tweakable) Block Cipher
Let K and M be the set of keys and messages, respectively. Let T be the set of tweaks, where a tweak is a public parameter. A tweakable block cipher (TBC) [LRW02] is a function E : K × T × M → M s.t. E(K, T, ·) is a permutation on M for ∀(K, T ) ∈ K × T . It is also denoted by E T K , E T , or E, where K ∈ K and T ∈ T . If T is singleton (and we thus omit it from the notation) it means a plain block cipher. Namely, a block cipher E is defined as E : K × M → M s.t. E(K, ·) is a permutation on M for ∀K ∈ K and is also denoted by E K or E. A TBC can be built on a block cipher using various modes of operation [LRW02,Rog04]. Security Notion. Let Perm(n) denote the set of all permutations on {0, 1} n . An n-bit tweakable permutation of t-bit tweak is a function π : {0, 1} t × {0, 1} n → {0, 1} n s.t. for ∀T ∈ {0, 1} t , π(T, ·) ∈ Perm(n). The set of all n-bit tweakable permutations with t-bit tweak is denoted by TPerm(t, n). Let P s.t. P $ ← − Perm(n) be a uniform random permutation (URP) and P s.t. P $ ← − TPerm(t, n) be a tweakable URP (TURP). A block cipher E or a TBC E is said to be secure if it is computationally hard to distinguish from the ideal primitive with oracle access. More precisely, let A be an adversary who (possibly adaptively) queries an oracle O and subsequently outputs a bit. We write Pr[A O → 1] to denote the probability that this bit is 1. We define the advantage of A against TBC E as follows: where the first notion is for adversary with encryption oracle (i.e., chosen-plaintext queries), and the second is for adversary with encryption and decryption oracles (i.e., chosen-ciphertext queries). When the advantage is sufficiently small, E is said to be secure against the underlying adversary.

Message Authentication Code
Message authentication code (MAC) is a symmetric-key cryptosystem to ensure the integrity of a message. Throughout the paper, we consider nonce-based MAC 7 . It takes a nonce (a value never repeats when used for tag generation) together with a message. For the key space K, the nonce space N , the message space Here, A is assumed to be nonce-respecting, that is, the nonces in the tagging queries are distinct. The nonces in the verification queries have no restriction, and A can repeat nonce or reuse a nonce that was used by a tagging query.

Authenticated Encryption
Authenticated encryption (AE) [BN00] is used to ensure privacy and authenticity of input data simultaneously. As well as MAC, we consider nonce-based AE in this paper. For the key space K, the nonce space N , the message and ciphertext space M, and the tag space T , a nonce-based AE scheme AE consists of two functions; the encryption function AE.E : K×N ×M → M×T and the decryption function AE.D : K×N ×M×T → M×⊥. A ciphertext C ∈ M and a tag T ∈ T for the key K ∈ K, the nonce N ∈ N , and the message M ∈ M are derived as (C, T ) = AE.E(K, N, M ). The tuple (N, C, T ) is considered to be authentic if AE.D(K, N, C, T ) returns the message M ∈ M and M = ⊥, and otherwise it is rejected.
It is possible to extend AE so that it also accepts associated data [Rog02], an information that is not encrypted but authenticated, though we do not need it in this paper. Security Notion. The security of AE is evaluated by two criteria, privacy and authenticity advantages. The privacy advantage is the probability that the adversary successfully distinguishes the encryption function of AE from the random oracle $( * , * ). For any query (N, M ), if (C, T ) ← AE.E(K, N, M ), $(N, M ) returns random bits of length |C| + |T |. Thus, Adv priv The authenticity advantage is the probability that the adversary creates a successful forgery by accessing encryption function and decryption function. It is defined as Adv auth AE (A) := Pr[A AE.E,AE.D forges], which means the probability that A receives M = ⊥ from AE.D by querying (N , C , T ) while (N , M ) has never been queried to AE.E.
For both advantages, we assume the adversary is nonce-respecting in encryption queries. For authenticity however, there is no restriction on nonce in the decryption queries, that is, A may repeat a nonce or reuse a nonce that was used in an encryption query.

Authentication Tree for Memory Protection
We assume two regions in storage memory: on-chip and off-chip areas. On-chip area is assumed to be secure in which the adversary cannot eavesdrop or tamper the stored data. Off-chip area can be attacked by the adversary who may perform eavesdropping (getting information of plaintext from ciphertext), tampering (modify the ciphertext without being detected), and replay (replacing the ciphertext with an old legitimate one). As mentioned in the introduction, tampering can be detected by simply applying a MAC to each data unit and storing the nonce and tag off-chip. If we use a nonce-based AE scheme instead, it also prevents eavesdropping. However, these means are not sufficient to protect from replay attacks since the adversary can perform a replay on the (nonce,ciphertext,tag) tuple. Moreover, since on-chip area is generally much more expensive than off-chip area, it is desirable to thwart all of these attacks with a small amount on-chip area as possible.
To address the problem, a number of memory protection tree schemes have been proposed [Mer88,RCPS07, HJ06, UWM19, TSB18, SNR + 18]. The classical Merkle hash tree [Mer88] associates each memory data chunk stored off-chip to a leaf node of a tree. The hash values of all intermediate and leaf nodes are stored off-chip, and only that of the root node is stored on-chip. The integrity of a leaf node (data) can be verified by recursively computing the corresponding hash values from the leaf to the root.
A similar scheme can be considered by using MACs instead of hash functions by storing the secret key on-chip, and among such schemes, we focus on PAT (Parallelizable Authentication Tree) proposed by Hall and Jutla [HJ06] for its parallelizability of both verify and update operations. It assigns a nonce to each node and stores the nonce associated with the root node in the on-chip area. Here, nonces need to be distinct from each other, and have one-time property to prevent replay attacks. To construct parallel scheme, PAT employs a MAC to compute a tag by taking the nonce assigned to own node and nonces in children nodes 8 .
In this paper, we hereafter use the term authentication tree to refer to the memory protection scheme using the tree construction. Note that we suppose the authentication tree also encrypts data associated with leaf nodes. We introduce a generic construction of authentication tree PAT2 (Fig. 1). It is mostly identical to PAT, however it achieves confidentiality of memory by applying an AE scheme to the leaf nodes 9 and it splits any nonce of PAT associated to a node into two values, an address and a local counter. The former is the memory address of the node, and the latter is a counter exclusively assigned to the node.
Let us briefly describe how PAT2 of Fig. 1 works. Each nonce N i assigned to node i consists of the address addr i and the local counter ctr i , which is initialized to 0 for all nodes. Memory data is split into 4 units, M 3 to M 6 . After initialization, the tree keeps N i , T i for i = 1, . . . , 6, and C j for j = 3, . . . , 6 at the off-chip area, and N 0 at the on-chip area. When verifying a data, say M 3 , we check if (1) AE.D K (N 3 , C 3 , T 3 ) is authentic (i.e., not returning ⊥) and (2) MAC.V K (N 1 , ctr 3 ctr 4 , T 1 ) = and (3) MAC.V K (N 0 , ctr 1 ctr 2 , T 0 ) = . If all hold M 3 is considered to be authentic and the corresponding local counters (ctr 0 , ctr 1 and ctr 3 ) are incremented. When updating M 3 , we first perform the above verification procedure, update the counters, and Only ctr 0 is stored in the on-chip area.
then renew (N 3 , C 3 , T 3 ), N 1 , T 1 and T 0 . Observe that the steps in the verification and update procedures are independent and thus parallelizable. This is a crucial advantage of PAT/PAT2 over the classical hash tree which only allows parallel verification. The nonce format guarantees distinctness across different nodes, and allows to reduce the MAC input and off-chip overhead from the original PAT. Since an address is anyway given from the outer legitimate system, it does not need to be explicitly stored. To the best of our knowledge, this technique was first proposed by [RCPS07].
In fact, by specifying the parameters (e.g., the depth and the branch number of the tree and the format of nonce) and the underlying MAC and AE schemes, the resulting scheme is mostly identical to SIT. Therefore, PAT2 can be seen as an abstraction of SIT. We consider PAT2 as our baseline scheme for its simple structure and efficiency, and present our scheme based on it (Section 4).
To the best of our knowledge, the provable security of PAT2 have not been shown in literature. As described above, many memory encryption schemes have been proposed, but there are few papers which shows provable security of proposed schemes. Whereas PAT paper defines the security notion of integrity tree (i.e., tree-based memory protection scheme, but not encrypting memory) and proves the security of PAT, PAT has slightly different tree construction from PAT2 and there is no description about privacy of plaintext associated with leaf nodes. In Section 5, we define security notions (privacy and unforgeability) for authentication trees and prove the security of PAT2 in each notion. The analysis is not surprising, but to our knowledge we cannot find such a formal treatment (in particular for the combination of MAC and AE to guarantee privacy and unforgeability) in literature.
Other Schemes. In addition to the above schemes, a number of authentication tree schemes that better handle the various criteria (except for latency) have been proposed. TEC-tree [ECL + 07] provides confidentiality by encrypting data stored in all nodes. MAES [UWM19] is an authentication tree providing security against differential power analysis attacks. VAULT [TSB18] and Morphable Counter [SNR + 18] reduce the overhead of off-chip memory and are suitable for protecting large memory (e.g., larger than the giga byte order); however there is a tradeoff with the average latency because their counters are compressed.

Components of ELM
To achieve low latency operation, we designed dedicated MAC and AE schemes. Our MAC scheme, which we call PXOR-MAC, is a simple combination of nonce-based XOR-MAC [BGR95] and PHASH, a message hashing function of PMAC [BR02,Rog04]. For AE, we propose a new mode named Flat-OCB based on OCB. We show more details in the following.

Incremental MAC
Specification. Algs. 1 and 2 show the tagging function and the verification function of PXOR-MAC. Here, the block cipher E has n-bit block. The second key K is n bits and independent of K. The length of the nonce N and the tag T are n bits and τ bits, respectively. Note that we exclude the case of partial block (i.e., |M | mod n = 0 always holds.) for simplicity. As we assumed n = 128, this is reasonable for the typical use case of authentication tree schemes. PXOR-MAC computes a tag as the sum of encrypted plaintext blocks and the encrypted nonce, as depicted in Fig. 2. An input mask to E K is derived from a multiplication of K and the block index over GF(2 n ), and L = E K (0 n ).

Properties.
Since L can be computed in advance, the latency of tag computation is essentially a sum of latencies of 128-bit multiplication (K · i for the block index i) and one call of E K . The cost of 128-bit multiplication can be large if i has large variations, however m is not too large in practice, even when the total memory size is huge. Typically, m is upperbounded by the number of branches, for example, at most 2 7 according to [SNR + 18]. Therefore, using a Gray code, the hardware implementation is much more efficient compared to implementing a full 128-bit multiplier (see Section 6). Consequently, the latency of mask computation becomes negligible. In this setting, PXOR-MAC has optimal latency of one block cipher call for tagging and verification functions thanks to the full parallelizability of block cipher calls.
In addition, PXOR-MAC is an incremental MAC [BGR95], which allows an efficient tag computation when a message is changed at a small number of blocks. To be more concrete, when a block of a message is changed Notes on Incremental Property. Our PXOR-MAC corresponds to the incremental MAC for replace operation with basic security [BGG94]. We emphasize that the arguments of the update function defined at [BGG94] is different from those of Alg. 3. In detail, the update function in [BGG94] takes the set of block indices to replace, and the contents of new and old blocks in addition to the old tag T old . For notational convenience we adopted our presentation of Alg. 3, however we used the standard form of [BGG94] in our implementations for efficiency. In addition, the basic security means that T old = PXOR-MAC.T K,K (N old , M old ) must hold for any (N old , M old , T old ) in an input to the update function to guarantee the correctness. This implies that the update function cannot be queried by the adversary. Bellare et al. [BGG94] also defined a stronger, tamper-proof security, where the adversary can arbitrary query the update oracle. This is a crucially different security notion, as the adversary may feed an unauthentic tuple (N old , M old , T old ) to the update oracle. Fortunately, an incremental MAC with basic security suffices for our purpose (see Section 4.3).
Security. The security bound of PXOR-MAC is shown below. We assume the underlying block cipher is an n-bit URP P. It is an information-theoretic idealization. The computational counterpart, where the underlying block cipher is instantiated by a practically secure block cipher such as AES, is derived from our bound. Since this is fairly standard [BDJR97], we omitted it here.

Theorem 1. The MAC advantage of PXOR-MAC is
where A is the nonce-respecting adversary against PXOR-MAC, σ mac is the total number of accesses to P invoked by the queries such that σ mac ≤ 2 n−1 , and q v is the number of queries to the verification oracle.
Proof. First, we observe that PXOR-MAC can be interpreted as a TBC-based MAC, PXOR-MAC-TBC, defined in Algs. 4 and 5. If the TBC used in PXOR-MAC-TBC is specified as where P is an n-bit URP, P is an n-bit TURP having the same tweak space as E, and E is a TBC involving P and an independent key K , defined as E 0 n ,i,j (M ) = P(M ⊕ K · i ⊕ j · P(0 n )). Also, B is the adversary against E querying the encryption oracle. In what follows, we evaluate each term of the right side of (1) in turn. Recall that we assume |M | (mod n) = 0 for plaintext M .
From the above cases, we obtain the following advantage when q d = 1: Finally, we apply the standard conversion from single to multiple verification queries [BDJR97] and obtain the bound Adv mac Analysis of the Second Term. We evaluate Adv tprp E (B) in (1). We follow the framework proposed in [MM09]. We define the offset function F as follows. j), P(0 n ))) holds for any (i, j, M ). Here, we introduce the following definition and the lemma for offset functions.

Definition 1 (A Simplified Version of Definition 4.1 [MM09]).
Let V be a uniformly random value over {0, 1} n . We say that a offset function F is (ε, γ, ρ)-uniform if F satisfies the following conditions.
where q is the number of encryption queries such that q ≤ 2 n−1 .
We derive the security bound of E by evaluating (ε, γ, ρ) in Definition 1.
From the above discussions, we obtain where σ mac is the number of accesses to P and σ mac ≤ 2 n−1 . Combining (1),(2), and (3), we conclude the proof.

Low-Latency Authenticated Encryption
An AE scheme can be built on a block cipher by a mode of operation. While it is possible to build an AE by a generic composition of a MAC mode and an encryption mode (e.g., Counter mode) [BN00, Kra01, NRS14], OCB is generally faster. It needs m plus a few block cipher calls to process m-block input (while a generic composition needs at least 2m calls), and these m calls are parallelizable. Thanks to this property, OCB has quite a small latency. However, there is a gap in the latency for encryption and decryption of OCB. Specifically, the encryption of plaintext checksum must be done after the main decryption routine. It results in one block cipher call that cannot be computed in parallel, and adds a significant latency compared to the encryption (we detail later). We present a solution to this problem. Because our proposal is essentially an improvement of a TBC-based interpretation of OCB (ΘCB [KR11] 10 ), we first describe it, which we call Flat-ΘCB. Then we show two block cipher-based instantiations of Flat-ΘCB, denoted by Flat-OCB-f and Flat-OCB-m.
As a related work, Qameleon [ABB + 19] is an AE scheme proposed to the ongoing NIST standardization project for lightweight cryptography [NIS19]. It is based on ΘCB using a low-latency TBC QARMA [Ava17] and has the same issue as ΘCB in decryption latency.

Specification.
We show Flat-ΘCB in Algs. 6, 7 and Fig. 3. It is an AE mode based on n-bit TBC, E. The nonce N is also assumed to be n bits. As well as the case of MAC, we assume that the case of partial block is excluded for simplicity. The structure of Flat-ΘCB is almost the same as that of ΘCB. The crucial difference is the generation of the tag T . While ΘCB encrypts the checksum using E to produce T , ours first encrypts N and take a sum with the checksum.
To build a block cipher-based AE, we instantiate E with an n-bit block cipher E K as follows.
where i ∈ {0, 1, 2, . . .}, j ∈ {0, 1}, and the part of mask ∆ is an n-bit value derived from N . We show two derivations of ∆, MASK1 and MASK2, in Alg. 8 and Alg. 9, respectively. MASK1 explicitly requires n = 128 (or, more specifically the doubling and tripling yield a safe instantiation of XEX [Rog04]). MASK2 can use any even n. MASK1 computes ∆ by using 4-round AES denoted by AES4 with four independent 128-bit secret keys, as used by the existing MAC and TBC constructions based on AES [MT06,Min07] (this is to utilize 4-round AES's differential property without harming provable security reduction to the entire AES: see below). Thus, it is natural to assume that E in (4) is also AES when MASK1 is used. Here, we assume that 1-round AES is the sequence of operations (AddRoundKey, Subbyte, ShiftRows, MixColumns), and four independent Table 1: Comparison of AE modes. SIT-AE is a GCM-based AE defined by SIT. Enc Latency (Dec latency) denotes the encryption (decryption) latency in terms of the number of primitive calls. Here, 1 BC (TBC) denotes a call of a block cipher (TBC), and 1 multi. denotes a multiplication on GF(2 n/2 ). The fourth column denotes the components that need to be implemented in parallel to achieve the latency figures, to process m-block input. The last column denotes the total size of secret key and the preprocessed values to achieve corresponding latency. For simplicity, the encryption and decryption latency of (T)BC are assumed to be identical, and (T)BC has n-bit block size and n-bit key. For Flat-OCB-f, we assume BC is AES. 128-bit secret keys are XORed in each AddRoundKey individually. MASK2 computes ∆ by splitting nonce into two n/2-bit words and multiplying them (over GF(2 n/2 )) with four independent n/2-bit keys. Let TBC-f and TBC-m denote the block cipher-based TBC defined in (4) with MASK1 (for the use of four-round AES) and MASK2 (for the use of multiplication), respectively. We also write Flat-ΘCB instantiated with TBC-f and TBC-m as Flat-OCB-f and Flat-OCB-m, respectively. By writing Flat-OCB K,K or Flat-OCB, we mean both of Flat-OCB-f and Flat-OCB-m, where K = (K 1 , K 2 , K 3 , K 4 ). Table 1, the latency of Flat-ΘCB to encrypt m-block input is just one TBC invocation if m + 1 TBC circuits are implemented in parallel. As a mode of TBC, this latency is essentially the lowest achievable, hence optimal. Moreover, this holds for both encryption and decryption. In case of ΘCB, the decryption latency of ΘCB costs two TBC calls because it generates a tag by encrypting checksum of plaintext blocks. It can be mitigated if we change the decryption procedure so that we check the match of checksum values instead of tags (by decrypting the tag), however, this is possible only for the case of n-bit tag, which limits usability. Comparing with ΘCB in other criteria, Flat-ΘCB has the same key size, the same number of TBC calls for encryption and decryption, and has the same security bound up to the constant (see next paragraph for the security). To get a rough idea on latency values, let us assume that a AES4 call and a multiplication on GF(2 n/2 ) have the same latency as one block cipher call. Then, Flat-OCB has the same encryption latency as OCB, and achieves a lower decryption latency than OCB as shown in Table 1. Note that E K (0 n ) used in TBC-f and TBC-m are pre-processed, thus it increases the memory by n bits (the last column of Table 1), which will be stored at on-chip area when used in our memory encryption scheme. Although the key size of Flat-OCB is larger than that of OCB, it has the same number of block cipher calls for encryption and decryption. Regarding the security, the security bound of Flat-OCB-f decreases to O(2 56 ) while that of OCB is O(2 64 ) when n = 128. On the other hand, Flat-OCB-m has the same security bound as that of OCB up to the constant as well as the case of Flat-ΘCB and ΘCB. In comparison to SIT-AE, Flat-OCB have the same encryption latency and lower decryption latency, however, SIT-AE needs a circuit of 2m multipliers in addition

Properties. As shown in
to m + 1 block cipher cores, while Flat-ΘCB requires only 4 multiplication circuits 11 . Another disadvantage of SIT-AE is its key size: it is linear to m (which will have a non-negligible impact on the overhead of the on-chip area) while that of Flat-OCB is constant.
One limitation of Flat-ΘCB and Flat-OCB is that they explicitly need integral input blocks, i.e., the last block must be of n bits. ΘCB and OCB can process arbitrary length of input. By introducing a padding with a minor modification on the tweak values, Flat-ΘCB and Flat-OCB can also process arbitrary length of input. However, the length of ciphertext will expand. Anyway, this limitation is not critical for our application, where the input to AE is typically full blocks and the length is fixed.

Security.
We show the security bounds of Flat-ΘCB, Flat-OCB-f and Flat-OCB-m in Theorem 2 below. As well as the case of Section 3.1, we assume that the underlying block cipher is an n-bit URP P, and only present an information-theoretic bound based on P.
In a nutshell, Flat-ΘCB has the same advantages as those of ΘCB (Adv priv ΘCB (2 n−τ q d )/(2 n − 1)), hence there is no security penalty, up to the constant. The same applies to the advantages of Flat-OCB-m when compared with those of OCB. When n = 128, Flat-OCB-f has roughly 56-bit security while OCB has 64-bit security. This degradation comes from the use of the differential property of AES4 (see the proof below for the details). We stress that the provable security of Flat-OCB-f relies solely on the pseudorandomness of AES, and AES4 does not introduce any computational assumption. This is because we used a proved AES4's (expected) differential property [KS07]; that is, it works as one large S-box with differential probability of 1/2 113 ). The technique has been introduced by Minematsu and Tsunoo [MT06] for MAC modes, and Minematsu [Min07] for building an AES-based TBC.

Theorem 2. The advantages of Flat-ΘCB and Flat-OCBs are
Adv priv The number of multipliers for hardware implementation is determined depending on the system constraint/architecture in practice. We discuss its details in Section 6.
where A ( resp. A ± ) is the adversary performing the privacy ( resp. authenticity ) game, and σ priv , σ auth , and q d are the parameters for A and A ± . The parameter σ priv ( resp. σ auth ) is the number of the access to P in the privacy ( resp. authenticity ) game such that σ priv , σ auth ≤ 2 n−1 . The parameter q d is the number of the queries to the decryption oracle in the authenticity game.
Proof. First, we evaluate the security bounds of Flat-ΘCB, then we derive the security bounds of (two versions of) Flat-OCB by evaluating the security bounds of TBC-f and TBC-m. Suppose that all plaintexts M and ciphertexts C in the following proof satisfies |M | (mod n) = 0 and |C| (mod n) = 0.
Proof of Flat-ΘCB. We first evaluate the privacy bound. Since the adversary is nonce-respecting, every TURP calls in the privacy game takes different tweaks. Thus, we obtain Adv priv We then evaluate the authenticity bound. We start with the case q d = 1. Suppose that the adversary performs a decryption query after all encryption queries without loss of generality. Let Z = {(N 1 , M 1 , C 1 , T 1 ), . . . , (N qe , M qe , C qe , T qe )} be the transcript obtained by encryption queries. Let (N , C , T ) be the decryption query. Suppose that T * and M * be the valid tag and plaintext corresponding to (N , C ), respectively. Seeing Z as a random variable, we obtain Adv auth In what follows, we evaluate Pr[T = T * | Z = z] for the following cases.
The TURP which encrypts nonce takes a different tweak from all the tweaks invoked in the encryption queries. Thus, we obtain Pr We define |C | n = m . Since the inverse of TURP which decrypts C [m ] takes a different tweak from all tweaks invoked in the encryption queries, we obtain Pr From the above cases, we obtain the following advantage for the case q d = 1: Finally, we apply the standard conversion from single to multiple decryption queries [BDJR97] and obtain the bound q d (2/2 τ ) for q d ≥ 1. This concludes the proof for Flat-ΘCB.
Proof of Flat-OCB. Due to the definition of Flat-OCB, we obtain the following inequations.
where TBC is TBC-f when Flat-OCB indicates Flat-OCB-f, and TBC is TBC-m when Flat-OCB indicates Flat-OCB-m. Also, B (resp. B ± ) is the adversary against TBC querying the encryption oracle (resp. the encryption and decryption oracles ). Since we have evaluated Adv priv Flat-OCB P (A) and Adv auth Flat-OCB P (A) in the previous paragraph, all that remains is to evaluate the security bounds of TBC. As well as the case of MAC, we use the methodology proposed in [MM09]. We define the offset function F of TBC as follows.
where q is the number of encryption/decryption queries such that q ≤ 2 n−1 .
Bound of γ. Since P(0 n ) is uniformly random and independent from MASK K , we obtain Pr[F ((N, i, j), From above discussions, we obtain where σ ae is the number of accesses to P and σ ae ≤ 2 n−1 . Combining the above bounds of TBC P and the bounds of Flat-ΘCB proved in the previous paragraph, we obtain the security bounds of Theorem 2.

ELM
In this section, we detail our authentication tree scheme, ELM. As described before, we employ the tree construction PAT2. The inner MAC and AE schemes are instantiated by PXOR-MAC and Flat-OCB. Let ELM1 and ELM2 be the instances of ELM employing Flat-OCB-f and Flat-OCB-m as the inner AE schemes, respectively. We show how to combine PAT2, PXOR-MAC, and Flat-OCB in an optimal manner to reduce latency and computation for updating tree.

Notations for Tree
We describe a tree structure for ELM in Fig. 4. The number of branches is denoted by b ≥ 2, and d denotes the depth, where the depth of root and a leaf node are defined as 0 and d, respectively. We assume a balanced tree, hence the number of leaf nodes is b d . The entire memory (plaintext) to protect is divided into chunks, where each chunk has bits. We associate a chunk with each leaf node denoted by leaf(i) , which is stored in leaf(i). For the node u, the memory address, the counter, and the tag are denoted by ADD(u), CTR(u), Tag(u), respectively. The lengths of the memory address, the counter, and the tag are α, β, and τ , respectively. All the data stored in the on-chip and off-chip area is denoted by σ, which includes C[i] for 1 ≤ i ≤ b d , and CTR(u) and Tag(u) for all nodes u. As we adopt PAT2, we exclude the node addresses from σ and assume that they are given by the system when needed. Suppose the root node is denoted by u root , we store CTR(u root ) in the on-chip area. The leftover data of σ is stored off-chip. We may also use σ to mean the tree construction itself. We also write a node u, leaf node, plaintext chunk, ciphertext chunk of σ as u σ , leaf(i) σ , M σ [i], and C σ [i], respectively. If no confusion is possible, we omit their superscript σ. For any non-leaf node u σ and i ∈ {1, . . . , b}, ch i (u σ ) denotes its i-th child node.

Specification of ELM
ELM consists of three algorithms: InitTree, Verify, and Update defined in Algs. 10, 11, and 12, which is denoted by ELM = (InitTree, Verify, Update). Suppose that they take the tuple of keys for AE and MAC K T = ((K 1 , K 1 ), (K 2 , K 2 )) as input. The algorithm InitTree is the initialization of the tree. It takes a plaintext M and a tuple of keys as input, and outputs a tree σ. Here, σ consists of the local counters being initialized to zero, the tags for intermediate nodes, and the (ciphertext, tag) pairs for the leaf nodes. We use the incremental property of MAC (line 11) to efficiently compute the tags for the intermediate nodes since all message inputs to PXOR-MAC are identical (all zero). The algorithm Verify checks the validity of a specified leaf node. It is associated with a memory read operation. Verify takes the index of a leaf node

Features
ELM is designed to achieve low latency by utilizing the incremental property of MAC and full parallelizability of the cryptographic components and the tree structure. Especially, incremental property greatly contributes to reduced latency of Update. Since Update includes the operation of Verify and plain update of nodes, we can use the incremental MAC of basic security as described in Section 3.1. Moreover, rather than naively applying an incremental MAC, we optimize Update by defining PXOR-MAC.VU so that we can save some redundant computations generated by the invocations of both verification and update, which will contribute to reducing latency. Suppose α = β = n/2 and some even b. One invocation of Verify needs (1 + 2/b)d block cipher calls for intermediate and root nodes. One invocation of Update needs (3 + 2/b)d block cipher calls for intermediate and root nodes, while Update with a non-incremental MAC needs at least twice as many block cipher calls as Verify does. In addition, ELM is also scalable in terms of on-chip size. It is because the sizes of key and preprocessed data are constant. Suppose that the key of block cipher is n bits, ELM1 and ELM2 need 5n + 512 bits and 7n bits for key and preprocessed data, respectively. Thus, the required on-chip memory is small for any parameter of the tree. However, SIT (here we mean a generalized version, i.e., PAT2 with the MAC and AE schemes used by SIT) needs on-chip area of size linear to b and β.
Up to this point, we have ignored the off-chip memory overhead caused by storing counters and tags. However, it may be non-negligible if the target memory size gets larger. In such a case, we can combine ELM and a well-known technique to reduce the memory needed for counters, called Split counter [YEP + 06]. The technique will incur an increased average latency and has been adopted by state-of-the-art schemes [TSB18, SNR + 18]. Fortunately, the incremental property of PXOR-MAC is still quite effective even if we adopt the split counter. See Section 7.2 for more details.

Security of PAT2
In this section, we show that the security of PAT2 can be reduced to the security of underlying MAC and AE schemes. This immediately implies the provable security of ELM. First, we define security notions of an authentication tree in Section 5.1. The privacy notion is defined analogously to that defined for nonce-based AE (Section 2), and the unforgeability notion is mostly identical to that defined in [HJ06]. Then, we evaluate the security of PAT2 in Section 5.2.

Security Notion of Authentication Tree
Suppose that Tree is an authentication tree scheme defined as a tuple of three functions: the initialization function InitTree, the verification function Verify, and the update function Update, denoted by Tree = (InitTree, Verify, Update). Recall that InitTree(M ) = σ, Verify(idx, σ) = or ⊥, and Update(idx, B, σ) =σ or ⊥ (See Section 4 for details). Also recall that σ includes data stored in the on-chip memory (i.e., tamper-free area), which we denote Sec(σ) 13 . Security notions. We define two security notions of an authentication tree: privacy and unforgeability. For the privacy of Tree, we define InitTree-$ and Update-$. They return their ciphertexts and tags to be stored in the leaf nodes as random strings whose lengths are the same as those of InitTree and Update, respectively. Regarding other variables, for example, the data associated with the intermediate nodes, they return the same outputs as InitTree and Update. The privacy security of Tree is defined as the probability that an adversary A successfully distinguishes (InitTree, Update) from (InitTree-$, Update-$). It is written as where A plays the following game.
For the unforgeability notion for Tree, our definition follows [HJ06]. It is defined as the advantage of an adversary A querying InitTree and Update successfully distinguishes Verify from ⊥ Tree (·, ·) which always returns ⊥ for any inputs. The unforgeability advantage of A is defined as  1 , B 1 , σ 1 ,σ 1 ), . . . , (idx q , B q , σ q ,σ q )} be the transcript obtained by update queries. As well as the privacy game, we assume that σ 1 = σ init and σ i = σ i−1 for 2 ≤ i ≤ q so that A can always obtain an updated tree, not ⊥. 3. A queries (idx , σ ) to the verification oracle (Verify or ⊥ Tree ) and obtains or ⊥. Let (u 0 , . . . , u d ) be the path of nodes from the root node to leaf(idx ). To exclude a trivial win, we assume that there exists i ∈ {0, . . . , d} such that u σ i stores different data from that stored in uσ q i . Moreover, Sec(σ ) = Sec(σ q ) also must hold since the data in the on-chip area cannot be tampered.

Algorithm 15 Verify
Input KT = (KA, KM ), idx, σ Output or ⊥ 1: (u σ 0 , · · · , u σ d ) ← path of nodes from root to specified leaf ( i.e., u σ 0 is the root node and u σ d is equal to leaf(idx). Suppose that we also write as σ init =σ 0 . We stress that A can perform verification query such that u σ i stores same data as that stored in uσ j i for 0 ≤ i ≤ d and 0 ≤ j ≤ q − 1, unless the data stored in uσ j i is the same as that stored in uσ q i as described in the third operation of the above game. This condition is essential for the unforgeability notion to capture an adversary who performs a replay attack, which is the attack to replace data with old legitimate one.
Rationale of security notions. The privacy notion is defined similarly to that defined for nonce-based AE. Namely, we evaluate per-node indistinguishability between ciphertexts and tags associated with leaf nodes and random strings against an adversary performing chosen-plaintext attack (IND-CPA) [BN00].
For the unforgeability notion, we follow that defined in [HJ06], thus just extend it from for the authentication tree without encryption of leaf nodes to for that with. The notion captures the adversary performs CPA (by initialization query and update queries) and tampering the data stored in the off-chip area once (by verification query) 14 . This means that the unforgeability game can simulate, say, an adversary who tampers the data stored in a certain node with new values, swaps the data associated with two nodes, and performs replay attack (as described in the definition of the unforgeability game), in addition to the adversary captured by the privacy notion. Especially, it is important to capture the adversary performing replay attack since the security notion of a general MAC does not capture her. By proving the unforgeability advantage is negligible, we can prove that the authentication tree scheme can detect tampering (including replay) by such an adversary with sufficiently high probability.

Security Bounds of PAT2
Let MAC K M and AE K A be a MAC scheme and an AE scheme, where K M and K A are uniformly random and independent. We describe three functions of PAT2, (InitTree, Verify, Update), using MAC K M and AE K A in Algs. 14, 15, and 16, respectively. We note that, when (MAC, AE) is instantiated as (PXOR-MAC, Flat-OCB), each function of PAT2 returns the same computation result as a corresponding function of ELM.

Privacy Bound.
Theorem 3. The privacy advantage of PAT2 is bounded as follows.
where A ae is a privacy adversary against AE, using b d + q queries.
Proof. We assume that A is given the MAC key K M , denoted by A(K M ). Since A(K M ) can compute data associated with the root node and intermediate nodes, we can assume that A(K M ) obtains only data associated with leaf nodes from the tree initialization oracle and the update oracle. Let A ae be the privacy adversary against AE. The adversary A ae can properly simulate the privacy game of A(K M ). In what follows, we describe how A ae simulates two oracles that A(K M ) queries. If A(K M ) queries InitTree (resp. InitTree-$), A ae can simulate it by querying AE.E (resp. $ defined in Section 2) in the same manner as Alg. 14. Note that the tree initialization query of A(K M ) invokes nonce-respecting encryption queries of A ae since ADD(leaf(i)) CTR(leaf(i)) = ADD(leaf(j)) CTR(leaf(j)) necessarily holds for 1 ≤ i = j ≤ b d . Thus, the privacy adversary A ae can simulate the initialization oracles for A(K M ). Regarding the queries to the update oracles, A(K M ) invokes decryption queries of AE since the update queries invoke the verification function of the authentication tree (line 1 in Alg. 16). However, the verification function always outputs since σ 1 = σ init and σ i =σ i−1 for 2 ≤ i ≤ q as defined in Section 5.1. Thus, A ae can always output regardless of inputs to simulate the subroutine verification function in update queries. The adversary A ae can simulate the leftover pure update function (line 4 -12 in Alg. 16) in the same manner as the simulation of the initialization oracles: if A(K M ) queries Update (resp. Update-$), A ae can simulate it by querying AE.E (resp. $) in the same manner as Alg. 14. Also, the update query of A(K M ) invokes nonce-respecting encryption query of A ae due to the node-unique property of ADD(·) and the one-time property of CTR(·). Thus, the privacy adversary A ae can simulate the update oracles for A(K M ). Finally, we can confirm easily that the sequence of queries in the privacy game of A(K M ) invokes nonce-respecting encryption queries of A ae , hence the privacy adversary against AE A ae can properly simulate A(K M ) and we obtain following evaluations.

Adv ptree
where A ae queries b d + q to encryption oracle because a initialization query of A(K M ) invokes b d times encryption queries of A ae and update queries of A(K M ) invoke q times encryption queries of A ae .

Unforgeability Bound.
Theorem 4. The unforgeability advantage of PAT2 is bounded as follows.

Adv uftree
where A ± ae is the authenticity adversary against AE and A mac is the adversary against MAC. The adversary A ± ae queries b d + q times to the encryption oracle and queries one time to the decryption oracle. The adversary 14. Note that InitTree invokes nonce-respecting queries to MAC.T and AE.E since ADD(u)||CTR(u) = ADD(u )||CTR(u ) holds for all distinct nodes u and u . In addition, while simulating InitTree, we assume that A ma has two lists to record her queries and responses for MAC.T and AE.E, respectively. These lists will be used in the simulation for Verify, hence we will describe the role of them later. Thus, the simulation for InitTree by A ma is obtained by adding the following operations to Alg. 14.
After line 1: For the above simulation of InitTree, A ma queries (b d − 1)/(b − 1) times to MAC.T and queries b d times to AE.E. Next, we show how A ma simulates Update. As well as the proof of privacy for PAT2, A ma can simulate Verify which is a subroutine of Update without querying any oracles since what A ma has to do is only to return . Regarding other operations of Update except for the subroutine Verify, A ma can simulate them by employing MAC.T and AE.E in the same manner as Alg. 16. Also, Update invokes nonce-respecting queries to MAC.T and AE.E due to the uniqueness of ADD(·) and the one-time property of CTR(·). As well as the simulation for InitTree, A ma records her queries and response of MAC.T and AE.E. Thus, the simulation for Update by A ma is obtained by adding the following operations to Alg. 16. To simulate q invocations of Update, A ma queries q d times to MAC.T and queries q times to AE.E. Note that the sequence of queries to Update invokes nonce-respecting queries to MAC.T and AE.E.
We then show how A ma simulates a query to Verify. In this simulation, List MAC and List AE are used to prevent A ma from performing replay queries to MAC.V and AE.D, respectively. A replay query means that A ma queries (N, M, T ) (resp. (N, C, T )) to MAC.V (resp. AE.D) while (N, M ) (resp. (N, M ) such that (C, T ) = AE.E (N, M )) has been queried to MAC.T (resp. AE.E). Such queries may appear in the simulation of unforgeability game since A can perform replay attack. Our final goal of this proof is to show how to simulate the unforgeability game that A plays using the adversary against MAC A MAC and the authenticity adversary against AE A AE , and then, A MAC and A AE are prohibited replay queries as defined in Section 2. Therefore, we here have to show A ma can simulate Verify without replay queries. From the above discussion, we define the behavior of A ma when A queries (idx , σ ) to Verify as follows.
1. Get the path of nodes from the root to the specified leaf, denoted by (u σ 0 , · · · , u σ d ). Here, u σ 0 is the root node and u σ d is equal to leaf(idx ). Finally, as well as the case of privacy game, the sequence of queries in unforgeability game of A invokes nonce-respecting queries to MAC.T and AE.E due to the node-unique property of ADD(·) and the one-time property of CTR(·). From the above discussions, we observe that A ma querying MAC.T , AE.E, MAC.V and AE.D can simulate the functions that A queries without performing nonce-repeating queries to MAC.T and AE.E, and replay queries to MAC.V and AE.D. We write A where ⊥ AE (·, ·, ·) is the function for decryption queries to AE, which always returns ⊥ for any inputs, and ⊥ MAC (·, ·, ·) is the function for verification queries to MAC, which always returns ⊥ for any inputs. In the rest of this section, we bound the above three distinguishing probabilities of A in (7) in Lemmas 3, 4, and 5.
where A ± ae is an authenticity adversary against AE using b d + q encryption queries and 1 decryption query. Proof. Suppose that A ma eventually outputs a bit in her simulation of unforgeability game and A outputs the same bit as A ma . Then we obtain the following inequation.
The rest of the proof is almost the same as that of the privacy bound. We consider A ma (K M ) who owns the MAC key K M , and obtain the following inequation. The right side of (8) can be seen as the probability that A ma (K M ) querying AE.E successfully distinguishes AE.D from ⊥ AE under the unforgeability game for authentication trees. Without loss of generality, we can assume that this (distinguishing) probability is equal to the probability that A ma (K M ) querying AE.E and AE.D obtains something other than ⊥ from AE.D under the unforgeability game for authentication trees.
Here, let A ± ae be the authenticity adversary against AE. The adversary A ± ae can simulate the oracles that A ma (K M ) queries because the query sequence of A ma (K M ) respects the rule of authenticity game for AE schemes (performing nonce-respecting queries to AE.E and not performing replay queries to AE.D). Thus, we obtain where A ± ae queries b d + q times to AE.E and queries one time to AE.D. Lemma 4.

| Pr[A
where A mac queries (b d − 1)/(b − 1) + q d times to MAC.T and queries d times to MAC.V.
Proof. Both A ma and A ma have the same ways to simulate InitTree and Update. The adversary A ma simulates verification query of A using ⊥ MAC , ⊥ AE , List MAC , and List AE , while A ma only mediates it to ⊥ Tree that returns ⊥ for any inputs. Namely, the distinguishing probability we evaluate here can be seen as the probability that A querying InitTree and Update obtains from the tree verification oracle simulated by (⊥ MAC , ⊥ AE ) with List MAC and List AE . This probability is equal to the probability that the data associated with nodes in the path verified in the verification query of A consists only of the data in List MAC and List AE (see line 4 and 8 of the simulation of Verify in the proof of Theorem 4). In the following claim, we prove the probability is equal to zero.
Claim. Recall that (idx , σ ) is the tree verification query and (u 0 , · · · , u d ) is the path of nodes from the root to the specified leaf, then either (a) or (b) described below must hold.
This claim states that A ma querying MAC.T , AE.E, ⊥ MAC and ⊥ AE has to query to ⊥ MAC or ⊥ AE in her simulation of the tree verification query 15 . Thus, she always obtains ⊥ from ⊥ MAC or ⊥ AE and she returns it to A . All that remains is the proof of the claim. If (b) occurs, the claim is simply proved. We need to see that if (b) does not occur, (a) must hold. First we discuss u 0 (i.e., the root node). Let (N u0 , M u0 , T u0 ) = (ADD(u σ 0 ) CTR(u σ 0 ), CTR(ch 1 (u σ 0 )) · · · CTR(ch b (u σ 0 )), Tag(u σ 0 )). If (N u0 , M u0 , T u0 ) / ∈ List MAC holds, it means that (a) holds. Suppose that (N u0 , M u0 , T u0 ) ∈ List MAC , and we obtain the following equation.
Finally, we discuss u d (i.e., the leaf node).
Here, ADD(u σ d ) = ADD(uσ q d ) holds since ADD(·) cannot be tampered, and CTR(u σ d ) = CTR(uσ q d ) holds due to (10) when i = d − 1. Thus, we have since nonces included in List AE are distinct, hence the element of List AE including N u d is uniquely determined as (11). From (10) and (11), we proved that the data stored in u σ i is the same as that stored in uσ q i for all i ∈ {0, . . . , d}, which is a forbidden query. Therefore, there must exist i ∈ {0, . . . , d − 1} such that (N ui , M ui , T ui ) / ∈ List MAC . This concludes the claim. From (7), Lemmas 3, 4, and 5, we obtain where A ± ae is the authenticity adversary against AE and A mac is the adversary against MAC. The adversary A ± ae queries b d + q times to the encryption oracle and queries one time to the decryption oracle. The adversary A mac queries (b d − 1)/(b − 1) + q d times to the tagging oracle and queries d times to the verification oracle.

Implementation and Evaluation
In this section, we demonstrate the hardware implementation of the proposed scheme and evaluate it using logic synthesis. We instantiate our scheme using AES as the block cipher, which indicates that n = 128. We assume that the lengths of the memory address and the counter are both 64 bits (i.e., α = β = n/2). The tag is given by 64 bits (i.e., τ = 64), which corresponds to the security level of SIT (i.e., the bit length of keys in the inner-product MAC in [Gue16a]) for a fair comparison. In this paper, we focus on a high-throughput and area-time efficient architecture based on an unrolled and pipelined AES datapath, similar to that of SIT in [Gue16a], whose throughput is one block encryption per clock cycle. This high-throughput architecture suits the context of memory encryption. Note that our architecture can utilize other block ciphers and architectures (e.g., round-based and byte-serial ones) in accordance with the optimization goals. (The variations are discussed in the next section.)  Fig. 6: Proposed MAC hardware architecture. Figure 6 shows the proposed hardware architecture of PXOR-MAC. The primary inputs consist of a block index, the number of branches len (=b), nonce (ADD(u σ i ) CTR(u σ i )), two n-bit keys, and an n-bit segmented plaintext block (CTR(ch 2j+1 (u σ i )) CTR(ch 2j+2 (u σ i ))) (0 ≤ j ≤ b/2 − 2), and the primary output is given as tag. One plaintext block is fed to the hardware every clock cycle one after another and an encoding is completed with 11 clock cycles. In this architecture, the AES datapath is fully unrolled and pipelined. The pipeline registers are inserted at the boundaries of each round in order to increase the throughput. This enables the encryption of one plaintext block in one clock cycle with the frequency corresponding to the critical path of one round datapath.

MAC Hardware Architecture
An up-to-date AES round datapath with a tower-field S-box presented in [UMM + 20] is adopted for ELM (and SIT [Gue16a] for a comparison in this paper) in the following hardware implementation. A mask value for the input block to the AES core (K · i in Alg. 13) is generated by the multiplication of a gray code (converted from a block index) and a key K over GF(2 n ). The conversion from a block index to a gray code is given by a combinational circuit and the generation of a mask value is implemented using a (n × log b)-bit GF multiplier. This multiplier generates mask values from all indices in a tree with b branches in one clock cycle. The mask value for the nonce (N old ⊕ K · m ⊕ L in Alg. 13) is computed as the sum of the last mask value and L without any GF multiplication. The accumulation in the tag generator is implemented by a feedback loop consisting of a bit-parallel-XOR (i.e., GF(2 n/2 ) adder) and registers, which realizes the for loop at lines 3-9 in Alg. 13.
During PXOR-MAC.V for Verify and PXOR-MAC.VU for Update, the tag is computed using the upper ciphertext accumulator after the encryption result is truncated into τ bits at msb τ . The accumulator is given by a feedback loop with a bit-parallel-XOR (i.e., GF (2 n/2 ) adder). When updating the tag, the intermediate value Σ as mentioned in Alg. 13 computed in the pre-verification process is stored into a register in the lower ciphertext accumulator for the following update procedure to exploit the incremental property of PXOR-MAC. The lower feedback loop after msb τ is used for the accumulation to compute Σ. The updated tag is then calculated by XORing Σ and the old tag T old computed in pre-verification.   the pre-verification process and an encryption in the update process simultaneously for Update. The pre-mask and post-mask generators compute the mask values for the input and output of encryption/decryption, respectively. The proposed architecture utilizes two pre-mask generators and two post-mask generators for simultaneous encryption and decryption. The field doubling and tripling for mask value generation are achieved by combinational circuit blocks denoted by ×2 and ×3, which consist of four and 132 two-way XOR gates, respectively. Two pre-mask generators and two post-mask generators can be implemented with less area than another utilization consisting of one pre-mask generator, one post-mask generator, and two 128-bit-wise first-in first-out (FIFO) buffers 16 . Plaintext accumulators obtain truncated plaintext blocks for generating the tag, which consists of a feedback loop with a bit-parallel-XOR (i.e., GF(2 n/2 ) adder). Finally, E N,0,0 K (0 n ) is added before outputting the tag.

AE Hardware Architecture
The mask value ∆ is generated from a nonce by the ∆ generator module. In Flat-OCB-f shown in Fig. 7(a) (i.e., the proposed AE with MASK1 in Alg. 8), ∆ is generated by the four round datapaths of the abovementioned unrolled-pipelined AES using four distinct 128-bit keys K 1 , K 2 , K 3 and K 4 as round keys. In Flat-OCB-m at Fig. 7(b) (i.e., the proposed AE with MASK2 in Alg. 9), ∆ (= (N 1 · K 1 N 2 · K 2 ) ⊕ (N 2 · K 3 N 1 · K 4 )) is computed using a ((n/2) × (n/2))-bit GF multiplier with four clock cycles, where four distinct (n/2)-bit secret keys K 1 , K 2 , K 3 , and K 4 are used. The generated ∆ is added to the mask values at the pre/post-mask generators.
For Verify, the proposed architecture first computes ∆. The generated ∆ is added to doubled or tripled L at the pre-mask generator (for dec.), and then the resulting mask value ∆ ⊕ 2L or ∆ ⊕ 3L is added to the input ciphertext block before decryption. After performing the decryption, the mask value from the post-mask generator (for dec.) is added to the decryption result to obtain the corresponding plaintext. At the same time as the first block is being decrypted, E N,0,0 K (0 n ) is computed using the encryption core and then added to the above plaintext. The resulting value is truncated into τ bits and is stored in the register in the plaintext accumulator for the verification. Then, the second and subsequent blocks are processed in parallel in the pipelined datapath, and the processed blocks are accumulated in the plaintext accumulator. After processing all blocks, the architecture outputs the verification tag.
For Update, the proposed architecture performs a pre-verification and an update processes simultaneously thanks to the separately-implemented decryption and encryption cores. Let ∆ old and ∆ new be the parts of the mask values generated from a nonce for the pre-verification and update processes, respectively. Initially, ∆ old is first generated and then ∆ new is generated using the ∆ generator module. In the case of AES4 in Fig. 7(a), the generations of ∆ old and ∆ new are executed in parallel owing to the pipelined datapath. In contrast, in the case of the GF(2 n/2 ) multiplier in Fig. 7(b), the multiplication results of N 1 · K 1 and N 1 · K 4 are reused because the value of ADD(u σ ) (i.e., half of the nonce) is fixed in the pre-verification and update processes, which means that N 1 · K 1 and N 1 · K 4 are identical for the pre-verification and update processes. Therefore, the number of clocks can be reduced to two from four for computation of ∆ new . Then, the architecture generates the tag verification in the same manner as Verify. In addition, after computing E N old ,0,0 K (0 n ), we simultaneously compute E Nnew,0,0 K (0 n ), encrypt the plaintext blocks, and accumulate the results at the plaintext accumulator module (for the update process). After processing all plaintext blocks, the architecture outputs the updated tag.

Performance Evaluation
This subsection reports our performance evaluation of the proposed architectures and SIT, a major state-ofthe-art counterpart, on the basis of logic synthesis results. We assume that one MAC module (hardware) is used at the top and each middle layer in a tree structure and one AE module is used at the lowest layer. In other words, an authentication tree with a depth of d utilizes d MAC modules and one AE module, which fully exploits the parallelism provided by the tree structure as the claim of PAT. Under this assumption, we investigate the best-case performance given a constraint in area and power (i.e., the number of available MAC modules), and clarify the area-latency trade-offs from the performance evaluation results.
For the evaluation, we use Synopsys Design Compiler I-2013.12-SP5 and Nangate 15nm Open Cell Library. The performance of authentication trees with d = 3, 5, and 7 are evaluated as the major parameter values as in [Gue16a]. Table 2 lists the area obtained from the logic synthesis. ELM1 and ELM2 represent the proposed schemes where Flat-OCB-f and Flat-OCB-m are used as the AE, respectively. We set the timing-constraint for the synthesis to the operating frequency of 4GHz, assuming that the authentication tree is deployed for memory encryption in modern high-end CPUs operating at 3GHz or faster. We confirmed that no timing violation was found in the synthesis result, and therefore the proposed architecture can be used even for modern high-end CPUs without degrading the system clock frequency.
For comparison, Table 2 also lists the synthesis results of SIT implemented under the same conditions and assumption as above. The SIT was implemented according to [Gue16a]. We utilized the unrolled-pipelined AES encryption core with the same round datapath as our scheme. A 64 × 64-bit GF multiplier to compute the inner-product MAC was also implemented in the same manner as ours. Therefore, the critical path was given by the one round datapath of AES, like ours. The SIT also utilized d + 1 modules when the depth was d, as the AE and MAC of SIT are given by the same module (i.e., an AES encryption core and a GF(2 64 ) multiplier).
From Table 2, we can see that the area of the proposed architecture can be comparable with that of the SIT as the depth was larger. The hardware architectures for Flat-OCB require both encryption and decryption cores, which resulted in a larger area than the inverse-free AE used in the SIT. However, the proposed MAC hardware is implemented using only one AES encryption core as the major component, whereas the SIT requires a GF(2 64 ) multiplier in addition to one AES encryption core. Consequently, we can confirm that the proposed authentication trees have an advantage over the SIT in terms of latency and memory regions for the cases with large depths. Table 3 shows the numbers of clock cycles (i.e., latency) and the size of protected memory region (namely, the covered region) of ELM and SIT for various tree parameters. The corresponding comparison graphs for major parameters are shown in Fig. 8. Incremental SIT (Incr. SIT for short) indicates the evaluation result of SIT when Update is performed in an incremental manner. Note that such an incremental update has not previously been mentioned in the literature [Gue16a]. We also evaluate the corresponding incremental SIT for a fair comparison. Each clock cycle shown here is given by a larger one of either AE or MAC. For example, if our authentication tree has eight branches b = 8 and handles 512-bit blocks = 512 [bit] at the leaf (or lowest) node, ELM1 requires 18 and 24 clock cycles for MAC and AE, respectively; hence the clock cycle of this authentication tree is given as 24 clock cycles in the table. The bold-face characters in each row highlight the scheme that achieved the lowest latency (minimal clock cycles) under the parameter condition of the row. The parameters used in Fig. 8 are underlined in the chunk size column in Table 3. The table shows the results of all tree structures comprehensively, where some rows hatched in gray indicate better clock cycles than those in white in terms of the latency required for the covered regions. For example, a tree with b = 8 and = 4,096 requires a larger latency and a smaller covered region than that with b = 16 and = 512, and therefore there is no reason to use the former tree rather than the latter. Such meaningless parameters are caused by the gap in latency between AE and MAC, as discussed in Section 6.1.3. Table 3 and Fig. 8 show that the advantage of the proposed scheme (ELM1 and ELM2) is greater as the covered region becomes larger. One major reason is that the proposed scheme utilizes a 128-bit block cipher (i.e., AES), whereas the SIT processes a plaintext in a 64-bit-wise manner (i.e., inner-product MAC over GF(2 64 )). More precisely, since the MAC module at each layer (and AE module) should process more bits for a larger parameter, the 128-bit-wise computation of PXOR-MAC in the proposed scheme enables fewer calls of the underlying pseudo-random function than the 64-bit-wise computation of the SIT, which results in a lower latency of the proposed scheme. In addition, the number of clock cycles in the update process of AE is reduced by using a distinct decryption core to perform the pre-verification and update processes simultaneously. Note that the SIT uses an Encrypt-then-MAC for AE, which is given by the counter-mode encryption followed by inner-product MAC. Since SIT does not utilize any decryption function and the MAC computation becomes critical for the latency, the latency of SIT cannot be reduced in the same manner as our scheme. In contrast, for small parameters, such as b = 8 (or b = 16) and = 512 (which is the original parameter for SIT in [Gue16a]), the SIT has a lower latency for pre-verification and update processes thanks to the lightness of GF(2 64 ) multiplication. These results suggest that ELM is superior to the SIT for most of the parameters, especially when covering a larger region. As the memory region to be protected becomes larger, the advantage of ELM increases significantly. The on-chip and off-chip memory sizes for each architecture are listed in Table 4. We assume that both counter and tag lengths are 56 bits for comparison with SIT. With respect to the on-chip storage, ELM1 requires four 128-bit round keys for ∆ gen., a 128-bit L, a 128-bit plaintext/ciphertext processing key K, and a 56-bit CTR(u σ root ). Similarly, ELM2 requires four 64-bit round keys for ∆ gen., a 128-bit L, a 128-bit plaintext/ciphertext processing key K, and a 56-bit CTR(u σ root ). Here, the amount of on-chip storage in ELM1 and ELM2 is constant regardless of parameters b and for the tree. In contrast, the SIT needs to store the 1XPRIEUDQFKHV&KXQNVL]H 2 × 128-bit keys and inner-product MAC for nonce processing and mask generation, depending on the size of the tree parameters. As a result, the on-chip memory size gets larger when the tree parameters b and are larger because the key length of inner-product MAC increases in proportion to the length of the input block(i.e., b). For example, the on-chip memory size is 768 bits when b = 8 and = 512, whereas it is 8,448 bits when b = 128 and = 8,192.
In addition, the off-chip memory size is determined by the size of the counters excluding the root and tags. In the SIT, the counters at d = 1 are originally stored in the on-chip memory, but here we store it in the off-chip memory for comparison. Our tree requires CTR and Tag for the middle/lowest layers, and Tag for the root layer, all of which are given in 56 bit. Because SIT stores one unused bit in each layer, the size of its memory unit is one bit larger than that of ours. For example, when d = 3 and b = 8, the off-chip memory sizes of SIT and ours are 66,049 and 65,464 bits, respectively.

Design Optimizations
Considering System Constraints. In this paper, we evaluated the performance of our authentication trees without considering system constraints in order to demonstrate the scalability of the proposed scheme. In practice, we need to design and optimize the total hardware configuration depending on various system/architecture constraints including system clock frequency, memory size to be protected, available resource in the on-chip size, memory bandwidth, cache memory structure, and so on. The sizes of the MAC and AE modules should also be determined considering the above constraints, though here we used d MAC modules and one AE module for the authentication tree implementation according to the parallelizability of the authentication tree. (Note again that PAT was proposed as the first scheme that offers such a parallelizability.) For example, as for the operation frequency, the result discussed in Section 6.3 suggests that our architectures should not limit the operating frequency since the system clock frequency of modern high-end CPUs is currently at most 3.8GHz unless overclocking occurs (e.g., Intel Core i7-10700K and AMD Ryzen 3900X) 17 . When the maximum frequency of the MAC and AE modules is far higher than the system clock, and if it is allowed under the system constraints, the number of clock cycles for encryption and decryption can be reduced without changing the system clock frequency by removing and rearranging the pipeline registers in the unrolled AES datapath appropriately. In contrast, when the system clock frequency is higher than the maximum frequency of MAC and AE modules, we should modify the datapath to enhance the frequency for the deployment.
Mitigating Gap in Latency between AE and MAC. A gap in latency between AE and MAC leads to a loss of efficiency for the authentication tree because the entire latency is determined by a larger latency of either AE or MAC. (This gap is the reason most rows in Table 3 are denoted in gray. Only well-balanced parameters are meaningful.) While we systematically evaluated the typical tree structures in Section 6.3, where the parameters are given by the power of two, these parameters should be determined such that the latencies of AE and MAC are well-balanced.
However, it would be difficult to align the latencies of AE and MAC exactly. In such a case, it is effective to reduce the number of pipeline stages of AES in either AE or MAC with larger latency. In addition, selecting the appropriate S-box implementation would be useful, as the above evaluation utilized a tower-field S-box for achieving high area-time efficiency [UMM + 20]. Since the AES encryption/decryption core was unrolled and pipelined, the usage of a table-based S-box for two consecutive rounds makes it possible to remove the pipeline register between them (i.e., reduce the number of clock cycles) without significantly degrading the operation frequency. In other words, two rounds can be computed in one clock cycle if we use a table-based S-box for the rounds. We confirmed through the logic synthesis with NanGate 15nm Open Cell Library that such an implementation could operate at 4GHz.
Here, our architecture for Flat-OCB-f (i.e., proposed AE with AES4) uses an AES encryption core for the generation of ∆ from nonce and the encryption of plaintext blocks. It is particularly effective to reduce the latency of the four-round datapath used for AES4 using table-based S-box-because, in Flat-OCB-f, the four round datapath is used for both nonce processing and plaintext encryption.
In summary, when designing the authentication tree and its hardware, we should first determine the optimal (i.e., well-balanced) tree structure parameters for the required covered region. Then, we can mitigate the remaining gap in latency between AE and MAC on the basis of the above hardware optimization approach.

Application of Split Counter
The split counter is a method to reduce the amount of counters stored in an off-chip for memory authentication trees [YEP + 06]. It uses two types of counters: major and minor ones. A major counter is shared by the children nodes of the node of interest (or a parent node), and each child node is equipped with a minor counter. In other words, in an authentication tree with split counter, children nodes having the same parent node share the upper bits of the same major counter.
ELM can also be applied to the split counter. In the following, we evaluate the off-chip memory size for the case where the split counter is used. We assume here that both tag and counter lengths without the split counter are 64 bits and the major and minor counters are given as 56 and 8 bits, respectively. The off-chip memory size of the entire tree is 128 × i=0 b i + 56 without and with the split counter, respectively. The value of 72 is the sum of the tag and minor counter lengths, and the third and fourth terms of the expressions indicate the size of the major counter. As an example, when d = 3 and b = 8, the memory size is 74,816 bits without the split counter and 46,200 bits with the split counter, which shows a large reduction in the amount of memory.
We should point out that overflows of the minor counters frequently occur, since each minor counter is given with a small bit length. When such an overflow occurs, the corresponding major counter is incremented and all the minor counters associated with it are reset to zero. In ELM, b counters are used as the input for the plaintext part of tag generation by the MAC algorithm, that is, b × n/2 bits should be verified by MAC. The usage of the split counter reduces the amount of counters stored in off-chips and the average latency of MACs because the input to the MAC algorithm is reduced 18 . More precisely, let ctr be the counter of the parent node and let ctr 1 , ctr 2 , . . . ctr b be the counters of children nodes without the split counter, where b is the number of branches in the tree structure. In ELM, a tag T is generated as T = PXOR-MAC.T K,K (ADD ctr), (ctr 1 ctr 2 · · · ctr b ) , where PXOR-MAC.T K,K (N, M ) calculates a tag from a nonce N and a plaintext M . Each counter is given by n/2 bits and b × n/2 bits should be encrypted.
In contrast, we consider the tag generation when utilizing the split counter. Let Mctr and mctr be the major and minor counters of a parent node, respectively. Let Mctr and mctr 1 , mctr 2 , . . . mctr b be the major counter and minor counters of children nodes, respectively. In this case, unless the overflow of minor counters occurs, a tag is generated as T = PXOR-MAC.T K,K (ADD Mctr mctr), (Mctr mctr 1 · · · mctr b ) .
Here, let s and t be the bit lengths of major and minor counters, respectively (s + t = n/2). When the split counter is used, the input is given with s + bt bits. Since s + bt ≤ b(s + t)/2, the input bit length of the MAC algorithm is reduced thanks to the split counter. In addition to the tag generation algorithm (i.e., PXOR-MAC.T ), Verify (i.e., PXOR-MAC.V) and Update (i.e., PXOR-MAC.VU) algorithms are performed with the split counter as well. The tags of leaf (or lowest) nodes can also be generated, verified, and updated in a similar manner even when the split counter is applied.
On the other hand, as described above, when the major counter is incremented due to the overflow of a minor counter, all minor counters associated with the major counter are reset to 0. Accordingly, we need to update all the tags where the nonces are reset. While the tag update of a tree without the split counter should update only d tags at each layer, the tag update with the split counter requires b times tag updates at the layer where a major counter is incremented (i.e., the overflow of minor counter occurs), which is non-trivial in the whole tree update process (i.e., Update).
Tables 5 and 6 show the numbers of clock cycles when an overflow occurs in MAC and AE, respectively. We evaluate the cases of five different numbers of branches. Since the reset counter is always the same value, the encryption result of reset counter in MAC can be pre-computed for both SIT and our trees. Hence, only the major counter (i.e., Mctr') and nonce (i.e., (add Mctr mctr)) should be computed if MAC offers the incremental property. Thus, ELM maintains superiority to SIT under the condition where an overflow occurs. We also found that the proposed AE is advantageous even with the split counter thanks to the simultaneous execution of pre-verification and update. In particular, when the number of branches increases, the proposed scheme becomes more advantageous in comparison with that without the split counter. The split counter is basically applied to trees covering a large memory size, where the amount of counters can be critical, and therefore we can confirm again the advantage of the proposed scheme.

Conclusion
We have presented ELM, a new memory encryption scheme with tree-based authentication. Unlike many recent proposals from computer architecture perspective, we focus on the internal MAC and AE modes, including their interactions, to reduce the entire latency of tree operations. ELM combines fully parallelizable MAC and AE modes and utilizes the incremental property of the MAC mode. Our AE mode is similar to OCB, however has a better decryption latency and it can be of independent interest as a stand-alone AE mode. We provide provable security results for these components as well as the whole authentication tree. Since Intel SGX's scheme (SIT) is a representative work on the same direction, we instantiated ELM using the same AES and compared ELM with SIT, and presented preliminary hardware implementations. The results showed that ELM achieves significantly lower latency, while keeping the comparable implementation size of SIT. Several future directions can be considered, as follows: Other Instantiations. The use of AES is not a ultimate choice for latency. As we described, a low-latency block cipher or a tweakable block cipher (e.g., PRINCE [BCG + 12], QARMA [Ava17], and Midori [BBI + 15]) will significantly improve our hardware results for both latency and size. It is even possible to consider using multiple primitives of possibly different block sizes for optimized performance. It is also interesting to study instantiations based on cryptographic permutations, e.g., [NIS15, DEMS16, BKL + 17].
Side-Channel Attacks. Cryptographic hardware frequently needs to be resistant against side-channel attacks for the application to memory encryption. It would be conducted in the future to design and evaluate side-channel-resistant hardware architecture for ELM. Here, the proposed architectures can employ any other block ciphers (satisfying the security criterion) and any type of architecture, instead of unrolled and pipelined AES encryption and decryption cores used in this paper. This indicates that we can easily realize a side-channel-resistant ELM hardware by replacing the AES cores with side-channel-resistant one because typical attackers attempt to retrieve the secret key of the underlying block cipher to break the confidentiality and authenticity.
For example, the masked round-based AES implementation in [SBHM20] achieves a far lower latency than any other conventional implementations based on a functional decomposition and byte-serial architecture, which would suit to the context of memory encryption. However, such masked implementations require a considerably large area overhead and on-the-fly random number generation (except for [Sug19, WM18]), which makes it impractical to unroll and pipeline the masked AES datapaths for high throughput. The usage of a (first-order) masking-friendly lightweight (tweakable) block cipher such as PRESENT [BKL + 07], GIFT [BPP + 17], and Skinny [BJK + 16] would be a practical alternative to realize the side-channel resistance with a less area overhead and no on-the-fly randomness.
Furthermore, it would be interesting to design leakage-resilient TBC/permutation-based AE (e.g., [DEM + 17, BGP + 19]) and MAC that enable low-latency operation and are suitable to be used with ELM.