Reducing the Cost of Implementing AES as a Quantum Circuit

To quantify security levels in a postquantum scenario, it is common to use the quantum resources needed to attack the Advanced Encryption Standard (AES) as a reference value. Specifically, in the National Institute of Standards and Technology's ongoing postquantum standardization effort, different security categories are defined that reflect the quantum resources needed to attack AES-128, AES-192, and AES-256. This article presents a quantum circuit to implement the S-box of AES. Also, leveraging an improved implementation of the key expansion, we identify new quantum circuits for all three AES key lengths. For AES-128, the number of Toffoli gates can be reduced by more than 88% compared to Almazrooie et al.'s and Grassl et al.'s estimates while simultaneously reducing the number of qubits. Our circuits can be used to simplify a Grover-based key search for AES.


Introduction
Reacting to progress in the development of quantum computers, NIST has initiated a process to standardize cryptographic primitives that are designed to remain secure in the presence of large-scale quantum computers [13].To fix security categories, NIST's call for proposals offers the quantum resources for an exhaustive key search in the case of  as a reference point.Relevant cost measures include the number of qubits, the number of T -and Clifford gates, and the T -depth.It is not hard to see that with exception of the highly structured S-box-the SubByte transform-all of AES can be implemented by means of NOT and CNOT gates.
Contributions.Below we present a new quantum circuit to implement SubByte, which builds on a result by Boyar and Peralta [6].This approach allows a substantial reduction in the number of T -gates compared to the quantum circuits proposed by Grassl et al. [8] and, more recently, by Almazrooie et al. [1].Our circuit requires 32 qubits, 55 Toffoli gates, 314 CNOT gates, 4 NOT gates, Toffoli depth 40, and a total (NCT) depth of 298, including "cleaning up" ancillas-a reduction of the Toffoli count by more than 88%.There are different options to compile Toffoli gates into Clifford and T -gates, and the common quantum cryptanalytic approach is to first express AES as an NCT circuit, i. e., with NOT, CNOT, and Toffoli gates.Consequently, in this paper we stay at the NCT level, leaving the choice of a particular decomposition of Toffoli gates into more elementary building blocks to a subsequent synthesis step.
Moreover, building on [8,1], we present new quantum circuits for all three standardized key lengths of AES, which simultaneously offer savings in the number of qubits, the number of Toffoli gates, and the number of Clifford gates.
Organization.First, we briefly recall the structure of the S-box of AES and survey prior work to express this functionality as a quantum circuit.Thereafter, we present our design for implementing SubByte in the NCT gate set and integrate it into quantum circuits for  We conclude with updated cost estimates for an exhaustive key search with Grover's algorithm for AES.
For a full specification of AES we refer to [12], but we briefly recall the algebraic structure of SubByte.

The S-box of AES
The AES algorithm uses multiple different transformations, but as detailed in [8], with the exception of SubByte all needed calculations can be expressed with NOT and CNOT gates alone.The non-linear SubByte transformation takes a one byte input b ∈ F 8  2 and substitutes it with a byte S(b) ∈ F 8 2 obtained by applying the following two operations: In [12,Figure 7], SubByte is expressed as a traditional substitution table, but when aiming at an efficient quantum circuit, one can try to leverage the available algebraic structure: -Observing that the map S that defines SubBye is a permutation on F 8 2 and therewith inherently reversible, we can try to reduce the number of qubits by evaluating S "in place."-The affine map in the second step can be expressed with NOT and CNOT gates only, and one can focus on minimizing resources for the binary field inversion step, possibly exploiting the presence of intermediate fields.

Prior work to implement SubByte as quantum circuit
Several authors looked into implementations of AES and its S-box as quantum circuits.In 2016, Grassl et al. [8] report a quantum circuit that builds on the observation that SubByte is a permutation.They offer a circuit that maps |x with x ∈ F 8 2 to |S(x) using only a single ancilla.However, the circuit (found by solving a word problem in a permutation group) is quite large: while it requires only 9 qubits, it uses 1385 Toffoli plus 1551 CNOT or NOT gates.
In [8], another circuit is offered, which exploits the algebraic structure of SubByte and maps the input |x |0 32 to |x |S(x) |0 24 .With 40 qubits, this circuit needs only 512 Toffoli along with 469 CNOT and 4 NOT gates.Kim et al. [11] suggest an improvement to the design in [8], saving one F 2 8 -multiplication.Almazrooie et al. [1]  When aiming at a reduction of T -gates and T -depth, work on classical reversible circuits offers helpful results, and it seems these have not been fully leveraged yet.For instance, in 2018, Saravanan and Kalpana [14] suggest an implementation of SubByte involving only 35 Toffoli, 152 CNOT, and 4 NOT gates.This design (which again exploits the algebraic structure of the S-box) produces dozens of "garbage outputs," and for our purposes the cost to "clean up" wires is to be taken into account.Still, combining Bennett's method [4] with [14] leads to a quantum circuit with a Toffoli count of only 2 • 35 = 70, less than one sixth of the Toffoli counts in [8,1].
Exploring the S-box in AES from the perspective of identifying a low-depth combinational circuit, Boyar and Peralta present in [7] a proposal with only 34 AND gates.Their design again leverages the algebraic structure of SubByte, and by naïvely combining Boyar and Peralta's work with Bennett's method we could derive a quantum circuit for the S-box of AES with no more than 68 Toffoli gates-though possibly a solid number of ancillas.As starting point for our work we use an older design by Boyar and Peralta [6], which involves only 32 AND gates to evaluate SubByte.Below, we transform the latter into a quantum circuit for SubByte that avoids the direct application of Bennett's method.In particular, this limits the number of Toffoli gates to 55, including all necessary "clean up." 3 Proposed quantum circuit for the S-box in AES In [6], Boyar and Peralta discuss a technique for combinational logic optimization, which involves two steps.The first step identifies non-linear circuit components and reduces the number of AND gates-which for our purposes can be interpreted as saving Toffoli gates.The second step finds maximal linear components of the circuit and minimizes the number of XOR gates needed-therewith reducing the number of CNOT gates.

Decomposition of the S-box by Boyar and Peralta
Making use of the intermediate fields , and a non-linear function F : F 22 2 −→ F 18 2 .The matrices B and U are given in [6,Appendix A], and the function F can be computed as shown in Figure 1.From this, we see that no more than 32 Toffoli gates are needed to evaluate SubByte, but we still need to take care of "cleaning up" ancillas-and would like to keep the number of qubits small.To optimize the linear portion of SubByte, Boyar and Peralta derive short linear programs, which we do not reproduce here; they are available in [6,Appendix C,Figures 2 and 4] and involve XOR and XNOR operations only.The four NOT gates in our quantum circuit originate in the four XNOR gates used by Boyar and Peralta.

Deriving a compact quantum circuit
A naïve conversion of Boyer and Peralta's circuit yields a quantum circuit with 126 qubits, 32 Toffoli gates, 166 CNOT gates and 4 NOT gates-not yet taking into account the "clean up" cost.The circuit we aim at is to map |x |0 a to |x S(x) |0 a−8 with a small number a of ancilla qubits.Our circuit uses a = 24.We also identified a circuit with a = 23, but that circuit came at the expense of increasing the Toffoli count by 2, and our primary objective is to reduce the number of Toffoli gates.
To reduce the number of qubits in a straightforward translation, we notice that certain wires, after being accessed for a few immediate computations, remain idle until the end.Uncomputing these wires early on, enables us to reuse them instead of introducing additional ancillas.Another observation is that wires that store the output of the S-box do not need to be cleaned up.Thus, we try to have Toffoli gates applied directly to those wires to avoid involving them in the clean up process.Also, some computations would target a wire, and later on, the result is just added somewhere else.We try to place gates so that such "intermediate wires" are avoided.The final circuit we obtain, including "cleaning up", requires 32 qubits, 55 Toffoli, 314 CNOT, and 4 NOT gates.The Toffoli depth is 40, and the overall S-box depth is 298. Figure 2 gives a high-level view of the circuit, Appendix A gives a detailed description.
To produce the circuit description in Appendix A, we used the open-source software framework for quantum computing ProjectQ [16,10].The 8-bit input of SubByte is represented by U; T and Z represent ancillas, which are used in the intermediate computations and returned to zero at the end, and S represents the output of the S-box.
The main portion of the source code is the translation of equations in Figure 1.We treat U[0],. . ., U [7] as basis elements, and we update them as we progress to provide needed input values for a calculation.For instance, to compute t 2 , we first compute y 12 and y 15 and then apply a Toffoli gate.The value y 12 can be obtained as a linear combination of U[0], U [3], U [5], U [6], and we store this result on U [5].Similarly, y 15 is a linear combination of U[0], U [3], U [4], U [6], and we store this result on U [4].The Toffoli will use U [5], U [4] as controls and target T[0], which is t 2 at this moment.Our basis elements remain the same except for U [4] and U [5].We have U[0]+U [3]+U [4]+U [6] and U[0]+U [3]+U [5]+U [6] respectively for the next computation.We repeat this technique until t 45 is computed.Notice that we are able to reuse the qubit T [8].
The computations annotated by "for z16" to "for z14" are preparations for later usages, because we do not want to uncompute Toffoli gates that can target output qubits directly.Some of the z i can be computed directly onto an output qubit and copied to other designated locations.For others, we compute them onto the ancilla Z[0], then copy the result to the needed output qubits before cleaning up Z[0] for reuse.
4 Quantum resource estimates for AES-{128, 192, 256} Aside from the reduced S-box above, we offer a reduction for AES in terms of circuit depth as well as number of qubits over prior work in [8,1].This saving is due to a new cost-saving design in the architecture of the key expansion along with the reduced qubit requirements of our S-box.For our round generation, we adopt the "zig-zag" method from [8].This is kept identical in AES-128 and AES-256, and a minimal change is made at the very end for AES-192.Expanding on ideas in [1], we recognize that by storing all k 4n+3 for AES-128 and AES-256 and k 4n+5 for AES-192 where n represents the round number, not only could we use a combination to construct future keys, but also gain the ability to remove keys once they are no longer used in future constructions.Since there is a direct correlation between T -depth and "S-box depth" (as the S-box is the only use of T -gates in our construction), we stop at S-box depth; the S-box can be replaced if a different design is preferred.Note the S-box design proposed in this paper uses 8 less qubits than [8] and 24 less qubits than [1] which allows for greater parallelization and thus a generally reduced "S-box depth."

Savings in AES-128
As part of the key expansion, various keywords k i are computed.Each k 4n requires the use of four S-boxes and an XOR of previous keys.After Round 3 (after k 12 ), all keys have a similar structure which can be seen below.In our design, we store k 4n+3 once round n has been fully computed.To save depth, each keyword will be constructed at the same time as the round it is used in, except for round one.This is because the plaintext and cipher key (k 0 , k 1 , k 2 , and k 3 ) are XORed together to produce Round 0. However, both are required to construct Round 1, so Round 1 and k 4 must be constructed at sequential times.For the remaining rounds, the parallelization greatly reduces depth.For example, during Round 2, all S-box computations for k 8 as well as Round 2 can be computed with an S-box depth of one, using 320 auxiliary qubits.Once these S-boxes and MixColumns are computed, k 8 can be XORed onto Round 1, followed by the construction of k 9 , k 10 , and k 11 , each being XORed onto the round one construction.Thus, Round 2 is fully computed, and k 11 is stored, and this entire construction took a total S-box depth of one.When 320 auxiliary qubits are not available, not all S-box computations can be done in parallel and the depth must be increased (up to 7).Round 1 (without k 4 ), Round 2, removing Round 1, and Round 5 all are computed with an S-box depth of one.
By storing k 4n+3 for each n ≤ 7, once we get to Round 7, and store keyword k 31 , we can remove keywords k 15 , k 11 , and k 7 from Rounds 3, 2, and 1, thus gaining storage space to place keywords for Rounds 8, 9, 10 in this space.This removal is done using S-boxes in reverse after the keys are returned to their k 4n values.This is equivalent to the "zig-zag" method used in [8] to remove rounds, but here we use it to remove keys.This saves 96 qubits over [8] and 64 qubits over [1] who only removed one keyword.Since each keyword uses four S-boxes, the removal requires the use of 12 additional S-boxes for a substantial savings in qubits.The removal of keyword k 15 can be done during the removal of Round 5 without additional depth.Similarly, the removal of keyword k 7 can be done during the construction of Round 8 without additional depth.Thus, the S-box depth for the key expansion is two, which includes computing k 4 and removing k 11 , both with an S-box depth of one.If in the future, it is found the savings in qubits is not worth the additional gates and a depth of three S-boxes, this can be ignored and extra qubits can be used.The total depth of this circuit uses 47 S-boxes, 15 MixColumns computations with a depth of 39 each, and a depth of 142 to apply the AddRoundKey to each round.

Savings in AES-192
AES-192 differs slightly in the key generation.Recall AES-192 only uses an S-box for every sixth key, and since only four keys are needed per round, some rounds only need 16 S-boxes to be fully computed and some need 16 plus the additional 4 for the key generation.So even though there are more rounds than in AES-128, there are less keywords  generated.By the time keyword k 48 needs to be computed, k 11 and smaller keys are no longer needed, thus k 11 can be reversed to k and then removed using an inverted S-box, thus saving 32 qubits for an additional 4 S-boxes.Also, the "zig-zag" method used in [8] used the same amount of qubits for AES-256 as it did for AES-192.This means there is room for additional rounds or space savings.While we did not reduce the number of qubits for the round generation, we were able to use some of this additional space for the key expansion.Instead of placing Round 12 on the remaining 128 qubits, we can reverse part of Round 10 and reuse those qubits to store part of Round 12, thus gaining enough qubits to store round keys.Thus, when keyword k 42 is generated, it is generated below where Round 11 is stored, thus saving another 32 qubits for the cost of another 4 inverted S-boxes.Overall, we were able to save 64 qubits over the results in [8].The total depth of this circuit uses 41 S-boxes, 18 MixColumns, and a depth of 208 to apply AddRoundKey to each round.

Savings in AES-256
The methods for AES-128 in Section 4.1 above apply equally here, but since the key constructions in AES-256 require more previous keys, the removal of keys is not as simple.However, after Round 11 and key k 47 is constructed, keys for rounds three and two (k 15 and k 11 ) can be removed in the same fashion as above, and keys for Rounds 12 and 13 (k 51 and k 55 ) can be stored in their place.Also, after Round 13, key k 23 can be removed and replaced with key material for Round (k 59 ).This is a total saving of 96 qubits for the increased costs of 12 S-boxes with a total additional depth of 3 S-boxes.The total depth of this circuit uses 54 S-boxes, 22 MixColumns, and a depth of 267 to apply AddRoundKey to each round.
This method of computing the keywords during the round generation and only storing k 4n+3 for AES-128 and AES-256, and k 4n+5 for AES-192 means the keys between k 4n and this key need to be computed several times, however this method is comparable to other methods of producing the additional keys directly in the rounds.we also recall resource counts for designs proposed in [8,1].Comparing Table 2 with Table 1, we see that the revised S-box design in combination with the changes to handling the key expansion enables attractive resource savings.

Exhaustive key search with Grover's algorithm
For our discussion of leveraging Grover's algorithm for an exhaustive key search, we follow the approach in [8], i. e., we assume a straightforward application of Grover's algorithm, using our AES design to implement the pertinent Grover operator.We leave it for future work to explore possible time-space trade-offs in the spirit of Kim et al.'s work [11].For Grover's algorithm [9], we need a quantum circuit U f : |x |y −→ |x |y ⊕ f (x) , where x ∈ {0, 1} k represents a candidate key, and f (x) = 1 if the key x matches all given plaintext-ciphertext pairs, and f (x) = 0, otherwise.Following Amento-Adelmann et al. [2], we assume that r k = k/128 known plaintext-ciphertext pairs are sufficient to avoid false positives in an exhaustive key search for 192,256}).Thus, taking into account "cleaning up" of wires, we need to implement Here, we do not distinguish between implementing encryption or decryption, as the latter can be obtained by running encryption backwards, thereby not affecting the cost parameters we are looking at.
Remark 5.1.The above choices for r 128 , r 192 , and r 256 are smaller than the ones used by Grassl et al. [8], but Amento-Adelmann et al.'s [2] reasoning could be applied to argue for a smaller number of AES instances in [8], too.We are not offering anything novel here -our contribution only affects the quantum circuit for AES.

Number of qubits
As noted in [2], multiple plaintext-ciphertext pairs can be tested sequentially or in parallel, trading gates and circuit depth for the number of qubits.Prioritizing a smaller T -depth, here we choose the parallel option, as in [8], which leads to a total qubit count of r k • q k + 1, where q k is the number of qubits needed to implement AES-k according to Table 1:

Gate counts
Operator U f .Inside the operator U f , we need to compare the 128-bit outputs of the AES instances with r k given ciphertexts.For this, we can use a 128 • r k -controlled NOT (plus some NOT gates, which we neglect and that depend on the given ciphertext(s).)We also budget 2 • (r k − 1) • k CNOT gates to make the input key available to all r k parallel AES instances (and uncomputing this operation at the end).And, of course, we need need to implement the actual AES instances.From Table 1, we obtain the following resource estimates: -AES-128: Two AES-instances require 2•16, 940 = 33, 880 Toffoli gates with a Toffoli depth of 2•1, 880 = 3, 760.
In addition, we need 2 Grover operator.Grover's algorithm repeatedly applies the operator where |0 is the all-zero basis state of appropriate size.So in addition to U f , further gates are needed.In this paper we do not offer any improvements to those parts of the algorithm.Following [8], for the operator 2 |0 0|−1 2 k , we budget a k-fold controlled NOT gate.With π 4 • 2 k/2 Grover iterations being used for AES-k, we can now give estimates in the Clifford+T model and compare our results with prior work.
, and replace b with the bitstring b corresponding to b −1 .For b = 0, set b = b.2.Apply an affine transformation, which consists of multiplication by an invertible matrix followed by the addition of a vector:

Fig. 2 .
Fig.2. Circuit diagram for implementing SubByte with 32 qubits and 55 Toffoli gates; the input value x is stored on the top-most eight wires; the output S(x) of SubByte is stored on the last 8 wires.

Fig. 3 .
Fig. 3.The keys required to construct each key in AES-128.The leftmost column requires four S-boxes while the rightmost column is what is stored at the end of each round.

Fig. 4 .
Fig. 4. AES-128 Diagram of Round 7 computations which have an S-box depth of seven.Each column represents an S-box depth of one.

Fig. 5 .
Fig. 5. AES-256 Circuit Diagram showing when keys are constructed and the S-box depth of each round computation.
improve on Grassl et al.'s Toffoli count, using the same number of F 2 8 -multiplications as Kim et al.Exploiting again the algebraic structure of the S-box, Almazrooie et al. identify a quantum circuit with 56 qubits, 448 Toffoli gates, 494 CNOT gates, and 4 NOT gates.

Table 1 .
Table summarizes the resources needed to implement AES with the approach suggested here.For comparison, Quantum resources to implement AES.

Table 2 .
Resource estimates for AES using designs from prior literature.

Table 3 .
Revised resource estimates in the Clifford+T model for a Grover-based key search for AES-k.