Efficient Construction of a Control Modular Adder on a Carry-Lookahead Adder Using Relative-phase Toffoli Gates

Control modular addition is a core arithmetic function, and we must consider the computational cost for actual quantum computers to realize efficient implementation. To achieve a low computational cost in a control modular adder, we focus on minimizing KQ, defined by the product of the number of qubits and the depth of the circuit. In this paper, we construct an efficient control modular adder with small KQ by using relative-phase Toffoli gates in two major types of quantum computers: Fault-Tolerant Quantum Computers (FTQ) on the Logical layer and Noisy Intermediate-Scale Quantum Computers (NISQ). We give a more efficient construction compared to Van Meter and Itoh's, based on a carry-lookahead adder. In FTQ, $T$ gates incur heavy cost due to distillation, which fabricates ancilla for running $T$ gates with high accuracy but consumes a lot of specially prepared ancilla qubits and a lot of time. Thus, we must reduce the number of $T$ gates. We propose a new control modular adder that uses only 20% of the number of $T$ gates of the original. Moreover, when we take distillation into consideration, we find that we minimize $\text{KQ}_{T}$ (the product of the number of qubits and $T$-depth) by running $\Theta\left(n / \sqrt{\log n} \right)$ $T$ gates simultaneously. In NISQ, CNOT gates are the major error source. We propose a new control modular adder that uses only 35% of the number of CNOT gates of the original. Moreover, we show that the $\text{KQ}_{\text{CX}}$ (the product of the number of qubits and CNOT-depth) of our circuit is 38% of the original. Thus, we realize an efficient control modular adder, improving prospects for the efficient execution of arithmetic in quantum computers.

Many researchers have constructed simple quantum cir-cuits for NISQ machines. Researchers at IBM implemented the first 15 = 3×5 factoring circuit on a liquid NMR machine in 2001 [9]. Since then, researchers have implemented Shor's algorithm on a variety of machines [10]- [15], though care must be taken not to extrapolate too far from these demonstrations [16]. Researchers have also demonstrated small instances of Grover's algorithm [17], [18]. However, we cannot realize large-scale quantum computation on NISQ, due to the high error rate. These errors propagate as the calculation proceeds, and we cannot extract the correct result. Thus, we must reduce the error rate in quantum computers. To realize computation with high accuracy, research on Fault-Tolerant Quantum Computers (FTQ) is proceeding [19]- [21].
Jones et al. [22] proposed a method for constructing FTQ as a layered architecture. Specifically, we conduct the accurate computation on the Logical layer, which is achieved using large numbers of physical qubits with errors.
However, T gates impose an additional cost when run on FTQ. By the Gottesman-Knill theorem [23], we can conduct classical simulation of quantum circuits composed only of Clifford gates, but to realize universal quantum computation, we require non-Clifford gates such as a T gate, taking us into a realm that cannot be simulated classically. We achieve high-fidelity T gates by incorporating distillation [24], which requires a lot of logical qubits and a lot of time; research on optimization of distillation is being carried out [25]. In FTQ, we may realize large-scale quantum algorithms such as Shor's algorithm [26] and Grover's algorithm [27]. Shor's algorithm is of particular interest if it can implemented effectively, because it solves the factorization problem or the discrete logarithm problem in polynomial time, breaking the security of current cryptosystems, such as RSA [28] or elliptic curve [29], [30], whose security is based on the factorization problem or the discrete logarithm problem, respectively.
In Shor's algorithm, the control modular exponentiation step dominates the total cost, leading many researchers to study its construction [31]- [41]. The control modular exponentiation calculates |j |1 → |j |a j mod N . (1) In eq.(1), a, N and j are non-negative integers satisfying a < N . One strategy realizes a control modular exponentiation by the repeated calling of control modular additions. More precisely, this construction is realized by following two steps: 1) Decomposition of a control modular exponentiation into control modular multipliers 2) Decomposition of a control modular multiplier into control modular adders The first decomposition is based on the following equation where j can be expressed in n-bit, namely j = (j n−1 . . . j 0 ) 2 : In eq.(2), a j is decomposed into n-times multiplications, namely a 2 l j l . For example, a exponentiation 2 5 is decomposed into 2 5 = 2 1012 = 2 4 × 2 1 . Overview of a control modular adder. The first register has a single qubit which is used as a control bit.
The second register has n qubits which are used to store the result. a and N are n-bit classical numbers.
From the above discussion, a control modular exponentiation is decomposed into control modular adders. Thus, if we reduce the cost of a control modular adder, the total cost of Shor's algorithm will shrink. In this paper, we focus on the efficient construction of a control modular addition.

A. BACKGROUND
A control modular addition is defined by a control qubit x and n-bit numbers a, b, and N . a and b satisfy 0 ≤ a, b ≤ N −1, and a and N are classical numbers. A control modular addition calculates |x |b → |x |b + xa mod N .
An overview is shown in Figure 1. However, the optimal construction of a control modular adder is not obvious. A control modular adder is constructed from simple adders [32], [33], [35], [38]- [41], and there are many kinds of adders [39], [40], [42]- [46]. Previous constructions follow similar overall structure, but differ in detail. We need to determine which combination is the best.
This paper focuses on minimizing KQ [47] to construct a circuit with low execution cost. KQ is defined by the product of the number of qubits and the depth of the circuit. Minimizing KQ benefits both FTQ [22] and NISQ [18]. Much previous research focuses only on depth or the number of qubits, but reducing only one metric improves only one performance. We believe KQ more accurately represents the total resource consumption, especially for deep circuits, capturing the total "qubit-time steps" of a circuit. FIGURE 2: Optimization of a control modular adder. In first-level optimization, we optimize the construction of a control modular adder. In second-level optimization, we minimize KQ for FTQ or NISQ by using relative-phase Toffoli gates. This paper proposes our new circuit based on Van Meter and Itoh's construction [39], which uses three of Draper et al.'s carry-lookahead adders [43]. Van Meter and Itoh's construction has small KQ values than the other constructions but has room for further minimization of KQ. For example, Thapliyal et al. [48] proposed a means of minimizing the number of T gates in a carry-lookahead adder by using Gidney's relative-phase Toffoli gates [45]. Thapliyal et al.'s construction reduces KQ in FTQ, but similar optimization can be applied to NISQ by Maslov's relative-phase Toffoli gates [49]. Thus, we reduce KQ by applying relative-phase Toffoli gates on the Van Meter-Itoh construction.

B. OUR CONTRIBUTION
In this study, we propose a method for optimizing a control modular adder based on a carry-lookahead adder for both FTQ and NISQ. We apply two-level optimization on the original Van Meter-Itoh construction [39] as in Figure 2.
In first-level optimization, we optimize the construction of a control modular adder (Section III). Specifically, we optimize by focusing on the efficiency of the comparator in a carry-lookahead adder and reduce some control operations by taking advantage of the classicality of a and N .
In second-level optimization, we minimize KQ for FTQ or NISQ by using relative-phase Toffoli gates (Section IV). In this study, we assumed all qubits are connected, without considering the physical or logical topology [33], [50], [51]. We assume full connectivity because some current NISQ machines, such as IonQ [7] and Honeywell [8], realize full connectivity. Future work must consider the problem of mapping circuits to other NISQ machines and to FTQ machines.
First, we clarify the definition of KQ in each device, because the cost of gates is different in FTQ or NISQ. For many implementations, the most expensive gates are T gates [22] CNOT gates [3], [4], respectively. We define KQ T for use with FTQ and KQ CX on NISQ as the product of the number of qubits and T -depth or CNOT-depth, respectively. Next, we use Gidney's relative-phase Toffoli gates [45] in FTQ circuits and Maslov's relative-phase Toffoli gates [49] in NISQ circuits, instead of the standard Toffoli gates. However, the construction for FTQ does not consider the cost of distillation, and there is a trade-off between T -depth and the number of T gates running simultaneously. We show that we achieve smallest KQ T when we run Θ n/ √ log n T gates simultaneously.

II. PRELIMINARIES
In this paper, we optimize a carry-lookahead adder by replacing Toffoli gates with relative-phase Toffoli gates. To maintain an accurate calculation, we must consider the role of Toffoli gates well. Moreover, we reduce computational costs by decomposing Toffoli gates into single-qubit gates and CNOT gates.
In subsection A, we explain the quantum gate set used in this paper. Next, to clarify the role of Toffoli gates, we review Draper et al.'s carry-lookahead adder [43] briefly in subsection B. We explain T -minimization [48] by using Gidney's relative-phase Toffoli gates [45] in subsection C. We review the general construction of a control modular adder in subsection D.

A. QUANTUM GATE SET
In this paper, we use the following: CNOT gate • non-Clifford gates: T gate The CNOT gate is a two-qubit gate, and the others are onequbit gates. We express X gates as ⊕ in the circuit.
In this paper, we focus on two gates: T and CNOT,

B. DRAPER ET AL.'S CARRY-LOOKAHEAD ADDER
In this subsection, we use the same notations in Draper et al.'s paper [43]. First, we explain the calculation of a + b when a and b are n-bit numbers. We express a as a n−1 a n−2 . . . a 0 and b as b n−1 b n−2 . . . b 0 , where a i and b i are 0 or 1. To calculate a + b, we employ a carry c i . Carry c i is defined as an overflow from the (i − 1)-th bit to the i-th bit. In more detail, we define c i as Then, (a + b) i , the i-th bit of a + b, is calculated as Thus, we need carries to calculate an addition. VOLUME 4, 2016 Now, we give a brief explanation of a carry-lookahead adder. Before calculating an addition, we determine the propagation of a carry from the i-th bit to the j-th bit as a function of the following three conditions: • propagate: A carry is propagated from the i-th bit to the j-th bit. Namely, c j = c i . • generate: A carry is generated in the j-th bit, namely c j = 1, regardless of the value of c i . • kill: A carry is killed in the j-th bit, namely c j = 0, regardless of the value of c i . To calculate the propagation, we define two functions is true when the carry from the i-th bit to the j-th bit should be propagated. Similarly, g [i, j] is true when the carry out at the jth bit is true independent of the value of the carry in at the i bit. We do not need a separate function for kill, as its value can be inferred from p and g. By using these functions, we can calculate the propagation state over a wider span. Specifically, when i < k < j, where ∧ is Boolean AND, and ⊕ is Boolean XOR. By using these properties, we calculate c j = g [0, j]. Now, we explain Draper et al.'s carry-lookahead adder for |a |b → |a |b + a . This requires an additional n qubits for the carry register |c and n qubits for register |p , containing p [i, j]. Thus, a carry-lookahead adder requires 4n qubits. Now, we explain the implementation briefly. This implementation consists of five phases, Initialization, P-rounds, Grounds, C-rounds, and inverse P-rounds. In each round, • Initialization: we calculate g [i, i + 1] in |c i+1 and p [i, i + 1] in |b i , • P-rounds: we calculate the p-function and write result in |p , • G-rounds: we calculate |c 2 k (k ∈ N) by calculating some g-function, • C-rounds: we calculate all carry |c by calculating some g-function, and we clean |p in inverse P-rounds. After inverse P-rounds, we calculate each bit of a+b by using these carries |c . In this calculation, we run P-rounds and G-rounds simultaneously, and we run C-rounds and inverse P-rounds simultaneously. However, the value of carries remain on |c . Thus, we must clean |c to |0 except for c n . Draper et al. found that the value of carries c i except c n in a + b is the same in a + (2 n − 1 − a − b). Therefore, we erase carries by performing the addition a + (2 n − 1 − a − b) on the lower n − 1 qubits. The block level circuit is shown in Figure 3.
As noted above, a carry-lookahead adder is mainly constructed by a calculation on p and g. We calculate p and g with eq. (9) or (10) respectively, and those are implemented by Toffoli gates as shown in Figure 4. The detailed explanation of Draper et al.'s adder, including which p-function or g-function we calculate, is given in Appendix A. In total, In this figure, we sort qubits from the lowest qubits to the highest qubits, which is different from Figure 1. |c i is given as |0 at the beginning of this circuit and these are cleared to |0 after Erasing Carry. The detailed circuit is shown in Appendix A.
Up to this point, we have explained the construction of an adder. Draper et al. also proposed other operations, such as a subtractor and a comparator, based on their adder. The number of gates and the depth in a subtractor is almost the same as those in an adder. In a comparator, the number of gates is 60% of an adder and the depth is 50% of an adder. Draper et al. implement a comparator using only Initialization, P-rounds, G-rounds, and their inverses. More precisely, Draper et al. regard a and b as 2 log n -bit numbers by padding 0 in higher bits, but we do not use these qubits. If we calculate p [i, j] or g [i, j] when i ≤ n − 1 and j ≥ n, we calculate p [i, n] or g [i, n] respectively. Then, we calculate g [0, n] after G-rounds.

C. T -COUNT MINIMIZATION OF A CARRY-LOOKAHEAD ADDER
Thapliyal et al. [48] proposed T -count minimization by using relative-phase Toffoli gates. The standard Toffoli gate (ST) [23] decomposition is given in Figure 5. However, we can calculate correctly even if we replace some Toffoli gates with Gidney's relative-phase Toffoli gate (GRT) or its inverse (IGRT) [45]. GRT is shown in Figure 6 and the corresponding unitary matrix of GRT in the computational The standard decomposition of a Toffoli gate [23]. We call this decomposition ST. The control bits are the first and second qubits, and the target bit is the third qubit. This calculation preserves the phase.  [45] given by the unitary matrix (11). We call this decomposition GRT. The control bits are the first and second qubits, and the target bit is the third qubit. This calculation preserves the phase only when the target qubit is |0 on input.
basis is  and we calculate correctly when the target bit is |0 . IGRT is shown in Figure 7. In the carry-lookahead adder, as in many circuits, we must clean our ancilla qubits, returning them to a known, disentangled state, typically |0 . In this case, we can reduce our cost by measuring the ancilla on IGRT. Gidney's paper [45] shows that using measurement reduces 2 T gates. Using measurement is better because one accurate T gate requires many measurements. Thapliyal et al. proposed two constructions. The first construction replaced Toffoli gates in Initialization and P-rounds : Inverse of Gidney's relative-phase Toffoli gate [45]. We call this decomposition IGRT. This calculation preserves the phase when we input |000 , |010 , |100 , or |111 , which are outputs of GRT having valid phase. Control-Z is a Clifford gate, and we use no T gate.
: Replacing Toffoli gates in G-rounds and Crounds in a T -optimized carry-lookahead adder. We call this decomposition PGRT. We replace the first Toffoli gate with GRT and the second Toffoli gate with IGRT. The third qubit is an ancilla qubit. This qubit is measured as part of executing IGRT and will be |0 after running PGRT. The second construction replaced all Toffoli gates with GRT or IGRT, increasing the required number of ancilla qubits. Thapliyal et al. call this construction T -optimize. Specifically, we replace Toffoli gates in Initialization, Prounds, and the inverse of these similarly as the first construction. We replace Toffoli gates in G-rounds and C-rounds by the pair of GRT and IGRT as in Figure 8. We call these gates PGRT, where P is the abbreviation of "pair". In this construction, Thapliyal et al. claim that the number of qubits is 6n and the number of T gates is 20n. However, we recalculated these results and our results differ from results in [48]. In our result, the number of qubits is 4.5n and the number of T gates is 28n. The difference in the number of qubits occurs from our method for preparing ancilla qubits. Thapliyal et al. prepare new ancilla qubits for G-rounds and C-rounds respectively, while they recycle ancilla qubits for P-rounds. We apply this to G-rounds and C-rounds similarly. VOLUME 4, 2016

D. THE GENERAL CONSTRUCTION OF A CONTROL MODULAR ADDER
In this subsection, we explain the calculation of |x |b |0 → |x |b + ax mod N |0 .
The general construction of a control modular adder is shown in Figure 9. The first register has a single qubit which is used to hold the value of the control. We call this the CTRL qubit.
The second register has n qubits which are used to hold the result of a control modular addition. The third register has a single qubit which is used to hold the result of a comparison temporarily. We call this the COMP qubit. Specifically, we determine whether we subtract N or not based on COMP.
We conduct a comparator with one control qubit and an adder with two control qubits, and we write these as a C-comparator and a CC-adder, respectively.
To execute a control modular adder, we conduct operations in this order: 1) We compare the second register |b and the classical value N − a. If b ≥ N − a, namely a + b ≥ N , we flip COMP. 2) If both CTRL and COMP are 1, we subtract N −a from the second register. If CTRL is 1 and COMP is 0, we add a. Otherwise, we add no value. 3) If the second register is strictly less than a, we flip COMP.

III. FIRST-LEVEL OPTIMIZATION: OUR CONSTRUCTION OF A CONTROL MODULAR ADDER
In this section, we explain first-level optimization on the original construction [39]. In the general construction, a comparator has about 1/2 the depth of a carry-lookahead adder. Thus, by constructing a carry-lookahead adder using the same general construction, the depth is about the same as 2 adders, because a carry-lookahead adder is composed of two comparators and one adder. In the original construction, we use 3 adders. Thus, we use only 2/3 of KQ of the original construction when comparing two constructions of a control modular adder. The original construction applies two optimizations on repeating control modular adders. Our construction will be more efficient by adopting the same optimizations with some overhead, but the detail, such as the amount of overhead, should be evaluated in future work. Based on the above discussion, we need to give the construction of • C-comparator (subsection A) • CC-adder (subsection B) on a carry-lookahead adder. In this construction, we do not decompose Toffoli gates, because the decomposition of Toffoli gates is different in FTQ or NISQ respectively. Thus, we leave Toffoli gates as they are, and we consider the decomposition of Toffoli gates in Section IV.
In our construction, we consider the classicality of a and N as described by Markov and Saeedi [36] to realize higher efficiency. Moreover, we give a construction of C-comparator FIGURE 10: Our construction of a control modular adder based on Figure 9. A CC-adder is constructed by embedding, an adder, and resetting. Then, we add the second register |d as an n-qubit ancilla for embedding the value based on CTRL. The carry register |c with n qubits and the p-function register |p with n qubits are not represented in this figure for visibility. In a C-comparator, we do not use the second register. In total, our control modular adder requires 4n + 2 qubits.
that is not given in the original paper. By doing these, we propose a circuit construction of a control modular adder. Based on Figure 9, we construct our circuit as shown in Figure 10. We add the second n-qubit ancilla register for embedding value with CTRL. In addition to these registers, we use the carry register |c with n qubits and the p-function register |p with n qubits to realize the carry-lookahead adder, not represented in Figure 10. Thus, our control modular adder requires 4n + 2 qubits. The number of gates and the depth is given in Table 1, and the breakdown of this is given in Table 5 in Appendix B. Now, we explain the C-comparator and the CC-adder briefly.

A. CONSTRUCTION OF A C-COMPARATOR
In a C-comparator, only COMP is changed and other qubits do not change. Thus, to implement a C-comparator, it is sufficient that we add control operations only on the gates including COMP and remain other gates.
In our construction of a control modular adder, we use two types of C-comparators. In the first C-comparator, we flip COMP if CTRL is 1 and b ≥ N − a. In the final Ccomparator, we flip COMP if CTRL is 1 and b < a. In both cases, we judge whether b ≥ d or b < d with a classical value of d.
We construct these operations taking advantage of the classicality of d. The intuitive explanation of this operation is that we calculate b + (2 n − d) and check whether there is an overflow in the n-th bit. Specifically, and there is an overflow when b ≥ d. This construction is similar to previous constructions by Markov and Saeedi [36], but slightly different from them because our construction does not require X gates on |b . The number of gates and the depth is given in Table 1. The detailed construction is : Block-level view of our construction of a Ccomparator. In this figure, we sort qubits from the low-order qubits to the high-order qubits, top to bottom. This circuit is symmetric about the Toffoli gate surrounded by a dotted box. |c i is given as |0 at the beginning of this circuit and these are cleared back to |0 after the computation. The example circuits are shown in Figure 22 and 24 in Appendix B.
given in Appendix B. The block level construction of our Ccomparator is given in Figure 11, and the example circuits are shown in Figure 22 and 24.

B. CONSTRUCTION OF A CC-ADDER
In a CC-adder, we embed values before and after an adder, similar to a C-adder [40]. Based on this construction, we apply optimization by considering the classicality of a and N . From this point forward, we mainly focus on embedding on |d . In a CC-adder, we conduct the following: • If CTRL is 1 and COMP is 1, we add a and subtract N .
This operation can be realized by adding 2 n + a − N and disregarding the calculation of a carry c n . • If CTRL is 1 and COMP is 0, we add a. • Otherwise, we add no value. Thus, the embedding is conducted as in Figure 12. The resetting is conducted by inverting the embedding circuit.
After embedding, we apply a standard adder. Then, we conduct two optimizations as follows: • disregarding gates including |g [0, n] . • eliminating gates in Initialization where we know the control bit is 0. The number of gates and the depth is given in Table 1. The detailed construction and example circuit of a CC-adder are given in Appendix B.
: Block-level diagram of the embedding circuit. We omit |b in Figure 10. We embed 2 n + a − N or a on |d based on CTRL and COMP. The example circuit of the embedding is shown in Figure 23 in Appendix B.

IV. SECOND-LEVEL OPTIMIZATION: CONSTRUCTING A CONTROL MODULAR ADDER FOR FTQ AND NISQ DEVICES
In this section, we explain our second-level optimization. We evaluate the computational cost for both FTQ on the logical layer, and NISQ, focusing on the decomposition of Toffoli gates. We define KQ more specifically for FTQ and NISQ and minimize this value. For FTQ, we minimize the number of T gates by using Gidney's relative-phase Toffoli gates. However, this construction does not take into consideration the cost of distillation. We take into account the cost of distillation by finding the maximal number of T gates which should be run simultaneously, optimizing KQ T . For NISQ, we apply Maslov's relative-phase Toffoli gates with a small number of CNOT gates [49] and minimize KQ CX . By doing these, we propose a control modular adder that is more efficient than Van Meter and Itoh [39], called the original construction in this section. In the following discussion, we disregard the rounds with O(1) gates. In this section, we explain the optimization for FTQ in subsection A and the optimization for NISQ in subsection B.

A. COMPUTATIONAL COST ON THE FTQ LOGICAL LAYER
Next, we consider the optimal circuit for FTQ on the Logical layer, using Jones et al.'s architecture as a model [22]. This architecture, in common with other error correctedarchitectures,provides a fundamental gate set consisting of X, Y , Z, CNOT, and H gates, and measurement; here, we ignore qubit movement in the surface code. To run an S gate, we prepare an ancilla qubit |Y = (|0 + i |1 ) / √ 2 and run the circuit shown in Figure 13. An S † gate can be realized by the reverse circuit of Figure 13.
To achieve universal computation, we also need a non-Clifford gate; the choice of T is typical. To run a T gate, we prepare an ancilla qubit |A = |0 + e iπ/4 |1 / √ 2 and VOLUME 4, 2016 : Running an S gate [22]. The second qubit is Assuming correct operation on top of error correction, this ancilla passes through the gate execution unmodified, allowing it to be reused.
: Running a T gate [22]. The second qubit |A = |0 + e iπ/4 |1 / √ 2; the |A state is consumed in the process, with the consequence that creation of highfidelity |A states is one factor limiting performance.
run the circuit shown in Figure 14. To run a T † gate, we apply an S † gate instead of a S gate. To realize accurate T gates, we must prepare accurate |A state defined by |0 + e iπ/4 |1 / √ 2. Preparing |A is done by distillation, as shown in Figure 25 in Appendix E. This distillation circuit requires 15 qubits and 6 time steps, even assuming all of the CNOT gates can be implemented concurrently, but this is difficult to realize. Distillation is an expensive operation, and its optimization is an ongoing topic of research [25]. Thus, a T gate is the greatest factor in the cost of an FTQ circuit, leading us to focus on reducing the number of T gates.
We now minimize the number of T gates on our control modular adder. We adopt the Thapliyal construction with minor modifications, namely the replacement into relativephase Toffoli gates, except G-rounds in a C-comparator. We employ GRT in G-rounds and IGRT in inverse G-rounds as Figure 15. Our construction calculates correctly because Toffoli gates in G-rounds and inverse G-rounds are symmetric about the Toffoli gate surrounded by a dotted box as the block level circuit of a C-comparator shown in Figure 11.
Our construction requires an additional n qubits to preserve in a C-comparator. Fortunately, we do not use n qubits for |d in Figure 10. Thus, we realize this construction without an overhead of qubits. We give example circuits as Figure 22 or 24 in Appendix B.
The computational cost of our control modular adder is shown in Table 2, and the breakdown of constructions based on our construction is given in Table 6 in Appendix D. From Table 2, our construction requires 43n T gates. We call this construction a T -optimal control modular adder. The original construction requires 30n Toffoli gates, which when implemented using ST (each requiring 7 T gates) gives 210n T gates in total. Thus, our construction requires only 43/210 ≈ 20% T of the number of T gates of the original construction. Now, we focus on KQ of a T -optimal control modular adder. In this circuit, we use O(n) qubits and O(log n) depth, FIGURE 15: Our construction of G-rounds and inverse Grounds in a C-comparator. In Figure 8, we apply IGRT after the first CNOT gate immediately in G-rounds and inverse Grounds. In our construction, we calculate the result of GRT in the third ancilla qubit and preserve this qubit until the corresponding Toffoli gate in inverse G-rounds. Then, we clear this ancilla qubit by IGRT.
giving a KQ of O(n log n). However, we do not consider the computational costs for distillation in this calculation. We can trade space for time, with substantial flexibility, by allocating more qubits to ancilla "factories", corresponding to increasing the number of T gates that are in concurrent execution [21], [52].
To realize an efficient circuit, we should consider the tradeoff between the depth and the number of qubits allocated for distillation. For example, Kim et al. [53] showed that it is possible to run Shor's algorithm with as little as 2% of the qubits dedicated to distillation, but this construction runs only a single T gate at at time. Since the circuit still requires O(n) T gates, this construct is unable to run in depth O(log n) and is instead still constrained to O(n) depth. To realize smaller KQ, we must run many T gates parallel. However, there is an upper bound on the number of T gates that can be usefully run in parallel, with the depth limited by the cascading reuse of the qubits. Paler and Basmadjian also consider this problem [54], and they have concluded that we must determine optimal scheduling methods for T gates. To realize an accurate estimate of the cost and to enable fair comparison with prior research, we must take into account the T gate costs, including the space for distillation [22], [52], [55], allowing a circuit to run at "the speed of data" [55].
However, it is difficult to calculate computational costs for distillation precisely, because the cost depends on many architecture-specific parameters. Instead of KQ, we define a new index KQ T , defined as the product of the number of logical qubits and T -depth. We define n T as the Twidth, the upper-bound of the number of T gates running simultaneously. We assume that we require a constant c g logical qubits for the distillation step. By calculating n T minimizing KQ T , we reduce the computational cost of our control modular adder.
In the above discussion, our control modular adder uses 4n + 2 qubits for calculation, as explained in Section III. In addition, we require ancilla qubits for running n T T gates. Specifically, to run one T gate, we require one qubit |Y for running S gates and c g qubits for generating |A . Thus, when we run n T T gates simultaneously, we use the following  2: T -count of our control modular adder and prior work. The latter four constructions are based on our construction proposed in Section III. The breakdown of the latter four constructions is shown in Table 6 in Appendix D. qubits: • |y (Contains |Y states) n T qubits • |g (Generates |A states) c g n T qubits. The number of qubits in |y is given as n T , because we consume one S gate in each T gate. Then, the number of qubits is Now, we calculate T -depth. To calculate T -depth, we assume that we run GRT with the same timing, and each GRT has 2 T -depth from Figure 6. T -depth depends on n T as Figure 16. Then, T -depth is 86n n T + 12 log n T − 12.
The detailed calculation is given in Appendix C.

Our Construction
Distill Distill

# Qubits
Distill -depth -depth Distill FIGURE 16: Calculating T -depth. Distill means distillation circuits. In the naive construction, we run as many T gates as possible. In our construction, we restrict the upper-bound of the number of simultaneous T gates to n T . When we reduce n T , the total number of qubits is smaller and Tdepth is larger. Now, we minimize KQ T on n T . KQ T is (4n + (c g + 1)n T + 2) 86n n T + 12 log n T − 12 .
We minimize this on n T > 0. Letting the expression in Eq. 16 be f (n T ), we see that in n T > 0. Thus, f (n T ) is a convex function and it is sufficient to search for only one optimal value of n T . Then, the optimal value n T = 86 3(c g + 1) Therefore, the dominant term of KQ T is 48n log n.

B. OPTIMIZATION FOR NISQ
Now, we propose a form of the control modular adder reducing CNOT gates. To reduce this number, we review the decomposition of Toffoli gates into CNOT gates. We use relative-phase Toffoli gates with differences in phase as in Figures 17 and 18, proposed by Maslov [49]. The corresponding unitary matrix of Figure 17 in the computational basis is  This calculation changes the phase when we input |1 |0 |1 , |1 |1 |0 , or |1 |1 |1 . We call this relative-phase Toffoli gate RT3, and we call its inverse IRT3. The corresponding unitary matrix of Figure 18 in the computational basis is This calculation changes the phase when both control bits are 1. We call this relative-phase Toffoli gate RT4, and its inverse VOLUME 4, 2016 : A relative-phase Toffoli gate with 3 CNOT (RT3). This calculation changes the phase when we input |1 |0 |1 , |1 |1 |0 , and |1 |1 |1 . We call the inverse circuit of RT3, IRT3. IRT4. By using these relative-phase Toffoli gates, we reduce the number of CNOT gates. Next, we address which Toffoli gates can be replaced with relative-phase Toffoli gates. First, we consider which Toffoli gates can be replaced in a C-comparator. The structure of a C-comparator is shown in Figure 11, and we give an example circuit in Figure 22 or 24 in Appendix B. In these figures, all Toffoli gates are symmetric about the Toffoli gate surrounded by a dotted box, in the middle of the circuit. Thus, we can replace the Toffoli gates to the left of the dotted box by RT3 and those to the right of the box by IRT3. Therefore, we can replace all of the Toffoli gates in a C-comparator except this middle one with RT3 or IRT3.
Next, we address which Toffoli gates can be replaced in a CC-adder, and find that those in P-rounds can be replaced by RT3 and those in inverse P-rounds by IRT3. The other Toffoli gates used calculate the value of carries, and these carries are cleared after the calculation. In the calculation of carries, the values of the control bits change between calculating a carry and erasing it, which would seem to rule out using anything but pure Toffoli gates. However, looking more closely, we see that the value of a carry changes at most once, namely when both control bits are |1 . Thus, if we calculate correctly in the other situations, we can calculate and clear carries correctly. RT4 satisfies this. Therefore, we can replace Toffoli gates by RT4 in the Initialization, G-rounds, and C-rounds, and we can replace Toffoli gates by IRT4 in the inverse rounds.
As a result, the cost of our control modular adder is shown in Table 3 and 4. The breakdown of those based on our construction are shown in Table 7 and 8 in Appendix D. From Table 3 and 4, our construction is better in terms of both the number of CNOT gates and CNOT-depth. Now, we compare our circuit to the original construction.
First, we compare the CNOT count. Our construction requires 64.75n CNOT gates. The original construction re-quires 30n Toffoli gates implemented by ST using 6 CNOT gates, and we use an additional 4.5n CNOT gates in embedding or resetting. Thus, the original construction requires 184.5n CNOT gates in total. Therefore, our construction reduces the number of CNOT gates to only 35% of the number in the original.
Next, we compare KQ CX , defined as the product of the number of qubits and CNOT-depth. Our construction requires 120n log n KQ CX . The original construction requires 12 log n Toffoli depth implemented by ST requiring 6CNOT-depth, and we require 6 log n CNOT-depth for the embedding step. Thus, the original construction requires 78 log n CNOT-depth and 312n log n KQ CX . Therefore, our construction requires only 38% of the KQ CX of the original construction.

V. CONCLUSION AND FUTURE WORK
In this study, we proposed a method of optimizing a control modular adder based on a carry-lookahead adder [43] and Van Meter and Itoh's construction [39]. First, we show that the general construction given as Figure 9 is about 2/3 of the KQ of the original construction. Then, we construct a more efficient circuit. We evaluate the computational cost in FTQ and we show that our circuit requires only 20% of the T gates of the original. Moreover, we show that our circuit achieves its minimum KQ T when we run Θ n √ log n T gates simultaneously. Finally, we propose an efficient circuit for use in the NISQ era, and we show that our circuit requires only 35% of the CNOT gates and 38% KQ CX of the original. In this work, we have focused on optimizing Toffoli gates by using relative-phase Toffoli gates. However, in previous research [56], [57], other researchers have used gates such as Fredkin and Peres gates. These gates also may be simplified by replacing them with relative-phase gates. Thus, we expect that those circuits would also show an improvement with these techniques applied.
In this paper, we have considered only the single control modular addition. In additional future work, the circuits that postpone and summarize multiple modular arithmetic operations, as proposed by Van Meter and Itoh [39], should be addressed using similar optimization techniques. In addition, it is important to minimize KQ by reordering gates [37], [58].
Our construction does not consider the architecture of quantum computers as linear nearest neighbor architecture [33], [50], [51]. Thus, in the next step, we will consider the appropriate architecture and additional cost for our construction.
Lastly, we focused only on the Logical layer of FTQ in this study. Future work, we must consider the mapping to physical qubits, as well as distillation protocols.

Draper et al.'s carry-lookahead adder is given as follows:
Initialization (n Toffoli gates and n CNOT gates) CNOT count of our control modular adder and prior work. The latter four constructions are based on our construction proposed in Section III. The breakdown of the latter four constructions is shown in Table 7 in Appendix D.
. We use |c i+1 as the third qubit. We can run these gates simultaneously for i = 0 to n − 1.
We calculate g [i, i + 1] and p [i, i + 1] (0 ≤ i ≤ n − 1), as follows: The circuit calculating these is shown in Figure 19. P-rounds (n Toffoli gates and log n Toffoli depth) We calculate the p-function by using eq. (9). We use a parameter t p representing the range of the propagation of carry. We increase t p from 1 to log n −1. In each t p , we calculate p [2 tp i, 2 tp (i + 1)] (1 ≤ i ≤ n/2 tp − 1) by setting |p [2 tp i, 2 tp (i + 1/2)] and |p [2 tp (i + 1/2) , 2 tp (i + 1)] as the control qubits in in Toffoli gate in Figure 4a. These Toffoli gates are applied simultaneously in each t p .
G-rounds (n Toffoli gates and log n Toffoli depth) We calculate |c 2 k (k ∈ N ∪ {0}) by using eq. (10). We use a parameter t g similar to the way we used it in Prounds. We increase t g from 1 to log n . In each t g , we calculate g [2 tg i, 2 tg (i + 1)] (0 ≤ i ≤ n/2 tg − 1) by setting |c 2 tg i+2 tg −1 and |p [2 tg (i + 1/2), 2 tg (i + 1)] as the control qubits and |c 2 tg (i+1) as the target qubit in Toffoli gate in Figure 4b. These Toffoli gates are applied simultaneously in each t g . Moreover, G-rounds with t g can be run in parallel with former P-rounds with t g + 1.
C-rounds (n Toffoli gates and log n Toffoli depth) We calculate all carries |c by using eq. (10). We use a parameter t c similar to the way we used it in P-rounds. We decrease t c from log (2n/3) to 1. In each t c , we calculate |c 2 tc i+2 tc −1 1 ≤ i ≤ n − 2 tc−1 /2 tc − 1 by setting |c 2 tc i and |p 2 tc i, 2 tc i + 2 tc−1 as the control qubits and |c 2 tc i+2 tc −1 as the target qubit in Toffoli gate in Figure 4b. These Toffoli gates are applied simultaneously in each t c .
Inverse P-rounds (n Toffoli gates and log n Toffoli depth) We apply the same gates as P-rounds in reverse order. Rounds with t p can be run in parallel with former C-round with t p + 1.
Calculating |a + b (n CNOT gates) We calculate (a + b) i (0 ≤ i ≤ n − 2) on |b i . We apply CNOT gates with the control qubit of |c i+1 and the target qubit of |b i+1 . These CNOT gates are applied simultaneously.
Erasing Carry (5n Toffoli gates, 2n CNOT gates, and 2 log n Toffoli depth) We erase all carries by applying the inverse circuit of a + (2 n − 1 − a − b) on the lower n − 1 bits, as shown in Figure 20. We apply gates before P-rounds and after inverse Initialization to erase carries. We call these gates PE-rounds and inverse PE-rounds respectively. Now, we show the example circuit of Draper et al.'s carrylookahead adder as given in Figure 21. In this example, we define a and b as 6-bit values, and we calculate |a |b → |a |a + b . In Figure 21, in constrast to Figure 9, qubits are sorted from low order to high order.

B. DETAILED CONSTRUCTION OF OUR CONTROL MODULAR ADDER
In this section, we explain detail of our control modular adder. We show the example figures of our control modular adder too.

1) A C-Comparator
Now, we explain the construction of a C-comparator in more detail. In a C-comparator, we judge whether or not b ≥ d, where b is a quantum value and d is a classical value. As VOLUME 4, 2016 |c n FIGURE 20: Erasing |c . We apply gates only on the lower n − 1 qubits of |a , |b , and |c . We apply the same gates in omitted qubits |a i , |(a + b) i , and |c i+1 . The P-rounds and inverse C-rounds can be run in parallel, as can the inverse G-rounds and inverse P-rounds. We define PE-rounds as the gates before P-rounds, and inverse PE-rounds as the gates after inverse Initialization.
noted in Section III. A., we conduct this by calculating the carry out of the entire circuit b + (2 n − d). Our construction is given as follows: Initialization If we conduct Initialization naively, we apply a Toffoli gate and a CNOT gate for each bit. However, the compilation of a quantum algorithm often requires compilation (selection of the sequence of gates) to be adapted to the specific classical values that are inputs to the overall algorithm. Because 2 n − d is a classical value, we can convert some Toffoli gates to CNOT gates and eliminate other gates. Then, we calculate each (2 n − d) i (0 ≤ i ≤ n − 1). If (2 n − d) i = 1, 1) We apply CNOT gates with the control qubit |b i and the target qubit |c i+1 . 2) We apply X gates with on |b i . If we want to flip COMP when b ≥ d, we apply Toffoli gates with the control qubits of CTRL and |g [0, n] , and with the target qubit of COMP. If we want to flip COMP when b < d, we apply Toffoli gates similarly to b ≥ d, but we apply NOT gates on |g [0, n] before and after the Toffoli gate.

Resetting qubits
We conduct inverse G-rounds and inverse P-rounds similar to Draper et al.'s construction. Moreover, we conduct the inverse of our Initialization. Then, we reset all qubits except COMP as the initial values.

2) A CC-adder
First, we explain the construction of embedding in more detail. We want to embed as follows: • If CTRL is 1 and COMP is 1, we embed 2 n + a − N .
• If CTRL is 1 and COMP is 0, we embed a.
• Otherwise, we embed no value. Therefore, we embed on the second register on Figure 10 as follows: • If CTRL is 1 and (2 n + a − N ) i = a i = 1, i-th qubit is |1 . • If CTRL is 1, COMP is 1, (2 n + a − N ) i = 1, and a i = 0, i-th qubit is |1 . • If CTRL is 1, COMP is 0, (2 n + a − N ) i = 0, and a i = 1, i-th qubit is |1 . • Otherwise, we do nothing. In the above condition, the values of (2 n + a − N ) i and a i are classical information, and CTRL and COMP are quantum information. Thus, embedding in the first condition can be realized by CNOT gates with the control qubit of CTRL. Moreover, embedding in the second and third condition can be realized by Toffoli gates with the control qubits of CTRL and COMP. However, the set of i in each classical condition has no overlap. Therefore, once we embed one of i, we can embed the remaining value as CNOT gates. In each set, we have average n/4 elements requiring n/4 CNOT gates, O(1) additional gates. Thus, these embedding can be implemented by 3n/4 CNOT gates. Moreover, because we can run these simultaneously, embedding requires log n CNOT depth. The reset of embedding can be implemented similarly.
Next, we explain the optimization in an adder. In our calculation, there is no carry for g [0, n] whether we subtract N − a or add a. Thus, we can disregard calculation of carry qubit g [0, n]. To realize this, we omit calculation of p [i, n] and g [i, n] (i < n). Moreover, by using classicality of a and N , we know that we embed no value in average n/4 qubits on the second register of Figure 10. In these qubits, we can omit Initialization, inverse Initialization, and CNOT gates with the control qubit of |a i and the target qubit of |b i in erasing carry. By considering these optimizations, we reduce n/2 Toffoli gates and 3n/4 CNOT gates.
The gate count and depth is shown in Table 5.

3) Example of Our Control Modular Adder
We show an example of a 6-bit control modular adder when N = 59 and a = 37. Circuits are given in  In these example figures registers are shown with loworder qubits at the top, in contrast to Figure 10. In this subsection, the register |b contains a quantum value.
The algorithm follows in this order: 1) Conduct a C-comparator with the control qubit CTRL. Compare |b and a = 37. If b < 37, flip COMP. This is implemented by calculating carry of adding 2 6 − a = 27. These steps correspond to Figure 22, 23, and 24 respectively.

C. DETAILED CALCULATION OF T -DEPTH
In this section, we analyze the T -depth of our T -optimal control modular adder. We assume that we run GRT with the same timing, and each GRT has T -depth 2 from Figure 6. We focus on the parts that can be run concurrently. Except for Initialization, we run • P-rounds and G-rounds simultaneously, • C-rounds and inverse P-rounds simultaneously,  5: Gate count and depth of our proposed control modular adder. We omit the rounds whose gate count is O(1) and whose depth is O(1).

Count Depth Operation
Rounds 9.5n 4.75n 4 log n 2 log n Total 17.5n 6.75n 8 log n 2 log n • P-rounds and inverse C-rounds simultaneously, and • inverse G-rounds and inverse P-rounds simultaneously.
In the first and third steps, we run many T gates simultaneously at the start and fewer T gates as the calculation progresses. In the second and fourth steps, we run only a few T gates simultaneously initially and more as the calculation progresses. Thus, there is a difference in the number of T gates we can run simultaneously. As noted in Section IV. A., we define n T as the upperbound of the number of T gates running simultaneously, and we calculate T -depth based on n T as in Figure 16. In each round, there are parts where we can run more than n T T gates. However, by setting n T , we run these T gates separately. Compared to this, in the parts having less than n T T gates, we can run these T gates simultaneously.
First, we consider the parts having fewer than n T T gates, which happens when we run P-rounds and G-rounds simultaneously, C-rounds and inverse P-rounds simultaneously, Prounds and inverse C-rounds simultaneously, and inverse Grounds and inverse P-rounds simultaneously. In these rounds, if we have no restriction on running T gates, patterns are given as follows: • In the first and the third cases, the number of T gates we can run simultaneously decreases by one half as the calculation progresses. Thus, in the latter part of the calculation, we run fewer than n T T gates simultaneously. This part has T -depth 2 log n T and n T T gates in total. • In the second and the fourth cases, the number of T gates we can run simultaneously doubles as the calculation progresses. Thus, in the former part of the calculation, we run less than n T T gates simultaneously. This part has T -depth 2 log n T and n T T gates in total.
We have 6 parts each with a small number of T gates, as follows: • P-rounds and G-rounds in the first C-comparator, • P-rounds and G-rounds in the CC-adder, • C-rounds and inverse P-rounds in the CC-adder, • P-rounds and inverse C-rounds in the CC-adder, • inverse G-rounds and inverse P-rounds in the CC-adder, and • P-rounds and G-rounds in the final C-comparator Thus, we consume 12 log n T T -depth and 6n T T gates in these.
Next, we consider the remaining parts. In these parts, we run T gates n T each. The number of total T gates is 43n from Table 2, and we run 43n−6n T T gates. Thus, T -depth of this part is given by In conclusion, T -depth is given by

D. DETAILED GATE COUNT ON FTQ OR NISQ
In this section, we detail the T gate count on FTQ or NISQ.
The FTQ count is shown in Table 6. The detailed CNOT gate count on NISQ is shown in Table 7. The detailed CNOT depth count and KQ CX on NISQ are shown in Table 8.

E. A DISTILLATION CIRCUIT FOR A T GATE
A distillation circuit for a T gate is given as Figure 25.
14 VOLUME 4, 2016     : An example of the last C-comparator. We flip the COMP qubit if b < 37. This is achieved by adding 2 6 − 37 = 27 = 011011 2 and using the carry out. First, we apply pairs of gates, a CNOT and an X gate, on the first, second, fourth, and fifth groups of qubits. In contrast to Figure 22, we apply X gates before and after the center Toffoli gate. This circuit is symmetric about the Toffoli gate surrounded by a dotted box. Init, IP, IG, and IInit means Initialization, Inverse P-rounds, Inverse C-rounds, Inverse G-rounds, and Inverse Initialization respectively.

SHUMPEI UNO is a Chief Consultant of Mizuho
Research & Technologies, Ltd(MHRT). He holds a M.Sc and a Ph.D in Particle Physics from Nagoya University. During the Ph.D, he formulated quantum electrodynamics on finite volume lattice in order to accurately predict the light quark masses. He entered MHRT in 2011 and become a project researcher of Keio Quantum Computing Center in 2018. He researches quantum computing for using financial applications, such as derivatives simulation, risk management, optimization and machine learning. Contact him at shumpei.uno@mizuho-ir.co.jp.
TAKAHIKO SATOH is a project assistant professor of Keio University Quantum Computing Center. He studied at Keio University and the University of Tokyo in 2015. He received a PhD in computer science from UT. His research field is quantum computing and quantum networking, particularly quantum network coding, NISQ algorithm design, and Quantum Internet security. He is a member of the Physical Society of Japan (JPS). Contact him at satoh@sfc.wide.ad.jp.
RODNEY VAN METER (Senior Member) is a professor of Environment and Information Studies at Keio University's Shonan Fujisawa Campus. He is vice center chair of the Keio Quantum Computing Center, a board member of the WIDE Project, and a member of the Quantum Internet Task Force. Besides quantum networking and quantum computing, his research interests include storage systems, networking, and post-Moore's law computer architecture. Van Meter received a PhD in computer science from Keio University. He is member of ACM, the American Physical Society, and the American Association for the Advancement of Science (AAAS). Contact him at rdv@sfc.wide.ad.jp.