Shor's Algorithm Using Efficient Approximate Quantum Fourier Transform

Shor's algorithm solves the integer factoring and discrete logarithm problems in polynomial time. Therefore, the evaluation of Shor's algorithm is essential for evaluating the security of currently used public-key cryptosystems because the integer factoring and discrete logarithm problems are crucial for the security of these cryptosystems. In this article, a new approximate quantum Fourier transform is proposed, and it is applied to Rines and Chuang's implementation. The proposed implementation requires one-third the number of <inline-formula><tex-math notation="LaTeX">$T$</tex-math></inline-formula> gates of the original. Moreover, it requires one-fourth of the <inline-formula><tex-math notation="LaTeX">$T$</tex-math></inline-formula>-depth of the original. Finally, a <inline-formula><tex-math notation="LaTeX">$T$</tex-math></inline-formula>-scheduling method for running the circuit with the smallest KQ (where K is the number of logical qubits and Q is the circuit depth) is presented.


I. INTRODUCTION A. BACKGROUND
Evaluation of Shor's algorithm [1] is extremely important.Shor's algorithm is a method for solving the integer factorization and discrete logarithm problems, which take subexponential time in classical computers [2].These problems are fundamental problems for the security of the current public-key cryptosystems, including the RSA cryptosystems [3] and elliptic curve cryptosystems [4], [5].Currently, the scale of quantum computers is considerably small for breaking these two public-key cryptosystems [6], [7], [8], [9], [10], [11].However, the scale of quantum computers is increasing [12], and it is important to estimate the time when Shor's algorithm breaks these two public-key cryptosystems.To estimate the time when Shor's algorithm breaks the current public-key cryptosystems, precise evaluation of Shor's algorithm is important.This article discusses Shor's algorithm on a single quantum computer.If there are more than two computers, recently proposed distributed Shor's algorithm [13] will reduce computational costs.Our results will be able to combine with this result, and a single quantum computer is considered in this article.This article focuses on Shor's algorithm factoring an n-bit composite number N.
Previous researchers have evaluated the computational cost of Shor's algorithm by constructing quantum circuits for it.Shor's algorithm consists of modular exponentiation and a quantum Fourier transform (QFT).The modular exponentiation has a higher computational cost than the QFT.Previous research works [14], [15], [16], [17], [18] have decomposed a modular exponentiation into smaller arithmetic circuits, adders, and multipliers.These methods have different approaches for addressing carry, which is an overflow at each digit.The most widely used methods are the following.1) Ripple-Carry [14], [17]: Carry is calculated sequentially from the least significant bit (LSB) to the most significant bit (MSB).2) Carry-Lookahead [16]: Carry is calculated first before the actual calculation.3) Fourier-Basis [15], [18]: All carries in each digit are calculated by QFT.
The first (ripple-carry) and the second (carry-lookahead) methods calculate each carry only on the corresponding digit.Unlike these two methods, the third method (Fourier basis) calculates a carry on all digits in the Fourier basis.Table 1 shows the computational cost of each method.Without fault tolerance, the Fourier-basis circuit is better than other constructions.This difference occurs because the Fourierbasis method calculates carry with higher parallelization than the other methods.
To consider the actual computational cost for the future, we must consider fault tolerance.Several researchers [17], [18] have considered fault tolerance in Shor's algorithm.In a fault-tolerant setting, non-Clifford gates, including T gates, have a higher cost than Clifford gates.Usually, we realize non-Clifford gates by magic state distillation [19], resulting in a heavy cost.Therefore, previous researchers have represented a quantum circuit with Clifford+T gates and minimized the computational cost of T , including the number of T gates (called "T -count" in this article) and depth of T gates (called "T -depth" in this article).
This article focuses on Shor's algorithm using the Fourierbasis method.As noted previously, the Fourier-basis method has a lower computational cost than other methods without fault tolerance.However, the computational cost of the Fourier-basis method increases with fault tolerance, as shown in Table 1.In more detail, the T -count and T -depth are O(log n) times with fault tolerance.This increase occurs because the Fourier-basis method requires phase gates, which require O(log n) Clifford+T gates [20].
However, the Fourier-basis method proposed by Rines and Chuang [18] needs improvement.Previous research has shown that approximate QFTs, which omit phase gates with a small phase, remain a desired result in Shor's algorithm [21], [22].Rines and Chuang [18] method adopts QFTs but does not consider approximate QFTs.Therefore, we should consider the computational cost of adopting approximate QFTs for evaluating the Fourier-basis method more precisely.

B. OUR CONTRIBUTIONS
We evaluate Shor's algorithm using the Fourier-basis method [18] more precisely.Especially, the controlled modular multiplier using Montgomery reduction is improved by reducing T -depth.The contributions consist of the following.1) Montgomery multiplication [18] is improved (see Section III).
2) The proposed Montgomery multiplication is applied to Shor's algorithm (see Section IV).
First, the improvement of Rines and Chuang [18] implementation of the Montgomery multiplication is described in Section III.Their quantum circuit is improved where the dominant term is 2n, namely, almost the same number of qubits as in Rines and Chuang [18] construction.The method for reducing the computational cost for the Montgomery multiplication is shown in Fig. 1, namely reducing the computational cost of QFTs.In Section III-A, the thresholds in approximation are discussed.Rines and Chuang [18] original Montgomery multiplication is modified for Shor's algorithm in Section III-B because we cannot directly apply their method to Shor's algorithm.Moreover, a method for reducing the computational cost by approximate QFTs is proposed in Sections III-C and III-D.In Section III-C, a naive application of approximate QFTs is discussed.This implementation requires only one-third the T gates of Rines and Chuang [18] original implementation.In addition to this naive method, an approximate QFT with a smaller T -depth for realizing a smaller T -depth Montgomery multiplication is proposed in Section III-D.By applying the proposed approximate QFTs and parallelizing quantum gates by negligible ancilla qubits, this implementation requires only one-fourth of the number of T -depth of Rines and Chuang [18] original implementation.
Finally, the proposed Montgomery multiplication is applied to Shor's algorithm in Section IV.First, the proposed Montgomery multiplication is applied directly in Section IV-A.Next, a small T -depth circuit is proposed in Section IV-B by using many ancilla qubits.Moreover, similar to Oonishi et al. [23] method, a T -scheduling method is proposed for minimizing KQ (where K is the number of logical qubits and Q is the circuit depth) [24], the product of the number of qubits and depth, in Section IV-C.It is then shown that the proposed circuit takes the smallest KQ when n 1.5 T gates are run simultaneously.

II. PRELIMINARIES
First, notations are provided in Section II-A.Next, evaluation indices for quantum circuits are introduced in Section II-B.Moreover, the previous approximate QFT is described in Section II-C.Finally, the Fourier-basis method for the multiplication proposed by Rines and Chuang [18] is introduced in Section II-D.

A. NOTATIONS
The following gate set is used.1) Clifford gates: H, S, and cnot gates.
Phase gates P(θ ) are defined as 1 0 0 exp(iθ ) .Especially, P j is defined as P j = P(2π/2 j ).For example, S = P 2 and T = P 3 .In addition, the n-qubit register is represented as x j 2 j , this subscript n is omitted when n = 1.Moreover, the Fourier-basis register, namely a register applied QFT, is represented by superscript .For example, we represent QFT|x n as |x n .

B. EVALUATING INDICES FOR QUANTUM CIRCUITS
This article focuses on the number of qubits, T -count, Tdepth, and KQ [24].The emphasis is on T gates because T gates require the highest cost in this gate set.KQ is the product of the number of qubits and depth.Jones et al. [25] reported that the probability of correct output is gate precision to the power of KQ.A smaller KQ realizes quantum computation faster because quantum circuits with a smaller KQ require lower precision in each gate.Therefore, a smaller KQ is important for realizing quantum computation.This article proposes a T -scheduling method for minimizing KQ T , which is defined as the product of the number of qubits and T -depth.

C. PREVIOUS APPROXIMATE QFT
In this section, approximate QFT is introduced.First, QFT without approximation is discussed in Section II-C1.Next, a naive approximate QFT is introduced in Section II-C2.Several studies have focused on how to implement approximate QFT efficiently.Nam et al. [26] approximate QFT is the focus of one of these studies.Their construction realizes a smaller T -count, and this construction is introduced in Section II-C3.The previous constructions have the same dominant term-the n-bit QFT requires approximately n qubits.Moreover, the approximate QFT of Cleve and Watrous [27] is introduced in Section II-C4.Unlike the previous constructions, their construction achieves a smaller T -depth with many ancilla qubits.

1) QFT WITHOUT APPROXIMATION
The QFT is one of the fundamental transformations in the quantum circuit.Fig. 2 shows the QFT where the number of qubits n is four.As Fig. 2 shows, the QFT consists Apply controlled P k+1 gates whose control qubit is the qubit j and target bit is the qubit j + k.

5:
end for 6: end for 7: return AQFT |ψ n of H gates and controlled P j gates.The QFT is now explained in more detail.Let x be an integer satisfying 0 ≤ This QFT is realized by Algorithm 1.In Algorithm 1, |ψ n is the superposition of some |a n 's.The label in N is set from the left qubit to the right qubit, namely, the qubit j corresponds to |x n− j .

2) NAIVE APPROXIMATE QFT
The approximate QFT realizes efficient computation by omitting phase gates having a small phase, which have a small effect on computation.In the approximate QFT, we determine threshold ε q from the overall computation.Then, controlled phase gates whose phase is less than ε q are omitted.The upper bound of for loop on Step 3 in Algorithm 1 is rewritten.In more detail, n − j is replaced into min( log(1/ε q ) − 1, n − j) as Algorithm 2.

Transactions on IEEE
Oonishi and Kunihiro: SHOR'S ALGORITHM USING EFFICIENT APPROXIMATE QFT T -count using Gidney [28] ripple-carry adder.Let b be log(1/ε q ) .In Step 4, a phase gate works when both the qubit j and the qubit j + k are |1 .Let p j,k be one when both the qubit j and the qubit j + k are |1 , and otherwise zero.The change of phase from qubit j is then . Based on the earlier discussion, Nam et al. [26] method prepares the following ancilla qubits.1) b − 1 qubits for storing |p j,k .
2) An auxiliary quantum state 3) Ancilla b − 1 qubits for storing carries in the adder.
1) First, p j,k is calculated and stored in an ancilla qubit.
Especially, we calculate the and operation with the qubit j and the qubit j + k using the relative Toffoli gate.2) Let |p b−1 be |p j,1 . . .p j,b−1 .
3) Then, |p b−1 is added into |φ b using Gidney [28] ripple-carry adder.After this calculation, the register |φ b changes into exp 2π i p 2 b |φ b .4) Finally, p j,k is reset by measurement.
By adopting the previous method, the T -count is 8n log(1/ε q ) with 3 log(1/ε q ) ancilla qubits.In each control qubit j, this method requires 4b T gates in the calculation of |p j,k and 4b T gates in the adder.Therefore, this method requires 8n log(1/ε q ) T gates in total.[27] We now explain Cleve and Watrous [27] approximate QFT.QFT on |x realizes the following state:

4) APPROXIMATE QFT OF CLEVE AND WATROUS
( Now, let b be log(1/ε q ) , similar to Section II-C.In the approximate QFT, the above phases of |1 in the state (2) are then replaced as Their approximate QFT calculates the Fourier-basis state of all qubits simultaneously.In more detail, their method corresponding to qubit k simultaneously.To realize the earlier calculation, we prepare , where the first register is an input value, the second register is b − 1 ancilla qubits for each input qubit, the third register is an output value, and the fourth register is b − 1 ancilla qubits for each output qubit.The approximate QFT is calculated as follows.
1) An H gate is applied to qubit k in the third register.
2) The value of this qubit k is copied on corresponding b − 1 ancilla qubits in the fourth register.4) The value in the second and fourth registers is then erased.
We then obtain . We erase the value of the first register by the inverse method from the third register to the first register.
The previous method requires 2n log(1/ε q ) qubits, 2n log(1/ε q ) controlled P j gates, and two controlled P j -depth.As discussed later in Section III-A, the desired ε q is (n −2 ).Thus, we require 4n log n qubits to run these gates simultaneously, which is larger compared with Nam et al. [26] approximate QFT.However, this method calculates all P j gates simultaneously and requires O(log n) cnot depth and two controlled P j -depth.Therefore, it requires a smaller depth compared with Nam et al. [26] approximate QFT.[18] In this section, we explain Rines and Chuang [18] controlled modular multiplier using Montgomery reduction.The controlled modular multiplier is the calculation given as

D. PREVIOUS MULTIPLIER
where X is a classical n-bit number.This controlled modular multiplier consists of two modular multipliers without control.First, the controlled modular multiplier is decomposed into modular multipliers without control in Section II-D1.
Realizing modular multipliers without control is discussed in Section II-D2.Finally, the calculation cost of a modular multiplier is explained in Section II-D3.

Transactions on IEEE 1) DECOMPOSITION OF CONTROLLED MODULAR MULTIPLIER
In this decomposition, |c |y n |0 n is prepared.The first register |c is a control qubit.The controlled modular multiplier is realized as follows.
1) The second and third registers are swapped when the 2) Next, X times the second register is added into the third register.Then, 3) The second and third registers are swapped when the 4) Then, −X −1 mod N times the second register is added into the third register.Actually, we calculate the inverse of a calculation adding X −1 mod N times of the second register into the third register.Then, 5) The second and third registers are swapped when the ) .
In the previous calculation, operations 2) and 4) are modular multipliers without control, and these operations require (n 2 log n) T -count and (n log n) T -depth, respectively.Different from operations 2) and 4), operations 1), 3), and 5) require O(n) T -count and O(n) T -depth, respectively.Therefore operations 1), 3), and 5) are negligible compared to operations 2) and 4).In conclusion, a controlled modular multiplier consists of almost two modular multipliers without control.

2) CONSTRUCTION OF MODULAR MULTIPLIER
First, it is necessary to explain Montgomery reduction.Montgomery reduction realizes efficient multiplication.Then, the multiplication of two n-bit integers x and y must be explained.Let R be a constant, and x and y are transformed as xR mod N, yR mod N. We then calculate the multiplication by on modulo N.This calculation is efficient when we set R as the power of two.Let m be log n , and let R be 2 m for simplicity.
We now explain how to realize a modular multiplier.This method requires m + 1 more ancilla qubits.We then calculate a modular multiplier as We then calculate a modular multiplier as follows.
1) Multiplication: We calculate a multiplication as where t = (XR mod N)(yR mod N).Naively, the following are performed in this order.a) A QFT on the second register, state (8) → (9).b) Fourier-basis multiplication, which is represented as "Mult," state ( 9) → (10).However, to reduce the T -depth, these calculations are performed simultaneously.The previous calculation consists of H gates and controlled P j gates.In P j gates, a control qubit and a target qubit are interchangeable.Therefore, all the controlled P j gates whose control qubit is in the second register of the state ( 8) are performed simultaneously.
2) Reduction: We apply Montgomery reduction on the second register of the state (10) and only focus on this register.We then conduct the following.a) Estimation: We multiply 2 −m mod N in this procedure.In more detail, we perform the following from j = 1 to m.
i) The lower order j − 1 qubits are disregarded.ii) The H gate is applied on the LSB.
iii) The LSB is set as a control qubit.Then, (N − 1)/2 is subtracted from the remaining qubits with this control qubit.A calculation in each j corresponds to division by two.The calculation result is The first register of state (11) takes from −(N − 1) to N − 1, when the MSB of this register is regarded as the sign bit.This value is corrected from 0 to N − 1 in the Extraction and Correction procedures.b) Extraction: The sign qubit on the first register of state ( 11) is extracted, and only this register is focused on.QFT −1 is applied, and the MSB is extracted as the sign qubit.Moreover, QFT is applied to the lower-order n qubits.This procedure changes state (11) as where |s is the sign qubit.In Rines and Chuang [18] construction, QFT −1 , the first operation, is combined with the previous Estimation procedure, and QFT, the second operation, is combined with the later Correction procedure.

c) Correction:
The exact t2 −m mod N is calculated in this procedure.Then, the focus is on all registers of state (12).The first register is set as a control qubit and N is added to the second register using this control qubit.Then, the cnot gate is applied, whose control qubit is the LSB of the second register and target qubit is the first register.Moreover, the first register is added as the MSB of the third register.The result state is then 3) Uncomputation: By the previous procedure, we obtained the following state: (14) because t = (XR mod N)(yR mod N).The focus is on all registers of state (14).First, a QFT is applied to the third register.We then simultaneously conduct the following as Multiplication.a) An inverse QFT is applied to the second register.b) Then, (XR mod N)N −1 mod 2R times of the first register is subtracted from the third register, which is represented as "Inv mult."Finally, the third register is reset into |0 by applying H gates to these qubits.

3) COMPUTATIONAL COST OF A MODULAR MULTIPLIER
In the previous construction, controlled P(θ ) gates, which require approximately one P(−θ/2) gate [18], are only non-Clifford gates.Controlled P(θ ) gates naively require one P(−θ/2) gate and two P(θ/2) gates, as shown in Fig. 3.We only employ one P(−θ/2) gate between two cnot gates.The other two P(θ/2) gates can be combined with other phase gates applied on the same qubit if there is no H gate between them.We can calculate correctly under this summary because the value of a qubit does not change because of the phase gates.Therefore, the computational cost of two P(θ/2) gates is small compared to P(−θ/2) in Fig. 3.
According to the earlier discussion, controlled phase gates require almost one phase gate.A phase gate requires 3 log(1/ε a ) T gates, where ε a is the precision of phase gates [20].In more detail, ε a is the upper bound of U −  [18] construction.By using this precision ε a , T -count is 9n 2 log(1/ε a ) and T -depth is 12n log(1/ε a ), as shown in Table 2.

III. IMPROVEMENT OF MONTGOMERY MULTIPLICATION BY APPLYING APPROXIMATE QFTS
In this section, we explain how to improve the Montgomery multiplication proposed by Rines and Chuang [18].Their quantum circuit is improved, where the dominant term is 2n, almost the same number of qubits with their construction.Tables 3 and 4 show the T -count and T -depth of the proposed Montgomery multiplication, respectively.The dominant term is mainly discussed in this section.We now explain how to realize implementation realizing the computational cost as shown in Tables 3 and 4.
Before explaining the proposed method, how to set thresholds ε q and ε a is discussed in Section III-A.Previous research used the thresholds ε q [22], [26] and ε a [18].These thresholds are discussed more rigorously to apply both approximations to the Montgomery multiplication.
In the remaining sections, the proposed method is described.In these discussions, m is log n and R is 2 m for simplicity, as in Section II-D2.
First, modification for Rines and Chuang [18] controlled modular multiplication is proposed because their method cannot calculate correctly when using Montgomery reduction.A method for calculating the controlled modular multiplication correctly on Montgomery reduction is provided in Section III-B.
Next, the computational cost is evaluated by the naively application of approximate QFTs in Section III-C.Because approximate QFTs are applied, the T -count is only one-third that of Rines and Chuang [18] original implementation.
Finally, the T -depth of the previous quantum circuit is minimized in Section III-D.The method minimizes the Tdepth by a new approximate QFT proposed in Section III-D1.Moreover, the proposed method minimizes the T -depth by parallelizing controlled phase gates by using ancilla qubits, as discussed in Section III-D2.The proposed Montgomery multiplication then requires only one-fourth of the Tdepth of Rines and Chuang [18] original implementation in Section III-D3.

A. EVALUATING APPROXIMATION ERROR FROM THRESHOLDS
In this section, thresholds ε q and ε a in Shor's algorithm are discussed.Especially, the approximation error occurring from these thresholds is considered.First, each approximation error is analyzed in Section III-A1.We can then address these errors easily.Next, the approximation error on Shor's algorithm from thresholds ε q and ε a is evaluated in Section III-A2.It is then shown that Shor's algorithm requires ε q = (n −2 ) and ε a = (n −3 ) for outputting the correct answer with a high probability.

1) APPROXIMATION ERROR IN EACH GATE
This section addresses thresholds ε q and ε a .The thresholds ε q and ε a deal with different types of error.The threshold ε q addresses only the phase error.The threshold ε a comprises flip error in addition to phase error, and these errors are dependent on previous research [29].These errors are then decomposed into flip error and phase error to address them independently.Especially, it is shown that O(ε 2 a ) flip error and O(ε a ) phase error occur in a phase gate.
We now review the threshold ε a [29].This threshold ε a is used for approximating a phase gate P(θ ).Let U be which is an approximate matrix of P(θ ) gate.The matrix of the P(θ ) gate is written as This representation is consistent with Section II-A without a global phase.Let u be α, β T and z be x, y T .Moreover, let φ be an angle between two vectors u and z.This angle φ corresponds to phase error.Then, threshold ε a satisfies the following equation: which is shown by Selinger [29].

2) APPROXIMATION ERROR IN SHOR'S ALGORITHM
Next, the approximation error is evaluated from the thresholds ε q and ε a for Shor's algorithm.Especially, it is shown that Shor's algorithm requires ε q = (n −2 ) and ε a = (n −3 ) to output the correct answer with a high probability.
A quantum circuit of the Montgomery multiplication adopts thresholds ε q and ε a on controlled phase gates.The thresholds ε q and ε a have phase error and flip error.
The analysis of flip error is simple, focusing only on the correct state without flip.Let M f be the number of all controlled phase gates.The probability of no flip error is then (23) because the probability of no flip error in each controlled phase gate is 1 − O(ε 2 a ).Next, the phase error is examined.Phase error is calculated on the following procedure iterated in Shor's algorithm: 1) approximate QFT; 2) actual calculation on Fourier basis; 3) approximate inverse QFT.
It is assumed that the dimension of the approximate QFT is n.Moreover, let M p be the maximal number of controlled P j gates on each qubit.
First, an approximate QFT is considered.In this calculation, phase error occurs from thresholds ε q and ε a .The maximal value of phase error from the threshold ε q in each qubit is 2π i Moreover, the maximal value of phase error occurs from the threshold ε a is iO(ε a ) in each controlled phase gate.There are at most log(1/ε q ) controlled phase gates in each qubit.
Therefore, the maximal value of phase error from the threshold ε a in each qubit is iO(ε a ) log(1/ε q ) In conclusion, the phase error in each qubit is then 2π i(ε q + O(ε a ) log(1/ε q )).
Next, we must consider the actual calculation of the Fourier basis.In the previous discussion, controlled P j gates on QFTs are omitted.However, we cannot omit phase gates in this procedure because the phase is not always small.The error source in each qubit is the phase error from the threshold ε a .Therefore, the phase error in each qubit is M p O(ε a ).
Finally, an inverse approximate QFT is considered.Similar to Step 1), the phase error in each qubit is 2π i(ε q + O(ε a ) log(1/ε q )).By gathering the above phase errors, the total phase error in each qubit is We now discuss how phase error affects the calculation.
Phase error is converted into flip error by an approximate inverse QFT.In the approximate inverse QFT, we transform the Fourier basis into the standard basis from the LSB to the MSB.Especially, we recover each qubit with phase error ε as follows where y ∈ {0, 1} The probability p I (ε) with the |Incorrect state is then because p I (0) = 0 and p I (ε) = π √ (1 + cos ε)/8.Therefore, the flip error occurring from phase error is almost the same as the error (25), and the probability of no flip error is then We now summarize probabilities (23) and (32) and calculate the probability of obtaining the desired quantum states in Shor's algorithm.In Shor's algorithm, the number of controlled P j gates is O(n 3 ) [18], which means M f = O(n 3 ).Modular exponentiation uses O(n) controlled modular multiplier [1], [30], with O(1) QFTs and O(n) controlled P j gates in each qubit.Especially, one controlled modular multiplier outputs the desired state with the probability (32) where Therefore, the probability of obtaining the desired quantum states in Shor's algorithm is This probability ( 33) is (1) when ε q = O(n −2 ) and ε a = O(n −3 ).
In the previous discussion, the focus is on modular exponentiation.However, Shor's algorithm obtains the output by inverse QFT.Therefore, we should consider the effect of approximation on this inverse QFT.Let r be the order of y of a modulo N, namely r is the minimal value satisfying y r = 1 mod N where r ∈ N.
The exact Shor's algorithm [1] is calculated as where p j,k takes a high value when y j mod N = k and a low value otherwise.In the previous discussion, only the probability of taking the pair ( j, k) satisfying y j mod N = k is evaluated.To clarify the effect of approximation, we must evaluate the probability in the other pairs.However, this probability is difficult to analyze, because it involves exponential branches.To analyze the probability of obtaining correct output, Assumption 1 is used.Assumption 1: where C = (1).Assumption 1 means that all desired pairs take a high probability, and the other pairs take the same small probability.The number of desired pairs is 2 3n /r, namely, there are 2 2n/r js in each k.Moreover, it is assumed that the other pairs take the same probability because phases on transformations from ( 26) to (28) are almost random in the later calculation.It is necessary to evaluate how much Assumption 1 realizes the actual calculation in the future work.Here, the effect of approximation based on Assumption 1 is discussed.
By using Assumption 1, Shor's algorithm transforms the state (39) as Let p k be 2 2n −1 j=0 p j,k exp(−2π i l 2 2n j), which corresponds to (38).In the state (42), only ks in the set {y j |0 ≤ j ≤ r − 1} remain for the following reasons.1) When k does not satisfy y j mod N = k with any j, p k = 0 because p j,k is a constant value from Assumption 1. 2) When k has j satisfying Approximation (43) satisfies because p j,k is a constant value when y j mod N = k from Assumption 1, and their phase uniformly distributes from 0 to 2π .
Therefore, when we apply approximation, we obtained the state (37) under p k satisfying (43).Equations ( 38) and (43) only differ in the constant term C. Thus, the exact and approximate Shor's algorithms have almost the same output-they obtain l near 2 2n /r with high probability.Similar reasoning is applicable to Ekerå and Håstad [30] construction-the states are canceled when they are not calculated in the exact algorithm.Therefore, Shor's algorithm outputs the correct answer with a high probability when ε q = (n −2 ) and ε a = (n −3 ).These ε q and ε a can vary in each procedure.This optimization may reduce further computational costs, but the constants ε q and ε a are focused on in this study.

B. MODIFICATION OF THE PREVIOUS CONTROLLED MODULAR MULTIPLICATION
In this section, a small modification on Rines and Chuang [18] controlled modular multiplication is explained.They proposed the modular multiplication given as the transformation (7).However, we cannot directly apply this modular multiplication to the controlled modular multiplication given in Step 2 in Section II-D1.In more detail, the correct calculation is but we actually obtain This difference occurs because previous researchers did not consider Montgomery reduction on |0 |0 n |yR mod N n .
To correct this miscalculation, we add m-bit left shift (multiplication by R = 2 m ) before applying the modular multiplication.The previous calculation is realized by Algorithm 3. In Algorithm 3, Steps 2-4 can be parallelized.Therefore, the number of gates is O(n) and the depth is O(n/ log n) in Algorithm 3, which is negligible compared with the other calculation.

C. NAIVE APPLICATION OF APPROXIMATE QFTS
This section evaluates the Montgomery multiplication using approximate QFTs, namely phase gates whose phase is less than ε q are omitted.We use the label of qubits in QFT similar to that in Section II-C.In the Montgomery multiplication, two types of QFT exist: log n-bit QFT and n-bit QFT.In the former log n-bit QFT, we use all gates.Therefore, we only focus on the approximation of n-bit QFT.By this approximation, the following costs decrease: 1) an n-bit QFT in Multiplication; 2) two n-bit QFTs in Reduction; 3) an n-bit inverse QFT in Uncomputation.
First, the T -count is evaluated.As Algorithm 2 shows, n-qubit approximate QFT applies at most log(1/ε q ) P k gates in Step 3.This procedure means that an approximate QFT applies log(1/ε q ) controlled P k gates to each qubit.As noted in Section II-D3, each controlled P k gate requires 3 log(1/ε a ) T gates [20].Therefore, an approximate QFT requires 3n log(1/ε q ) log(1/ε a ) T gates.Thus, using approximate QFTs, T -count of the Montgomery multiplication is only one-third of Rines and Chuang [18] original implementation.
Next, the T -depth is evaluated.In the previous construction, all controlled P j gates whose control qubit is in the second register of the state ( 8) are used simultaneously.When we apply approximate QFTs, the number of phase gates with the same control qubit decreases.However, the T -depth in each control qubit does not change.Therefore, adopting naive approximate QFTs maintains the T -depth.

D. OPTIMIZATION OF MONTGOMERY MULTIPLICATION
This section minimizes the T -depth of the Montgomery multiplication.The T -depth of the Montgomery multiplication is minimized using the following: 1) a new approximate QFT with less T -depth (see Section III-D1); 2) a method for parallelizing controlled phase gates in the "Inv mult" procedure by using ancilla qubits (see Section III-D2).
We then evaluate the computational cost by using the techniques in Section III-D3.

1) PROPOSED APPROXIMATE QFT
We now propose a new approximate QFT with less T -depth.This approximate QFT has less T -depth than that of Nam et al. [26].Their method minimizes the number of T gates by replacing phase gates into Gidney [28] adder.However, the same calculation can be realized by the other adders because this calculation only adds two numbers without using any specific property in Gidney [28] adder.The proposed method adopts Draper et al. [16] adder using Gidney [28] relative Toffoli gates, which was discussed by Thapliyal et al. [31] and reviewed by Oonishi et al. [23].The proposed method is shown in Algorithm 4. By using more qubits and T gates, the above adder reduces T -depth.Draper et al. [16] adder requires O(log n) T -depth while Gidney [28] adder requires O(n) T -depth.Table 5 shows the computational cost of the proposed approximate QFT.

Transactions on IEEE
Apply the H gate on the qubit j. 5: for k = 1 to min( log 1/ε q − 1, n − j) do 6: Calculate the AND operation with the qubit j and the qubit j + k using the relative Toffoli gate.The result is represented as |p j,k .7: end for 8: Let |p b−1 be |p j,1 . . .p j,b−1 .9: |p b−1 is added into |φ b by Draper et al. [16] adder using Gidney [28] relative Toffoli gates.10: |p b−1 is reset by measurement.11: end for 12: return AQFT |ψ n First, regarding the number of qubits, the previous method uses b − 1 qubits for storing |p j,k , an auxiliary quantum state |φ b , and b − 1 ancilla qubits for storing carries in the adder.Draper et al. [23] adder uses 2.5b ancilla qubits in addition to two b-qubit numbers.Therefore, the proposed method requires 4.5b ancilla qubits.
Next, regarding the T -count, the previous method requires T gates in the calculation of each control qubit as follows: 1) 4b T gates in the calculation of |p j,k ; 2) 4b T gates in the adder.
Draper et al. [23] adder uses 28b T gates in the addition of two b-qubit numbers.Therefore, the proposed method requires 32b T gates.
Finally, the T -depths of previous and proposed methods are discussed.In the previous study, the T -depth was not investigated.Here, the T -depth of the previous method is evaluated.The previous method consists of storing |p j,k , addition, and resetting |p j,k .This addition requires 2b Tdepth from each control qubit.Now, the remaining procedure, storing and resetting |p j,k , is discussed.Fig. 5 shows a method for storing |p j,k , and Fig. 6 shows a method for resetting |p j,k .For storing |p j,k , b circuits of Fig. 5 are used.The first qubit |x n− j of Fig. 5 is common in these b circuits, and the other two qubits differ between these b circuits.Therefore, we can parallelize the gates excluding the cnot gate surrounded by a dotted line, and the T -depth is O(1) on these gates.Moreover, the b cnot gates surrounded by a dotted line, whose depth is b, are used.This depth is negligible compared with the other depth because a cnot gate is a Clifford gate.Similar reasoning is applicable to resetting |p j,k because a controlled-Z gate is a Clifford gate.In conclusion, Nam et al. [26] QFT requires 2b T -depth.
The proposed method consists of storing |p j,k , addition, and resetting |p j,k , similar to the previous result.This addition requires 8 log b T -depth from each control qubit [31].The remaining procedure is storing and resetting |p j,k .For storing |p j,k , we run b cnot gates, surrounded by a dotted line shown in Fig. 5. Naively, this procedure requires b cnotdepth, and this depth is nonnegligible compared with 8 log b T -depth in addition.However, we reduce this depth by using b ancilla qubits prepared for the adder.Especially, we can store |p j,k as follows.Therefore, this depth is negligible compared with depth in addition.Similar reasoning is applicable to resetting |p j,k because a controlled-Z gate is a Clifford gate, as shown in Fig. 6.In conclusion, the proposed QFT requires 8 log b Tdepth.

2) PARALLELIZING CONTROLLED PHASE GATES
This section proposes a method for parallelizing controlled phase gates by using ancilla qubits, which is similar to Cleve and Watrous [27] study.The T -depth is reduced in the following procedures: 1) "Estimation" in the Reduction procedure; 2) "Correction" in the Reduction procedure; 3) "Inv mult" in the Uncomputation procedure.
(46) Therefore, by using k log n ancilla qubits, the T -depth decreases as in Table 4.In more detail, we generate k log n min(#(control qubits), #(target qubits)) (47) copies of target or control qubits with fewer qubits.Controlled phase gates are then calculated as many as possible.

3) COMPUTATIONAL COST OF PROPOSED MONTGOMERY MULTIPLICATION
In Sections III-D1 and III-D2, the T -depth of the Montgomery multiplication is minimized.In this construction, QFTs are run independently from the other procedure, and "mult" in the Multiplication procedure becomes the dominant term of T -depth.This T -depth is only one-fourth of Rines and Chuang [18] original implementation.The computational cost of the proposed method is given in Tables 3  and 4.

IV. APPLICATION OF PROPOSED MONTGOMERY MULTIPLICATION TO SHOR'S ALGORITHM
In this section, the proposed Montgomery multiplication is applied to Shor's algorithm.First, the proposed Montgomery multiplication is applied directly to Shor's algorithm in Section IV-A.Then, the T -depth is minimized in Section IV-B.Moreover, a quantum circuit with the smallest KQ is proposed in Section IV-C using Oonishi et al. [23] method.Before evaluating the computational cost in each setting, we should review the total cost of Shor's algorithm.Shor's algorithm consists of modular exponentiations and an inverse QFT.Modular exponentiations, having a higher computational cost than an inverse QFT, require 3n times the Montgomery multiplications [30] as follows.
2) The controlled Montgomery multiplication requires two times the Montgomery multiplications [18].
In the previous construction, only one qubit is required for the control qubit for controlled Montgomery multiplications by qubit recycling [32].We now discuss the computational cost of Shor's algorithm.

A. DIRECT APPLICATION TO PROPOSED MONTGOMERY MULTIPLICATION
This section evaluates the direct application of the proposed Montgomery multiplication.Table 4 and the discussion in Section III-D reveal that the proposed Montgomery multiplication requires 9n 2 log n T gates and 9n log n T-depth, where k = ω (1).Moreover, the proposed Montgomery multiplication requires 2n + k log n qubits.This k log n term is negligible compared to 2n when k = o(n/ log n), and there are k satisfying k = ω(1) and k = o(n/ log n).Therefore, the proposed Montgomery exponentiations require 27n 3 log n T gates and 27n 2 log n T -depth with 2n qubits.

B. SHOR'S ALGORITHM MINIMIZING T -DEPTH
This section discusses Shor's algorithm with a low T -depth.First, the T -depth of each Montgomery multiplication is minimized in Section IV-B1.The T -depth of Shor's algorithm is then minimized by parallelizing Montgomery multiplications in Section IV-B2.

1) MINIMIZING T -DEPTH IN EACH MONTGOMERY MULTIPLICATION
The T -depth can be minimized by the following techniques: 1) Cleve and Watrous [27] approximate QFT given in Section II-C4; 2) parallelizing controlled phase gates given in Section III-D2.
Table 6 shows the computational cost by adopting the aforementioned techniques.
First, we minimize T -depth in approximate QFT.This minimization is applicable to the following procedures: 1) an n-bit QFT in Multiplication; 2) two n-bit QFTs in Reduction; 3) an n-bit inverse QFT in Uncomputation; 4) A log n-bit QFT in Uncomputation.
In the aforementioned computational cost, ε q = n −2 .Moreover, one P j gate requires 3 log(ε a ) T gates and Tdepth, namely 9 log n T gates and T -depth, because ε a = n −3 .Therefore, n-bit approximate QFT requires 4n log n qubits, 36n(log n) 2 T gates, and 18 log n T-depth.Next, Step 4, log n-bit QFT without approximation in Uncomputation, is considered.The log n-bit QFT requires (log n) 2 qubits, (log n) 2 /2 controlled P j gates, and two controlled P j -depth.Therefore, log n-bit QFT requires (log n) 2   Next, the other calculations are considered.We can parallelize controlled phase gates by using ancilla qubits.All gates in these calculations are controlled phase gates.If there are L controlled phase gates, these controlled phase gates are used in L/K P j -depth by preparing 2 K qubits (K control qubits and K target qubits).In the Montgomery multiplication, the Multiplication procedure employs n 2 controlled phase gates.Therefore, by preparing 2n 2 qubits, all the calculation steps require one controlled P j -depth: the 9 log n T -depth.
The total computational cost of Shor's algorithm is now evaluated.The previous discussion reveals that the number of qubits is 2n 2 qubits.Moreover, the Montgomery multiplication requires 9n 2 log n T-count and 108 log n T-depth, as shown in Table 6.Therefore, Shor's algorithm requires 27n 3 log n T -count and 324n log n T -depth.

2) MINIMIZING T -DEPTH BY PARALLELIZING MONTGOMERY MULTIPLICATION
Section IV-B1 considers the computational cost by assuming sequential Montgomery multiplications.However, we can parallelize these Montgomery multiplications [27], [33].Cleve and Watrous [27] method realized modular exponentiations with O(log n)-depth controlled multiplications.In more detail, we realize M Montgomery multiplications using number y 0 , y 1 , . . ., y M−1 as Algorithm 5.In Line 6, the multiplication is carried out as where m = log n and |g is a garbage register.This garbage state is erased only when y 0 y 1 . . .y M−1 R mod N is obtained.The other garbage states are retained because all values except for the obtained value y 0 y 1 . . .y M R mod N are erased by reverse computation.This implementation then requires the following procedure: 1) Multiplication; 2) Reduction; 3) "Inv n-QFT" in Uncomputation.
Moreover, to realize the previous computation, we cannot adopt qubit recycling [32] on a control qubit for controlled Montgomery multiplications.Therefore, we must prepare control qubits for controlled Montgomery multiplications.
We now discuss how to realize parallelizing Montgomery multiplications.Parallelizing Montgomery multiplications changes "Mult" in Multiplication.In this procedure, phase gates with two control qubits were employed using the following qubits in Montgomery multiplication (48): 1) control qubit from the first register; 2) control qubit from the second register; 3) target qubit from the third register.These combinations of these three qubits were run, and the number of combinations is n 3 in "Mult" in a Multiplication procedure.Therefore, n 3 phase gates with two control qubits are used in one controlled Montgomery multiplication.To run these gates simultaneously, 2n 3 ancilla qubits are required as follows: 1) n copies of each AND results on n 2 pairs of two control qubits; 2) n 2 copies of n target qubits.Moreover, n 2 copies of n target qubits are realized by log(n 2 ) cnot-depth.To realize the remaining states, a similar method to that of storing |p j,k in Section III-D1 is adopted.In more detail, we adopt and gates given in Fig. 5, run one-qubit gates on the third qubit simultaneously, and run cnot gates with O(log n) depth with O(n 2 ) ancilla qubits.
Next, the computational cost of the above implementation is evaluated.To calculate the computational cost, the values of ε q and ε a must be clarified.In a controlled Montgomery multiplication, the number of controlled phase gates corresponding is n 2 in each qubit.Therefore, M p = n 2 .Moreover, this implementation requires 1.5n 4 qubits, because of the following.
1) In each multiplication, 2n 3 qubits are prepared.2) At most, 0.75n multiplications are run simultaneously.This implementation runs 1.5n Montgomery multiplications with overhead from running as (48).In each Montgomery multiplication, O(n 2 ) T gates are required for copying and results of control qubits, but this is negligible compared with n 3 controlled phase gates.Therefore, the number of controlled phase gates is 3n 4 , and this means M f = 3n 4 .By substituting these values M p and M f on (33), the probability of obtaining the desired quantum states is Thus, when ε q = n −2 and ε a = n −4 , the probability of obtaining the desired quantum states is (1).Based on the previous discussion, the T -count is Each term, from left to right, represents the following.
2) 2: Forward calculation and reverse computation for erasing ancilla qubits.3) n 3 : Dominant term of the number of P j gates in each Montgomery multiplication.4) 3 log(1/ε a ): The number of T gates for each P j gate.
Moreover, Montgomery multiplications in this implementation require 33 log(1/ε a ) T -depth, which is obtained from Table 6 excluding "Inv mult."Therefore, the T -depth is (51)

C. SHOR'S ALGORITHM MINIMIZING KQ
Next, the KQ optimized quantum circuit is evaluated.In the earlier sections, the T -depth was minimized.These constructions realize smaller a T -depth by using more ancilla qubits.These constructions run many T gates simultaneously.Therefore, circuits with a smaller T -depth require more ancilla qubits for magic states of the T gate [34], [35].
To realize an efficient circuit, we should consider the above tradeoff between the T -depth and the number of ancilla qubits.KQ [24] is an index considering both circuit depth and the number of ancilla qubits.Thus, an efficient construction based on KQ is proposed.Before this quantum circuit is evaluated, Oonishi et al. [23] method for optimizing KQ is introduced.Draper et al. [16] decreased KQ on a controlled modular adder using Draper et al. [16] adder.Their method decreases KQ by considering the distillation of T gates [19].Let n T be the maximal number of T gates running simultaneously, and let c g + 1 be the number of required qubits for a T gate.Note that c g is the number of qubits for a distillation circuit, and +1 is used for an S gate correcting phase.Based on these parameters, Oonishi et al. [23] method calculates KQ T , namely, #(Qubits) × (T -depth).Their method then minimizes the KQ T .Similar to their method, KQ T is now calculated and minimized.
We now calculate n T realizing the smallest KQ T .Based on the previous discussion, Table 7 shows KQ T in each proposed construction.As Table 7 shows, KQ T increases as the T -depth decreases.This relationship is attributed to the fact that the T -depth decreases only a part of circuit when the number of qubits increases.The implementations other than the implementation in Section IV-B2 are now discussed because the implementation in Section IV-B2 has a larger KQ T than the other implementations in Table 7.In these implementations, n T is at most n 2 , and n ≤ n T ≤ n 2 is considered.It is assumed that we use 2n T ancilla qubits for copying control and target qubits, and KQ T is given as follows.Therefore, KQ T is larger than the value in (57) when n T ≤ n log n.
In conclusion, KQ T takes the smallest value when n T = 2n 3 11(c g + 3) , and the dominant term of KQ T is 27(c g + 3)n 3 log n.

V. CONCLUSION AND FUTURE WORKS
We examined Shor's algorithm using the Fourier basis by improving Rines and Chuang [18] implementation of the Montgomery multiplication.The contributions of this study are as follows.
2) The proposed Montgomery multiplication was applied to Shor's algorithm (see Section IV) First, Rines and Chuang [18] implementation of the Montgomery multiplication was improved, as discussed in Section III.Their quantum circuit was improved so that the dominant term is 2n, almost the same number of qubits as in Rines and Chuang [18] construction.Their Montgomery multiplication for Shor's algorithm was modified because the original implementation did not consider the situation without changing the value.Moreover, a method was proposed for reducing the computational cost by two approximate QFTs (a naive method and the proposed method) based on the rigorous analysis on approximation errors.By applying the naive approximate QFT, the implementation requires only one-third the number of T gates of Rines and Chuang [18] original implementation.Moreover, the implementation requires only one-fourth of the T -depth of Rines and Chuang [18] original implementation when the proposed approximate QFT is applied.
Next, the proposed Montgomery multiplication was applied to Shor's algorithm in Section IV.First, the proposed Montgomery multiplication is applied directly.Next, a small T -depth circuit was proposed by adopting Cleve and Watrous [27] method.Moreover, as in Oonishi et al. [23] method, a T -scheduling method for minimizing KQ [24], the product of the number of qubits and depth, was proposed.The construction was then given with the smallest KQ, and we obtained the smallest KQ when we run n 1.5 T gates simultaneously.
We now discuss future works.First, constants ε q and ε a were adopted in this article.However, nonconstant ε q and ε a values can lead to more-efficient construction.Therefore, the appropriate values of ε q and ε a in each procedure should be considered.Moreover, Assumption 1 must be evaluated in more detail.
Next, the quantum computer architecture should be investigated.In this study, it was assumed that all qubits are fully connected, but actual quantum computers may not be fully connected.Therefore, the computational cost on specific structures of quantum computers as in previous research [36], [37] should be considered.
Moreover, the appropriate costs for phase gates should be examined.Ross and Selinger [20] method for decomposing a phase gate into Clifford+T gates was adopted.The optimal decomposition for the proposed method requires further investigation.In addition, the optimal distillation and error correction for the proposed method should be determined.
Finally, this study only focused on the Fourier-basis Shor's algorithm, but there are many different constructions.Therefore, it is necessary to clarify the best one based on the result proposed in this article.

FIGURE 1 .
FIGURE 1. Intuition of reducing the computational cost for the Montgomery multiplication.Major computational cost of the original circuit consists of multiplication and QFTs.By reducing the computational cost of QFTs, an efficient Montgomery multiplication is realized.

FIGURE 5 .FIGURE 6 .
FIGURE 5. AND operation in calculating |p j,k [26]: We can parallelize the gates excluding the CNOT gate surrounded by a dotted line.

1 )
We copy the value of qubit |x n− j into b ancilla qubits.This operation requires log b cnot-depth (Clifford), which is negligible compared with 8 log b Tdepth (non-Clifford) in addition.2) We store |p j,k by parallelized computation.This operation requires O(1)-depth, which is negligible compared to depth in addition.3) We reset the value of b ancilla qubits by the inverse operation of Step 1).

1 )
n log n ≤ n T ≤ n 2 : In these n T , the number of qubits is 2n + (c g + 3)n T and T -depth is27n 3 log n n T + 297n log n. (52)In (52), the first term, namely 27n 3 log n/n T , is Tdepth of "mult" in Multiplication.From Table 6, the Tdepth without multiplication is 108 log n − 9 log n = 99 log n in each Montgomery multiplication.The Montgomery multiplication is run 3n times, and Tdepth of "mult" except for Multiplication is 297n log n, which is the second term.Therefore, T -depth is given as (52) and KQ T = 2n + c g + 3 n T 27n 3 log n n T + 297n log n (53) = 297 c g + 3 n log n n T + 54n 4 log n n T + C(n) where C(n) is a function independent from n T .The value in (54) is minimized when n T = 2n 3 11 c g + 3 .(55) KQ T is then KQ T = 27 c g + 3 n 3 log n + 54 22 c g + 3 n 2.5 log n + 594n 2 log n (56) = 27 c g + 3 n 3 log n + O n 2.5 log n .(57) 2) n T ≤ n log n: The focus is now on the KQ T of "Mult" in Multiplication.In this part, KQ T is 2n + c g + 3 n T 27n 3 log n n T = 27 c g + 3 n 3 log n + 54n 4 log n n T (58) = 27 c g + 3 n 3 log n + n 3 .(59)

TABLE 2 . Computational Cost of Montgomery Multiplication of n-Bit Number With QFT P
(θ ), where U is an approximate matrix of phase gate.Therefore, controlled phase gates require almost 3 log(1/ε a ) T gates in Rines and Chuang

TABLE 5 . Computational Cost of Approximate QFT Using Adder With Threshold ε q Algorithm 4:
Proposed Approximate QFT.

TABLE 6 . Computational Cost of T -Depth Minimized Montgomery Multiplication of n-Bit Number, Which Requires 2n 2 Ancilla Qubits
qubits, 9(log n) 3 /2 T gates, and 18 log n T -depth.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.