Quantum Resources Required to Block-Encode a Matrix of Classical Data

We provide a modular circuit-level implementation and resource estimates for several methods of block-encoding a dense <inline-formula><tex-math notation="LaTeX">$N\times N$</tex-math></inline-formula> matrix of classical data to precision <inline-formula><tex-math notation="LaTeX">$\epsilon$</tex-math></inline-formula>; the minimal-depth method achieves a <inline-formula><tex-math notation="LaTeX">$T$</tex-math></inline-formula>-depth of <inline-formula><tex-math notation="LaTeX">$\mathcal {O}(\log (N/\epsilon)),$</tex-math></inline-formula> while the minimal-count method achieves a <inline-formula><tex-math notation="LaTeX">$T$</tex-math></inline-formula>-count of <inline-formula><tex-math notation="LaTeX">$\mathcal{O} (N \log(\log(N)/\epsilon))$</tex-math></inline-formula>. We examine resource tradeoffs between the different approaches, and we explore implementations of two separate models of quantum random access memory. As a part of this analysis, we provide a novel state preparation routine with <inline-formula><tex-math notation="LaTeX">$T$</tex-math></inline-formula>-depth <inline-formula><tex-math notation="LaTeX">$\mathcal {O}(\log (N/\epsilon))$</tex-math></inline-formula>, improving on previous constructions with scaling <inline-formula><tex-math notation="LaTeX">$\mathcal {O}(\log ^{2} (N/\epsilon))$</tex-math></inline-formula>. Our results go beyond simple query complexity and provide a clear picture into the resource costs when large amounts of classical data are assumed to be accessible to quantum algorithms.


I. INTRODUCTION A. Motivation
A commonly used access model for quantum algorithms requiring classical data is a so-called "block-encoding" of the classical data into a unitary operator.Costs associated with these algorithms are often quoted in terms of the asymptotic scaling of the number of queries to the block-encoding oracle.However, if we are to make assessments about the potential of quantum random access memory (QRAM)-based algorithms, we must take into account the cost of each block-encoding query as well.In this work, we provide a detailed implementation and resource comparison of different methods one might use to block-encode a dense matrix A representing classical data.
Block-encodings for N × N matrices 1 are useful because they can, in principle, be implemented in exponentially small time (i.e., with quantum circuits of depth only polylog(N ) [1]), albeit still with space cost of O(N 2 ) qubits.The block-encoding framework has proven useful for quantum algorithms for a variety of applications.Together with insights from adiabatic quantum computing, block-encoding has been crucial to the discovery of quantum linear system solvers with provably optimal scaling with respect to relevant parameters [2][3][4][5].The framework has led to faster algorithms for phase estimation [6], quantum gradients [7], and improved Hamiltonian simulation and regression techniques [1,8].Furthermore, block-encoding has been used to analyze quantum optimization algorithms [9,10] and related algorithms for portfolio optimization [11].Many of these algorithms make use of the quantum singular value transform (QSVT) algorithm [2,12], which uses a technique known as quantum signal processing (QSP) [8,13] that performs polynomial transformations on the block-encoded matrix.

B. Model
A unitary matrix U A block-encodes the matrix A ∈ R N ×N when the top-left block of U A is proportional to A, i.e.
where α ≥ ||A|| is a normalization constant.We use || • || to denote the operator norm.The other blocks in U A are irrelevant, but they must be encoded such that U A is unitary.Note that we focus on real matrices A, but the extension to complex matrices is straightforward.
There is a variety of techniques available to block-encode a matrix [1,2].For applications where we wish to operate on dense, classical data, such as is often encountered in machine learning or finance applications, a particularly relevant method is the QRAM input model and its many proposed implementations [14][15][16][17][18].More concretely, we refer to QRAM as the quantum circuit that allows query access to classical data in superposition where j is the address in superposition with amplitude α j and |b j is the classical data loaded into a quantum state.In this work, we discuss two different circuit-level approaches to accomplish the QRAM operation, which are optimized for different objectives.The first is a select-swap (SS) model, which is particularly efficient in terms of T -gate utilization [19].The second is a bucket-brigade (BB) model [14], which has reduced susceptibility to errors when operated on potentially faulty hardware [18].We also provide a variant of the select-swap QRAM approach that performs a slightly different version of Eq. ( 2), where a more general single-qubit state can be loaded controlled by a flag qubit.
The block-encoding constructions presented in our work compile all circuits into Clifford gates and T gates, where attempts are made to minimize either the total number of qubits, the total number of T gates-henceforth called the T -count-or the minimum number of layers of parallel T gates-henceforth called the T -depth.Our estimates, as overviewed in the next subsection, illustrate the trade-offs between these three metrics.We focus on T gates since, in many proposals for fault-tolerant quantum computation based on quantum error-correcting codes like the surface code, Clifford gates are transversal and can be performed essentially for free (or in some cases by simply updating the Pauli frame for the decoder completely in software).Meanwhile, T gates require the expensive process of magic state distillation [20,21].Indeed, the motivating model for our work is one in which the logical quantum circuits we describe are carried out with lattice surgery [22][23][24][25] on many physical qubits encoded with the surface code.We additionally assume that logical CNOT gates can be performed between arbitrary qubits in constant time, and that the fanout-CNOT operation, i.e. the product of many CNOTs with the same control but different targets, can be performed in constant time [23].As such, our calculations can be seen as optimistic and could be underestimates if there are additional hardware-specific constraints on logical topology and connectivity.

C. Overview of results
The main contributions of our work are summarized as follows: • Clifford+T circuits with minimal T -depth, or alternatively, with minimal T -count, to perform a block-encoding unitary U A as given in Eq. (1) for N × N matrices A • Resource calculations for the block-encoding, including all constant factors, expressed in terms of a small number of tunable parameters • Clifford+T circuits for QRAM and state preparation sub-routines, along with associated resource calculations • Conceptual improvements to state preparation procedure yielding quadratic speedup in T -depth complexity.
Our results for the circuit resource costs in terms of the number of qubits, T -count, and T -depth are shown in the top part of Tab.I, including constant factors.We note that the cost of a having the T -depth be exponentially smaller than the time needed even to write down all of the matrix entries is a large overhead of O(N 2 ) ancilla qubits and total T -count.It is possible to reduce the qubit and T -count to O(N ) at the cost of making T -depth grow also to O(N ).
In addition to parameterized resource estimates, we compute the exact costs to block-encode matrices of sizes [16 × 16, 256 × 256, 4096 × 4096] with precision = 0.01.To estimate a matrix norm, needed for the scale factor α, we took the entries to be uniformly distributed random numbers on the interval [5.0, 105.0] as representative of the type of data we might choose to block-encode (e.g., financial market data).Example numbers are shown in the bottom part of Tab.I. TABLE I.Quantum resources required to block-encode an N × N matrix to precision , where we assume that N = 2 n .The top part of the table contains parameterized estimates, while the bottom shows realistic values obtained for chosen values of the parameters.The parameter Ry is the number of T gates needed to synthesize a single qubit rotation about the Y axis by an arbitrary angle, and t is the number of bits precision to store the classical data, where both t and Ry scale as O(log(1/ )).In the bottom part of the table, we computed the costs to block-encode random real matrices of sizes [16 × 16,  In order to make the definition from Eq. ( 1) precise, we provide quantum circuits that perform a unitary ŨA on n + qubits, where N = 2 n for which where the parameter denotes the number of ancillas, I denotes the identity on n qubits, and • denotes the operator norm. 2 Subscripts on bras and kets are included for clarity to indicate the associated number of qubits in the state.Following standard convention, we call ŨA a (α, , )-block-encoding of A. The normalization factor for the block-encoding we treat is α = ||A|| F , where || • || F denotes the Frobenius norm. 3ll of our constructions follow the prescription laid out in Refs.[2,26,27], which calls for forming U A as the product of a pair of controlled-state preparation unitaries U L and U R .In this prescription, where, controlled on an n-qubit register in the state |j n , U R prepares the n-qubit state |ψ j n , and U L prepares the state |φ j n with the assistance of QRAM ancilla qubits 4 .That is, we have We then choose where A j,• denotes the j-th row of A and A j,• the standard Euclidean norm of that vector.Since |φ k n is independent of k, U L is simply state preparation, rather than controlled-state preparation.However, in the q-norm version of the block-encoding presented in App.A 2, |φ k n will depend on k.It is easily verified that U A is an exact block-encoding of A, i.e., We refer to Fig. 1 for an overview of this reduction from block-encoding to controlled-state preparation.To implement the controlled-state preparation unitaries U R and U L , we combine a QRAM-like data-loading step with a protocol for state preparation of n-qubit states.
Reduction from block-encoding to controlled-state preparation.The block-encoding unitary UA is the product of two controlled-state preparation procedures, UL and U † R .Given a family of quantum states {|φ k } N −1 k=0 the controlled-SP gate denotes the unitary that performs |0 n → |φ k n on the first register conditioned on the final register being |k and using = − n ancillas on the second register that begin and end in the state |0 .One can obtain the A jk matrix element from the unitary by evaluating the 00j| • |00k matrix element, as in Eq. (7).The indicates that the register acts as a control and may produce non-trivial action on the target for any control setting (in contrast to • and •).The controlled-state-preparation gates can be implemented either with the circuit from Fig. 10 or the circuit from Fig. 14, discussed in Sec.III.

E. State preparation
The state preparation problem has been studied in a series of prior literature [19,26,[28][29][30][31][32][33][34][35][36][37].To prepare an n-qubit state, one basic approach is to perform a sequence of n single-qubit rotations on successive qubits, where the angle of rotation is controlled on previously rotated qubits.That is, for p = 0, . . ., n − 1, a rotation is performed on qubit p by one of 2 p possible rotation angles determined by the setting of qubits 0, . . ., p − 1.Thus, there are n−1 p=0 2 p = N − 1 angles (where N = 2 n is the dimension of the Hilbert space) that might be used at some point in the procedure.In Ref. [19], it was shown that given an n-qubit state |ψ n , this approach can produce a state | ψ n for which This scaling comes from the need to do the n = log(N ) controlled-rotations in series 5 , and to perform each controlledrotation by (i) leveraging a O(n)-depth controlled-swap network to copy in a O(log(1/ ))-bit description of the relevant rotation angle, and (ii) for each bit applying an -accurate Clifford+T gate decomposition of the corresponding controlled-rotation, costing O(log(1/ )) gates per bit. 6In Sec.III C, we implement a version of this approach, which we call the "fixed-precision" method.The extension to controlled-state preparation calls for creating one of N = 2 n n-qubit states {|φ k } N −1 k=0 controlled on an n-qubit register.As there are now N (N − 1) rotation angles that might be needed, the first step is to load the correct subset (which depends on the setting of the control k) of N − 1 angles into an ancilla register.This loading step can be understood as a more general form of the QRAM query from Eq. (2).In Sec.II we provide details of this initial loading step and illustrate two separate circuit-level approaches compatible with the fixed-precision approach: (i) the select-swap (SS) approach, which is based on Ref. [19], and (ii) the bucket-brigade (BB) approach, which is based on Refs.[14,15,18].Both approaches allow the loading step to be accomplished in as little as O(log(N )) T -depth.The select-swap approach has a better constant factor, while the bucket-brigade approach has enhanced natural noise resilience [18] that may lead to better overall performance when quantum error correction overhead is considered.After loading, a normal state preparation procedure is performed, followed by unloading of the angles to reset the ancillas to |0 .As QRAM is a common subroutine in quantum algorithms, our estimates in Sec.II could be useful beyond the application of block-encoding.

F. Improved state preparation
In addition to the state preparation procedure presented above, we also present a distinct approach to state preparation that we call the "pre-rotated " approach.Pre-rotated state preparation may be of independent interest, as it quadratically improves the T -depth for the task of state preparation and controlled-state preparation from O(log 2 (N/ )) to O(log(N/ )).The pre-rotated approach achieves logarithmic T -depth by (a) pre-applying all possible single-qubit rotations onto N −1 ancilla qubits in parallel using O(log(1/ )) depth, and (b) enacting each of the log(N ) controlled-rotations in series using a constant-depth controlled-swap network that injects the appropriate ancilla into the state register.Our controlled-swap networks require only constant depth per controlled-rotation because we wait to uncompute garbage until after all log(N ) controlled-rotations have been completed.Note that the fixed-precision method can implement step (b), but not step (a).The pre-rotated method utilizes both steps and also requires a flag 2. Schematic demonstrating our resource estimation procedure for block-encoding.The user specifies a QRAM access model, a parameter λ that allows one to trade depth for width, the method for state preparation routine, along with the blockencoding normalization factor and whether one wants a standard block-encoding, a controlled version, and/or a symmetric encoding.Note that not all possible combinations are allowed, e.g., the state-preparation method pre-rotated is only available together with the access model flags and the choice λ = n.Details are provided in the main text.
mechanism for uncomputing the ancillas that were not injected.These results of this approach are summarized in Tab.V. A similar idea of pre-rotation for state preparation was also explored in Ref. [37], but their construction is not garbage-free.

G. Variations and extensions
We discuss further variations of block-encodings that include, in particular, trade-offs for T -count, and versions for improved noise resilience when using imperfect quantum hardware.The available variations are illustrated in Fig. 2. The technical contributions of this work are as follows: • We demonstrate how to implement a mechanism suggested in Refs.[18,19,39] that trades T -depth for T -count, parameterized by an integer λ ∈ {0, 1, . . ., log(N )}.The T -count for block-encoding is minimized at O(N ) using the select-swap QRAM with λ = 0, as reported in Tab.I.However, the corresponding T -depth is also O(N ), which is exponentially worse than the minimal depth case.Our results for general λ for the bucket-brigade and the select-swap approaches appear later in Tab.IV.We report exact expressions including constant pre-factors and sub-leading terms for all cases, which allows us to find a minimal T -count construction.These ideas are discussed in Secs.II B and II C.
• We explain how our construction can be adapted to perform a controlled-block-encoding of A, i.e. the operation |0 0| ⊗ I + |1 1| ⊗ U A .These controlled block-encodings are important e.g. when composing block-encodings of multiple matrices [2] and are discussed in Sec.IV D.
• We construct block-encodings of symmetrized matrices which are useful in some applications, and which can be block-encoded more efficiently than the general construction owing to their structure.Symmetrized block-encodings are discussed in App.A 1.
• We give a variation of the symmetrized encoding for which the normalization factor is the q-norm, which can be smaller than the Frobenius norm; the resource counts are similar.This variation is discussed in App.A 2.

H. Outlook
While implementation of QRAM-based algorithms with a large amount of classical data on actual hardware is quite a ways off, we hope our detailed circuit layouts and resource counts challenge the hardware community to think through these requirements and investigate whether or not specialized hardware can meet them with reduced resources.Just as modern computer random-access memory (RAM) is distinct from processing elements and permanent memory, it would be desirable for future quantum computers to have hybrid architectures in which the QRAM elements are distinct from the quantum processor and optimized for the requirements at hand.We hope that our results provide a blueprint for these requirements, as having an efficient means to load classical data into a quantum computer could open up many applications in business, finance, and beyond.

I. Outline
The remainder of our manuscript is structured as follows.In Sec.II, we give circuits and resource estimates for performing the QRAM operation of Eq. ( 2), as well as a more general data-loading operation that we use in our minimal depth block-encoding construction.In Sec.III, we elaborate on the conceptual approach to state preparation and controlled-state preparation and give circuits and resource estimates for these tasks.We present two distinct approaches, which we call fixed-precision and pre-rotated, where the former relies on the SS-QRAM or BB-QRAM, while the latter relies on the generalized QRAM operation.In Sec.IV, we compute and report the overall resource estimates for block-encoding.In Sec.V, we give an error analysis for finite precision implementations.Finally, in Sec.VI, we offer some concluding remarks.Various technical details and variants are deferred to App.A-C.

A. Overview
As overviewed in the previous section, block-encoding relies on controlled-state preparation, which necessitates an initial data loading step resembling the QRAM query of Eq. (2).We present multiple options for QRAM implementations as depicted in the upper-left block of the flow chart in Fig. 2. We explain each of these options and provide non-Clifford resource counts.Since the QRAM operation is an important primitive in many quantum algorithms, this resource analysis could be of interest independent from our block-encoding estimates.
Note that in this section and throughout the paper we make extensive use of the symbol when depicting (a + b)qubit gates in which the a-qubit register acts as a control.That is, when the gate can be decomposed as for some set of b-qubit unitaries {U j }, we draw on the a-qubit register in the circuit diagram.When U j is identity for all j besides j = |1 ⊗b , we replace with •, and when U j is identity for all j besides j = |0 ⊗b , we replace it with •.A similar usage of appears in Ref. [19].

B. Select-swap data loading
The select-swap QRAM access-model allows one to trade circuit depth for width via a tunable integer parameter λ, where 0 ≤ λ ≤ n [19].The model also allows one to utilize qubits prepared in arbitrary initial states (so-called "dirty" qubits) at the cost of a constant factor increase to the circuit depth.While dirty qubits could lead to resource savings when considering a full architectural implementation of an algorithm, we do not make explicit use of dirty qubits; we assume that the memory is initialized to the |0 state for all circuits in this paper, except where otherwise stated.
The select-swap formalism-introduced in Ref. [19]-realizes the QRAM data-loading operation of Eq. ( 2) for a D-bit data register, except that it leaves additional garbage states in a (Λ − 1)D-qubit ancilla register, where Λ = 2 λ .3. Select-Swap circuit with variable depth/width, which coherently loads classical data into a D-qubit data register.The n = s + λ qubit control register is divided up into s qubits for the select control and λ qubits for the swap control.The size of the ancilla register is Q λ = (2 λ − 1)D.Select iterates through all 2 s settings of the first s bits and writes the corresponding 2 λ D bits of classical data to the final two registers, costing O(2 s ) depth.Swap moves the correct D bit register into the first of the 2 λ positions in O(λ) depth.When select is implemented with unary iteration to minimize T -count and T -depth, an additional s − 1 ancilla qubits are needed (not shown) [39].Implementation details for select and swap are given in Fig. 16 of App.B 1.
That is, select-swap realizes the unitary LOAD ss operation defined by the equation where |b j D is the D-bit classical data associated with address j, and |g j (Λ−1)D is a garbage state that contains shuffled and possibly phase-flipped versions of classical data at other addresses.Note that the garbage could be uncomputed by copying the data into an ancilla register and then running LOAD ss in reverse, but this will not be necessary in our application.
The idea behind select-swap is as follows.The n = s + λ qubit control register is divided into s qubits for the select control and λ qubits for the swap control, as shown schematically in Fig. 3.This division corresponds to a partition of the 2 n D-bit entries in the classical database into 2 s subsets of size 2 λ = Λ.The first portion of LOAD ss is a select subroutine, which implements unary iteration [39] through the s high-bits of the control register and, controlled on each setting, writes the corresponding subset of classical data into Λ D-qubit registers.The depth of select is exponential in the number of control bits s.The next portion is the Swap subroutine, which, controlled on the λ low bits of the control, swaps the appropriate ancilla register into the top register, leaving the other Λ − 1 registers in some garbage state, as defined by the following equation.

Swap
where the states |ξ i D are arbitrary and |h j (Λ−1)D is a garbage state.The depth of Swap is linear in the number of control bits λ.More details of these two subroutines are given in App.B 1.
Note that during the select subroutine, fanout CNOT gates are used to write the classical data, where the presence or absence of a target on each qubit is determined by the classical data at compile time.Fanout CNOT is an architectural primitive for surface-code based quantum computers, so we assume their cost is minimal relative to non-Clifford gates [23].If single-target CNOT gates are required by the architecture instead, the fanout CNOT can be decomposed into a logarithmic depth network of normal CNOT gates [19].Moreover, the swap subroutine requires controlled-swap gates, which can be implemented fairly efficiently with a phase-incorrect parallel controlled-swap operation, shown in Fig. 22 of App.C (the incorrect phases can be absorbed into the garbage states |g j ).

C. Bucket-brigade data loading
The bucket-brigade QRAM was initially proposed in Ref. [14].The primary motivation behind studying this approach is that it has improved noise-resilience compared to the select-swap approach [14-18, 40, 41].We will not reprise this argument here; we refer the reader to the references, especially Refs.[16,18].If this noise-resilience can be realized in a physical machine, it can reduce the resources required to error correct the data-loading operation, leading to physical resource savings.Whether SS-QRAM or BB-QRAM is preferred will depend upon the underlying architecture and details on the overhead required to implement quantum error correction.
Bucket brigade circuit with variable circuit depth/width, which coherently loads classical data into a D-qubit data register.The circuit iterates through all 2 s settings of the first s bits of the address register, with the first and last iteration shown.In each iteration, a multiply controlled Toffoli gate (where some bits are controlled on |0 and some on |1 , denoted generically with ) computes whether the s bits agree with the address setting |j .If so, the remaining λ = n − s address qubits and the D-qubit bus (third register) in the state |+ ⊗D are swapped into ancilla registers, where the bucket-brigade circuit (denoted by BBi, see Fig. 17 of App.B 2 and Ref. [18]) is performed to load the state H ⊗D |bj D into the bus.No garbage is produced, since during all other iterations the ancillas are in |0 , which is a +1 eigenvector of the bucket brigade circuit.When the sequence of multiply controlled Toffoli gates is implemented with unary iteration to minimize T -count and T -depth, an additional s − 1 ancilla qubits are needed (not shown) [39], which, combined with the q λ = (2 λ − 1)(2D + 1) − 1 ancillas needed for the bucket-brigade, gives a total of Q λ = (2D + 1)2 λ + n − D − 2 ancillas for the LOAD bb operation.
Unlike the SS-QRAM, the BB-QRAM does not leave garbage and the unitary LOAD bb can be defined by the equation where the last register contains Q λ ancilla states that begin and end in |0 .Similar to the swap portion of the SS-QRAM circuit (see Fig. 16 of App.B 1), the BB-QRAM circuit is composed primarily of controlled swap gates, with an arrangement that acts to minimize entanglement and enhance noise resilience.Additionally, as in the SS-QRAM case, we can trade circuit depth for width using a parameter λ [18] as shown in Fig. 4, where the implementation of the gates labeled BB i is described in more detail in App.B 2. However, reducing circuit width does not reduce the overall T -count as it does with the SS implementation, which is a drawback of this scheme.Note that only the portion of the circuit labeled BB i will retain the noise resilience, and thus the noise resilience is lost when we choose λ = 0.

D. QRAM operation with flags
We introduce a third and final data-loading operation, which generalizes the previous approaches by allowing data of the form cos(θ/2) |0 + sin(θ/2) |1 to be loaded for any θ, rather than just classical bits |0 or |1 .Additionally, this load operation has a flag qubit and only acts non-trivially if the flag is set to 1, which is necessary to implement the minimal depth "pre-rotated" version of state preparation in Sec.III D.
Suppose we have a set of N = 2 n angles {θ (j) } N −1 j=0 .Let R y (θ) denote the unitary rotation about the Y -axis by angle θ, that is The data-loading with flags operation, which we call LOADF, acts on four registers: an n-qubit "address" register, a single-qubit "flag" register, an N -qubit "angle" register, and an N -qubit ancilla register.The unitary LOADF operation is defined by the equation When the first address register is in the state |j , the unitary R y (θ (j) ) is applied to the first position of the angle register if and only if the flag bit f is set to 1.
Our implementation of LOADF involves applying doubly-controlled-R y gates, which can be done using the decomposition in Fig. 18 of App.C and adding an extra control.Indeed, we will need to apply a gate we call V , which enacts a unitary specified by the equation In other words, for each k = 0, . . ., N − 1, V performs a R y (θ (k) ) rotation on the k-th qubit of the angle register if the flag bit is 1 and the k-th qubit of the ancilla register is 1.This can be implemented in O(1) total T -depth using the doubly-controlled-rotation decomposition of Fig. 18 with the parallel-Toffoli construction implicit from Fig. 19.
Since LOADF only applies R y (θ (j) ) and no other rotations, we must first prepare the state |x 0 1 . . .|x N −1 1 for which x j = 1 and x k = 0 for all k = j, so that application of V applies the correct rotation.To prepare this state, we flip the first of the N ancilla bits to |1 , and then run the inverse of the controlled-swap network from the LOAD ss operation in Fig. 3, which moves the |1 state into the j-th ancilla.Application of V computes R y (f θ (j) ) |0 into the j-th position of the angle register, while leaving all other angle registers in |0 .Next, the controlled-swap network from LOAD ss is applied to move the j-th angle register to the first position and the j-th ancilla register back to the first position, so that an X gate on the first ancilla leaves the entire ancilla register back in |0 .This implementation realizes the LOADF operation and is depicted in Fig. 5.A complete example for n = 2 is given in Fig. 6.
The LOADF operation loads a single angle conditioned on a single flag.In general, we can load D states R y (θ r ) |0 1 for r = 1, . . ., D by copying the flag, angle, and ancilla registers D times and running D copies of LOADF in parallel, where the controlled-Swap operations for all D copies share a common control (the n-qubit address register). 7The address qubits only ever act as controls for multi-qubit controlled swap gates, so increasing D does not increase the T -depth.Note also that the LOADF operation requires controlled-swap gates with the target registers in states of the form R y (θ) |0 , which does not admit the use of the phase-incorrect controlled-swap that can be used for LOAD ss and LOAD bb .Therefore, we utilize the parallel implementation shown in Fig. 19.This circuit shows a controlled swap between arbitrary two-qubit states with the assistance of two ancilla qubits.The Toffoli construction from [42] then allows one to implement the Toffoli with a T -depth of 1 and a T -count of 4, at the cost of a single extra clean ancilla qubit.

E. QRAM resource estimates
We now present the non-Clifford resource estimates for the three data-loading operations implementations discussed above, summarized in Tab.II.For LOAD, we report the resources required for the hybrid scheme that allows one to vary the circuit depth and width via the parameter λ ∈ {0, 1, . . ., log(N )}.For both LOAD ss and LOAD bb , the T -depth is minimized at O(n) when λ = n.Taking D = O(1), the LOAD ss operation achieves minimum T -count of O(2 n/2 ) when λ ≈ n/2, while LOAD bb requires Ω(2 n ) T -count for all λ.For λ = 0 one has unary iteration [39], which yields the minimal-qubit-count but maximal-depth case.
Our counts can be verified using the circuit diagrams in Figs. 3, 4, and 5, along with the following additional observations, which reference gate decompositions given in App. C.
• The Select gate from Fig. 3 and the sequence of multiply-controlled Toffoli gates in Fig. 4 are accomplished with unary iteration which requires 4(2 s − 1) (non-parallelizable) T gates and s − 1 additional ancilla qubits that are not depicted in those figures (see Ref. [39] for construction).We assume intermediate measurements and subsequent classically controlled Clifford gates are allowable; if not, an additional 4(s − 1) T gates would be needed.
• The Swap gate that appears in Figs. 3 and 5 is accomplished using 2 λ − 1 controlled-swap gates, occuring over λ parallel layers.For LOAD ss , these parallel controlled-swap gates are implemented with the construction in Fig. 22, costing total T -count 4(2 λ − 1) and T -depth 4λ.For LOADF (where λ = n), the goal is to give the minimal possible depth, so we choose to implement the parallel controlled-swap gates using the construction in Fig. 19 (costing 2 n−1 ancilla qubits) where we perform each of the Toffolis using the depth-1 count-4 construction of Ref. [42] (costing 2 n−1 additional ancilla qubits and requiring intermediate measurements).The total T -depth for the Swap gate is n and the total T -count is 4(2 n − 1).The final two swap gates of Fig. 5 are performed in parallel.
• The costs of the BB j gates of Fig. 4 are evaluated by inspection of Fig. 17 and quoted in the caption of that figure.
• The controlled-controlled-rotations appearing within the circuit for the gate V of Fig. 5 are implemented with Fig. 18, which introduces a set of parallel Toffoli gates and single-qubit rotations.Each of these parallel Toffoli gates is implemented the same way as parallel controlled-swap gates, reusing the same 2 n ancillas, contributing depth 1 and count 4 • 2 n .Note that if it is known that the flags are set to 1, these Toffolis become CNOTs, which are Clifford gates.Each single-qubit rotation costs T -depth and count equal to R y , where R y is the number of T gates in the gate synthesis of a single qubit rotation unitary, for which the leading term is 3 log(1/δ) when the target precision is δ (see Eq. ( 49) of the error analysis in Sec.V).
• For LOADF, Fig. 5 is for D = 1; to generalize to larger D, the flag, angle, and ancilla registers are copied D times, and all gates are performed in parallel, multiplying the T -count by a factor of D and leaving the depth unchanged.49)).We do not examine a depth/width tradeoff for LOADF as it is only used in our minimal depth construction.
In Sec.I D, we described how block-encoding is reduced to controlled-state preparation.Here we give explicit circuits for state preparation and controlled-state preparation.In Sec.III B, we describe the conceptual approach to state preparation coming from prior literature.In Sec.III C, we give circuits for what we call the "fixed-precision" version of state preparation and controlled-state preparation.Then, in Sec.III D, we give circuits for what we call the "pre-rotated" version of state preparation and controlled-state preparation.Both versions have the same conceptual framework, but differ on other aspects.The fixed-precision version of controlled-state preparation utilizes either the SS-QRAM or BB-QRAM circuit from Sec. II as a subroutine, and is compatible with the depth/width trade-off parameterized by λ.Choosing λ = 0 with the SS-QRAM leads to our minimal T -count construction.Meanwhile, the pre-rotated version is used to achieve minimal T -depth.Indeed, the T -depth for preparing an N -dimensional state to error scales as O(log(N/ )) for the pre-rotated construction, an asymptotic improvement over O(log 2 (N/ )) from Ref. [19].

B. Rotation angles and binary tree data structure
Prior to the specific circuit-level instantiations, we describe our general approach to state preparation, which has appeared throughout the literature [19,26,29,30].The task is to construct a circuit that creates an n-qubit quantum state |ψ n = || β|| −1 N −1 j=0 β j |j n given a list of its coefficients {β j }, where N = 2 n is a power of two and zero padding of the classical data can ensure this, and || β|| denotes the Euclidean vector norm to ensure normalization.We use the notation to denote the vector of values between the high (v) and low (u) indices with v > u.
We assume knowledge of the real amplitudes β j .Given these amplitudes, we can construct a classical binary tree data structure as follows.The tree has depth n, and at the leaves we store the values |β j | 2 and sgn(β j ) in the leaf n 2 X l x X r 9 W S 8 7 0 Z o v 8 g P P 2 C a R s n q E = < / l a t e x i t > . . .6 7 z F e K s 4 w y 5 J i c k i J x y R W p k F t S J T V C y Q t 5 I + / k w 3 q 1 P q 0 v 6 3 v a u m L N Z o 7 I H K z R L 7 s E o g o = < / l a t e x i t > nodes of the tree. 8Then, each internal node stores the sum of the two child nodes.This proceeds to the root of the tree, which stores the sum of the squares of all N amplitudes, i.e. || β|| 2 .We show an example of such a tree in Fig. 7.
The binary tree data structure allows efficient updates if single amplitudes change; that is, if an amplitude is updated, only log(N ) of the 2N − 1 nodes need to be recomputed.Thus, the data structure can be efficiently maintained if matrix entries arrive in an on-line fashion [26].
Using the data in the binary tree, we construct a circuit that prepares the state |ψ .The state is initialized to the |0 n state.In step 1, a Y -axis rotation is applied on the first of the n qubits by the angle θ 1 = 2 cos −1 (|| β N/2 0 ||/|| β||), as in Eq. ( 14), yielding the state This angle can be computed from the values stored at the root of the binary tree and its children.In step 2, a rotation is applied to the second qubit, either by the angle θ 2 = 2 cos −1 (|| β , where the angle θ 2 is used conditioned on the first qubit being in state |0 and θ 3 is used conditioned on the first qubit being in state |1 .Note that θ 2 and θ 3 can each be computed from values stored at one level-2 node and its children.This creates the state Collecting terms, we can rewrite this state as where In general, for any w ∈ {1, . . ., n} and for any w-bit binary string y ∈ {0, 1} w , we can define α y = || β (y+1)2 n−w y2 n−w ||/|| β||, where for the purpose of multiplication on the right-hand side we interpret y the integer associated with its binary 8 If we wish to allow β j to be complex, we can store a complex phase in place of sgn(β j ) string.Note that the values stored at the 2 w nodes of level w of the binary tree are precisely |α y | 2 for y ∈ {0, 1} w .Extrapolating from the pattern in steps 1 and 2, we assert that the state after step w is given by This formula will hold if we implement the transition from |ψ w n to |ψ w+1 n as follows: That is, we apply a rotation to qubit w + 1 by an angle θ 1y , where which depends on y. 9 At the final iteration, the state becomes where in the final equivalence, we switched from binary notation, denoted by the variable y to decimal notation denoted by the variable j.This is exactly the state that we set out to prepare, up to the sign of the amplitude.To set the correct sign, we apply a (−1) phase to any state |j for which the j-th leaf in the binary tree indicates a sign of (−1).In practice, this is accomplished by loading the sign bit into an ancilla register and applying a Pauli-Z gate, and then unloading the sign bit. 10he binary tree data structure in Fig. 7 has 2N − 1 nodes.All nodes must store the value || β v u || 2 and the leaf node must also store a single bit for the sign.However, we notice that in the above construction, all that matters is the angle of rotation, and there are only N − 1 distinct angles, which are each inferred from the values at two of the nodes.(For example, θ 1 = 2 cos −1 (|| β N/2 0 ||/|| β||).)Thus, in our resource analysis, we assume that these angles are classically pre-computed to avoid the need for any arithmetic on the quantum computer.Ultimately, the information about these angles enters the circuit through the coherent data-loading operations discussed in Sec.II.
For controlled-state preparation, we are tasked with preparing one of N different arbitrary states, depending on the setting of a control register.To do so, we must compute the N − 1 angles for each of these N states, giving N (N − 1) distinct angles, which can be organized into N separate binary tree data structures.

C. Fixed-precision circuit
The high-level protocol described in the previous section calls for applying n single-qubit rotations by an angle that depends on the setting of other qubits.Our "fixed-precision" state preparation protocol, which we describe in this section, performs each of these rotations by loading a binary representation of the correct angle up to some fixed precision (i.e. an approximation of the angle with a pre-specified number of bits), performing a controlled R y rotation by that angle, and then uncomputing the binary description of the angle.
The angle-loading step is essentially a QRAM query since a different angle must be loaded for each control setting as in Eq. ( 2).However, in this application, it is allowable to leave garbage in an ancilla register as long as the garbage is eventually uncomputed.Our implementation of the angle-loading step assumes that the initial state has a t-bit description of all N − 1 angles stored in N − 1 t-qubit ancilla registers.The circuit consists entirely of controlled-swap gates that shuffle these N − 1 ancillas to move the correct angle into the first position, leaving the other N − 2 registers in a garbage state that is entangled with the data.These circuits are very similar to the swap portion of the select-swap networks described in Sec.II B. 8. Fixed-precision state preparation circuit, which approximately prepares the state |ψ n into the first n qubits.The protocol requires an ancilla register initialized in a computational basis state | Θ (N −1)t | b N containing information about the N − 1 rotation angles and N sign bits.Note that a = (N − 2)t for a total of D = (N − 1)t + N ancillas.The circuit alternates between swapping a t-bit description of the next angle into the first ancilla register (denoted by Sp for p = 2, . . ., n), and rotating a single qubit by the angle stored in that register (denoted by Ry in the figure).The symbol indicates that a different rotation angle is performed for each of the 2 t possible settings of the register.The operation S1 is omitted because it is the identity, and the other Sp operations can each be completed in constant T -depth; an example of this implementation for n = 3 is shown in Fig. 9.The Z gate acts only on the first qubit of the t-qubit register, which contains the appropriate sign bit after application of S±.
Example three-qubit fixed-precision state preparation circuit.For simplicity of presentation, the sign bits and the inverse operation S † are omitted.The two boxes correspond to implementations of S2 and S3 from Fig. 8.The key point is that all controlled-swaps in the implementation of Sp have a single common control, which allows them all to be implemented with T -depth 4, independent of n, as shown in Fig. 22 of App. C.
Given the set of N − 1 angles {θ 1 , . . ., θ N −1 } needed to create the state |ψ n , let | θj t be a computational basis state corresponding to the binary representation of the quantity 2 t θ j /π, rounded to the nearest integer.Moreover, let {s 0 , . . ., s N −1 } denote the sign bits for each of the N computational basis states.Define | Θ . Then, we assume that the state preparation protocol acts on an initial state of n + D qubits, with D = (N − 1)t + N , given by This is a computational basis state and thus can be prepared in a single Clifford layer by applying X gates on the appropriate qubits determined by the classical data.In the case of controlled-state preparation, a different set of N − 1 angles and N sign bits needs to be loaded depending on the setting of an n-qubit control register; in this case the operation LOAD ss or LOAD bb defined in Secs.II B and II C with D = (N − 1)t + N is used to load in the D bits of angular and sign data.
Recall that the state we want to create is |ψ n = || β|| −1 N −1 j=0 β j |j .For each basis state |j a different sequence of n angles and one sign bit is actually applied.Accordingly, we define n + 1 controlled swap networks denoted by S 1 , S 2 , S 3 , . . ., S n , S ± (note that S 1 will always be the identity operation and can be omitted from the circuit).These can each be written in the form where S (j) p acts on the D-qubit ancilla register such that the product S has the action of swapping the t-bit description of the angle associated with the p-th rotation for |j into the first t-qubit ancilla register, and S has the action of swapping the sign bit for |j into the first 1-qubit register.Importantly, it will be the case that S p is controlled only on the first p − 1 bits of |j .
To prepare |ψ n , these swap networks are interleaved with controlled single-qubit rotations: for each p = 1, . . ., n, after the gate S p , a controlled-R y rotation is implemented on qubit p controlled by the t-bit description of the angle stored in the first register.This is accomplished with t controlled-R y rotations by a fixed angle with a single bit as the control.After the gate S ± , a Z gate is applied to the first ancilla qubit to apply the correct sign.Finally, the angles are restored to their initial positions by performing the gate S † = S † 1 S † 2 . . .S † p S † ± .This protocol realizes the unitary SP depicted in Fig. 8 and defined by the equation The controlled swap networks in the SS-QRAM circuit have depth O(p) when there are O(p) controls.A straightforward way to implement S p is to use these networks first to do the reverse of S p−1 in depth O(p) to swap out the (p − 1)-th angle, and second to swap in the correct angle in depth O(p).This would suggest the T -depth of performing S 1 , . . ., S n , S ± is ).However, we give an optimization that reduces the depth of each S p to O(1), and thus an overall depth of O(n); this is seen in the example implementations of S 2 and S 3 for n = 3 shown in Fig. 9.The main idea is to avoid undoing the work already accomplished by S p−1 , and note that each S p can be controlled on just one of the n data qubits.
The full block-encoding circuit in Fig. 1 requires a controlled-state preparation, not state preparation.Here there are N different state |φ k , each with their own angle data Θ(k) and sign data s(k) .To perform controlled-state preparation one simply loads this data conditioned on a control register in state |j using either LOAD ss or LOAD bb defined in Sec.II.Then the state preparation circuit from Fig. 8 is performed followed by the inverse of the LOAD operation to clear the data registers.This is shown in Fig. 10.

|0
/ n 10. controlled-state preparation circuit diagram for the fixed-precision approach to state preparation.For each of the N possible settings of the bottom register, a different n-qubit state |φ k is prepared in the top register.This is accomplished with three subroutines.The LOAD subroutine, which can be either LOADss from Fig. 3 or LOAD bb from Fig. 4, loads in a t-bit description of the N − 1 angles | Θ(k) (N −1)t and the N sign bits |s (k) N into the second register of size D = (N − 1)t + N , with the assistance of Q λ ancillas.The SP subroutine is given in Fig. 8, and prepares the state |φ k n .Finally, the reverse of the LOAD operation is performed to reset the second register and the ancillas to |0 .The instances of controlled-state preparation that appear in the block-encoding unitary UA of Fig. 1 can be accomplished with this circuit.

D. Pre-rotated circuit
In this subsection, we present an alternative approach to state preparation which achieves smaller T -depth than the fixed-precision version, both practically and asymptotically.The asymptotic T -depth scaling is O(log(N/ )), with the error on the state prepared.The pre-rotated version takes the idea of pre-computing the angles and signs for state preparation one step further, and encodes them as the amplitude of a single qubit.For a given quantum state |ψ n , we assume that we have classically computed the N − 1 angles {θ r } N −1 r=1 and N sign bits {s j } N −1 j=0 stored in the binary tree data structure as discussed previously.Let 11.Pre-rotated state preparation circuit, which approximately prepares the state |ψ n into the first n qubits with garbage.That is, if The entanglement between the n-qubit data register and (N − 1)-qubit garbage register can be uncomputed using a flag mechanism as described in the main text with additional O(log(N/ )) cost, where is the error on the state prepared.
and for each r let V r be a Clifford+T gate decomposition of an R y rotation by some angle that prepares |0 → |θ r up to error δ. 11 Note that if we have classically pre-computed the angle θ r , we can also classically compute a gate sequence V r that gives a δ approximation for R y (θ r ) in classical time polylog(1/δ) [43].We assume that the input state is the product-state on n + N − 1 qubits given by which also acts as the definition of |Θ N −1 .This product state can be prepared by applying each of the V r to N − 1 ancillas initially in |0 in parallel.In the case of controlled-state preparation, later, we will use the LOADF operation to prepare a different initial state for each setting of a control register.Given the state in Eq. ( 30) as input, we perform a nearly identical state preparation circuit to that shown in the previous subsection.The only differences are that the application of the sign bit with S ± and a Z gate can be omitted, as the sign bit is built into the state |θ r , and that the nt non-Clifford controlled-R y rotations are replaced by n single-qubit swap gates (which are Clifford gates).That is, rather than use the angle qubit as a control for a rotation, we inject the angle into the data with a simple swap gate, as shown in Fig. 11.For any computational basis state |j , n of the N − 1 states |θ r are injected into the data.Once the state preparation procedure is complete, we reverse the swap circuit to return all the angles |θ r 1 back to their initial positions with the exception that the n angles that were injected are replaced by |0 1 .Let f r|j = 1 if angle θ r is an ancestor of the j-th leaf in the binary tree, i.e. if |θ r was injected into the state in the computational path associated with |j ; otherwise f r|j = 0. Then we can state the action of our protocol, denoted by the unitary SPF and depicted in Fig. 11, by the following equation where The state given by Eq. ( 31) has the correct coefficients for each basis state |j n , but has |0 1 states in place of |θ r 1 for the n angles that were swapped into one of the first n registers.In other words, there is garbage leftover that is entangled with the data.In some applications, this garbage might be allowable.However, one can also uncompute the garbage and disentangle the two parts of the state, which we do by computing the "flag" bits 1 − f r|j into ancilla registers, using them as a control to apply a controlled-V r operation controlled on that bit, and then uncomputing the flag bits.The N − 1 flag bits can be computed into N − 1 ancillas using the unitary FLAG, depicted in Fig. 12 and defined by the equation The circuit for FLAG is very similar to the SPF circuit, and, in fact, FLAG can be run in parallel with the S † portion of the SPF gate.
12. Circuit for the adjoint of the FLAG gate used to disentangle the garbage and data registers in the pre-rotated state-preparation protocol.We present FLAG † rather than FLAG to draw attention to the similarity between FLAG and the first half of SPF from Fig. 14.Controlled on the data qubits in state |j , FLAG switches n of the N − 1 flag qubits from |1 → |0 at locations corresponding to the positions of the n angles that are used in the synthesis of amplitude |j .
The full pre-rotated garbage-free state-preparation protocol, including uncomputation of garbage with flags, is depicted in Fig. 13.
13. Circuit for garbage-free state preparation with pre-rotated method.An arbitrary n-qubit state |ψ is prepared with the assistance of 2(N − 1) ancilla qubits.The circuit for the SPF gate is given in Fig. 11 and the FLAG gate is given in Fig. 12.The X gate denotes a Pauli-X on all N − 1 qubits.The controlled-Ry gate denotes N − 1 completely parallel controlled Ry(θr) gates, with each flag qubit (third register) acting as a control for a rotation on a different angle qubit (second register).To prepare |ψ to precision , the single-qubit rotations must be synthesized with a gate sequence of T -depth O(log(1/ )), and meanwhile the SPF and FLAG gates incur O(log(N )) T -depth for a total T -depth of O(log(N/ )).
To perform controlled-state preparation, one must first load one of N initial states |Θ (k) N −1 into an ancilla register, depending on the setting of an n-qubit control register in the state |k .This is accomplished by the LOADF operation depicted in Fig. 5.In particular, we perform LOADF with D = N − 1, which is equivalent to N − 1 copies of LOADF to load each of the states {|θ r=1 .Each of these LOADF operations has its own flag qubit, which we initialize to |1 1 , but all share the same n-qubit control, and use their own Q F -qubit ancilla space.Once |Θ (k) N −1 is loaded, application of SPF yields the correct state with garbage.To disentangle the garbage and reset all ancillas to |0 we follow a three step process: first, we apply FLAG, which flips flag bits from 1 to 0 in positions that were injected into the data; second, we run the reverse of LOADF, which sends |θ r 1 → |0 1 in all the positions where the control bit is 1, which is precisely the positions where it is not already |0 1 ; third, we apply FLAG again, followed by an X gate to return all flag bits to |0 1 .This process is depicted in Fig. 14.

E. State preparation resource estimates
Here, we summarize the non-Clifford resources required for state preparation.As with the QRAM estimates, we utilize the phase-incorrect controlled-swap gates for the fixed-precision version and the phase-correct version for the pre-rotated case.The phase-correct version requires additional ancilla qubits not shown in these circuits.Details 14. controlled-state preparation circuit diagram for the pre-rotated approach to state preparation.For each of the N possible settings of the bottom register, a different n-qubit state |φ k is prepared in the top register, which is accomplished with three subroutines.First, an X ⊗(N −1) gate sets the N − 1 flag qubits (fourth register) to 1, and N − 1 parallel copies of LOADF (Fig. 5) load the state |Θ (k) into the second register, conditioned on the last register being |k (note that since all the flags are set to 1, the doubly-controlled rotations that appear in the circuit for LOADF can be replaced with singly-controlled rotations in this instance).Each of these copies uses the same n-qubit control, but its own ancilla space of QF qubits, so the total ancilla count is QF = (N − 1)(2N − 1).Second, the state is computed into the first register using the SPF operation (Fig. 11), which leaves garbage in the second register.Third, the garbage is uncomputed by setting the flags for the angles that were injected into the circuit to 0 using the FLAG gate (Fig. 12), running LOADF in reverse, and then returning the rest of the flags to |0 with FLAG † .The depth of LOADF is O(log(N/ )) and the depth of SPF and FLAG is O(log(N )) for a total depth of O(log(N/ )), where is the error on the state |φ k .The instances of controlled-state preparation that appear in the block-encoding unitary UA of Fig. 1 can be accomplished with this circuit.
of the controlled-swap circuits for both cases are summarized in App. C. The state preparation resource counts are provided in Tab.III.For controlled-state preparation, one simply prepends the state preparation routine with a LOAD operation followed by a LOAD † operation (LOADF in the case of the pre-rotated approach) as in Figs. 10 and 14.
Our counts can be verified using the circuit diagrams in Figs. 8 and 13, along with the following additional observations, which reference gate decompositions from App. C.
• Each of the n multiply-controlled-R y gate in Fig. 8 is accomplished by t singly-controlled-R y rotations in series, by the fixed rotation angles π, π2 −1 , π2 −2 , . . ., π2 −t+1 .These are implemented with the construction from Fig. 18, which involves two applications of a single-qubit rotation.These single-qubit rotations are synthesized with a Clifford+T gate sequence with T -count denoted by the quantity R y in the table.The total T -depth and T -count is thus 2nR y .
• Each S p gate with p = 2, 3, . . ., n, ± in Figs. 8, 11, and 12 is a parallel controlled-swap gate with a single mutual control and many target pairs.For fixed-precision, each S p is implemented with the construction from Fig. 22 along with the phase-incorrect decomposition in Fig. 21 yielding T -depth 4. The total number of controlledswap gates that occur during S 2 ,. . .,S n is (2 n − n − 1)t + (2 n − 1), where the first term comes from shuffling the t-bit angle data and the second term comes from shuffling the sign bit data.For pre-rotated, the parallel controlled-swaps are implemented with the construction from Fig. 19 (costing (N − 2)/2 ancillas), where Toffolis are implemented via the depth-1 count-4 construction of Ref. [42] (costing another (N − 2)/2 ancillas).The number of controlled-swaps is 2 n − n − 1.
• In Fig. 13, the FLAG gate can be performed in parallel with the S † gate (last gate) of Fig. 11.Note that each requires N − 2 ancillas (see previous bullet).The depth of the controlled-swap portion of the circuit is 2(n − 1) for SPF and n − 1 for FLAG.Finally, we remark here that one could also utilize the swap networks of the bucket-brigade QRAM to perform the S p gates in the state preparation routine, which could confer some amount of natural noise resilience.However, due to the interleaved R y rotations, the parallelism that was previously exploited to ensure log-depth scaling is now broken, so the bucket-brigade state preparation circuit can only achieve a minimum T -depth of O(n 2 ) depth.Furthermore, the constant factors are higher for qubit and T -count.For these reasons, we do not consider a bucket-brigade style state preparation approach, and all of our resource estimates use the select-swap versions presented here.

A. Overview
We now have all the necessary ingredients to estimate the full resources required to block-encode a dense matrix of classical data using the circuit shown in Fig. 1.This is the portion labeled "Block-Encoding" in Fig. 2. A variety of choices can be made with respect to exactly how the matrix is block-encoded.These options are shown in the upper right side of Fig. 2. We outline the controlled version in the next subsection.We provide methods to implement the other options symmetric and q-Norm in App. A.
For the standard Frobenius encoding, the block-encoding procedure reduces to two applications of controlled-state preparation requiring both a QRAM-like data-loading operation and state preparation routine (see Figs. 10 and 14).We can make one optimization to reduce the resource count: note that the state |φ k defined in Eq. ( 6) is independent of the control register k.Therefore, this state can be prepared with standard state preparation.We provide the resource counts for the fixed-precision case in Tab.IV and pre-rotated block-encoding in Tab.V for the two versions of QRAM that we consider.

B. Fixed precision resources
For the fixed-precision case, we leave the resource counts in terms of the parameter λ ∈ {0, 1, . . .The fixed-precision bucket-brigade approach cannot achieve the same scaling with respect to T -count; in all cases the count is at least Ω(N 2 ).The minimal-depth case can achieve the same asymptotic scaling as select-swap, but we note that the constant factors for all resources are higher.Whether this approach can achieve an overall physical resource reduction will depend upon the quantum architecture, error correcting code, and error requirements, which are all beyond the scope of this paper.
For the pre-rotated case, we only consider the minimal-depth circuit, that is, we make no attempt to trade width for depth.Note that any manifestation of the pre-rotated idea would need to satisfy a T -count lower bound of Ω(N 2 ) due to the need to have controlled-Ry rotations for all N (N − 1) angles somewhere in the circuit.However, the benefits of the pre-rotated approach can be seen by both an improvement in asymptotic scaling, and constant factor improvements to both T -depth and qubit count.This technique allows us to achieve T -depth of O(log N + Ry) ∼ O(log(N/ )), which we contrast with the fixed precision approach that scales as O(R y t log N ) ∼ O(log N log 2 (1/ )).
The number of qubits required for this approach is approximately four per classical matrix entry (not counting additional qubits needed for routing operations).One qubit is required for the data and one for the flag, and an additional two ancilla are needed to perform the parallel controlled-swap operations (see Fig. 19 in App.C).One could exploit the fact that the parallel controlled-swap gate can use dirty qubits for the ancilla to reduce this to just a single extra ancilla for the T -depth one Toffoli gate [44].However, the first round of controlled-swap gates in the select-swap QRAM circuit operates in parallel across all QRAM data registers, so there are no available qubits to do this.One could split the first set of controlled-swap gates into two rounds to utilize the other data qubits as dirty ancilla, but we choose the shortest possible depth approach and counted the cost of the additional ancilla qubit in our resource analysis.
Note that since all the flags are set to 1 at the beginning of the controlled-state preparation protocol, the Toffolis that appear within the gate V of LOADF in Fig. 5 can be replaced with Clifford CNOTs, saving T -depth 2 and T -count 2N (N − 1).TABLE V. Block-Encoding resource requirements for pre-rotated method.To compare to Tab.IV take λ = n.Loading the classical data in pre-rotated form with flag qubits allows us to achieve asymptotic depth scaling of O(log N +Ry) ∼ O(log(N/ )).
Resource Pre-rotated

D. Controlled block-encodings
A controlled block-encoding is useful in certain applications, for example in solving linear systems of equations [4].In particular, we wish to implement the unitary where U A is the standard form of the block-encoded matrix A given in Eq. ( 1).One possible implementation is to add zeros to the QRAM and just select these values if the appropriate control bit is 0, but this approach is highly inefficient in qubit count for our assumed case of N being an exact power of two, since the number of qubits in the QRAM must be doubled.Instead, we simply select zeros by adding a single binary tree of all zeros, which will be selected using a single register controlled-swap to replace the loaded register if the control bit is one, as shown in Fig. 15.This method requires one extra qubit for the control, D extra qubits for the |0 state in QRAM, and one extra controlled-swap between D qubit registers.

A. Overview
The unitary implemented by our circuit is ŨA , which is different from the exact block-encoding unitary U A in two respects: (a) the rounding error from the representation of the angles θ using t-qubit registers and (b) the error from the gate synthesis of the R y rotations.Both (a) and (b) affect the fixed-precision block-encoding; only (b) affects the pre-rotated block-encoding.
A possible modification to the LOAD operation that allows a controlled-block-encoding CUA = |0 0| ⊗ I + |1 1| ⊗ UA to be efficiently performed.The LOAD operation can be either LOADss or LOAD bb from Sec. II, and brings D bits of classical data into the third register, where D = (N − 1)t + N .Controlled on the first register, this data is swapped into the second register, which acts as the input to state preparation, as in Fig. 10.A similar idea could be implemented for the LOADF operation with respect to Fig. 14.If the control is |0 , the input to state preparation is the trivial state.The cost of this construction is D ancilla qubits and D controlled-swap gates with a common control, which can be performed in O(1) T -depth and O(D) T -count (in cases that the LOAD operation does not produce garbage, the LOAD ancillas could be reused, in which case no additional ancillas are needed for controlled block encoding).Adding additional controls onto the controlled-swap gate yields a multiply-controlled block-encoding with an arbitrary number of controls, without any additional ancillas.

B. Rounding error
For the fixed-precision case, rounding errors cause the ideal angles θ ∈ [0, π] to be approximated by the angle θ that is the nearest exact multiple of 2π/2 t , i.e. |θ − θ| ≤ π2 −t .We can, therefore, represent 1 − ( θ/2π) exactly in binary with t bits (θ 1 , . . ., θ t−1 ), where θ i ∈ {0, 1} and The rotation R y (θ) sends |0 → cos(θ/2) |0 + sin(θ/2) |1 ; we need only cover the space θ ∈ [0, π], since we may assume the matrix entries are all positive (signs are applied later), and hence sin(θ/2) ∈ [0, 1] is sufficient.Note that R y (η) has eigenvalues e ±iη/2 and hence ||R y (η) We now argue that the above implies the controlled -rotations are close to their rounded versions.The controlled rotations in the circuit for U A perform a different rotation angle, depending on the setting of several control qubits.Suppose that, for some integer p, we have a collection of 2 p unitaries V i and approximations Ṽi such that ||V i − Ṽi || ≤ δ for all i ∈ [2 p ]. Let CV and C Ṽ be the operations that, controlled on a p-qubit register being in the state i, perform the operations V i and Ṽi , respectively.It is then easy to verify that ||CV − C Ṽ || ≤ δ, as CV and C Ṽ are each block diagonal matrices with the same block structure.Hence, as long as all angles are correct up to error π2 −t−1 , the controlled-R y (θ) operation in the circuit is (π2 −t−1 )-close to controlled-R y ( θ).Let Ũ A denote the unitary for which all controlled-R y (θ) gates are replaced by an exact implementation of controlled-R y ( θ).As the circuit has 2 log(N ) controlled-rotations (log(N ) each for U † R and U L ), by the triangle inequality, we have that C. Gate synthesis error In the fixed-precision approach, the unitary Ũ A exactly implements controlled-R y ( θ) by loading the bits (θ 1 , . . ., θ t ) of θ into t ancilla registers, and then exactly performing controlled-R y (π2 −j+1 ) operations (with one control qubit) for j = 1, . . ., t with these ancilla registers acting as the controls.As seen in Fig. 18, a controlled-R y (π2 −j+1 ) is accomplished by decomposition into two CNOTs and two R y (π2 −j ) operations.The actual circuit ŨA differs from Ũ A only in that these R y (π2 −j ) operations are performed approximately using a decomposition of the R y (π2 −j ) gate into Clifford+T .Denote the unitary enacted by this decomposition by Ry (π2 −j ).The decomposition error [43] is then bounded as as long as we choose a gate sequence with T -count (note T -depth = T -count for a single-qubit operation) at least equal to R y with Again using the triangle inequality, we find that replacing the 4t log(N ) appearances of a R y (π2 −j ) gate with the approximate Ry (π2 −j ), we incur error bounded as In the pre-rotated approach, there are 2N − 2 controlled-R y rotations (actually, they are doubly-controlled R y rotations), but nearly all of them (the ones that do not get injected) are exactly undone.Moreover, by choosing a gate decomposition of R y (θ) of the form H * θ H θ for some gate sequence H θ (where H * θ denotes complex conjugate) the construction of Fig. 18 with H and H † in place of R y (θ/2) and R y (θ/2) guarantees that the action on the target is exactly trivial when the control is set to |0 .Ultimately, the error on the state prepared by the protocol is exactly the same as it would be in the fixed-precision method using the decomposition H * θ H θ for all the controlled-rotations (rather than a product of t gate decompositions).As there are 2 log(N ) rotations, and thus 4 log(N ) applications of some sequence H θ , the overall error is 4 log(N )δ decomp .Note that there is no rounding error for this approach.

D. Overall block-encoding error
For the fixed-precision approach, from Eqs. ( 37) and ( 40) and the triangle inequality, we have that Further, note that by the definition of block-encoding and sub-multiplicativity of the spectral norm we have If we wish for this error to be smaller than , it suffices to choose t = log(απ/ ) + log(log(N )) + 1, and δ decomp = /(8tα log(N )), which is achieved with R y = 3 log(α/ ) + 3 log(log(N )) + 9 + O(log(log(α/ ))) + O(log(log(log(N )))) (46) t = log(α/ ) + log(π) + log(log(N )) + 1 . (47) These equations should be substituted in the fixed-precision resource counts whenever one encounters them, if one wishes to have resources in terms of the final block-encoding error rate .
For the pre-rotated approach, the same analysis gives Our numerical estimates use these formulas for estimating exact resources, but we ignore the O(•) terms which are all doubly logarithmic or smaller.

VI. CONCLUSION
It is well known that loading classical data into a quantum computer is a challenging problem.However, for many problems of practical interest, one assumes access to such classical data without necessarily considering the full cost to encode the data in the quantum computer.Our results provide concrete resource counts and system sizes required to perform this task for two different QRAM models.
Combining QRAM with a state preparation routine, we provide detailed circuit descriptions and resource estimates for a commonly used algorithmic primitive: a block-encoding.Our modular implementation also allows one the freedom to consider resource counts for different QRAM models that allow for differing optimizations.We provide two such choices: one that minimizes T -count or T -depth and the other that minimizes the impact of noise.Our results address the practical feasibility of quantum algorithms that require large amounts of classical data, and we note that in fault-tolerant implementations of these algorithms, QRAM implementations can be seen as a bottleneck.
The details of our circuits also elucidate the ingredients that would be necessary in a circuit architecture optimized for block-encoding classical data.The QRAM circuits we describe use a large number of controlled-swap gates that we assume can be applied in parallel.Realizing such parallelism requires applying T gates across many qubits at once and to achieve polylog(N ) depth (which is necessary in any scenario where an exponential speedup is sought), our constructions require O(N 2 ) parallel T gates.Even if a processor with so many qubits were available, it could be challenging to implement these parallel operations at large N due to the overhead required for magic-state-distillation and decoding latencies, which could significantly decrease the rate at which layers can be applied.Furthermore, we assume that fanout-CNOT gates with arbitrarily long range can be applied in one time step.In surface-code latticesurgery architectures, this parallelism is possible, but it requires additional communication qubits for performing the required lattice-surgery operations, thereby increasing the qubit resources beyond what we have considered here.Therefore, although we do not prove any rigorous lower bounds on block-encoding, our resource estimates might be viewed as maximally optimistic for block-encoding of dense classical data, and they would increase as architectural constraints are added.
FIG. 17. Bucket Brigade QRAM with an address space of four bits, shown here since it can help visualize the general case.Note that when single qubits are swapped into a D-qubit register (e.g., the first gate in the circuit), it is assumed that they occupy the first position.As discussed in Fig. 10 of [18], additional parallelization is possible (but hard to depict in the drawing), which reduces the depth of a QRAM with n address qubits from O(n 2 ) to O(n) FIG. 19.Phase-correct parallel controlled-swap between two qubit registers in arbitrary states with a single control use for pre-rotated resource estimates.Two clean ancillas are required (see Ref. [19] for how dirty ancillas can be used at the expense of two additional Toffoli gates).With an additional clean ancilla (not shown), the Toffoli gates can be constructed with a T -depth of 1 and a T -count of 4 [42,44].The output state is shown assuming the input control is |1 .If the control is |0 the states are not swapped.To perform parallel Toffolis (instead of controlled-swaps), simply omit the CNOT gates in the second and fourth layers.
22. Controlled-swap between multi-qubit registers decomposed into a set of T gates, a fanout CNOT, and G = S † HT HS gates [19,46].The G (or equivalently T ) gates and CNOT gates can all be implemented in parallel.The decomposition on the right adds a phase to certain basis states, so it can only be used in situations where that phase is irrelevant [46].We only use this version for fixed-precision resource estimates.The entire circuit has a T -depth of 4. The T -count is 4t where t is the size of the registers being swapped.

FIG. 5 .
FIG.5.Circuit that implements LOADF, which loads a single-qubit state of the form cos(θ/2) |0 + sin(θ/2) |1 , where the angle θ depends on a control register.The controlled-Swap gates on registers of size N have the action of swapping the register in position j to position 1; the adjoint of swap moves position 1 to position j.The gate V applies the rotation Ry(θ (k) ) to the k-th angle register controlled on the k-th ancilla and flag qubits set to |1 , for every k.The controlled-Swap gates have depth O(n), and the V gate has depth O(log(1/δ)) if we wish to perform each Ry(θ (k) ) unitary up to precision δ.A complete example of LOADF for n = 2 is shown in Fig.6

<
l a t e x i t s h a 1 _ b a s e 6 4 = " n 8 b 3 t 3 2 b 0 e f B z b h g 4 j e 4 t o / G + g 3 5 9 l 5 d z 6 c z 3 n r m p P P n K A F O F + / y u 6 Y L g = = < / l a t e x i t > . . .< l a t e x i t s h a 1 _ b a s e 6 4 = " h z r P z 7 n w 4 n / P W N S e f O U E L c L 5 + A b 4 G m C Y = < / l a t e x i t > . . .< l a t e x i t s h a 1 _ b a s e 6 4 = " / b t / t R K 5 e u V S v L z e U z k 9 r e 0 M y Z O e / O h / O 5 G C 0 4 y 5 1 z t F L O 1 y 8 a w q H 4 < / l a t e x i t > . . .< l a t e x i t s h a 1 _ b a s e 6 4 = " / b t / t R K 5 e u V S v L z e U z k 9

~ N/ 2 0 2 <
z r P z 7 n w 4 n / P W N S e f O U E L c L 5 + A b 4 G m C Y = < / l a t e x i t > l a t e x i t s h a 1 _ b a s e 6 4 = " B / l Y m e 4 Y D Q L 0 r s y e i c W 3 E 4 q w
, n} that allows one to trade T -depth for circuit width and overall T -count.For the select-swap QRAM, the minimal T -count is achieved by choosing λ = 0, which yields a T -count and T -depth proportional to O(N ) on O(N ) total qubits.The minimal T -depth of O(log(N )) is achieved when λ = n at the expense of requiring a T -count of O(N 2 ).
FIG. 17. Bucket Brigade QRAM with an address space of four bits, shown here since it can help visualize the general case.Note that when single qubits are swapped into a D-qubit register (e.g., the first gate in the circuit), it is assumed that they occupy the first position.As discussed in Fig.10of[18], additional parallelization is possible (but hard to depict in the drawing), which reduces the depth of a QRAM with n address qubits from O(n 2 ) to O(n).An example of this effect is seen in the figure by noting that the region in the dashed box can be shifted left by three layers.The total T -count of the n-qubit version of this circuit is 16(D + 1)(2 n − 1) over T -depth of 48(n − 1) assuming that controlled-swap gates are implemented with the phase-incorrect construction of Fig.21.The total number of qubits is n + D + (2D + 1)(2 n − 1) − 1 (where we save a qubit by omitting the first swap gate and identifying the first address qubit directly with the L0 router setting).

FIG. 18 .
FIG.18.Decomposition of controlled-Ry rotation and controlled-controlled-Ry rotation used for resource counts.Since Ry(θ) is equivalent to Ry(θ +2π) up to a global phase, controlled-Ry(θ) is equivalent to controlled-Ry(θ +2π) up to a controlled-phase, i.e. a Z gate.Thus, for angles θ ∈ [π, 2π], a Z gate should additionally be applied to the first qubit in the top diagram, and a CZ gate should be applied between the top two qubits of the bottom diagram.

FIG. 20 .FIG. 21 .
FIG.20.Toffoli gate decomposed into a T -depth 4 circuit used for fixed-precision resource estimates.The gate G = S † HT HS.The decomposition is exact up to a phase that takes |100 → − |100 .

TABLE II .
Resource counts for LOAD and LOADF operations, which coherently load one out of 2 n possible D-qubit data registers, controlled by an n-qubit address register.LOAD loads D bits of classical data, and is accomplished using one of two models of QRAM implementation, select-swap (SS) and bucket-brigade (BB), each of which admits a depth/width tradeoff governed by integer parameter 0 ≤ λ ≤ n.LOADF performs a more general data loading operation where each of the D data qubits is left in the state cos(θ/2) |0 + sin(θ/2) |1 for some θ ∈ [0, 2π].The parameter Ry is the T -count of the single-qubit Clifford+T gate sequence used to synthesize each single-qubit state, which scales as O(log(1/δ))) with the desired precision δ (see Eq. (

TABLE III .
Resource counts for two approaches to state preparation as depicted in Fig.8and Fig.13.The parameter Ry is the number of T gates needed to synthesize a single qubit rotation about the Y axis by an arbitrary angle, and t is the number of bits precision to store the classical data, where both t and Ry scale as O(log(1/ )), where is the error on the state prepared by the protocol.

TABLE IV .
Block-encoding resource requirements for fixed-precision implementation.Taking λ = n yields the minimal depth circuit at the cost of needing O(N 2 ) T -count for both QRAM implementations.The select-swap QRAM can achieve a T -count of O(N ) by taking λ = 0.