Sampling Overhead Analysis of Quantum Error Mitigation: Uncoded vs. Coded Systems

Quantum error mitigation (QEM) is a promising technique of protecting hybrid quantum-classical computation from decoherence, but it suffers from sampling overhead which erodes the computational speed. In this treatise, we provide a comprehensive analysis of the sampling overhead imposed by QEM. In particular, we show that Pauli errors incur the lowest sampling overhead among a large class of realistic quantum channels having the same average fidelity. Furthermore, we show that depolarizing errors incur the lowest sampling overhead among all kinds of Pauli errors. Additionally, we conceive a scheme amalgamating QEM with quantum channel coding, and analyse its sampling overhead reduction compared to pure QEM. Especially, we observe that there exist a critical number of gates contained in quantum circuits, beyond which their amalgamation is preferable to pure QEM.

and the j 1 -th to j 2 -th columns from A is denoted as [A] i1:i2,j1:j2 . The notation [A] :,i denotes the i-th column of A, and [A] i,: denotes the i-th row, respectively. • The notation diag(·) denotes a diagonal matrix obtained by placing its argument on the main diagonal, and mdiag(A) denotes the matrix obtained by setting all entries in matrix A to zero apart from the main diagonal. • The notation vec(A) denotes the vector obtained by vectorizing matrix A, and vec −1 (·) denotes the inverse operation. • The trace of matrix A is denoted as Tr{A}, and the complex conjugate of A is denoted as A † . Similarly, the complex adjoint of an operator X is also denoted as X † . • The notation A ⊗ B represents the Kronecker product between matrices A and B. The tensor product of operators A and B is also denoted as A⊗B. Furthermore, the Cartesian product between sets A and B is denoted as A ⊗ B as well. • The sign function sgn(x) is defined as x > 0; 0, x = 0; −1, x < 0.

I. INTRODUCTION
R ECENT years have witnessed an astonishing development in the area of quantum computing. State-of-the-art quantum computers are typically equipped with qubits scaling from fifty to a few hundreds, and have been shown difficult to be simulated on classical supercomputers [2]. This marks the beginning of the era of noisy intermediate-scale quantum computing [3]. The word "noisy" indicates that noisy intermediatescale quantum computers would suffer from the notorious quantum decoherence effects, which impose a perturbation on each and every quantum operation carried out by a "quantum gate".
In principle, noisy quantum gates do not necessarily prevent quantum computation from being sufficiently accurate. A classical result, namely the threshold theorem [4], states that quantum computation may be carried out in the presence of decoherence with the help of quantum error correction codes (QECCs) [5]- [8], given that the error rate of each quantum gate is below a certain threshold. Generally speaking, QECCs protect a logical quantum bit (qubit) by mapping it to a larger set of physical qubits. The redundancy of the physical qubits ensures that errors perturbing a small fraction of the qubits can be detected and corrected with the help of some ancillary qubits (ancillas). Moreover, the error-correction capability can be further enhanced by concatenating several QECCs, albeit naturally, at the expense of higher qubit overhead [9]- [11].
However, compared to the idealized quantum computing models considered in the threshold theorem, practical noisy intermediate-scale quantum computers may not be capable of supporting fully fault-tolerant operations, due to their limited number of qubits. Consequently, they may not be able to execute algorithms that require relatively long processing time, such as Shor's factorization algorithm [12] and Grover's quantum search algorithm [13]. Fortunately, there is evidence that algorithms tailored for noisy intermediate-scale quantum computers may yield superior performance compared to those of classical computers [3], [14]. Most of these algorithms belong to the category of hybrid quantum-classical algorithms, which exploit the power of classical computation to compensate for the short coherence time of quantum processors. As portrayed in Fig. 1, a typical hybrid quantum-classical algorithm would be performed in an iterative fashion. The quantum circuit, which will be referred to as the functionevaluation circuit in this treatise, is designed to evaluate an objective function, given a set of input parameters [15]. The value of the objective function is then utilized in a classical optimizer, which computes an adjusted set of parameters for the next iteration. In general, the design of the functionevaluation circuit determines the application of the algorithm. Popular designs include the alternating operator circuit used in the quantum approximate optimization algorithm [14], the "unitary coupled-cluster ansatz" circuit applied in the computation of molecular energy based on variational eigensolver [16]- [18], as well as other heuristic designs aiming for quantum machine learning [19], [20]. Remarkably, the quantum approximate optimization algorithm has been applied to communication-related problems as well, such as channel decoding [21].
Although the quantum circuits in hybrid quantum-classical algorithms have short depth, their decoherence may still inflict non-negligible computational errors [22], [23]. This necessitates the design of low-qubit-overhead techniques for protect-ing quantum gates. Recently, quantum error mitigation (QEM) has been proposed, which may correct the computational result without using any ancilla [24]- [26]. Without loss of generality, we may decompose a realistic imperfect quantum gate into a perfect gate followed by a quantum channel. QEM mitigates the deleterious effect of the channel by applying an "inverse channel" right after the imperfect gate, which is implemented using a probabilistic mixture of gates [24].
In contrast to the qubit overhead of QECCs, QEM introduces another type of computational overhead, namely the sampling overhead [24]. This overhead originates from the fact that the "inverse channel" is typically not completely positive trace-preserving (CPTP) (unless the original channel is of unitary nature, and hence it does not impose decoherence) [24], [25]. Consequently, QEM leads to an increased variance in the final computational result, hence additional measurements are required at the output quantum state for achieving a satisfactory accuracy. In effect, increasing the number of measurements will slow down the computation process. As the depth of the quantum circuit grows, the sampling overhead may accumulate dramatically. Ultimately, the benefit of quantum speedup will be neutralized for computation tasks that require extremely long coherence time. 1 In general, the sampling overhead required depends on the channel characteristics. Naturally, a fundamental question concerning the practicality of QEM is: "Can we predict and control the sampling overhead given a limited number of channel parameters?" In this treatise, we investigate this deeprooted research question from both theoretical and practical perspectives. We first introduce the notion of the so-called sampling overhead factor (SOF) for characterizing the sampling overhead incurred by a quantum channel, and then provide a comprehensive analysis of the SOF of general CPTP channels. Finally, we discuss potential techniques of reducing the SOF of quantum channels.
Our main contributions are summarized as follows.
• We present the general design philosophy of QEM from a communication theoretical perspective, emphasizing its application in hybrid quantum-classical computation, and introduce the notion of SOF to quantify its sampling overhead. Specifically, we highlight that by invoking a "quantum channel precoder" in the quantum circuit, the SOF required by QEM may be reduced. • We formulate a coherent-triangular error decomposition of memoryless quantum channels, which are tensor products of single-qubit channels, with the aid of their Pauli transfer matrix [27] representation. Specifically, the coherent component of a quantum channel has a zero SOF, meaning that it can be mitigated without any sampling overhead when compensated using QEM in the ideal case, while the triangular component always has non-zero SOF. The structure of the treatise is demonstrated in Fig. 2, and the rest of this treatise is organized as follows. In Section II, we present preliminary concepts that will be used extensively thereafter, such as quantum states, channels and hybrid quantum-classical computation. Then, in Section III we present the general design and formulation of QEM. Based on this formulation, we analyse on the SOF of uncoded quantum gates in Section IV. In Section V, we conceive and analyse the amalgam of QEM and QECCs as well as QEDCs, which will be referred to as the QECC-QEM and QEDC-QEM schemes, respectively. The analytical results are then demonstrated via numerical simulations in Section VI. Finally, we conclude in Section VII.

A. Quantum states
The basic carrier of quantum information is a qubit, namely a two-level quantum system. Ideally, the state of a qubit can be characterized by a vector as satisfying the normalization property of |α| 2 +|β| 2 = 1. Under the conventional computational basis, the basis vectors |0 and |1 can be expressed as In practice, the interaction between the qubits and the environment will cause decoherence, namely turning a deterministic quantum state into a probabilistic mixture of states. Especially, the deterministic states are termed as pure states, while the probabilistic mixtures are termed as mixed states. A mixed state of a qubit can be fully characterized by a 2 × 2 matrix termed as the density matrix ρ given by satisfying p i ≥ 0 for all i and i p i = 1. Hence the density matrix is positive semi-definite and has unit trace. The pure states |ψ i are the components of the probabilistic mixture. Additionally, under the computational basis, it can be expressed as the linear combination of the following matrices: In general, a mixed state of an n-qubit system can be characterized by a 2 n ×2 n density matrix. Similar to the single qubit case, the n-fold tensor products of S I , S X , S Y and S Z form a basis for the space of 2 n × 2 n density matrices, termed as the n-qubit Pauli group. To facilitate further discussion, we denote S (n) i as the i-th operator in the n-qubit Pauli group. The superscript (n) is omitted when it is unambiguous from the context.
The difference between two quantum states ρ 1 and ρ 2 is typically quantified using fidelity, defined as and can be simplified as F (ρ 1 , ρ 2 ) = Tr{ρ † 1 ρ 2 } if either ρ 1 or ρ 2 represents a pure state.

B. Quantum channels
Formally speaking, quantum channels are typically modelled by linear operators acting on quantum states. Naturally, the output of a quantum channel has to be a legitimate quantum state, which is positive semi-definite, and has trace one. Therefore, quantum channels are required to be completely positive, trace preserving (CPTP) operators [28,Sec. 8.2.4]. Any CPTP operator C admits an operator-sum representation formulated as follows: where the operation elements K i satisfy the completeness condition given by Alternatively, quantum channels can also be expressed in a matrix form. To elaborate, if we represent a quantum state having a density matrix of ρ as a vector x, a quantum channel C acting on x can be written as a matrix C satisfying where T is a basis transition matrix which determines the specific matrix form of the quantum channel. In general, the specific matrix form of a quantum channel depends on the set of bases we choose. In this treatise, we use the Pauli transfer matrix representation [27] of quantum channels, given by for which the basis transition matrix T is given by In this sense, the vector representation x of a density matrix ρ can be expressed as A key quality indicator of a quantum channel is its average fidelity. As the terminology "average fidelity" suggests, it is the fidelity between the input and the output states, integrated over the space of all legitimate input states. Formally, the average fidelity of a quantum channel C is defined as [29] F (C) = F (|ψ ψ| , C(|ψ ψ|)) d|ψ .
Using the Pauli transfer matrix representation, the average fidelity of C can be written in closed form as where n is the number of qubits that C acts upon. In general, F (C) satisfies 0 ≤F (C) ≤ 1, and 1 −F (C) is often referred to as the "average infidelity" of C [29]. For Pauli channels, another important quality metric is the gate error probability (GEP), namely the probability that the output state does not coincide with the input state. For example, for the following channel the GEP is p. Many important results on quantum coding, including the threshold theorem, are based on GEP.
A somewhat perplexing issue is that the GEP is inconsistent with the average fidelity. More precisely, for a Pauli channel C, we haveF (C) = 1 − GEP. To avoid the difficulty of using two different metrics for Pauli and non-Pauli channels, in this treatise we introduce a generalization of the GEP, which will be referred to as the generalized gate error probability (GGEP) hereafter. Specifically, we define the GGEP of channel C as As a channel quality metric, GGEP has the following advantages. 1) When C is a Pauli channel, the GGEP degenerates to the conventional GEP, which is p in (11). 2) For a general channel C, which is not necessarily a Pauli channel, the GGEP is proportional to the average infidelity of C, in the sense that Thus an operation preserving the average fidelity would also preserve the GGEP.

C. Hybrid quantum-classical computation
The evaluation of eigenvalues and eigenvectors is a fundamental subroutine in many existing quantum algorithms, such as Shor's algorithm, the Harrow-Hassidim-Lloyd (HHL) algorithm, and in Hamiltonian simulation [12], [30]- [33]. In early contributions, quantum phase estimation [28,Sec. 5.2] was the default algorithm for eigenvalue evaluation, which requires a long coherence time. To enable eigenvalue evaluation on noisy intermediate-scale quantum computers, hybrid quantumclassical algorithms based on variational optimization have been proposed in [34]- [36].
Mathematically, for a Hermitian matrix H, the eigenvector ψ 0 corresponding to the smallest eigenvalue can be calculated as follows subject to ψ 2 = 1.
This problem is referred to as the variational formulation of eigenvalue evaluation in the literature [16], [34], [37]. If the eigenvectors are reparametrized using a vector θ, yielding ψ = ψ(θ), the task of finding the smallest eigenvalue or the corresponding eigenvector may be achieved by searching for the minimum of the objective function in the space of θ, while satisfying the normalization constraint (14b). The formulation (15) has been applied to the electronic structure computation of the hydrogen molecule 2 in [38]. In general, this would be a non-convex problem with respect to θ, which may be solved using iterative non-convex optimization solvers, such as the classic gradient descent and the Nelder-Mead simplex method [39]. In each iteration, the objective function or another function (e.g. the gradient) is first evaluated at a specific point in the parameter space, and then the parameters are updated according to the function values. Under the framework of quantum computation, the problem (14) can be recast as For conciseness of our discussion, we shall use the pure-state formulation (16) hereafter, whenever there is no confusion. The essence of hybrid quantum-classical computation is to evaluate the functions using a quantum circuit, whereas the parameter values are updated using a classical computer, as illustrated in Fig. 1. To be more specific, a schematic of the quantum circuit evaluating the objective function J(θ) is portrayed in Fig. 3. The input state of the circuit is typically the all-zero state |0 ⊗n . The function-evaluation circuit U(θ) encodes the parameter vector θ, and transforms the input state to the state |ψ(θ) . The expectation value J(θ) is then computed based on the result of multiple measurements. This is achieved by decomposing the observable H (involving at most K-qubit interactions) as where σ (j l ) i l denotes a Pauli-j l operator (i.e., j l may be X , Y or Z) acting on the i l -th qubit. In light of this, the term k l=1 σ (j l ) i l can be implemented using a simple quantum circuit consisting of k single-qubit gates followed by measurements, as shown in the dashed box of Fig. 3. For example, a Pauli-Z operator in (18) corresponds to a direct measurement, whereas a Pauli-X operator corresponds to a Hadamard gate followed by measurement. Thus, the expectation value can be obtained by measuring the outputs of these simple circuits, and evaluating a weighted sum over them using the weights h i1,i2,...,i k .
In contrast to "fully quantum" algorithms (e.g. Shor's algorithm and Grover's search algorithm [40]) aiming to arrive at one of the computational basis states at the very end of computation, hybrid quantum-classical algorithms aim to compute the expectation values. Therefore, the measurement results have to be averaged over a number of independent circuit executions. In this treatise, we will refer to this process as "circuit sampling". To portray the potential advantage of hybrid quantumclassical computation, we provide a sketchy complexity comparison between classical computation and hybrid quantumclassical computation. Using classical computation, evaluating J(θ) for ψ ∈ C 2 n requires on the order of O(2 2n ) operations. By contrast, the complexity of the hybrid scheme depends both on the complexity of the function-evaluation circuit as well as on the structure of H. More precisely, denoting the complexity of the function-evaluation circuit in terms of quantum gates as T , the total complexity of evaluating J(θ) would be O(T K k=1 N k ), where N k denotes the number of weight-k Pauli strings in H. Therefore, for application scenarios where the observable H is "sparse" in the sense that the number of terms K k=1 N k is small (e.g. polynomial in n), a substantial speedup over classical computation may be achieved, when using the hybrid approach.

III. A BRIEF INTRODUCTION TO QUANTUM ERROR MITIGATION
When contaminated by decoherence, the quantum circuits of hybrid quantum-classical computation would produce erroneous expectation values. Fortunately, the weighted-averaging nature of hybrid quantum-classical computation facilitates the conception of a qubit-overhead-free method that mitigates the deviation from the true expectation value, namely the QEM [24]. In this section, we introduce the formulation of QEM and the computational overhead it incurs -namely the sampling overhead.

A. The Basic Formulation of QEM
The philosophy of QEM is to insert a probabilistic quantum circuit right after every quantum gate, which reverts the effect of the quantum channel modelling the imperfection inflicted by the gate. Conceptually, a QEM-protected gate can be portrayed as in Fig. 4. The imperfect gate (in this case an imperfect CNOT gate) can be decomposed into a perfect gate and a quantum channel C. Given an input state having the density matrix ρ in , according to (4), the output state after the imperfect gate is given by where the matrix U g corresponds to the effect of the perfect gate. In a vectorized form, the output state can be expressed as where we have G = U * g ⊗ U g , and the Pauli transfer matrix C is given in (6). If we have an estimateĈ of C, potentially obtained using methods such as quantum process tomography [41], an estimate of the output of the decoherence-free gate G can be obtained aŝ where M =Ĉ −1 is the Pauli transfer matrix representation of the probabilistic quantum circuit M constructed for inverting the channel.
To elaborate further, if a gate is followed directly by measurement, M is implemented by applying different circuits according to a probability distribution in different circuit executions and performing a weighted averaging over the measurement outcomes. This can be formulated as where M k is the k-th candidate circuit applied at a probability of p k , w k is the weight of the measurement outcome of the k-th candidate circuit, and m k is the Pauli transfer matrix representation of the measurement operator corresponding to the outcome. For a circuit constructed by multiple gates, the weights and probability distributions follow directly by linearity. For example, for a simple circuit containing two consecutive imperfect gatesG (1) andG (2) , we have where the superscripts "(1)" and "(2)" indicate the first and the second gate, respectively. Since the expectation evaluation in hybrid quantum-classical computation is implemented by applying a linear transformation (weighted averaging) to the measurement outcomes, it fits nicely with QEM. By contrast, "fully quantum" algorithms such as Shor's algorithm and Grover's algorithm operate in different ways, hence they might not be protectable by QEM.
Optionally, one may apply a quantum channel precoder to the imperfect gate, yieldinĝ as shown in Fig. 4. The quantum channel precoder turns the channel C into another (possibly more preferable) channel

Imperfect Gate
Channel Precoder Figure 1: Schematic of a qem!-protected imperfect CNOT gate equipped with a quantum channel precoder.
1 Fig. 4. Schematic of a QEM-protected imperfect CNOT gate equipped with a quantum channel precoder.
For example, the so-called Pauli twirling of [25], [26], [42] may be viewed as a quantum channel precoder turning an arbitrary channel into a Pauli channel.
Similarly, Clifford twirling [43] turns an arbitrary channel into a depolarizing channel. According to the operator-sum representation [28, Sec. 8.2.4], a quantum channel precoder can be implemented by a probabilistic mixture of gates applied both before and after the imperfect gate to be protected 3 .
In order to obtain M , we may first choose a basis matrix B, in which each column is the vectorized Pauli transfer matrix of a candidate circuit [25]. For example, if the first column of B represents an operator O, we have [B] :,1 = vec(O), where O is the Pauli transfer matrix of O. Next, we determine the coefficients as follows: Given the coefficients µ C , we can now express M as This can be realized by applying the candidate circuit cor- and assigning a weight sgn([µ C ] i ) · µ C 1 to the measurement outcome. In light of this, µ C is referred to as the quasiprobability representation [24] of channel C.

B. The Sampling Overhead of QEM
In general, the probabilistic implementation of M will incur a sampling overhead. To elaborate, if we wish to compute the expectation value J(θ) to a certain accuracy, we have to operate the circuit a certain number of times (sampling from the output state vector). When the gates are perfect, the number of samples required is determined by the variance of the observable H given by In practice, the observable H cannot be measured directly, but has to be estimated using the measurement outcome of several operators according to the decomposition (18). Therefore, if and Upon assuming that the required accuracy is quantified in terms of its variance σ 2 , this may be achieved using N s samples in the perfect gate scenario. After QEM, the expected value remains unchanged. However, if the number of samples is kept fixed, QEM will lead to a variance increase, since the entries in µ C are not necessarily positive. Explicitly, the variance after QEM is given by where µ C 1 ≥ 1. Therefore, in order to achieve the same accuracy, we have to sample every quantum gate N s ( µ C 2 1 − 1) times additionally. To elaborate a little futher, we consider the 'toy' example portrayed in Fig. 5. In this example, we assume that the error-free expectation value is E, and we assume that µ C 1 = 3. Observe from the figure that when the circuit is sampled N s times, the computational result without QEM is randomly distributed around its mean valueẼ, which deviates from the true value E. Having been corrected by QEM, the mean value of the computational result equals to E. However, the variance of the result is increased by a factor of µ C 2 1 = 9.
To ensure that the accuracy meets our requirement, we have to sample the circuit N s µ C 2 1 = 9N s times, as illustrated by the dotted curve of Fig. 5. Empirical evidence has shown that applying quantum channel precoders is capable of reducing the sampling overhead for certain types of channels [25].
In the previous example, we have considered the case of a single channel C. In general, a quantum circuit consists of N g > 1 gates, which may be viewed as the cascade of N g channels, denoted by C 1 , . . . , C Ng . To achieve the required computational accuracy, we have to sample the circuit times. To facilitate our analysis, we define the sampling overhead of a circuit as the additional number of samples imposed by QEM, which equals to N s − N s in our previous example. We note that the sampling overhead of a circuit increases exponentially with the number of gates.
According to the previous discussions, we may use the following notion to characterize the sampling overhead incurred by a single channel C when compensated by QEM.
Definition 1 (Sampling Overhead Factor): We define the sampling overhead factor (SOF) of a quantum channel C as Remark 1: When there is only a single gate modelled by the associated channel C in the circuit, we can see from (29) that the sampling overhead of the circuit may be represented in terms of the SOF γ C as N s γ C . When there are several gates in the circuit, the sampling overhead of the circuit can be computed using the SOFs of the channels as follows: To provide further intuitions about the SOF, let us consider the 'toy example' of a single-qubit depolarizing channel C, having the following operator-sum representation which can be observed to have GGEP = 0.03. The corresponding Pauli transfer matrix takes the following form C = diag {1, 0.96, 0.96, 0.96} , and the inverse Pauli transfer matrix is given by Geometrically, the channel C corresponds to a homogeneous shrinking of the Bloch sphere, making its radius 0.96 times the original radius, while the inverse channel C −1 corresponds to a homogeneous dilation extending the radius to 1/0.96 = 1.0417 times the original radius, as portrayed in Fig. 6.
To perform QEM on the channel C, we should first choose a basis. Note that the Pauli operators have the following Pauli transfer matrix representations which constitutes a complete basis of diagonal Pauli transfer matrices. In light of this, we may choose the Pauli operators as the basis. The corresponding quasi-probability representation can then be computed as  yielding µ C 1 = 1.0625. Therefore, given the basis we chose, the SOF of the channel C is given by

IV. SAMPLING OVERHEAD FACTOR ANALYSIS FOR UNCODED QUANTUM GATES
In this section, we investigate the SOF of quantum gates that are not protected by quantum codes. This will also lay the foundation for the analysis of coded quantum gates in Sec. V.
The SOF, in essence, may also be viewed as a specific characteristic of the representation of a quantum channel under a specific basis, as implied by (25) and (30). Hence, the particular choice of basis will certainly have an impact on it. Considering realistic restrictions and aiming for simplifying our analysis, we make the following assumptions concerning the choices of basis.
1) The basis vectors should correspond to legitimate quantum operations in order to be implementable. Formally, we assume that the basis vectors are the vectorized Pauli transfer matrices of completely positive tracenonincreasing (CPTnI) operators, for which the operation elements of (4) satisfy This includes perfect gates (unitary operators), imperfect gates (CPTP operators), and measurements (tracedecreasing operators). 2) We assume that the basis always includes all vectorized Pauli operators. This is not very restrictive for most existing quantum computers, since Pauli gates are one of their most fundamental building blocks. The other (potentially more important) factor influencing the SOF is the quantum channel itself. Various channel models have been proposed in the literature, such as depolarizing channels, phase damping channels, amplitude damping channels, etc. [25], [44], [45]. To maintain the generality of our treatment, we do not explicitly consider a specific channel model, but rather a general CPTP channel.

A. Coherent-triangular decomposition of Memoryless CPTP channels
Without loss of generality, we assume that Assumption 1: The first row and the first column in a Pauli transfer matrix corresponds to the identity operator I ⊗n in the n-qubit Pauli group.
Note that for a valid density matrix ρ, the condition Tr{ρ} = 1 is always satisfied. According to Assumption 1, this implies that for the corresponding vector representation x, we have wherex ∈ R 4 n −1 , according to (8). The dimensionality of x is 4 n , because the number of Pauli operators over n qubits is 4 n . Thus for any trace-preserving channel, we have which amounts to the following result It can now be seen from (34), (35) and (36) that a CPTP channel can be viewed as an affine transformation in the (4 n − 1)-dimensional space spanned by the Pauli operators excluding the identity. In this regard, we have the following result for single-qubit channels, whose geometric interpretation is demonstrated in Fig. 7.
Lemma 1: Any single-qubit CPTP channel can be expressed as the composition of (up to) two coherent channels 4 and a triangular channel, meaning that whereŨ andṼ are a pair of unitary matrices, andD is a diagonal matrix. The matrix D corresponds to the triangular channel, while the matrices U and V represent the coherent channels.
Proof: Consider the singular value decomposition ofC given byC From (35) we obtain directly that D is a triangular channel, and hence it now suffices to show that both U and V can be implemented by unitary gates. Since the entries of Pauli transfer matrices are all real numbers [46], the matricesŨ andṼ are both 3 × 3 orthogonal matrices, corresponding to the three-dimensional rotations around the Bloch sphere belonging to the special orthogonal group SO (3). They can be implemented by single-qubit unitary gates belonging to SU(2) due to the SO(3)-SU(2) homomorphism.
Compared to the triangular component, the coherent component of a CPTP channel might be easier to deal with, since their effect may be compensated by using unitary gates. This implies that if the unitary gates designed for the compensation are error-free, the effect of coherent channels may be reversed without any sampling overhead. By contrast, the triangular component may have to be compensated by using probabilistic gates, hence imposes overhead.
It is known that Lemma 1 does not hold for general multiqubit channels [47]. Nevertheless, it is applicable to the case where the channel C is memoryless, hence it can be described by the tensor product of single-qubit channels. To see this, we may rewrite a memoryless channel as C = C 1 ⊗C 2 ⊗. . .⊗C n , and for each Observe that both n i=1 U i and n i=1 V i correspond to practically implementable single-qubit gates. Since the Kronecker product preserves the triangular structure, we see that n i=1 D i also represents a triangular channel.

B. Analysis on triangular channels
According to the discussions in Section IV-A, we are particularly interested in the quasi-probability representation of triangular channels, whose Pauli transfer matrices take the same form as the matrix D in (37). More precisely, we define triangular channels as CPTP quantum channels whose Pauli transfer matrix can be written as follows where L is a lower triangular matrix. This includes both amplitude damping channels and Pauli channels as representative examples. For example, a single-qubit amplitude damping channel having decay probability p has the following Pauli transfer matrix  where the rows/columns are ordered for ensuring that they correspond to the Pauli-I, X, Y, and Z operators, respectively. Observe that the matrix does have a triangular structure, which is preserved under the permutation of the Pauli-X, Y, and Z operators. As a direct corollary, for a multi-qubit channel inflicting amplitude damping independently on each qubit, the Pauli transfer matrix is triangular, since the triangular structure is preserved under the Kronecker product. For a fair comparison, we consider channels having the same GGEP , meaning that D ∈ D n ( ), where we have: and D n denotes the set of all CPTP triangular channels over n qubits. In the following proposition, we will show that regardless of the specific choice of the basis B, Pauli channels have the lowest SOF among all triangular CPTP channels. The Pauli channels are defined as channels that transform one Pauli matrix into another. Based on (7), this implies that the corresponding Pauli transfer matrices are diagonal matrices.
Proposition 1 (Pauli channels have the lowest SOF): Given a fixed GGEP , for any full-rank basis matrix B consisting of vectorized Pauli transfer matrix representation of CPTnI operators, among all CPTP triangular channels over n qubits, Pauli channels have the lowest SOF.
Proof: Please refer to Appendix I. Remark 2: We note that there is some empirical evidence supporting that projecting a quantum channel onto the set of Pauli channels might help in reducing the sampling overhead [25]. Here we formally show that this is indeed true, and it is true for the entire family of triangular channels.
Next, we consider some notational simplifications for a further investigation in the context of Pauli channels. First of all, since the Pauli transfer matrices of Pauli channels are diagonal (as exemplified by (32)), we may rewrite the vector representation of a Pauli transfer matrix (or that of its inverse) in a reduced-dimensional manner. Specifically, we could represent the Pauli transfer matrix of a Pauli channel using merely the vector on its main diagonal, i.e. d = vec{mdiag{D}}.
Additionally, the 16 n × 16 n basis matrix B can be reduced toB ∈ R 4 n ×4 n for Pauli channels. In light of this, we have For simplicity, we introduce the further notation of Note that each column inB represents the vectorized Pauli transfer matrix of a specific Pauli operator (as a quantum channel). According to the definitions of the Pauli operators and (7), under the computational basis, the vectorized Pauli transfer matrix representation of single-qubit Pauli operators can be expressed as respectively. In this case, it can be seen that the corresponding simplified basis matrix of has the form of the Hadamard transform matrix. In general, Pauli operators over n qubits can be expressed as the tensor product of n single-qubit Pauli operators, hence the corre-spondingB takes the form of which is simply a Hadamard transform matrix of higher dimensionality, whereB 1 denotes the matrixB for singlequbit systems. By exploiting the properties of the Hadamard transform, we are now able to obtain the following result.
Proposition 2 (Depolarizing channels have the lowest SOF): Among all Pauli channels over n qubits having the GGEP , depolarizing channels have the lowest SOF.
Proof: Please refer to Appendix II. According to Proposition 2, depolarizing channels lend themselves most readily to be compensated by QEM. This means that the QEM method has a strikingly different nature compared to the family of quantum error correction schemes in terms of the overhead imposed, since depolarizing channels may be viewed as the channels most impervious for QECCs. To elaborate, depolarizing channels exhibit the lowest hashing bound among all Pauli channels, hence they would require the highest qubit overhead 5 for QECCs [28].

C. Bounding the SOF of Pauli channels
In Sec. IV-B we have shown that Pauli channels are preferable for QEM in the sense that they have the lowest SOF. In this subsection, we proceed by further investigating Pauli channels and bound their SOF for a given GGEP . First of all, by explicitly calculating the SOF of depolarizing channels, we can readily obtain a lower bound on the SOF of triangular channels (hence also on Pauli channels), as stated below.
Corollary 1 (SOF lower bound): For an triangular channel C having the GGEP , the SOF incurred by QEM is bounded from below as The lower bound is attained, when C is a depolarizing channel. 5 Given a fixed GEP.
Proof: According to Proposition 2, it is clear that the channel having the lowest SOF among all N -qubit triangular channels is the N -qubit depolarizing channel. The SOF of this channel is given by where H N and H −1 N are the Hadamard transform matrix and the inverse Hadamard transform matrix having dimensionality of 16 N × 16 N , respectively, and we have It is straightforward to verify that the right hand side of (45b) is a monotonically decreasing function of N , thus Hence the proof is completed.
Let us now derive an upper bound on the SOF of Pauli channels. For this purpose, we consider a matrix representation specifically designed for Pauli channels, which will be referred to as the Pauli random walk (PRW) representation hereafter. More precisely, a Pauli channel C over n qubits can be represented by a matrix C PRW having the following form: Conventionally, an n-qubit Pauli channel C can be characterized using a vector η C satisfying We will refer to η C as the probability vector of C in the rest of this paper. For examples of η C values corresponding to some simple channel models, please refer to Appendix III. In light of (48), the PRW representation can be expressed as a function of η C as follows To gain further insights into the PRW representation, we will rely on a weighted Cayley graph [48] G of Pauli groups, in which the i-th vertex represents the i-th operator in P n . For a specific channel C, a pair of nodes i and j in the Cayley graph are connected with an edge having a weight of [η C ] l , if we have P i P j = P l . As a tangible example, the graph G corresponding to the single-qubit Pauli group is portrayed in Fig. 8. The function σ(O) denotes the index of the operator O in P, where we have σ(X ) = 2, σ(Y) = 3, and σ(Z) = 4. For a fixed GGEP , we can rewrite C PRW (η C ) as where A(G , η C ) is the weighted adjacency matrix of the graph G corresponding to the channel C, which satisfies Whenever there is no confusion, we will simply denote C PRW (η C ) as C, and A(G , η C ) as A.
It can be observed from (50) that the channel C may be interpreted as a random walk over the graph G , which maps an input state |ψ ψ| to P i |ψ ψ| with probability η i . The goal of the quasi-probability representation method is to find another operator that reverses the random walk process. Specifically, (25) can be simplified as follows where α = [1 0 T 4 n −1 ] T , andμ C is obtained by extracting the 4 n entries corresponding to Pauli operators from µ C in (25).
With the aid of PRW representation, we are now ready to present the following SOF upper bound for Pauli channels.
Proposition 3 (SOF upper bound): For an n-qubit Pauli channel C, given a GGEP , the SOF can be upper bounded as The equality is attained when there is only a single type of error, namely there is only one non-zero entry in η C . Proof: Please refer to Appendix IV. Note that Pauli channels having only a single type of error correspond to the highest hashing bound, when mitigated by QECCs. Therefore, by considering both Corollary 1 and Proposition 3, one may intuitively conjecture that for Pauli channels having the same GGEP, the SOF increases as the hashing bound increases. The hashing bound of a Pauli channel C can be expressed as [49]- [51] where R hashing is the highest affordable coding rate capable of satisfying the hashing bound, and H(η C ) denotes the entropy of η C viewed as a probability distribution. Mathematically, the entropy H(η C ) is a Schur-concave function [52, Sec. 2.1] with respect to the probability distribution η C . To elaborate, a Schur-concave function f (x) is characterized by for any doubly stochastic matrix Q. This implies that doubly stochastic transformations on η C would lead to the increase of entropy [53]. The term R hashing in (53) can be seen to have the exactly opposite property termed as Schur-convex [52, Sec. 2.1], hence the aforementioned conjecture can be formulated as follows: "the SOF is a Schur-convex function with respect to the probability vector η C ". Next we show that the conjecture is correct, when the channels under consideration are memoryless channels.
Proposition 4: For any n-qubit memoryless Pauli channel C = n i=1 C i , given a fixed GGEP , the SOF is a Schurconvex function of η C , meaning that holds for all doubly-stochastic matrices Q i preserving the GGEP, where µ(x) denotes the quasi-probability representation vector of the Pauli channel having the probability vector of x.
Proof: Please refer to Appendix V.

D. SOF reduction using quantum channel precoders: Practical considerations
Our previous analysis indicates that depolarizing channels are the most preferable channels in terms of having the lowest SOF. This implies that Clifford twirling [43], [54], [55], a technique that turns an arbitrary channel into a depolarizing channel while preserving the original average fidelity, might be a quantum channel precoder enabling effective SOF reduction. Specifically, given a quantum channel C over n qubits, the Clifford twirling T C transforms the channel such that the output state satisfies where the summation is carried out over the Clifford group on n qubits. Conceptually, the Clifford twirling over two-qubit channels can be implemented as demonstrated in Fig. 10a, where the gates comprising the circuits U and U † are chosen according to a uniform distribution over the set of Clifford gates.
In practice, however, the gates used for implementing Clifford twirling might be imperfect themselves. In light of this, a real-world Clifford twirling would in general impose an average fidelity reduction, and thus lead to additional SOF. For certain channels, the theoretical SOF reduction of Clifford twirling may be outweighed by this additional overhead. A representative example is constituted by the family of Pauli channels, whose SOF is rather close to that of depolarizing channels, according to Proposition 3.
The observation that the Pauli channels have similar SOFs implies that Pauli twirling T P might be a more practical quantum channel precoder, which turns an arbitrary channel into a Pauli channel in the following manner [42] T  The implementation of Pauli twirling is portrayed in Fig.  10b, where the gates A and B are chosen according to a uniform distribution on the set of Pauli gates. In state-of-theart quantum computers, two-qubit gates, as used in the Clifford twirling shown in Fig. 10a, would result in much more error than single-qubit gates (typically by a factor of 10 or even higher [56]), hence Pauli twirling using single-qubit gates may introduce much lower additional SOF than Clifford twirling.
In practise, we cannot directly implement twirling at both sides of the channel. Instead, we have to twirl simultaneously both the perfect gate and the channel. Therefore, the techniques should be slightly modified in order to effectively apply the twirling to the channel. Specifically, if we wish to apply Pauli twirling to a channel C associated with an imperfect gatẽ G = C • G where G denotes the perfect gate, we may apply the following modified twirling toG A similar procedure can also be applied to Clifford twirling. We note that the operation (G † S i G)ρ(G † S i G) † can be simplified to S m ρS † m for some m, if the perfect gate G is a Clifford gate, since the Clifford group C n is the normalizer of the Pauli group P n satisfying C n = {U ∈ U(2 n )|U P n U † = P n }.

V. SAMPLING OVERHEAD FACTOR ANALYSIS FOR CODED QUANTUM GATES
In this section, we investigate the SOF of gates protected by quantum channel codes, including QECCs and QEDCs. These codes are designed to convert the original channel corresponding to the unprotected gate into an reduced-errorrate channel over more qubits, with the objective of having lower GGEP. Under the framework of QEM, they can also be viewed as channel precoders. Naturally, it is of great interest to us, whether an amalgam of quantum codes and QEM can benefit each other.
Specifically, we consider the scenario where every set of k logical qubits is protected using n physical qubits. Using the terminology of quantum coding, this means that we consider [[n, k, d]] codes, where d is the minimum distance of the code [7]. Furthermore, if not otherwise stated, we assume that Clifford gates are considered using the transversal gate scheme of [44], while non-Clifford gates are implemented via the magic state distillation process of [57]. These are conventional assumptions in the quantum fault-tolerant computing literature [28,Sec. 10.6]. Furthermore, for the conciseness of discussion, we assume that the quantum channels encountered in this section are all Pauli channels, or had been turned into Pauli channels by means of Pauli twirling.

A. Amalgamating quantum codes with QEM: A toy example
To elaborate further on how QEM may be amalgamated with quantum codes, we consider the simple example of protecting a single Hadamard gate. As shown in Fig. 9a, an uncoded imperfect Hadamard gate can be decomposed into a perfect Hadamard gate H and a quantum channel C H . Given that the channel C H is known, we can apply the QEM circuit M H to invert it. By contrast, in the coded scheme, the logical qubit is protected using an encoder V exploiting n physical qubits at the input of the circuit, as portrayed in Fig. 9b. In the code space, the original input state |ψ is expressed as the coded state ψ , while the coded Hadamard gate may be decomposed into an equivalent perfect Hadamard gate H and another quantum channel C H . Consequently, the QEM circuit M H has to be designed for the transformed channel C H . More specifically, the Hadamard gate protected using the transversal gate configuration is depicted in more detail in Fig. 9c. The equivalent Hadamard gate is implemented simply by n transversal Hadamard gates. As a result, each physical qubit experiences the same channel C H . Right after the transversal gates, with the help of m ancillae, the integrity of the output state is examined by the stabilizer check S. The subsequent recovery circuit R is capable of correcting a fixed number of Pauli errors, depending on the minimum distance of the code. For example, if Steane's codes is applied, R can correct any single Pauli error that appeared within the circuit. The transversal gates along with the stabilizer check and the recovery circuit constitute the transformed channel C H .
Ideally, since S and R are able to correct errors, the transformed channel C H might have a lower GGEP than the original channel C H . However, this might not be true in practice, because S and R themselves are also prone to errors. Intuitively, assuming that the GGEP of each gate in the circuit is at most , as tends to zero, the GGEP of C H is at most on the order of O( 2 ), since all single errors are corrected. Therefore, quantum codes are capable of reducing the channel GGEP, provided that is sufficiently small. Specifically, the value of physical gate GGEP th , below which quantum codes become beneficial, is referred to as the fault-tolerance threshold [58].
In general, given a quantum code, the logical gate GGEP would be higher than the physical gate GGEP, when the physical gate GGEP is relatively high. As the physical gate GGEP decreases, it gradually becomes higher than the logical gate GGEP, as sketched in Fig. 11   point of the two curves is the fault-tolerance threshold of the quantum code. We will refer to the region where the quantum code is beneficial, namely where the logical gate GGEP is lower than the physical gate GGEP, as the error-resilient region. By contrast, the opposite region will be referred to as the 'error-proliferation' region.
In the following subsections, we will first analyse the SOF of coded gates when the code is operating in its errorproliferation region, followed by the opposite scenario.

B. Quantum codes operating in their error-proliferating regions
Using our previous results on uncoded gates in Section IV-C, it may be readily shown that quantum codes operating in their error-proliferating regions may not lead to substantial SOF reduction. Formally, we have the following result.
Corollary 2: For an uncoded gate having a GGEP of and SOF γ, the SOF is lower-bounded by γ · (1−2 ) 2 (1− ) 3 , when the gate is protected by some quantum code operating in its errorproliferating region. Furthermore, provided that the channel corresponding to the uncoded gate is a depolarizing channel, the lower bound can be further refined to γ · (3−4 ) 2 3(1− ) 2 (3− ) . Proof: This is a direct corollary following from Corollary 1 and Proposition 3. To elaborate a little further, considering the extreme case where the threshold is met exactly so that the output GGEP of the quantum code is equal to , the generic lower bound is obtained in the form of: using (44) and (52). The lower-bound valid for depolarizing channels is obtained as  using (44) and (45b), and by further exploiting the fact that single-qubit depolarizing channels have the highest SOF among all depolarizing channels sharing the same GGEP.
To demonstrate the implications of Corollary 2 more explicitly, we plot the lower bounds in Fig. 12. Since the faulttolerance thresholds of most QECCs are as low as 10 −2 ∼ 10 −3 , as it can be observed from Fig. 12, even when the GGEP meets the threshold exactly, the quantum codes can only offer an overhead reduction of at most 1%. Therefore, amalgamating QEM with codes operating in their error-proliferating regions may not be mutually beneficial.

C. QECCs operating in their error-resilient regions
In light of our previous discussions, it becomes plausible that QECCs operating in their error-resilient regions may contribute to SOF reduction by reducing the gate error probability. As stated in [28, Sec. 10.6.1] for example, their error-correcting capability can be further improved via concatenation. However, the price of concatenating codes is a drastic increase in the qubit overhead. It is thus interesting to investigate whether the amalgamation of QECC and QEM would outperform pure concatenated QECCs, and if so, in what scenarios.
To elaborate, we consider the simple example of transversal Hadamard gates protected by a rate 1/3 repetition code, as portrayed in Fig. 13 and detailed in [59]. By concatenating the repetition code twice, the number of physical qubits protecting a single logical qubit become three times that of the non-concatenated code. By contrast, if we amalgamate the rate 1/3 code with QEM, the additional qubits can be used to parallelize the computation, leading to a computational acceleration by a factor of three. In this sense, the QECC-QEM scheme outperform the concatenated scheme, when the SOF of QEM obeys γ C ≤ 2.
As the number of gates in a circuit increases, the QEM sampling overhead grows exponentially (as indicated by (31)), while the qubit overhead of the concatenated scheme remains constant. Therefore, there would be a critical point where the overall QEM sampling overhead escalation starts to outweigh the parallelization speedup benefits. This may be interpreted as a limitation imposed on the circuit size, beyond which full fault-tolerance becomes necessary. Next we provide an estimate of the critical point, given that the gate error probability is sufficiently low.
Proposition 5 (Lower bound on the critical point): Consider a quantum circuit in which each gate has a GGEP at most . If the gates are protected using l-stage concatenated (i.e., concatenated l times) [[n, k, d]] QECC operating in its errorresilient region via the transversal gate configuration, amalgamating the code with QEM is more preferable than applying the (l + 1)-stage concatenated code, when the number of gates N l satisfies when 1, and f ( ) is the output GGEP of the single-stage [[n, k, d]] code given the input GGEP , and f (l) ( ) denotes the l-times self-composition of function f ( ), as exemplified by f (2) Proof: For a circuit in which every gate is protected using l-stage concatenated [[n, k, d]] QECCs, the total computational overhead (including both the QEM overhead and qubit overhead)γ l can be bounded as where γ l denotes the highest SOF of a single logical gate in the circuit, and N is the total number of gates. Hence the critical point N l between l-stage concatenation and (l + 1)stage concatenation satisfies According to Corollary 1 and Propostion 3, when 1, the upper and lower bounds for the SOF of a single gate tend to be equal. Thus the SOF of each uncoded gate can be upper bounded by γ ≤ 4 . For l-stage concatenated codes, we have γ l ≤ 4f •l ( ). Therefore we obtain Additionally, using the Maclaurin approximation of ln(1 + x) ≈ x when x > 0 is sufficiently small, we obtain (59b). Hence the proof is completed. Remark 3: As a special case, the pure QEM (i.e., l = 0) is more preferable than amalgamating a single-stage [[n, k, d]] code with it, when the number of gates satisfies Note that when (l + 1)-stage QECC concatenation cannot be implemented due to the associated physical limitations (e.g. total number of physical qubits), the amalgam of l-stage QECC and QEM may be applied even beyond the critical point.

D. QEDCs operating in their error-resilient regions
Due to their smaller minimum distance than that of QECCs, QEDCs are not capable of correcting any error. Nonetheless, they can be used as important building blocks in the scheme of post-selection fault-tolerance [60], [61]. To expound a little further, post-selection fault-tolerance differs from its conventional counterpart in that it is implemented by detecting potential errors, and only accepting the results if no error is detected. Typically, QEDCs have a shorter codeword length compared to QECCs, hence they often also possess a higher threshold. For instance, the [[4, 2, 2]] QEDC can detect an error at the price of protecting a logical gate using four physical gates, while Steane's [ [7,1,3]] QECC requires seven gates. This makes post-selection fault-tolerance a preferable scheme, when the gates are relatively noisy [60]- [62].
In the context of QEM, the high threshold of QEDCs appears to make their amalgamation with QEM more beneficial. However, a subtle issue is that QEDCs suffer from their own sampling overhead. To elaborate, if for every gate the probability of successful error detection is p, similar to that of QEM, the SOF of the QEDC may be defined as which will be referred to as the QEDC-SOF in the following disucssions. Additionally, performing QEM on the postselected channel (where single errors are eliminated) also incurs an SOF, which in this context will be referred to as the QEM-SOF. Thus, the total SOF can be calculated as follows: where γ QEDC and γ QEM represent the QEDC-SOF and the QEM-SOF, respectively. In this regard, the amalgamation of QEDC and QEM is only beneficial when the total SOF is lower than that of QEM applied directly to uncoded gates.
To compute the QEDC-SOF of each logical gate for a specific QEDC, it suffices to compute the probability that a single Pauli error occurs in the entire physical circuit corresponding to the logical gate. Let us assume that every single-qubit gate incurs a Pauli channel having a GGEP of 1 , and every two-qubit gate incurs a Pauli channel having GGEP 2 . For any logical gate implemented using the transversal gate configuration, it is clear that the occurrence probability of a single Pauli error in single-qubit and two-qubit gates, namely p 1 and p 2 , can be expressed as for any [[n, k, d]] QEDC, since every logical gate is implemented using n physical gates. The terms having the orders of O( 2 1 ) and O( 2 2 ) are negligible when the GEPs are sufficiently small. According to (64), we also have the following result for the corresponding QEDC-SOF Among QEDCs capable of detecting a single arbitrary Pauli error, the one having the lowest n is the [[4, 2, 2]] code. In this case, we have γ 1 ≈ 4 1 and γ 2 ≈ 4 2 . The actual QEDC-SOF would be even higher due to the inevitable imperfections in the stabilizer measurements. On the other hand, for small 1 and 2 , we can see from Corollary 1 and Proposition 3 that when no QEDC is applied, the QEM-SOF is also approximately four times the GGEP. Therefore, we have the following remark. Remark 4: Amalgamating QEDCs and QEM might not be beneficial in the sense of SOF reduction, given that the logical gates are implemented using the transversal gate configuration.
This result may not be applicable to some logical gates, namely to those that are not implemented transversally. For example, some two-qubit gates processing the two logical qubits within a single block of the [ [4,2,2]] QEDC may be implemented using simple physical gates. As illustrated in Fig. 14, a logical controlled-Z gate can be implemented using six single-qubit gates S ⊗ (Z • S) ⊗ (Z • S) ⊗ S. Since twoqubit gates typically have much higher GGEP compared to single-qubit gates, the QEDC-SOF may be even lower than the GGEP of a single controlled-Z gate. Similarly, the logical gate SWAP • H ⊗2 can be implemented by simply using four physical Hadamard gates, hence also has a low QEDC-SOF. Here, the operator SWAP refers to the SWAP gate exchanging a pair of qubits [28,Sec. 1.3.4].
Unfortunately, non-transversal logical gates can only be designed in a case-by-case manner. Moreover, not all of them admit the nice and simple implementation as those shown in Fig. 14. For example, a CNOT gate between the two qubits in a [ [4,2,2]] code block has to be implemented via a SWAP gate, which has a high QEDC-SOF. By contrast, the transversal gate configuration is a general design paradigm that can be applied to all logical Clifford gates [44]. In this regard, we may draw the conclusion that QEDC-QEM is only beneficial for certain specific non-transversal gate designs.

VI. NUMERICAL RESULTS
In this section, we augment our discussions throughout previous sections by numerical results. Throughout this section, for single-qubit channels, the basis matrix B used for QEM is constituted by the conventional set of quantum operations listed in Table I [25]. The geometric interpretation of these operations is portrayed in Fig. 15. The basis operators of QEM for two-qubit channels are constituted by the tensor product of these operators. The operators R x , R y , and R z represent π/2 rotations around the x-, y-, and z-axes of the Bloch sphere, respectively, while R yz , R xz , and R xy represent π rotations around the axes determined by the equations y = z x = 0 , x = z y = 0 , and x = y z = 0 , respectively. Similarly, the operators π x , π y , π z , π yz , π xz , and π xy represent the measurement operations on the corresponding axes.

A. Uncoded gates
We first characterize Proposition 1 and Proposition 2 via numerical examples. In Fig. 16, the SOF vs. the GGEP is plotted for both single-qubit and two-qubit gates inflicted by coherent errors, amplitude damping and depolarizing channels, as detailed below. Here, the two-qubit channels are restricted to product channels, namely those constructed by the tensor product of two single-qubit channels. Specifically, a singlequbit amplitude damping channel C damp is characterized by [28,Sec. 8.3.5] where the operation elements are given by [28,Sec. 8.3.5] and the parameter δ is the amplitude damping probability of the channel, namely the probability that the channel turns an excited state |1 into the ground state |0 . Notably, the

Operator
Output state 1 Ry Rxy amplitude damping channel is a non-Pauli triangular channel. The single-qubit coherent channel we consider here is the overrotation channel, which takes the form of [25] C where .
The parameter φ controls the over-rotation angle of the channel, and ı = √ −1 denotes the imaginary unit. Observe from Fig. 16 that the amplitude damping channels affecting both a single and two qubits have higher SOFs than depolarizing channels. This corroborates Proposition 1 and Proposition 2, which imply that depolarizing channels have the lowest SOF among all triangular channels. The overrotation channels are not triangular channels, yet they exhibit the highest SOF. In general, their SOFs would depend on the specific set of basis operators comprised by the matrix B. In fact, coherent channels are represented by unitary transformations. In light of this, in the ideal case that "unitary rotation" gates can be implemented without decoherence, they can be compensated in an overhead-free manner by simply applying its complex conjugate.
In Fig. 16, we can also see the effect of quantum channel precoders, especially that of Pauli twirling. To elaborate, observe in both Fig. 16a and 16b, that the Pauli-twirled versions both of the coherent error and of the amplitude damping channels have almost the same SOF as depolarizing channels, provided that the gates used in the implementation of Pauli twirling are free from decoherence. By contrast, when Pauli twirling is implemented using realistic imperfect gates, the twirled channels have higher SOF, which is still lower than that of the amplitude damping channels. Another noteworthy fact is that imperfect Pauli twirling of two-qubit gates incurs relatively low overheads, compared to single-qubit gates. This is because two-qubit gates are more prone to decoherence than their single-qubit counterparts. Specifically, in this example we assume that every single-qubit gate (resp. two-qubit gate) has the same GGEP, and follow the convention that two-qubits gates have 10 times higher GGEP than single-qubit gates [56]. In light of these results, Pauli twirling may be a preferable quantum channel precoder, especially for two-qubit gates.
Next we illustrate the bounds of the SOF of Pauli channels presented in Section IV-C. As portrayed in Fig. 17, given a fixed GEP , all points representing the SOFs of randomly produced Pauli channels fall between the upper bound (52) and the lower bound (44). Moreover, it can be seen that when the GEP is less than 5×10 −3 , the upper and lower bounds are nearly identical, and a linear approximation of the SOF (i.e., 4 ) becomes rather accurate.

B. Transversal gates protected by QECC
In this subsection, we investigate the SOF when QEM is applied to transversal logical gates protected by QECCs operating in their error-resilient regions. For the simplicity of presentation, we assume that all physical gates are subjected to the deleterious effect of depolarizing channels. Furthermore, we also assume that they can be decomposed into the tensor products of single-qubit depolarizing channels. Finally, we assume that all single-qubit gates have an identical GEP, and so do two-qubit gates.
As presented in Section V-C, QEM and QECCs operating in their error-resilient regions can be beneficially amalgamated to reduce the SOF. Here we consider the amalgam of QEM and an l-stage concatenated Steane [ [7,1,3]] code. According to Proposition 5, for every QECC operating in its errorresilient region, there would be several critical circuit sizes. To elaborate, if a quantum circuit contains gates that exceeds the (l + 1)-th critical point, amalgamating QEM with the lstage concatenated code will be more beneficial than relying on the (l + 1)-stage concatenated code, and vice versa.
In Fig. 18, we compare the performance of three QECC-QEM schemes for quantum circuits containing various number of logical gates. In Fig. 18a, we demonstrate the aforementioned critical points of circuit size. In Fig. 18b   preferable among the three candidates. As portrayed in the figures, for the case where the GEP of physical gates equals to 10 −4 , pure QEM is preferable when the circuit contains less than about 6 × 10 3 gates. BY contrast, the single-stage QECC-QEM combination may be a good choice for circuits containing between 6 × 10 3 and 8 × 10 4 gates. An interesting issue is that the pure QEM becomes the most preferable option when the GEP is higher than about 10 −3 , which is somewhat counter-intuitive. This may be attributed to the fact that the fault-tolerance threshold of the [ [7,1,3]] code under our assumptions used in this treatise is around 1.5 × 10 −3 .
When the GEP of physical gates is close to their threshold, the error-correction capability of QECCs is not fully exploited.
To elaborate a little further, the term [f (l) ( ) − f (l+1) ( )] in (59b) would typically be a non-monotonic function of , with its maximum located close to the threshold. Hence, the critical circuit size increases as the GEP of physical gates decreases, provided that the GEP is sufficiently low.

C. Gates protected by QEDC
In this subsection, we consider a combined QEDC-QEM scheme, for which we make the same assumptions concerning the quantum gates as those stated in Section VI-B. Additionally, we assume that the GEP of two-qubit gates is 10 times as high as that of single-qubit gates.
When logical gates are implemented transversally, according to the discussion in Section V-D, the total SOF of the QEDC-QEM scheme would typically be even higher than that of pure QEM. This is demonstrated in Fig. 19, where we consider the total SOF of a single transversal logical CNOT gate protected by the [[4, 2, 2]] QEDC. It can be seen that most of the overhead is attributed to the QEDC-SOF, which is much higher than the overhead of pure QEM. By contrast, the QEM overhead in the QEDC-QEM scheme is significantly lower than that of pure QEM, implying that the post-selection fault-tolerance threshold of the [[4, 2, 2]] code is higher than 0.01.
As suggested by Fig. 14, some non-transversal logical gates may even outperform transversal gates in terms of requiring lower QEDC-QEM SOF. In particular, we consider the specific logical gate of SWAP • H ⊗2 implemented in the manner illustrated in Fig. 14d. Since this implementation only involves single-qubit gates, the QEDC-SOF is significantly lower compared to the transversal implementation. Consequently, as portrayed in Fig. 20, the total QEDC-QEM overhead is lower than that of pure QEM. However, it is still not clear, whether this fact can justify the practical value of the QEDC-QEM scheme, since designing a low-samplingoverhead non-transversal implementation of all Clifford gates would require substantial effort. This may be an interesting topic deserving further investigation.

VII. CONCLUSIONS
We have presented a comprehensive analysis on the SOF of QEM under various channel conditions. For uncoded gates affected by errors modelled by general CPTP channels, we have shown that Pauli channels have the lowest SOF among all triangular channels (which includes the amplitude damping channels) having the same GGEP. Following this line of reasoning, we have shown furthermore that depolarizing channels have the lowest SOF in the family of all Pauli channels.
We have also conceived the QECC-QEM as well as the QEDC-QEM schemes, and have shown that there exist several critical quantum circuits sizes, beyond which sophisticated codes having more concatenation stages is more preferable, and vice versa. Specifically, for QEDC-QEM, we have demonstrated that it may not be compatible with the popular transversal gate configuration, but they may still have beneficial applications, when the logical gates are appropriately designed.

APPENDIX I PROOF OF PROPOSITION 1
Proof: To prove our claim, it suffices to show that Pauli twirling does not increase the SOF. Without loss of generality, we assume that the specific columns corresponding to Pauli operators in B are the first 4 n columns. First we note that the quasi-probability representation corresponding to any CPTP channel having Pauli transfer matrix representation C may be expressed as: while the quasi-probability representation corresponding to the Pauli-twirled channel is given by where the super-operator T P represents the Pauli twirling operation. We can rewrite (71) in a matrix form as where T P denotes the matrix representation of the Pauli twirling operator. Recall that the Pauli-twirled channel is given by Upon introducing B = [b 1 b 2 . . . b 16 n ], the Pauli operator P i can be expressed in a matrix form as P i = b i b T i . Thus the Pauli twirling operator can be represented as with I P being the following matrix I P = I 4 n 0 4 n ×(16 n −4 n ) 0 (16 n −4 n )×4 n 0 (16 n −4 n )×(16 n −4 n ) .
Thus we have Since the Pauli transfer matrix C is triangular, we have hence (72) can be simplified as Using (70) and (75), we can show that if the statement of holds for a certain matrix T L , the proof can be completed by showing that T L 1 ≤ 1, since we have: Next we construct the matrix T L explicitly. Let us consider the QR decomposition [63] of the matrix B where Q is an orthogonal matrix and R is an upper triangular matrix. Substituting (74) and (76) into (75), we have Similarly, we can obtain Having compared (77) and (78), we may observe that Since Pauli operators are orthogonal to each other, the first 4 n columns in B (i.e., b 1 through b 4 n ) are also mutually orthogonal, meaning that when i = 1, 2, . . . , 4 n and j = 1, 2, . . . , 4 n . Therefore we have Upon introducing m B = max i=1,...,16 n b i 2 , from (79) we have where w i ∈ R min{4 n ,i} , and Using (81) Since B i = vec −1 {b i } corresponds to a CPTnI operator, the complete positiveness 6 implies [w i ] j ≥ 0 for all i and j, and the "trace-non-increasing" property implies j [w i ] j ≤ 1. Therefore we have w i 1 ≤ 1 for all i, and hence Note that m 2 B = max i B i 2 F . By consider the operator-sum decomposition of B i , we see that where (83d) follows from (33).
Combining (82) and (83), we can see that T L 1 ≤ 1, hence the proof is completed.

APPENDIX II PROOF OF PROPOSITION 2
To facilitate our further analysis, we denote the Hadamard transform matrix on the space of Pauli transfer matrix representation of an n-qubit system as H n ∈ R 4 n ×4 n . The corresponding inverse Hadamard transform is denoted by H −1 n = 1 4 n H. We omit the subscript n, whenever there is no confusion. Given these notations, according to (42), the simplified quasi-probability representation vector for a Pauli channel C can be expressed as Since the channel is CPTP, the vector 1/c satisfies [1/c] 1 = 1.
Next we show that μ C 1 ≥ μ L 1 . Note that the vector µ C can be decomposed as where r = 1/c − 1/l From the definition ofζ we see that 1 T r = 0, hence [μ C ] 1 = [μ L ] 1 µ 1 . In addition, we have since the channels are CPTP. Therefore we obtain The 1-norm ofμ L can be calculated explicitly as Forμ C , we denote the sign of its i-th entry as s i , thus Finally, since (L) ≥ (C), we may construct a depolarizing channel L characterized by (L ) = (C), while satisfying Hence the proof is completed.

APPENDIX III THE VALUES OF η C FOR BASIC PAULI CHANNELS
To elaborate further on the intuition about η C , the values of η C corresponding to some single-qubit Pauli channels are listed in Table II.
Since the graph G is symmetric, each column (resp. row) of C can be obtained by permuting the first column (resp. row) of C. Thus we have From (96) we can obtain whereÃ −1 A, and (98c) is obtained using the matrix inversion lemma. Exploiting the sub-multiplicativity of matrix p-norms [64, Chap. 5], we have where the equality follows from the fact that 1 TÃ = 1 and that all entries inÃ are non-negative. Substituting (99) into (98), we have Therefore, from (97) we obtain To show that channels having a single type of error achieve the equality, we note that holds for any Pauli operator P i . In light of this, the inverse of their PRW matrix can be shown to satisfy Therefore we have which is identical to (100b). Hence the proof is completed.

APPENDIX V PROOF OF PROPOSITION 4
Proof: Let η be the probability vector of a Pauli channel. We first show that the function f (η) = H −1 is Schur-convex with respect to η. We proceed by first decomposing f (η) as where Since h 2 (η) is an affine function of η, we see that is element-wise convex with respect to η. Therefore, to show that f (η) is Schur-convex, it suffices to show that g(x) is Schur-convex and increasing for x satisfying x 1.
Next we show the Schur-convexity of g(x). Note that is obtained by removing the first row and the first column from H 1 . Since doubly stochastic transformations do not affect the term 1 T (x − 1), the problem is reduced to showing the Schurconvexity of H 1 x 1 for x 0. To facilitate the analysis, we utilize = 1 T x and define x = [x 1 x 2 x 3 ] T . Now we see that hence For fixed , 2x− 1 is convex with respect to x. In addition, it is also a symmetric function of x, meaning that its value is unchanged upon permutation of x. Therefore, g(x) is Schurconvex.
To show that g(x) is increasing, we calculate the gradient of g(x) as where sgn(·) is the sign function satisfying After some manipulation, one can verify thatH 1 sgn(H 1 (x − 1)) −1 according to (108), hence g(x) is increasing. Given that the SOF of single-qubit Pauli channels is Schurconvex, we may generalize the result to n-qubit memoryless Pauli channels. To elaborate, the PRW matrix of an n-qubit memoryless Pauli channel can be expressed as where C i (η Ci ) corresponds to the partial channel of the i-th qubit. Thus Since C −1 i (η Ci ) 1 is Schur-convex with respect to the corresponding probability vector η Ci , we have for any doubly stochastic matrix Q i . Therefore Hence the proof is completed.