Bounds on Instantaneous Nonlocal Quantum Computation

Instantaneous nonlocal quantum computation refers to a process in which spacelike separated parties simulate a nonlocal quantum operation on their joint systems through the consumption of pre-shared entanglement. To prevent a violation of causality, this simulation succeeds up to local errors that can only be corrected after the parties communicate classically with one another. However, this communication is non-interactive, and it involves just the broadcasting of local measurement outcomes. We refer to this operational paradigm as local operations and broadcast communication (LOBC) to distinguish it from the standard local operations and (interactive) classical communication (LOCC). In this paper, we show that an arbitrary two-qubit gate can be implemented by LOBC with <inline-formula> <tex-math notation="LaTeX">$\epsilon $ </tex-math></inline-formula>-error using <inline-formula> <tex-math notation="LaTeX">${O}(\log (1/\epsilon))$ </tex-math></inline-formula> entangled bits (ebits). This offers an exponential improvement over the best known two-qubit protocols, whose ebit costs behave as <inline-formula> <tex-math notation="LaTeX">${O}(1/\epsilon)$ </tex-math></inline-formula>. We also consider the family of binary controlled gates on dimensions <inline-formula> <tex-math notation="LaTeX">${d}_{A}\otimes {d}_{B}$ </tex-math></inline-formula>. We find that any hermitian gate of this form can be implemented by LOBC using a single shared ebit. In sharp contrast, a lower bound of <inline-formula> <tex-math notation="LaTeX">$\log {d}_{B}$ </tex-math></inline-formula> ebits is shown in the case of generic (i.e. non-hermitian) gates from this family, even when <inline-formula> <tex-math notation="LaTeX">${d}_{A}=2$ </tex-math></inline-formula>. This demonstrates an unbounded gap between the entanglement costs of LOCC and LOBC gate implementation. Whereas previous lower bounds on the entanglement cost for instantaneous nonlocal computation restrict the minimum dimension of the needed entanglement, we bound its entanglement entropy. To our knowledge this is the first such lower bound of its kind.


I. INTRODUCTION
D ISTRIBUTED quantum computing on a multipartite system can arise in many common scenarios. For example, individuals at two different countries communicating classically with each other might want to combine their computing power to solve a difficult problem together. This type of quantum computation has been studied extensively under the setting of local operations and classical communication (LOCC). Under LOCC, pre-shared entanglement can be manipulated Manuscript  and put to use in some quantum information processing task.
In particular, the parties can transmit quantum states back and forth using teleportation [1], and thus they can simulate any quantum gate that acts globally across their systems.
In this paper, we focus on the setting of local operations and broadcast communication (LOBC). Contrary to the standard LOCC model, in LOBC the classical communication is noninteractive, meaning the parties can just send each other one message that depends only on their own local measurement data. Hence, consecutive rounds of teleportation are forbidden in this model. Research into LOCC has typically made a distinction between protocols in which just a single party sends a message (i.e. one-way protocols) and those in which interactive messages are exchanged between the parties (i.e. two-way protocols). More generally, the subject of LOCC round complexity studies the question of how much more powerful LOCC operations become as more rounds of classical communication are permitted [2]- [6].
There are two main motivations for considering LOBC operations. The first, being practical in nature, is that an LOBC protocol is typically more time efficient than a general LOCC process. More precisely, the duration of an LOBC protocol is no longer than the time it takes a message to be sent between two parties of greatest separation. This is of vital importance for realistic quantum information processing in which maintaining coherence for long time lengths is a formidable challenge. The time-constrained nature of LOBC processing has also found cryptographic application in the task of position verification [7]- [11], and we review this connection in Section III.
A second motivation is more fundamental in nature and it involves understanding interaction as a resource in distributed quantum information. The specific problem we study in this paper is the simulation of some nonlocal gate using pre-shared entanglement and LOBC operations. Historically, this task has been referred to as instantaneous nonlocal computation, but such a title can be misleading as the complete computation requires a nonzero implementation time; see Section II. We consider the question of how much entanglement is needed to simulate a given gate when non-interactive classical communication is allowed. This LOBC entanglement cost can then be compared to the LOCC entanglement cost of simulating the same gate when interactive classical communication is permitted (see Figs. 1 and 2). As a result, quantitative trade-offs can be formulated between shared entanglement and interactive classical communication. Beyond exemplifying this type of resource trade-off, the task of instantaneous nonlocal Fig. 1. The LOCC simulation of a nonlocal gate U may involve multiple rounds of interactive communication (see, for example, [5]). Alice and Bob perform local measurements and communicate their measurement outcomes an and b n+1 . The choice of local measurement at each round can depend on the outcomes of previous measurements. Fig. 2. In the LOBC simulation of a nonlocal gate U , two-way signaling is allowed but with no interaction. Protocols of this form are called instantaneous nonlocal computation of the gate U . This paper considers how much more entanglement |η is needed in the LOBC model to make up for the lost interactive classical communication.
computation touches on foundational questions in computation theory, as it provides a benchmark for assessing operational capabilities in generalized probability theories [12], [13].
This paper is structured as follows. We begin in the next section by describing the task of instantaneous nonlocal computation. Known results are reviewed and they are compared to analogous results in the general LOCC setting. In Section III, the cryptographic application of position verification is described in both the classical and quantum settings. Section IV contains our new results which involve deriving improved upper and lower bounds on the entanglement cost of simulating different families of gates using LOBC. The main proofs and protocols are then presented in Section V, and finally Section VI provides some concluding remarks.

II. INSTANTANEOUS NONLOCAL QUANTUM COMPUTATION
In instantaneous nonlocal quantum computation, the goal is to apply a global unitary gate over some multipartite system using local measurements alone. That is, for a given unitary U and arbitrary initial state |ψ , i.e., one whose classical description is unknown to the parties, they wish to invoke the transformation |ψ → U |ψ (1) by performing simultaneous local measurements on their respective subsystems; hence the description "instantaneous nonlocal computation." Of course, the notion of "instantaneous computation" should not be taken literally since this process is not physically possible for two reasons. The first reason is that U may be an entangling gate, and the transformation |ψ → U |ψ could then generate entanglement, something which is not possible using local operations. One can overcome this objection by allowing the parties to consume entanglement in the process. Such a transformation then takes the form where |η is some pre-shared entanglement resource known to all the parties. However, this process is still not possible in general due to relativistic constraints. If, for example, U were simply a permutation operators, then the transformation |ψ ⊗ |η → U |ψ could allow for instantaneous communication among the spatially separated parties, an impossibility even when using an unbounded amount of entanglement |η [14]. Thus the problem must be further modified if it is to be physically feasible.
One relaxation is to allow for locally correctable errors on the transformed state. The collective outcomes of the different local measurements can be denoted by variable m so that given particular outcomes m, the induced state transformation has the form |ψ → |φ m . Instead of aiming to achieve |φ m = U |ψ for every m, the goal is for |φ m where LU(m) = means that the two states are related by a local unitary (LU) transformation that can be determined from the measurement data m. In this sense, the task of instantaneous nonlocal quantum computation of the gate U means that using local quantum measurements having outcomes m. This could be further relaxed by considering target states -close to U |ψ or by allowing the equality to hold not for all measurement outcomes m, but only those belonging to some highly probable set. Equation (3) thus describes a process using local operations and broadcast communication (LOBC). Each party makes a suitable local measurement and then broadcasts the outcome. From this globally shared information m, the LU error correction can be determined and implemented with no further communication. The resultant transformation is then |ψ ⊗ |η → U |ψ , and the desired simulation of gate U is achieved. The main focus of this paper is on determining the minimal amount of entanglement |η needed to simulate a given unitary U in this way.
That it is even possible to perform Eq. (3) for every unitary U is not obvious. It was first shown by Vaidman [15] that instantaneous nonlocal computation can always be attained with arbitrarily high probability provided that the parties share enough entanglement. Specifically, in Vaidman's scheme the entanglement consumption scales as O(2 log(1/ )·2 4n ), with being the error and n being the number of qubits comprising the shared state |ψ . In this protocol, the full entanglement |η must be consumed for every outcome m. An improved protocol was devised by Clark et al. in which some of the outcomes m use only part of the initial entanglement, leaving the remainder usable for another task [16].
However, the average entanglement consumed across all outcomes m in this protocol still scales double exponentially in the system size. A breakthrough was later made by Beigi and König who used port-based teleportation [17], [18] as a primary subroutine within their protocol [19]. They were able to develop a general method for instantaneous nonlocal computation that uses only O(n 2 8n 2 ) ebits. Subsequent work has also been conducted on the instantaneous nonlocal computation of certain families of gates. For gates belonging to the so-called Clifford hierarchy, specialized protocols have been devised by Chakraborty and Leverrier [20]. General LOBC protocols were referred to as fast protocols by Yu et al. in Ref. [21], and they were able to construct specific protocols for the nonlocal implementation of unitaries having certain group structure. A different resource analysis has been carried out by Speelman who related entanglement consumption to the T -gate configuration in a quantum circuit realizing a given unitary U [22]. A restricted form of LOBC operations were studied for the task of entanglement distillation under the name of "measure and exchange" (MX) operations [23].
An important problem in the study of instantaneous nonlocal computation is to prove lower bounds on the entanglement cost for implementing certain gates. One automatic lower bound comes from the entangling power of the gate, which was alluded to at the start of this section. The entangling power is defined as the maximum increase in entanglement among all input states acted upon by the gate, and entanglement monotonicity under LOCC prohibits the entanglement implementation cost from being less than the entangling power. Note that since the entangling power is a property of the gate, it cannot be used as a lower bound that differentiates the LOCC and LOBC entanglement costs of implementation. Unfortunately, beyond the entangling-power bound, relatively little else has been proven. While the best upper bounds for simulating an arbitrary gate have entanglement costs that scale exponentially in the system size, it is unknown whether this amount of entanglement is necessary. The best lower bounds on the dimension of the shared entanglement scale linearly in the system dimension of the gate being implemented [19], [24]. A similar lower bound was proven for a BB84-based gate except in terms of the entanglement measure E max [25]. One drawback of these lower bounds is that they are not given in terms of ebit cost, unlike the upper bounds. This can be problematic for making comparative statements between upper and lower bounds. For example, if one considers the measure E max , which is no greater than the dimension of the entanglement, then the family of states Here E is the entanglement entropy which quantifies the amount of ebits in a bipartite pure state [26], [27]. The divergence of E max in this example can be easily seen from the fact that E max (|η d η d |) coincides with the log-robustness of entanglement [28], which has the form 2 log( d k=1 λ k ) for Schmidt coefficients λ k . Thus, E max and the entanglement entropy E can behave quite differently, and in terms of ebit cost, no lower bounds have been previously demonstrated for instantaneous nonlocal computation beyond the entanglement power. To our knowledge, the same is also true for general LOCC gate simulation. This is particularly relevant to the question of trade-offs between entanglement and interaction described in the introduction. One motivation for this work is to understand classical interaction as a resource in distributed quantum information processing. Its resource character can be quantified in terms of how much entanglement the parties must spend to remove interaction from the general LOCC setting and still complete the given task. Hence, it seems very natural to make this quantification using the standard resource unit of entanglement, which is an ebit. In this paper we provide such an ebit lower bound on the entanglement cost of performing generic bipartite controlled-phase gates using LOBC (Theorem 3).
To make a comparison between protocols with interactive communication and those without, we now briefly review some relevant results on the task of gate simulation using general LOCC. First note that any d A × d B gate can be implemented using teleportation and interactive communication at a cost of 2 log d A ebits. However, often this is not the optimal protocol. For Clifford gates, the entanglement cost is to equal the entangling power [29], which can be less than the dimension-bound of teleporation. For two qubits, any controlled unitary gate can be implemented under LOCC with just one shared ebit and two bits of classical information [30], [31]. This entanglement cost was later proven to be optimal for resource states having Schmidt rank two [32]. A generalization of this result came in Ref. [33], where it was shown that if an entangled resource state can simulate a unitary gate whose Schmidt rank is the same as the resource state, then the latter must be maximally entangled. Interestingly, these lower bounds no longer hold for resource states having a Schmidt rank that exceeds the Schmidt rank of the simulated gate, and they therefore fail to provide an ebit lower bound on the LOCC entanglement cost of gate simulation. In complementary earlier work, Cirac et al. have shown that the entanglement needed to simulate a family of weakly entangling gates can be smaller than one, and it approaches zero as the entangling power of these gates likewise approaches zero [34]. Our main protocol in Theorem 1 draws inspiration from the protocol described in Ref. [34].
When studying the entanglement cost of implementing a nonlocal unitary using either LOBC or LOCC, the problems of exact simulation versus -approximate simulation are different in nature. In fact, the entanglement cost could be far less in the -approximate regime, and arguably this is the more relevant setting to consider for realistic applications. However, the problem of exact simulation is still important from a fundamental perspective as it allows for fundamental separations to be drawn between LOBC and general LOCC. Furthermore, if one places a bound on the dimension of the entanglement resource, then the set of LOBC operations is compact and the cost of exact simulation serves as a limit for the -approximate cost as → 0. In this paper we consider both variants of the problem. Specifically, Theorem 1 pertains to the approximate simulation of an arbitrary two-qubit gate whereas Proposition 1, Theorem 2, and Theorem 3 deal with exact implementations.

III. CLASSICAL AND QUANTUM POSITION VERIFICATION
A concrete application of instantaneous nonlocal quantum computation by LOBC is quantum position verification (QPV). In position verification, a group of verifiers want to check if a prover P , who claims to be in position pos, is indeed at that location. A general verification scheme is to send a challenge to P and check if P responds with the correct answer within a specified amount of time. This technique is called distance bounding, and it was introduced in the classical setting by Brands and Chaum [35]. The intuition behind the scheme is that the adversaries, none of whom are at pos, are prohibited by relativistic constraints to correctly respond to the challenge within the allowed time frame. However, this intuition fails, and classical position verification has been shown to be insecure against multiple colluding adversaries [7].
One key step in the classical attacks is the cloning of information by the colluding adversaries. Since general cloning is not allowed in quantum mechanics, scientists attempted to build secure position-verification protocols based on the exchange of quantum information. The first QPV protocols were invented in 2002 under the name "quantum tagging" [8] with independent schemes proposed in Refs. [9] and [36]. However, these protocols are insecure provided the attackers have enough pre-shared entanglement [8], [10]. In general, all these protocols fall to a general attack based on instantaneous nonlocal quantum computation, as presented in detail by Buhrman et al. [12]. The attack relies on teleport * (teleportation without communication) and the use of multiple "teleportation" channels for each possible Pauli error. Thus, at the end of the protocol, the adversaries share the correct state in one of the channels. Through broadcasting their measurement outcomes, they can then identify this channel and fool the verifiers. However, the amount of entanglement consumed in this strategy is doubly exponential in the size of the system. Beigi and König [19] later improved on this result by using "port-based teleportation," which uses an amount of entanglement only exponential in the system size. It remains an important open problem whether or not QPV attacks exist that are sub-exponential in their entanglement consumption, and the best lower bounds only require the dimension of the entanglement to scale linearly with respect to the dimension of the simulated gate.
We should emphasize, however, that an LOBC attack is not the most general attack that can be performed on a QPV scheme. Indeed, LOBC assumes that the adversaries only communicate with one another classically. A conceivably more powerful attack allows the adversaries to exchange quantum information during the protocol as well. In other words, the operational class that encompasses a broader class of QPV attacks consists in local operations and broadcast quantum communication (LOBQC). We do not consider such a model in this paper.

IV. RESULTS
A. Two-Qubit Gates 1) Exact Implementations: We begin by describing a simple protocol that provides an exact implementation of certain two-qubit unitaries.

Definition 1. Let L be the family of two-qubit unitaries such
where are the standard Pauli matrices.

Proposition 1.
Any U ∈ L can be perfectly simulated by LOBC using two ebits and four classical bits of (non-interactive) communication.
Proof. The protocol we describe for performing U ∈ L is similar in spirit to the protocol of Vaidman and Buhrman et al. [11], [15], and we call it U2E (the "E" in the name stands for "exact"). A subroutine in this protocol is teleportation * , which is the standard teleportation protocol except with no classical communication and no Pauli correction on the receiving end [11]. Thus, at the end of teleportation * , the receiver has the teleported state up to a local Pauli error. Protocol U2E: Two ebit protocol for U ∈ L • Input an arbitrary two-qubit state |ψ AB . 1) Suppose that U satisfies Eq. (5). Using ebit |Φ + A1B1 = 1/2(|00 + |11 ) A1B1 , Alice teleports * A 1 to Bob by measuring in the rotated Bell basis {(Rσ j R † ⊗ I)|Φ + 1 AA1 } j . This leaves Alice (A) and Bob B) sharing the state where σ j is a Pauli error known to Alice. 2) Bob applies the unitary U on systems B 1 B, and by Eq. (5) we have 3) Using ebit |Φ + A2B2 , Bob teleports * B 1 back to Alice, they broadcast their results, and then they perform the necessary local error corrections, i.e. Bob's teleportation Pauli error and as well as T † j ⊗ V † j . In total, Alice and Bob are left in the shared state U |ψ , as desired.
In some cases, Protocol U2E is optimal. For example, consider the swap operator F ∈ L, whose action is given by F(|α A |β B ) = |β A |α B for an arbitrary product state |α |β . Since swap has an entangling power of two ebits (when acting on subsystems AB of the state |Φ + AA ⊗ |Φ + BB ), protocol U2E is optimal for the nonlocal simulation of swap.
In fact, it is straightforward to generalize protocol U2E to optimally perform the d-dimensional swap operator using teleportation * with a d-dimensional Bell basis. On the other hand Protocol U2E is sub-optimal for other gates. For example, CNOT is an element of L, and Theorem 2 below shows that CNOT can be implemented by LOBC using just one ebit.
Finally, let us briefly comment on the structure of L. First, observe that L is closed under local unitary transformations.
In general, we say that unitaries U and V are locally equivalent if they can be related by local unitaries in this way. Second, consider the two-qubit Pauli group, P 2 = {σ j ⊗ σ k } 3 j,k=0 × {±1, ±i}, as well as its normalizer, C 2 = {U : U gU † ∈ P 2 ∀g ∈ P 2 }. The latter is typically referred to as the Clifford group, and as easily seen from the definitions, any operator locally equivalent to a Clifford operator also belongs to L. However, somewhat surprisingly, the converse is also true.

Lemma 1. U ∈ L if and only if there exists local unitaries
The proof is provided in Section V-A, and we suspect this lemma may also find application in other quantum computation tasks. One immediate consequence of Lemma 1 is that Protocol U2E is no stronger in terms of entanglement consumption than the protocol recently given in Ref. [29]. In that paper, the authors provide an LOBC protocol for the implementation of any Clifford gate (in arbitrary dimension). Their protocol differs in that it involves Alice and Bob sharing the Choi state of U as their resource entanglement. Since for two qubits the entanglement of the Choi state can be less than two ebits, their protocol in general will have a smaller entanglement consumption. However, the resource state used in the protocol of Ref. [29] is specific to the gate being simulated, whereas protocol U2E uses a gate-independent resource state. One could modify the protocol of Ref. [29] by first equipping Alice and Bob with some fixed two-ebit resource state, and then have them convert this into the Choi state of a given unitary by LOCC. Doing this would render a protocol very similar to U2E.
2) Approximate Implementations: We now turn to the problem of instantaneous nonlocal computation of an arbitrary two-qubit unitary. We present a new protocol referred to as U2, and its detailed description is given in Section V. Except for certain angles, protocol U2 is probabilistic. It involves diagonalizing a two-qubit unitary in the so-called "magic basis" (see Eq. (15)) and then expressing this diagonalization as a sequence of simple single and two-qubit gates. The protocol then involves implementing these gates under the LOBC constraint following the "angle-doubling" error correction idea of Ref. [34]. One of the key features of our protocol is that it does not use Vaidman's "tree of teleportation channels" [12], [15], [16], and we therefore avoid an exponential growth in entanglement cost. Its performance is reported in the following theorem. We can compare the efficiency of protocol U2 to the port-based teleportation scheme of Beigi and König [19]. For a two-qubit gate U and any > 0, their protocol generates a quantum channel E which consumes 1 + 3·2 12 ebits while achieving an approximation of U quantified by ||E − U|| ≤ , where U(ρ) = U (ρ)U † and ||·|| is the so-called diamond norm [37]. In the protocol U2, Alice and Bob know when they have perfectly implemented the gate and when they have failed. In the latter case they can simply replace their state with "white noise," and thus U2 implements the quantum channel E U2 (ρ) = pU(ρ) + (1 − p)(I ⊗ I)/4 at the cost of 8N + 1 ebits and with p = (1 − 2 −N ) 3 . Setting = 2(1 − p), a straightforward calculation shows while consuming ebits. Hence in terms of approximation error , protocol U2 offers an exponential saving in the entanglement cost compared to port-based teleportation protocols. A similar savings holds relative to Vaidman-like schemes [11], [15].

B. Exact Implementation of Hermitian Binary-Controlled Gates
We now turn to a class of unitaries in general d A ⊗ d B systems. These are controlled gates of the form where P is an arbitrary projector on system A and V = V † is a hermitian unitary operator. This can be interpreted as a binary switch that applies V on system B when system A lies in the support of P . The LOBC implementation of operators having this form was studied in Ref. [21]. However, in their protocol the amount of consumed entanglement is not explicitly stated.
Here we show that only a single ebit is needed, regardless of the dimensions.
Performing these measurements on the initial state |ψ AB |η A B has outcomes Define the unitary operator Z = (I − P ) − P on Alice's system. Then for outcome A 0 B 0 Alice and Bob do nothing, for outcome A 0 B 1 they perform Z ⊗I, for outcome A 1 B 0 they perform I ⊗ V , and for outcome A 1 B 1 they perform Z ⊗ V . This attains U c |ψ with probability one.

C. An Ebit Lower Bound on the Exact Implementation of Generic Binary-Controlled Gates
We now consider systems of size 2 ⊗ s and show that, in stark contrast to Theorem 2, there are non-Hermitian controlled unitaries whose ebit consumption for implementation depends on the size of s.

Theorem 3. Let
have phase angles τ j ∈ [0, 2π) such that τ k = τ l for all k = l ∈ {0, · · · , s − 1}. An LOBC implementation of the controlled unitary on a 2 ⊗ s system requires at least log s ebits of shared entanglement resource.
Note that every controlled gate on 2 ⊗ s controlled from the 2-dimensional side is LU equivalent to U c in Eq. (14), and generically, the phase angles in U τ will be distinct. The proof of Theorem 3 is presented in Section V. It should also be noted that Theorem 3 assumes a pure-state resource, and so the amount of ebits refers to the entanglement entropy of the pure state. If one considers a mixed-state resource, then the entanglement bound in Theorem 3 refers to the entanglement of formation, which is the average pure-state entanglement entropy minimized over all ensembles realizing the resource state.
What is remarkable about this result is that it not only quantifies a lower bound on nonlocal instantaneous computation in terms of ebits, but it also demonstrates an unbounded gap between LOCC and LOBC. Under interactive LOCC, this gate can easily be performed using two ebits: Alice teleports her system to Bob, he performs U c on both systems, and then he teleports Alice's qubit back to her. Hence, Theorem 3 accomplishes one of the main goals of the paper; a rigorous trade-off has been identified between interactive communication and entanglement consumption.

A. The Two-Qubit "Magic Basis" and the Proof of Lemma 1
The magic basis in two qubits [38], [39] is the orthonormal family of states A number of convenient properties emerge when working in the magic basis, and we review them here since many of our proofs make use of them.

and only if it can be written as
where Proof. Note that the σ k ⊗ σ k form a pairwise commuting set for k = x, y, z. Thus, we can write e i(ασx⊗σx+βσy⊗σy +γσz⊗σz) = e iασx⊗σx e iβσy⊗σy e iγσz ⊗σz .
Using the identity e iθσ k ⊗σ k = cos θI + i sin θσ k ⊗ σ k and the fact that each magic state is an eigenstate of σ k ⊗σ k , it follows that iff the φ k and α, β, γ are related according to the above relations.
Hence from Eq. (21), we see that Conversely, if Ω is not in the Clifford group, then there must be some i, j, k for which Eq. (23) does not hold. This means that σ i ⊗ σ j anti-commutes with σ k ⊗ σ k . Thus, which clearly belongs to P 2 whenever θ is an integer multiple of π/4. As this would be a contradiction, we conclude that θ cannot be an integer multiple of π/4.

Proposition 4 ( [39]). Every two-qubit unitary U is locally equivalent to a matrix diagonal in the magic basis. That is U can be decomposed as
where Ω is diagonal in the magic basis and the R i ⊗ S i are local unitaries.
A detailed proof is given in Ref. [39]. We next make the connection between the magic basis and a gate's ability to generate entanglement. Here we say that U is non-entangling if U |α |β is a product state for every |α |β .

Proposition 5 ( [38]). A two-qubit unitary is non-entangling iff, up to an overall phase, it is real in the magic basis.
Proof. From Proposition 4 we write Our argument will involve first showing that every product unitary is real in the magic basis. Since all product unitaries are non-entangling, Eq. (26) implies that U is non-entangling iff Ω is non-entangling. With it having been established that every product unitary is real, the proposition will then follow by showing that U is non-entangling iff Ω is real in the magic basis.
Let us first consider any operator of the form I ⊗ V (or alternatively V ⊗ I), where V is an arbitrary unitary. Up to an overall phase, we can always express V = aI + i b · σ with a ≥ 0 and b a vector with real components. Then Since σ i σ j = i ijk σ k , we see that I⊗V is real when expressed in the magic basis. Let us write Ω = whenever 3 k=0 c 2 k = 0. This requires that φ k − φ 0 = ±π for all k. In other words, up to an overall phase, Ω is real.
We now turn to the proof of Lemma 1. It will make use of one more technical fact. Proposition 6. Let σ k be any Pauli operator and V an arbitrary one-qubit unitary. Then there exists some com- Under unitary conjugation σ k transforms to some other unitary e iϕn · σ, wherê n is a unit vector with real components and e iϕ is an overall phase. Hence e −iϕ Tr[V σ i V † σ k ] is a real number.

Lemma 1. U ∈ L if and only if there exists local unitaries
is obtained from U by local unitaries. Hence, it suffices to show that Ω(V σ i V † ⊗ I)Ω † being a product unitary for all i implies Ω ∈ C 2 . From this it will follow that (R 1 ⊗ S 1 )U (R 2 ⊗ S 2 ) ∈ C 2 for some local unitaries R n ⊗ S n .
If Ω(V σ i V † ⊗ I)Ω † is a product unitary it is non-entangling and therefore, by Proposition 5, there exists some phase e iϕ such that is real for each j and k. Under what conditions is this true? Note that when j = k the component vanishes, and so it suffices to just consider the case of j = k. First, suppose that j = 0 and k > 0. Then If these terms are real for all i, then by Proposition 6, there must exist some phase e iϕ k such that ie iϕ e i(φ0−φ k ) e iϕ k is real. Hence, ϕ + φ 0 − φ k + ϕ k = n 0k π + π/2, n 0k ∈ Z, ∀k = 1, 2, 3 Similarly, taking k = 0 and j > 0 we have From this we infer φ 0 − φ l = (n 0l − n l0 )π/2, ∀l = 1, 2, 3 and ϕ + ϕ l = (n 0l + n l0 + 1)π/2, ∀l = 1, 2, 3.
Now we turn to j, k > 0. We have Since σ k σ j = i kjl σ l , the RHS of the previous equation becomes Again by Proposition 6, for this to be real, we need for any distinct triple (j, k, l) of nonzero indices. Substituting Eq. (34) into (37) we get and adding Eq. (33) to this yields for any distinct triples j, k, l > 0. Finally, by applying the relations of Eqns. (17)- (20), we have Hence α, β, γ are all integer multiples of π/4. By Proposition 3, it follows that Ω is a Clifford gate. Proof. We freely interchange the symbols {1, 2, 3} ↔ {x, y, z} to denote the standard Pauli operators. We will also write the identity as σ 0 = I. The two-qubit controlled-not (CNOT) gate will be denoted as In addition, we define the single-qubit matrices as well as the two-qubit unitary Observe the relations From Propositions 2 and 4, pre-and post-local unitaries can convert a given U into an operator Ω, which in the magic basis has the diagonal form The magic basis can then be rotated into the computational basis using a CNOT gate and local unitaries. Doing so allows us to decompose any two-qubit unitary into the form up to pre-and post-local unitaries [40]. Thus it suffices to implement M (α, β, γ) using LOBC. Similar to Protocol U2E, Protocol U2 relies heavily on the subroutine teleportation * .
Recall that teleportation * is standard teleportation using a maximally entangled two-qubit state without the classical communication and Pauli correction at the end. Protocol U2: LOBC implementation of M (α, β, γ): Remark. Prior to Step 5 b, all operations by Alice (resp. Bob) will depend only on her (resp. his) previous measurement outcomes.
Step 1 -Implement (H ⊗ I) − → U x : Using 1 ebit, Alice and Bob implement CNOT using the protocol given in Theorem 2, except they do not communicate their measurement outcomes to each other. Alice then performs a Hadamard gate. This leaves Alice (A) and Bob (B) sharing the state where σ a (resp. σ b ) is a Pauli error known to Alice (resp. Bob). Note that a, b ∈ {0, 1}.
Step 2 -Implement I ⊗ R z (γ): a. Initialize round r = 1. On system B, Bob performs R z (γ). Using ebit |Φ + A1B1 , he then teleports * system B to Alice, which leaves her in the state b. On system A 1 , Alice applies σ a , and she enters the halting subroutine (see below) if a ∈ {0, 3}. Otherwise, using ebit |Φ + A2B2 she teleports * system A 1 to Bob. The resulting shared state is given by c. This begins round r = 2.
applies R z (−2γ). Using ebit |Φ + A3B3 , system B 2 is teleported * back to Alice. This leaves them in the state d. On system A 3 , Alice applies σ a2 and she enters the halting subroutine if a 2 ∈ {0, 3}. Otherwise, using ebit |Φ + A4B4 , she teleports * system A 3 to Bob. The resulting shared state is given by e. This continues for N total rounds. In each round, Bob applies either a positive or negative rotation with twice the magnitude of the rotation in the previous round. Whether the rotation is positive or negative depends on the product of all his previous Pauli errors. At the end of N rounds, Alice will have entered the halting subroutine in some round 1 ≤ K ≤ N with probability 1 − 2 −N . If she entered in round K, then the state held on Alice's side at the start of the halting subroutine is and the joint state at the end of N rounds is where the σ b 2j+1 are the Pauli errors introduced by Alice for each round after she halted and σ a2N is the teleportation * error from end of the halting subroutine. If Alice never entered the halting subroutine, then at the end of N rounds Alice and Bob's state is given by f. Bob applies to system B 2N the concatenation of all his Pauli errors σ b := N −1 j=0 σ b2j+1 . The crucial property of this protocol is that for any halting round K. This holds because in the halting subroutine, Alice is able to distinguish whether Bob's teleportation error belongs to either {I, σ z } or {σ x , σ y }. g. If either Alice entered the halting subroutine during some round or γ = l2 −N π (by Corollary 1), where l is an even integer, then Alice and Bob's final shared state has the form . The amplitude |γ a,b | 2 is the probability that Alice obtains measurement outcome a ∈ A and Bob obtains b ∈ B. To analyze further, it will be helpful to expand A a and B b in an orthonormal basis for system A and B respectively. Doing so yields the general forms where |α i ,i,a and |β j ,j,b are both vectors in a d-dimensional space. When expanded in the same basis, the RHS of Eq. (69) reads Thus, substituting (71) and (70) where we use the relation ( α i ,i,a | ⊗ β j ,j,b |)(I ⊗η)|Φ + d = β j ,j,b |η|α * i ,i,a . Eq. (72) is equivalent to the system of equalities: For any k, k ∈ {0, · · · , s − 1}, take the outer products of Eqs. (E:k) and (E:k ), trace out system A, and sum over a.
When this measurement is performed onηη † , we find where the second line follows from Eq. (75) and the third line comes from setting k = k in Eq. (73) and then summing over k in both sides of that equation. On the level of purifications, Eq. (78) says that (I A ⊗ M B b,t )|η A B is proportional to an s-dimensional maximally entangled state. Since this holds for every outcome M b,t , monotonicity of the entanglement entropy under local measurement implies that E(|η ) ≥ log s. (79) Remark. The lower bound of Eq. (79) has been proven for the exact implementation of U c . Thus, its significance lies in establishing the principle that LOBC requires more entanglement than LOCC for simulating certain gates, and this gap cannot be bounded even when fixing one of the systems to be a qubit. To have true cryptographic application in tasks such as QPV, one would want a similar result for an -approximate simulation of U c . We leave this to future work.

VI. CONCLUSION
The LOBC setting is important in distributed quantum computing when time is of the essence. In this paper, we focused on the task of instantaneous nonlocal quantum computation, which is gate simulation using LOBC operations and pre-shared entanglement. We have introduced a general two-qubit protocol that is exponentially better than other known protocols in terms of its entanglement consumption as a function of gate error. We have shown this protocol to be non-optimal for the simulation of certain gates, such as swap, which can be implemented using just two ebits. This two-ebit cost for swap is optimal even when interactive LOCC operations is permitted, two ebits are required for the implementation. This is somewhat surprising given that swap is the most nonlocal two-qubit gate in the sense that it can generate the most entanglement, and it can be used for simultaneous message exchange between Alice and Bob. Thus, our results suggest that the benefits of interactive communication in LOCC gate simulation mainly pertain to the entanglement cost of simulation rather than the entangling power of the simulated gate.
For a 2 ⊗ s system, we have shown that generic controlled unitary gates controlled from the 2-dimensional side require at least log(s) ebits to implement. Currently we do not know whether this lower bound is close to achievable. The known protocols have an ebit consumption that scales linearly with s and some function of the error parameter, and it is an important open problem to determine if this exponential gap can be closed. A more general theoretical question is whether every nonlocal gate can be perfectly implemented by LOBC using a finite amount of entanglement. Even in two-qubits, our new protocol has some failure probability unless U (α, β, γ) has special angles. It is unknown if a protocol with no failure branches exists for every U (α, β, γ).