Compiler Design for Distributed Quantum Computing

In distributed quantum computing architectures, with the network and communications functionalities provided by the Quantum Internet, remote quantum processing units can communicate and cooperate for executing computational tasks that single, noisy, intermediate-scale quantum devices cannot handle by themselves. To this aim, distributed quantum computing requires a new generation of quantum compilers, for mapping any quantum algorithm to any distributed quantum computing architecture. With this perspective, in this article, we first discuss the main challenges arising with compiler design for distributed quantum computing. Then, we analytically derive an upper bound of the overhead induced by quantum compilation for distributed quantum computing. The derived bound accounts for the overhead induced by the underlying computing architecture as well as the additional overhead induced by the suboptimal quantum compiler—expressly designed in this article to achieve three key features, namely, general-purpose, efficient, and effective. Finally, we validate the analytical results, and we confirm the validity of the compiler design through an extensive performance analysis.


Introduction
Current quantum computers are commonly defined as noisy intermediate-scale quantum (NISQ) devices, being characterized by few dozens of quantum bits (qubits) with non-uniform quality and highly constrained physical connectivity.
Hence, the growing demand for large-scale quantum computers is motivating research on distributed quantum computing architectures [1,2], and experimental efforts have demonstrated some of the building blocks for such a design [3]. Indeed, with the network and communications functionalities provided by the Quantum Internet [1,2,[4][5][6][7][8][9][10][11], remote quantum processing units (QPUs) can communicate and cooperate -through the distributed computing paradigm as a virtual quantum processor with a number of qubits that scales linearly with the number of remote QPUs [12] -for executing computational tasks that each NISQ device cannot handle by itself.
As overviewed in recent literature such as [3,12], several challenges arise with the design of a distributed quantum computing architecture. In the following, we focus on the problem of designing a quantum algorithm compiler for distributed quantum compilation.
Compiling a quantum algorithm means translating a hardware-agnostic description of the algorithm -i.e., the quantum circuit 1 -into a functionally-equivalent one that takes into account the physical constraints of the underlying computing architecture -i.e., the compiled quantum circuit. When it comes to distributed computing architectures, two are the main issues arising with the compiler design.
First, a fundamental question arises with distributed computation: at what price? Indeed, distributed computation requires the different processors being able to communicate each others for coordinating and data exchanging, and these tasks introduce an overhead that strongly depend on the particulars of the distributed computing architecture. For instance, the induced overhead becomes more severe as the connectivity between the QPUs shrinks or as the number of qubits stored at the QPUs decreases. Hence, from a compiling perspective, it is crucial to estimate the overhead effects onto the compiled quantum circuit, effects that are generally measured in terms of depth of the compiled circuit with respect to the depth of the original one.
Furthermore, compiling a quantum circuit is a very challenging task even for a single-processor architecture, being such a task a NP-complete problem [13]. Hence, optimal circuit compiling for distributed quantum architectures can be achieved only for very small circuit instances. Conversely, the compilation of medium-tolarge circuits of practical value induces an additional overhead -whose severity depends on the suboptimality of the quantum compiler -that further increases the depth of the compiled circuit.
With this in mind, in this paper we analytically derive an upper bound of the overhead induced by quantum circuit compilation for distributed quantum computing: • by considering the overhead induced by the worstcase scenario for a distributed quantum computing architecture, namely a scenario characterized by i) the lowest possible number of qubits at each QPU, and ii) the poorest connectivity among the QPUs, • and by considering the additional overhead induced by a sub-optimal quantum compiler.
Clearly, with reference to the last point, the additional overhead strongly depends on the particulars of the quantum compiler. To this aim, through the manuscript we design a quantum compiler with three key features: • general-purpose, namely, requiring no particular assumptions on the quantum circuits to be compiled; • efficient, namely, exhibiting a polynomial-time computational complexity so that it can successfully compile medium-to-large circuits of practical value; • effective, being the total circuit depth overhead induced by the quantum circuit compilation always upper-bounded by a factor that grows linearly with the number of logical qubits of the original quantum circuit.
The rest of the paper is organized as follows. In Section 2 we review some preliminaries about quantum circuits and quantum compilers. Then, in Section 3 we detail the problem of circuit compilation for distributed quantum computing, discussing the challenges that arise with the compiler design and the relevant literature. These basics are crucial for understanding the compiler design as well as the analytical derivation of the overhead bound given in Section 4. Then, Section 5 presents the performance analysis for the proposed compiler design. In particular, we present the implementation of a compiler that is able to cope with the worst-case scenario for a distributed quantum computing architecture; we validate the compiling overhead upper bound; we illustrate experimental results regarding the compilation of several quantum circuits, with our compiler compared to a state-of-the-art solution. Finally, Section 6 concludes the paper.
Journal of L A T E X Class Files, Vol. .., No. 8, August 2015 First, a fundamental question arises with distributed computation: at what price? Indeed, distributed computation requires the different processors being able to communicate each others for coordinating and data exchanging, and these tasks introduce an overhead that strongly depend on the particulars of the distributed computing architecture. For instance, the induced overhead becomes more severe as the connectivity between the QPUs shrinks or as the number of qubits stored at the QPUs decreases. Hence, from a compiling perspective, it is crucial to estimate the overhead effects onto the compiled quantum circuit, effects that are generally measured in terms of depth of the compiled circuit with respect to the depth of the original one.
Furthermore, compiling a quantum circuit is a very challenging task even for a single-processor architecture, being such a task a NP-complete problem [13]. Hence, optimal circuit compiling for distributed quantum architectures can be achieved only for very small circuit instances. Conversely, the compilation of medium-to-large circuits of practical value induces an additional overhead -whose severity depends on the suboptimality of the quantum compiler -that further increases the depth of the compiled circuit.
With this in mind, in this paper we analytically derive an upper bound of the overhead induced by quantum circuit compilation for distributed quantum computing: • by considering the overhead induced by the worst-case scenario for a distributed quantum computing architecture, namely a scenario characterized by i) the lowest possible number of qubits at each QPU, and ii) the poorest connectivity among the QPUs, • and by considering the additional overhead induced by a sub-optimal quantum compiler. Clearly, with reference to the last point, the additional overhead strongly depends on the particulars of the quantum compiler. To this aim, through the manuscript we design a quantum compiler with three key features: • general-purpose, namely, requiring no particular assumptions on the quantum circuits to be compiled; • efficient, namely, exhibiting a polynomial-time computational complexity so that it can successfully compile medium-to-large circuits of practical value; • effective, being the total circuit depth overhead induced by the quantum circuit compilation always upperbounded by a factor that grows linearly with the number of logical qubits of the original quantum circuit. The rest of the paper is organized as follows. In Section II we review some preliminaries about quantum circuits and quantum compilers. Then, in Section III we detail the problem of circuit compilation for distributed quantum computing, discussing the challenges that arise with the compiler design and the relevant literature. These basics are crucial for understanding the compiler design as well as the analytical derivation of the overhead bound given in Section IV. Then, Section V presents the performance analysis for the proposed compiler design. In particular, we present the implementation LAYER  : Example of a 5-qubit quantum circuit from [17], with each horizontal line representing the time-evolution of the state of a single logical qubit. of a compiler that is able to cope with the worst-case scenario for a distributed quantum computing architecture; we validate the compiling overhead upper bound; we illustrate experimental results regarding the compilation of several quantum circuits, with our compiler compared to a state-ofthe-art solution. Finally, Section VI concludes the paper.

II. BACKGROUND
We refer the reader to [14] for an introduction to the conceptual and notation differences separating quantum computing from conventional computing, and to [15] for an in-depth treatise of the subject.
In this work, we consider QPUs that support the quantum circuit model [16], which is the most popular and developed model for quantum computation. A quantum circuit is a model of a quantum algorithm, where quantum operators are described as quantum gates. A quantum circuit is still a logical abstraction, not to be confused with its realization on an actual quantum hardware device. Hence, in the following, the abstract qubits subjected to quantum gates as specified by the quantum circuit are called logical qubits to distinguish them from the physical qubits embedded within a quantum processor. Figure 1 shows a simple quantum circuit, where each horizontal line represents the time-evolution of the state of a single logical qubit, with time flowing from left to right, dictating the order of execution of the different gates. More specifically, gates affecting the same qubit must be executed sequentially, and this agrees with the intuition. Conversely, gates acting on different qubits can be performed simultaneously as long as the "ordering" arising from gates affecting multiple qubits is respected. This concept underlies the notion of layer, i.e., the set of gates that can be performed simultaneously on a disjoint set of qubits. The number of layers in a quantum circuit is denoted as circuit depth. As an example, the quantum circuit given in Figure 1 is composed of 9 layers and hence its depth is equal to 9. The number of gates within the circuit is denoted as circuit size.   [17], with each horizontal line representing the timeevolution of the state of a single logical qubit.

Background
We refer the reader to [14] for an introduction to the conceptual and notation differences separating quantum computing from conventional computing, and to [15] for an in-depth treatise of the subject.
In this work, we consider QPUs that support the quantum circuit model [16], which is the most popular and developed model for quantum computation. A quantum circuit is a model of a quantum algorithm, where quantum operators are described as quantum gates. A quantum circuit is still a logical abstraction, not to be confused with its realization on an actual quantum hardware device. Hence, in the following, the abstract qubits subjected to quantum gates as specified by the quantum circuit are called logical qubits to distinguish them from the physical qubits embedded within a quantum processor. Figure 1 shows a simple quantum circuit, where each horizontal line represents the time-evolution of the state of a single logical qubit, with time flowing from left to right, dictating the order of execution of the different gates. More specifically, gates affecting the same qubit must be executed sequentially, and this agrees with the intuition. Conversely, gates acting on different qubits can be performed simultaneously as long as the "ordering" arising from gates affecting multiple qubits is respected. This concept underlies the notion of layer, i.e., the set of gates that can be performed simultaneously on a disjoint set of qubits. The number of layers in a quantum circuit is denoted as circuit depth. As an example, the quantum circuit given in Figure 1 is composed of 9 layers and hence its depth is equal to 9. The number of gates within the circuit is denoted as circuit size.

Quantum Compilation
Given a quantum algorithm, there exist several equivalent quantum circuits modeling the same computation 2 Authors: Distributed Quantum Compiling

FIGURE 2:
Coupling map of the IBM Melbourne quantum processor [21]. The fifteen physical qubits are represented by circles. The arrows denote the possibility to realize a two-qubit CNOT gate between the connected qubits, with the arrow pointing toward the target qubit. As an example, a CNOT between qubits Q1 (control) and Q0 (target) can be directly executed by the quantum processor, whereas a CNOT between qubits Q2 and Q0 cannot.

A. QUANTUM COMPILATION
Given a quantum algorithm, there exist several equivalent quantum circuits modeling the same computation with a different arrangement or different ordering of gates. Circuits with fewer gates -i.e., with lower size -may be preferred to reduce the circuit complexity. However, the execution time of the circuit -rather than its size -is generally considered the key factor to be optimized [18], [19]. The rationale is to keep the execution time of the quantum circuit within the coherence time of the underlying quantum hardware architecture [15], [20]. By oversimplifying, the execution time increases with the number of layers. Therefore, it is crucial to build -for a given quantum algorithm -a quantum circuit characterized by the lowest possible depth. However, two issues arise as a consequence of the quantum processor characteristics.
First, even if there exists an uncountable number of quantum logic gates, the set of gates that can be executed on a certain quantum processor can be limited, as a consequence of the constraints imposed by the underlying qubit technology [3]. In this case, any gate outside this reduced set must be obtained with a proper combination of the allowed gates through a process known as gate synthesis.
Furthermore, regardless of the underlying qubit technology, any quantum processor exhibits physical constraints -arising as a consequence of the noise and the physicalspace limitations -on the possible interactions between the different physical qubits. For example, CNOT gates cannot be applied to any physical qubit pair, but they are instead restricted to certain pairs, as shown in Figure 2 with the coupling map of an IBM quantum processor.
From the above, it becomes clear that the execution of a quantum algorithm on a certain quantum processor requires that: i) each logical qubit of the quantum circuit is mapped 2 onto a physical qubit of the quantum processor, and ii) each CNOT operation between non-adjacent (within the coupling map) physical qubits is mapped into a sequence of CNOT operations between adjacent physical qubits, as shown 3 in Figure 3.
This process, known as quantum compilation, must be optimized so that the depth of the compiled circuit -i.e., the equivalent quantum circuit satisfying all the constrains imposed by the quantum processor -is minimized [24], [25], [27], [28].

III. COMPILERS FOR DISTRIBUTED QUANTUM COMPUTING
As highlighted in Section I, the demand for large-scale quantum computers is motivating research on distributed quantum computing architectures, where multiple small-scale quantum processors interact and cooperate through the Quantum Internet for solving challenging computational tasks. As a consequence, a new generation of quantum compilers is needed, for mapping any quantum algorithm to any distributed quantum computing architecture.
Let us consider a toy model for distributed quantum computing, in which a generic quantum algorithm must be executed on two quantum processors interconnected by a quantum link, as shown in Figure 4.

A. CHALLENGES
Several challenges arise with the design of a quantum compiler for mapping an arbitrary quantum circuit into a distributed quantum computing architecture, as discussed in the following.

Data Qubits vs Communication qubits
Similarly to classical distributed computing, a key requirement for distributed quantum computing is the possibility to perform remote operations, namely operations between qubits stored at different processors. But, differently from the classical domain, quantum mechanics does not allow an unknown qubit to be copied or even simply read or measured in any way, without causing an irreversible loss of the quantum information stored within the qubit [14], [15].
Thankfully, entanglement provides an invaluable tool for implementing remote operations without violating quantum mechanics [12]. Entanglement is a property of two (or more, in case of multipartite entanglement) quantum particles that exist in a special type of superposition state, such that any action on a particle affects instantaneously the other particle as well. This sort of quantum correlation, with no counterpart in the classical world, holds even when the particles are far away from each other. For an in-depth discussion about entanglement from an information engineering point of view, we refer the reader to [20].
By exploiting the availability of a Bell state -that is a state of two maximally-entangled qubits -shared between the two remote processors, it is possible to perform a remote CNOT through a sequence of local CNOTs and single-qubit operations/measurements as shown in Figure 4b. 3 With the state transfer strategy based on SWAPs usually preferred over the ancilla strategy [24]- [26].  [21]. The fifteen physical qubits are represented by circles. The arrows denote the possibility to realize a two-qubit CNOT gate between the connected qubits, with the arrow pointing toward the target qubit. As an example, a CNOT between qubits Q 1 (control) and Q 0 (target) can be directly executed by the quantum processor, whereas a CNOT between qubits Q 2 and Q 0 cannot. with a different arrangement or different ordering of gates.
Circuits with fewer gates -i.e., with lower size -may be preferred to reduce the circuit complexity. However, the execution time of the circuit -rather than its size -is generally considered the key factor to be optimized [18,19]. The rationale is to keep the execution time of the quantum circuit within the coherence time of the underlying quantum hardware architecture [15,20]. By oversimplifying, the execution time increases with the number of layers. Therefore, it is crucial to build -for a given quantum algorithm -a quantum circuit characterized by the lowest possible depth. However, two issues arise as a consequence of the quantum processor characteristics.
First, even if there exists an uncountable number of quantum logic gates, the set of gates that can be executed on a certain quantum processor can be limited, as a consequence of the constraints imposed by the underlying qubit technology [3]. In this case, any gate outside this reduced set must be obtained with a proper combination of the allowed gates through a process known as gate synthesis.
Furthermore, regardless of the underlying qubit technology, any quantum processor exhibits physical constraints -arising as a consequence of the noise and the physical-space limitations -on the possible interactions between the different physical qubits. For example, CNOT gates cannot be applied to any physical qubit pair, but they are instead restricted to certain pairs, as shown in Figure 2 with the coupling map of an IBM quantum processor.
From the above, it becomes clear that the execution of a quantum algorithm on a certain quantum processor requires that: i) each logical qubit of the quantum circuit is mapped 2 onto a physical qubit of the quan-2 Indeed, NISQ technology may require a logical qubit to be mapped onto several physical qubits to implement proper faulttolerant techniques [23]. Nevertheless, in the following we assume tum processor, and ii) each CNOT operation between nonadjacent (within the coupling map) physical qubits is mapped into a sequence of CNOT operations between adjacent physical qubits, as shown 3 in Figure 3.
This process, known as quantum compilation, must be optimized so that the depth of the compiled circuit -i.e., the equivalent quantum circuit satisfying all the constrains imposed by the quantum processor -is minimized [24,25,27,28].

Compilers for Distributed Quantum Computing
As highlighted in Section 1, the demand for largescale quantum computers is motivating research on distributed quantum computing architectures, where multiple small-scale quantum processors interact and cooperate through the Quantum Internet for solving challenging computational tasks. As a consequence, a new generation of quantum compilers is needed, for mapping any quantum algorithm to any distributed quantum computing architecture.
Let us consider a toy model for distributed quantum computing, in which a generic quantum algorithm must be executed on two quantum processors interconnected by a quantum link, as shown in Figure 4.

Challenges
Several challenges arise with the design of a quantum compiler for mapping an arbitrary quantum circuit into a distributed quantum computing architecture, as discussed in the following.

Data Qubits vs Communication qubits
Similarly to classical distributed computing, a key requirement for distributed quantum computing is the possibility to perform remote operations, namely operations between qubits stored at different processors. But, differently from the classical domain, quantum mechanics does not allow an unknown qubit to be copied or even simply read or measured in any way, without causing an irreversible loss of the quantum information stored within the qubit [14,15].
Thankfully, entanglement provides an invaluable tool for implementing remote operations without violating quantum mechanics [12]. Entanglement is a property of two (or more, in case of multipartite entanglement) quantum particles that exist in a special type of superposition state, such that any action on a particle affects a one-to-one mapping for the sake of clarity, without any loss of generality.

REVERSE CNOT
(a) Reversing CNOT [22]. A CNOT between Q 0 (control) and Q 1 (target) can be executed with the coupling map given in Figure 2 by performing a CNOT between Q 1 (control) and Q 0 (target) sandwiched between two H gates. We note that IBM Melbourne processor (shown in Figure 2) natively supports CNOTs in both directions between neighbor qubits.
A CNOT between qubits Q 0 (control) and Q 2 (target) can be executed through either: i) quantum state transfer, by first swapping qubits Q 0 and Q 1 so that Q 0 and Q 2 become adjacent qubits in the coupling maps, then by performing a CNOT between Q 0 and Q 2 , and finally by swapping again qubits Q 0 and Q 1 so that they recover their initial position, or ii) ancilla qubit, by performing four CNOT operations between neighbour qubits with the help of the intermediate qubit Q 1 .

FIGURE 3:
Example of equivalent quantum circuits generated during the quantum compilation for mapping an arbitrary CNOT into a sequence of CNOTs that can be directly executed by a given quantum processor.
(a) Two IBM Yorktown quantum processors are interconnected with a quantum link and a classical link. The classical link is used to transmit classical information, whereas the quantum link is needed to distribute Bell states -that is, maximally-entangled two-qubit states -between remote processors to execute remote operations. Indeed, at least one physical qubit at each processor must be reserved for storing the Bell state, as discussed in Figure 4b. This kind of qubits -dark-blue-colored in the figure -are called communication qubits [2], [29] to distinguish them from the remaining physical qubits -white-colored in the figuredevoted to computing and referred to as data qubits.
To perform a CNOT between remote physical qubits stored at different processors -say data qubits Q 4 and Q 1 in Figure 4a -a Bell state such as |Φ + must be distributed through the quantum link so that each pair member is stored within a communication qubit at each processor. Once the Bell state is available, the remote CNOT is obtained with a local CNOT between the data and the communication qubit at each processor, followed by a conditional gate on the data qubit depending on the measurement of the remote communication qubit. The double line denotes the transmission of one bit of classical information -i.e., the measurement output -between the remote processors.

FIGURE 4:
Toy-model for distributed quantum computing, with two quantum processors interconnected through a quantum network. Figure 4a shows the network topology along with the processors coupling maps, whereas Figure 4b provides the quantum circuit detailing the classical (2 bits) and the quantum (the Bell state) resources needed to execute a remote operation.
To distribute Bell states between different quantum processors, at least one qubit at each processor -referred to as communication qubit [29] to distinguish it from the remaining data qubits devoted to processing -must be reserved for remote inter-processor operations. Hence, a crucial trade-off between communication and data qubits arises. Specifically, for each remote CNOT, a Bell state is consumed ( Figure 4b) and a new Bell state must be distributed between the remote processors through the quantum link before another remote CNOT can be executed. Hence, the more communication qubits are available within a processor, the more remote CNOTs can be executed in parallel, reducing the overhead induced by the distributed computation. But the more communication qubits are available for inter-processor communication, the less valuable resources -i.e., data qubits -are available for computing.
It is unlikely that a data qubit could be dynamically turned into a communication qubit during the compilation, given the dedicated hardware -such as a matter-flying qubit interface [20] -required for entanglement distribution. Conversely, it is reasonable to envision that the distributed quantum compiler could easily reserve -when multiple communication qubits are available at the same processor -a subset of the communication qubits for computing. This optimization task represents an interesting yet unaddressed open problem.

Dynamic Connectivity
As mentioned in Section II-A, with single-processor quantum computing all the constraints on the possible interactions between different qubits -arising from the underlying physical computing architecture -can be effectively represented with a coupling map. Formally, a coupling map is a visual representation of the directed graph G: where V = {v i } denotes the set of vertices representing the qubits and E = {e i,j } v i ,v j ∈V denotes the set of directed 4 VOLUME 4, 2016 (a) Reversing CNOT [22]. A CNOT between Q0 (control) and Q1 (target) can be executed with the coupling map given in Figure 2 by performing a CNOT between Q1 (control) and Q0 (target) sandwiched between two H gates. We note that IBM Melbourne processor (shown in Figure 2) natively supports CNOTs in both directions between neighbor qubits. (a) Reversing CNOT [22]. A CNOT between Q 0 (control) and Q 1 (target) can be executed with the coupling map given in Figure 2 by performing a CNOT between Q 1 (control) and Q 0 (target) sandwiched between two H gates. We note that IBM Melbourne processor (shown in Figure 2) natively supports CNOTs in both directions between neighbor qubits.
A CNOT between qubits Q 0 (control) and Q 2 (target) can be executed through either: i) quantum state transfer, by first swapping qubits Q 0 and Q 1 so that Q 0 and Q 2 become adjacent qubits in the coupling maps, then by performing a CNOT between Q 0 and Q 2 , and finally by swapping again qubits Q 0 and Q 1 so that they recover their initial position, or ii) ancilla qubit, by performing four CNOT operations between neighbour qubits with the help of the intermediate qubit Q 1 .

FIGURE 3:
Example of equivalent quantum circuits generated during the quantum compilation for mapping an arbitrary CNOT into a sequence of CNOTs that can be directly executed by a given quantum processor.
(a) Two IBM Yorktown quantum processors are interconnected with a quantum link and a classical link. The classical link is used to transmit classical information, whereas the quantum link is needed to distribute Bell states -that is, maximally-entangled two-qubit states -between remote processors to execute remote operations. Indeed, at least one physical qubit at each processor must be reserved for storing the Bell state, as discussed in Figure 4b. This kind of qubits -dark-blue-colored in the figure -are called communication qubits [2], [29] to distinguish them from the remaining physical qubits -white-colored in the figuredevoted to computing and referred to as data qubits.

QUANTUM PROCESSOR #1
QUANTUM PROCESSOR #2 To perform a CNOT between remote physical qubits stored at different processors -say data qubits Q 4 and Q 1 in Figure 4a -a Bell state such as |Φ + must be distributed through the quantum link so that each pair member is stored within a communication qubit at each processor. Once the Bell state is available, the remote CNOT is obtained with a local CNOT between the data and the communication qubit at each processor, followed by a conditional gate on the data qubit depending on the measurement of the remote communication qubit. The double line denotes the transmission of one bit of classical information -i.e., the measurement output -between the remote processors.

FIGURE 4:
Toy-model for distributed quantum computing, with two quantum processors interconnected through a quantum network. Figure 4a shows the network topology along with the processors coupling maps, whereas Figure 4b provides the quantum circuit detailing the classical (2 bits) and the quantum (the Bell state) resources needed to execute a remote operation.
To distribute Bell states between different quantum processors, at least one qubit at each processor -referred to as communication qubit [29] to distinguish it from the remaining data qubits devoted to processing -must be reserved for remote inter-processor operations. Hence, a crucial trade-off between communication and data qubits arises. Specifically, for each remote CNOT, a Bell state is consumed ( Figure 4b) and a new Bell state must be distributed between the remote processors through the quantum link before another remote CNOT can be executed. Hence, the more communication qubits are available within a processor, the more remote CNOTs can be executed in parallel, reducing the overhead induced by the distributed computation. But the more communication qubits are available for inter-processor communication, the less valuable resources -i.e., data qubits -are available for computing.
It is unlikely that a data qubit could be dynamically turned into a communication qubit during the compilation, given the dedicated hardware -such as a matter-flying qubit interface [20] -required for entanglement distribution. Conversely, it is reasonable to envision that the distributed quantum compiler could easily reserve -when multiple communication qubits are available at the same processor -a subset of the communication qubits for computing. This optimization task represents an interesting yet unaddressed open problem.

Dynamic Connectivity
As mentioned in Section II-A, with single-processor quantum computing all the constraints on the possible interactions between different qubits -arising from the underlying physical computing architecture -can be effectively represented with a coupling map. Formally, a coupling map is a visual representation of the directed graph G: where V = {v i } denotes the set of vertices representing the qubits and E = {e i,j } vi,vj ∈V denotes the set of directed 4 VOLUME 4, 2016 (b) A CNOT between qubits Q0 (control) and Q2 (target) can be executed through either: i) quantum state transfer, by first swapping qubits Q0 and Q1 so that Q0 and Q2 become adjacent qubits in the coupling maps, then by performing a CNOT between Q0 and Q2, and finally by swapping again qubits Q0 and Q1 so that they recover their initial position, or ii) ancilla qubit, by performing four CNOT operations between neighbour qubits with the help of the intermediate qubit Q1. Figure 3: Example of equivalent quantum circuits generated during the quantum compilation for mapping an arbitrary CNOT into a sequence of CNOTs that can be directly executed by a given quantum processor.
instantaneously the other particle as well. This sort of quantum correlation, with no counterpart in the classical world, holds even when the particles are far away from each other. For an in-depth discussion about entanglement from an information engineering point of view, we refer the reader to [20]. By exploiting the availability of a Bell state -that is a state of two maximally-entangled qubits -shared between the two remote processors, it is possible to perform a remote CNOT through a sequence of local CNOTs and single-qubit operations/measurements as shown in Figure 4b.
To distribute Bell states between different quantum processors, at least one qubit at each processor -referred to as communication qubit [29] to distinguish it from the remaining data qubits devoted to processingmust be reserved for remote inter-processor operations. Hence, a crucial trade-off between communication and data qubits arises. Specifically, for each remote CNOT, a Bell state is consumed ( Figure 4b) and a new Bell state must be distributed between the remote processors through the quantum link before another remote CNOT can be executed. Hence, the more communication qubits are available within a processor, the more remote CNOTs can be executed in parallel, reducing the overhead induced by the distributed computation. But the more communication qubits are available for inter-processor communication, the less valuable resources -i.e., data qubits -are available for computing.
It is unlikely that a data qubit could be dynamically turned into a communication qubit during the compilation, given the dedicated hardware -such as a matterflying qubit interface [20] -required for entanglement distribution. Conversely, it is reasonable to envision that the distributed quantum compiler could easily reserve -when multiple communication qubits are available at the same processor -a subset of the communication qubits for computing. This optimization task represents an interesting yet unaddressed open problem.

Dynamic Connectivity
As mentioned in Section 2.1, with single-processor quantum computing all the constraints on the possible interactions between different qubits -arising from the underlying physical computing architecture -can be effectively represented with a coupling map. Formally, a coupling map is a visual representation of the directed graph G: where V = {v i } denotes the set of vertices representing the qubits and E = {e i,j } vi,vj ∈V denotes the set of directed edges representing the possibility to perform 4 a CNOT with v i and v j acting as control and target qubit, respectively. But when it comes to distributed quantum computing, a new kind of constraints arises as a consequence of the underlying physical network topology.
More in detail, similarly to single-processor quantum compiling, the remote operations are restricted to certain fixed pairs. Specifically, they are restricted to pairs composed by data qubits directly connected to a communication qubit within the processor coupling map. For instance, with reference to the network topology shown in Figure 4a, a remote CNOT between data qubits Q 2 and FIGURE 3: Example of equivalent quantum circuits generated during the quantum compilation for mapping an arbitrary CNOT into a sequence of CNOTs that can be directly executed by a given quantum processor.
(a) Two IBM Yorktown quantum processors are interconnected with a quantum link and a classical link. The classical link is used to transmit classical information, whereas the quantum link is needed to distribute Bell states -that is, maximally-entangled two-qubit states -between remote processors to execute remote operations. Indeed, at least one physical qubit at each processor must be reserved for storing the Bell state, as discussed in Figure 4b. This kind of qubits -dark-blue-colored in the figure -are called communication qubits [2], [29] to distinguish them from the remaining physical qubits -white-colored in the figuredevoted to computing and referred to as data qubits.
To perform a CNOT between remote physical qubits stored at different processors -say data qubits Q 4 and Q 1 in Figure 4a -a Bell state such as |Φ + must be distributed through the quantum link so that each pair member is stored within a communication qubit at each processor. Once the Bell state is available, the remote CNOT is obtained with a local CNOT between the data and the communication qubit at each processor, followed by a conditional gate on the data qubit depending on the measurement of the remote communication qubit. The double line denotes the transmission of one bit of classical information -i.e., the measurement output -between the remote processors.

FIGURE 4:
Toy-model for distributed quantum computing, with two quantum processors interconnected through a quantum network. Figure 4a shows the network topology along with the processors coupling maps, whereas Figure 4b provides the quantum circuit detailing the classical (2 bits) and the quantum (the Bell state) resources needed to execute a remote operation.
To distribute Bell states between different quantum processors, at least one qubit at each processor -referred to as communication qubit [29] to distinguish it from the remaining data qubits devoted to processing -must be reserved for remote inter-processor operations. Hence, a crucial trade-off between communication and data qubits arises. Specifically, for each remote CNOT, a Bell state is consumed ( Figure 4b) and a new Bell state must be distributed between the remote processors through the quantum link before another remote CNOT can be executed. Hence, the more communication qubits are available within a processor, the more remote CNOTs can be executed in parallel, reducing the overhead induced by the distributed computation. But the more communication qubits are available for inter-processor communication, the less valuable resources -i.e., data qubits -are available for computing.
It is unlikely that a data qubit could be dynamically turned into a communication qubit during the compilation, given the dedicated hardware -such as a matter-flying qubit interface [20] -required for entanglement distribution. Conversely, it is reasonable to envision that the distributed quantum compiler could easily reserve -when multiple communication qubits are available at the same processor -a subset of the communication qubits for computing. This optimization task represents an interesting yet unaddressed open problem.

Dynamic Connectivity
As mentioned in Section II-A, with single-processor quantum computing all the constraints on the possible interactions between different qubits -arising from the underlying physical computing architecture -can be effectively represented with a coupling map. Formally, a coupling map is a visual representation of the directed graph G: where V = {v i } denotes the set of vertices representing the qubits and E = {e i,j } vi,vj ∈V denotes the set of directed 4 VOLUME 4, 2016 (a) Two IBM Yorktown quantum processors are interconnected with a quantum link and a classical link. The classical link is used to transmit classical information, whereas the quantum link is needed to distribute Bell states -that is, maximally-entangled two-qubit states -between remote processors to execute remote operations. Indeed, at least one physical qubit at each processor must be reserved for storing the Bell state, as discussed in Figure 4b. This kind of qubits -dark-blue-colored in the figure -are called communication qubits [2,29] to distinguish them from the remaining physical qubits -white-colored in the figure -devoted to computing and referred to as data qubits.

FIGURE 3:
Example of equivalent quantum circuits generated during the quantum compilation for mapping an arbitrary CNOT into a sequence of CNOTs that can be directly executed by a given quantum processor.
(a) Two IBM Yorktown quantum processors are interconnected with a quantum link and a classical link. The classical link is used to transmit classical information, whereas the quantum link is needed to distribute Bell states -that is, maximally-entangled two-qubit states -between remote processors to execute remote operations. Indeed, at least one physical qubit at each processor must be reserved for storing the Bell state, as discussed in Figure 4b. This kind of qubits -dark-blue-colored in the figure -are called communication qubits [2], [29] to distinguish them from the remaining physical qubits -white-colored in the figuredevoted to computing and referred to as data qubits.
To perform a CNOT between remote physical qubits stored at different processors -say data qubits Q 4 and Q 1 in Figure 4a -a Bell state such as |Φ + must be distributed through the quantum link so that each pair member is stored within a communication qubit at each processor. Once the Bell state is available, the remote CNOT is obtained with a local CNOT between the data and the communication qubit at each processor, followed by a conditional gate on the data qubit depending on the measurement of the remote communication qubit. The double line denotes the transmission of one bit of classical information -i.e., the measurement output -between the remote processors.

FIGURE 4:
Toy-model for distributed quantum computing, with two quantum processors interconnected through a quantum network. Figure 4a shows the network topology along with the processors coupling maps, whereas Figure 4b provides the quantum circuit detailing the classical (2 bits) and the quantum (the Bell state) resources needed to execute a remote operation.
To distribute Bell states between different quantum processors, at least one qubit at each processor -referred to as communication qubit [29] to distinguish it from the remaining data qubits devoted to processing -must be reserved for remote inter-processor operations. Hence, a crucial trade-off between communication and data qubits arises. Specifically, for each remote CNOT, a Bell state is consumed ( Figure 4b) and a new Bell state must be distributed between the remote processors through the quantum link before another remote CNOT can be executed. Hence, the more communication qubits are available within a processor, the more remote CNOTs can be executed in parallel, reducing the overhead induced by the distributed computation. But the more communication qubits are available for inter-processor communication, the less valuable resources -i.e., data qubits -are available for computing.
It is unlikely that a data qubit could be dynamically turned into a communication qubit during the compilation, given the dedicated hardware -such as a matter-flying qubit interface [20] -required for entanglement distribution. Conversely, it is reasonable to envision that the distributed quantum compiler could easily reserve -when multiple communication qubits are available at the same processor -a subset of the communication qubits for computing. This optimization task represents an interesting yet unaddressed open problem.

Dynamic Connectivity
As mentioned in Section II-A, with single-processor quantum computing all the constraints on the possible interactions between different qubits -arising from the underlying physical computing architecture -can be effectively represented with a coupling map. Formally, a coupling map is a visual representation of the directed graph G: where V = {v i } denotes the set of vertices representing the qubits and E = {e i,j } vi,vj ∈V denotes the set of directed 4 VOLUME 4, 2016 To perform a CNOT between remote physical qubits stored at different processors -say data qubits Q4 and Q 1 in Figure 4a -a Bell state such as |Φ + must be distributed through the quantum link so that each pair member is stored within a communication qubit at each processor. Once the Bell state is available, the remote CNOT is obtained with a local CNOT between the data and the communication qubit at each processor, followed by a conditional gate on the data qubit depending on the measurement of the remote communication qubit. The double line denotes the transmission of one bit of classical information -i.e., the measurement output -between the remote processors. Figure 4: Toy-model for distributed quantum computing, with two quantum processors interconnected through a quantum network. Figure 4a shows the network topology along with the processors coupling maps, whereas Figure 4b provides the quantum circuit detailing the classical (2 bits) and the quantum (the Bell state) resources needed to execute a remote operation.  Figure 4a. The two 5-qubit quantum processors constitute a 8-qubit virtual quantum processor with qubits interconnected through both local and remote CNOTs. While the local CNOTs can be concurrently executed -i.e., they are unconstrained -the parallel execution of multiple remote CNOTs is constrained to the availability of multiple Bell states, with one Bell state for each concurrent remote CNOT. Given that only one communication qubit is available at each processor in Figure 4a, out of four remote CNOTs only one can be executed at any time.

Authors: Distributed Quantum Compiling
edges representing the possibility to perform 4 a CNOT with v i and v j acting as control and target qubit, respectively. But when it comes to distributed quantum computing, a new kind of constraints arises as a consequence of the underlying physical network topology.
More in detail, similarly to single-processor quantum compiling, the remote operations are restricted to certain fixed pairs. Specifically, they are restricted to pairs composed by data qubits directly connected to a communication qubit within the processor coupling map. For instance, with reference to the network topology shown in Figure 4a, a remote CNOT between data qubits Q 2 and Q 1 can be directly mapped onto the circuit given in Figure 4b. Conversely, a remote CNOT between data qubits Q 2 and Q 4 cannot be directly operations on the communication qubits [20]. Accordingly, a new Bell state must be generated and distributed through the quantum link to the communication qubits, before a subsequent remote CNOT could be executed. And even if the Bell state distribution can start right after the measurements, it is reasonable to assumegiven the several order of magnitudes separating intraprocessor qubit distance from inter-processor onethat the time needed to entangle the communication qubits significantly exceeds the time required for local CNOTs 5 . Accordingly, we have two major issues. First, the "clock" of the remote operations will be significantly lower than the "clock" of the local operations and, hence, it becomes fundamental to minimize the number of remote -rather than the number of local -operations to preserve the quantum information integrity from decoherence (Section II-A). Furthermore, there may be periods of time -following the execution of a remote operation up to the successful distribution of a new Bell state -during which the quantum processors are disconnected and only local operations are possible. These additional constraints must be properly modeled within the coupling map, so that the distributed quantum compiler can optimize the quantum circuit by accounting for the temporal dynamics arising with the distributed architecture. And this represents an open problem.

Augmented Connectivity
As shown in Figure 3b, single-processor quantum computing must resort to either state transfer (swapping) or ancilla strategy to implement a CNOT between non-adjacent (within the coupling map) physical qubits. The rationale for this lays in the impossibility to have direct interactions between distant qubits. And the further the qubits are within the  Figure 4a. The two 5-qubit quantum processors constitute a 8-qubit virtual quantum processor with qubits inter-connected through both local and remote CNOTs. While the local CNOTs can be concurrently executed -i.e., they are unconstrained -the parallel execution of multiple remote CNOTs is constrained to the availability of multiple Bell states, with one Bell state for each concurrent remote CNOT. Given that only one communication qubit is available at each processor in Figure 4a, out of four remote CNOTs only one can be executed at any time.
Q 1 can be directly mapped onto the circuit given in Figure 4b. Conversely, a remote CNOT between data qubits Q 2 and Q 4 cannot be directly executed between the pair, but it requires to distribute the operation through the neighbor qubits as shown in Figure 3b.
However, differently from single-processor compiling, the remote operations are subjects to two types of temporal constraints.
• Simultaneity Limitations. As previously discussed, each remote CNOT relies on the availability of a Bell state stored within a communication qubit. Hence, even if remote CNOTs can -in principle -be executed between different remote pairs, the number of remote CNOTs that can be executed simultaneously between two processors is limited by the number of communication qubits jointly available at each processor. With reference to Figure 5, out of four possible remote CNOTs (denoted with blue arrows), only one can be executed at any time.
• Consecutiviness Limitations. Each remote CNOT consumes a Bell state as a consequence of the measurement operations on the communication qubits [20]. Accordingly, a new Bell state must be generated and distributed through the quantum link to the communication qubits, before a subsequent remote CNOT could be executed. And even if the Bell state distribution can start right after the measurements, it is reasonable to assume -given the several order of magnitudes separating intraprocessor qubit distance from inter-processor one -that the time needed to entangle the communication qubits significantly exceeds the time required for local CNOTs 5 . Accordingly, we have two major issues. First, the "clock" of the remote operations will be significantly lower than the "clock" of the local operations and, hence, it becomes fundamental to minimize the number of remote -rather than the number of local -operations to preserve the quantum information integrity from decoherence (Section 2.1). Furthermore, there may be periods of time -following the execution of a remote operation up to the successful distribution of a new Bell state -during which the quantum processors are disconnected and only local operations are possible.
These additional constraints must be properly modeled within the coupling map, so that the distributed quantum compiler can optimize the quantum circuit by accounting for the temporal dynamics arising with the distributed architecture. And this represents an open problem.

Augmented Connectivity
As shown in Figure 3b, single-processor quantum computing must resort to either state transfer (swapping) or ancilla strategy to implement a CNOT between nonadjacent (within the coupling map) physical qubits. The rationale for this lays in the impossibility to have direct interactions between distant qubits. And the further the qubits are within the coupling map, the longer the sequence of additional CNOTs is required, regardless of the adopted strategy.
Conversely, distributed quantum computing can exploit a strategy -called entanglement swapping [3] and summarized in Fig 6 -to implement a remote CNOT between qubits stored at remote processors, even if the processors are not directly connected through a quantum link.
In a nutshell, to distribute a Bell state between remote processors -say quantum processor #1 and #3 in Figure 6a -two Bell states must be first distributed through the quantum links so that one Bell state is shared between the first processor and an intermediate node and another Bell state is shared by the same intermediate node and the second processor. Then, by performing a Bell state measurement (consisting of a H and a CNOT gate, followed by a joint measurement) on the communication qubits at the intermediate node -i.e., qubits Q 0 and Q 3 in Figure 6b -a Bell state is obtained at the remote communication qubits -i.e., qubits Q 0 and Q 3 in Figure 6a -by applying some local processing at the remote nodes depending on the (classical) output of the Bell state measurement.
From the above, it becomes clear that entanglement swapping significantly increases the connectivity within the virtual quantum processor. As an instance, qubit Q 4 in Figure 6a can interact with just two qubits within the same processor via local CNOTs and two qubits within the neighbor processor via remote CNOTs. However, it can interact with two more qubits -i.e., Q 1 and Q 2 -via entanglement swapping. And the higher is the number of available quantum processors, the higher is the number of possible interactions. Indeed, the number of additional interactions via entanglement swapping scales linearly with the number of available processors when only two communication qubits are available at each intermediate processor. If this constraint is relaxed, the number of additional interactions via entanglement swapping scales more than linearly.
However, it must be acknowledged that the augmented connectivity provided by entanglement swapping does not come for free. Indeed, entanglement swapping consumes the Bell states stored within the communication qubits at the intermediate processors. And the higher the number of intermediate processors, the higher the number of consumed Bell states.
Hence, a trade-off between "augmented connectivity" and "EPR cost" arises with entanglement swapping, and a distributed quantum compiler must carefully account for these pros and cons.

Related Work
Most quantum computer proposals are based on variations of the nearest-neighbor, two-qubit, and concurrent execution (NTC) architecture [31]. Depending on the layout of qubits, there are three NTC architectures: 1D, 2D and 3D. The 1D model, called Linear Nearest Neighbor (LNN) [32], consists of qubits located in a single line. In this model, only two neighboring qubits can interact. This is the most challenging scenario. The effects of the LNN model on performance have been investigated for many relevant use cases, such as the quantum Fourier transform [33,34], Shor's algorithm [35,36], and adders [37].
Beals et al. [38] provided algorithms for efficiently moving and addressing quantum memory in parallel. These imply that the standard circuit model can be simulated with low overhead by a more realistic model of a distributed quantum computer. The authors show that for an LNN N -qubit architecture, O(N ) time steps are necessary for performing N/2 two-qubit gates in parallel. However, it is worthwhile to note that the developed analysis does not consider any additional overhead induced by the compilation task and the derived Big-O bound relies on linear constant that is in the "many, many thousands" [39]. Conversely, in the following we develop an analysis that explicitly considers the additional overhead induced by the compilation task, as dis- (a) By swapping the entanglement at the intermediate nodes -namely, quantum processor #2 -it is possible to distribute a Bell state betwee remote processors -namely, processors #1 and #3 -even if they are not adjacent, i.e., they are not directly connected through a quantum lin Hence, entanglement swapping enhances the network connectivity through virtual quantum links.
A Bell state can be distributed between remote processors by swapping the entanglement at an intermediate node through local processing and classical communication.
(c) Dynamic coupling map for the network topology shown in Figure 6a. The solid blue lines denote remote CNOTs between adjacent processors, whereas the dotted blue lines denote remote CNOTs between distant processors achievable via entanglement swapping. nodes depending on the (classical) output of the Bell state measurement. From the above, it becomes clear that entanglement swapping significantly increases the connectivity within the virtual quantum processor. As an instance, qubit Q 4 in Figure 6a can interact with just two qubits within the same processor via local CNOTs and two qubits within the neighbor processor via remote CNOTs. However, it can interact with two more qubits -i.e., Q 1 and Q 2 -via entanglement swapping. And the higher is the number of available quantum processors, the higher is the number of possible interactions. Indeed, the number of additional interactions via entanglement swapping scales linearly with the number of available processors when only two communication qubits are available at each intermediate processor. If this constraint is relaxed, the number of additional interactions via entanglement swapping scales more than linearly.
However, it must be acknowledged that the augmented connectivity provided by entanglement swapping does not come for free. Indeed, entanglement swapping consumes the Bell states stored within the communication qubits at Bell states.
Hence, a trade-off between "augmented connectivity" an "EPR cost" arises with entanglement swapping, and a dis tributed quantum compiler must carefully account for thes pros and cons.

B. RELATED WORK
Most quantum computer proposals are based on variation of the nearest-neighbor, two-qubit, and concurrent executio (NTC) architecture [31]. Depending on the layout of qubits there are three NTC architectures: 1D, 2D and 3D. The 1D model, called Linear Nearest Neighbor (LNN) [32], consist of qubits located in a single line. In this model, only tw neighboring qubits can interact. This is the most challengin scenario. The effects of the LNN model on performance hav been investigated for many relevant use cases, such as th quantum Fourier transform [33], [34], Shor's algorithm [35 [36], and adders [37].
Beals et al. [38] provided algorithms for efficiently movin and addressing quantum memory in parallel. These impl that the standard circuit model can be simulated with low (a) By swapping the entanglement at the intermediate nodes -namely, quantum processor #2 -it is possible to distribute a Bell state between remote processors -namely, processors #1 and #3 -even if they are not adjacent, i.e., they are not directly connected through a quantum link. Hence, entanglement swapping enhances the network connectivity through virtual quantum links.
Journal of L A T E X Class Files, Vol. .., No. 8, August 2015 (a) By swapping the entanglement at the intermediate nodes -namely, quantum processor #2 -it is possible to distribute a Bell state between remote processors -namely, processors #1 and #3 -even if they are not adjacent, i.e., they are not directly connected through a quantum link. Hence, entanglement swapping enhances the network connectivity through virtual quantum links.
A Bell state can be distributed between remote processors by swapping the entanglement at an intermediate node through local processing and classical communication.
(c) Dynamic coupling map for the network topology shown in Figure 6a. The solid blue lines denote remote CNOTs between adjacent processors, whereas the dotted blue lines denote remote CNOTs between distant processors achievable via entanglement swapping. nodes depending on the (classical) output of the Bell state measurement. From the above, it becomes clear that entanglement swapping significantly increases the connectivity within the virtual quantum processor. As an instance, qubit Q 4 in Figure 6a can interact with just two qubits within the same processor via local CNOTs and two qubits within the neighbor processor via remote CNOTs. However, it can interact with two more qubits -i.e., Q 1 and Q 2 -via entanglement swapping. And the higher is the number of available quantum processors, the higher is the number of possible interactions. Indeed, the number of additional interactions via entanglement swapping scales linearly with the number of available processors when only two communication qubits are available at each intermediate processor. If this constraint is relaxed, the number of additional interactions via entanglement swapping scales more than linearly.
However, it must be acknowledged that the augmented connectivity provided by entanglement swapping does not come for free. Indeed, entanglement swapping consumes the Bell states stored within the communication qubits at the intermediate processors. And the higher the number of intermediate processors, the higher the number of consumed Bell states.
Hence, a trade-off between "augmented connectivity" and "EPR cost" arises with entanglement swapping, and a distributed quantum compiler must carefully account for these pros and cons.

B. RELATED WORK
Most quantum computer proposals are based on variations of the nearest-neighbor, two-qubit, and concurrent execution (NTC) architecture [31]. Depending on the layout of qubits, there are three NTC architectures: 1D, 2D and 3D. The 1D model, called Linear Nearest Neighbor (LNN) [32], consists of qubits located in a single line. In this model, only two neighboring qubits can interact. This is the most challenging scenario. The effects of the LNN model on performance have been investigated for many relevant use cases, such as the quantum Fourier transform [33], [34], Shor's algorithm [35], [36], and adders [37].
Beals et al. [38] provided algorithms for efficiently moving and addressing quantum memory in parallel. These imply that the standard circuit model can be simulated with low overhead by a more realistic model of a distributed quantum computer. The authors show that for an LNN N -qubit (a) By swapping the entanglement at the intermediate nodes -namely, quantum processor #2 -it is possible to distribute a Bell state between remote processors -namely, processors #1 and #3 -even if they are not adjacent, i.e., they are not directly connected through a quantum link. Hence, entanglement swapping enhances the network connectivity through virtual quantum links.
(c) Dynamic coupling map for the network topology shown in Figure 6a. The solid blue lines denote remote CNOTs between adjacent processors, whereas the dotted blue lines denote remote CNOTs between distant processors achievable via entanglement swapping. nodes depending on the (classical) output of the Bell state measurement. From the above, it becomes clear that entanglement swapping significantly increases the connectivity within the virtual quantum processor. As an instance, qubit Q 4 in Figure 6a can interact with just two qubits within the same processor via local CNOTs and two qubits within the neighbor processor via remote CNOTs. However, it can interact with two more qubits -i.e., Q 1 and Q 2 -via entanglement swapping. And the higher is the number of available quantum processors, the higher is the number of possible interactions. Indeed, the number of additional interactions via entanglement swapping scales linearly with the number of available processors when only two communication qubits are available at each intermediate processor. If this constraint is relaxed, the number of additional interactions via entanglement swapping scales more than linearly.
However, it must be acknowledged that the augmented connectivity provided by entanglement swapping does not come for free. Indeed, entanglement swapping consumes the Bell states stored within the communication qubits at the intermediate processors. And the higher the number of intermediate processors, the higher the number of consumed Bell states.
Hence, a trade-off between "augmented connectivity" and "EPR cost" arises with entanglement swapping, and a distributed quantum compiler must carefully account for these pros and cons.

B. RELATED WORK
Most quantum computer proposals are based on variations of the nearest-neighbor, two-qubit, and concurrent execution (NTC) architecture [31]. Depending on the layout of qubits, there are three NTC architectures: 1D, 2D and 3D. The 1D model, called Linear Nearest Neighbor (LNN) [32], consists of qubits located in a single line. In this model, only two neighboring qubits can interact. This is the most challenging scenario. The effects of the LNN model on performance have been investigated for many relevant use cases, such as the quantum Fourier transform [33], [34], Shor's algorithm [35], [36], and adders [37].
Beals et al. [38] provided algorithms for efficiently moving and addressing quantum memory in parallel. These imply that the standard circuit model can be simulated with low overhead by a more realistic model of a distributed quantum computer. The authors show that for an LNN N -qubit 6 VOLUME 4, 2016 (c) Dynamic coupling map for the network topology shown in Figure 6a. The solid blue lines denote remote CNOTs between adjacent processors, whereas the dotted blue lines denote remote CNOTs between distant processors achievable via entanglement swapping. Zomorodi-Moghadam et al. [9] proposed a general approach, based on the Kernighan-Lin algorithm for graph partitioning, to optimize the number of teleportations for a DQC consisting of two spatially separated and long-distance quantum subsystems. The same authors proposed also an approach based on dynamic programming [40].
Andrés-Martínez and Heunen [41] proposed an approach that may distribute circuits across any number of quantum devices. The main idea is to turn the quantum circuit into a hypergraph, then find a partitioning that minimizes the number of cuts, as each cut corresponds to a Bell state shared across two QPUs by means of communication qubits. The partitioning problem is addressed by means of the KaHyPar solver [42]. The proposed solution has some drawbacks, in particular that there is no way to define the number of communication qubits of each QPU. In the software implementation of the algorithm, the number of available communication qubits is unlimited and cannot be constrained.

Compiler Design and Overhead Bounds
As discussed in Section 3, several additional constraints arise with the shift from single-processor to distributed quantum compiling. And given that single-processor quantum compiling has been already proved to be NPcomplete [13], it is reasonable to expect that optimal distributed quantum compiling is an even harder challenging task. For this reason, in the following, we take a completely different approach. Specifically, we aim at designing a general-purpose, efficient and effective compiler for distributed quantum computing. General-purpose because our compiler does not require any particular assumption on the quantum circuit to be compiled.
Efficient because -as proved in Section 5.1 -our compiler is computationally efficient, exhibiting a polynomial time complexity that grows polynomially with the number of logical qubits and linearly with the depth of the quantum circuit to be compiled.
Effective because -as proved in Section 4.2 -our compiler assures a polynomial worst-case overhead, in terms of both: i) depth of the compiled quantum circuit and ii) number of calls to the costliest and most challenging task, i.e., the link entanglement generation.

System Model
We consider the worst-case scenario shown in Figure 7. More in detail, we assume that only one data qubit is available at each quantum processor 6 . The rationale for this choice is as follows. Whenever multiple data qubits are available at a single quantum processor, a local CNOT can be executed between these data qubits without incurring in any overhead induced by the distributed computation. Conversely, with just one data qubit available at each processor, each and every CNOT within the quantum circuit must be mapped into a remote CNOT, and hence the overhead induced by the distributed computation is the highest possible.
Furthermore, we assume that the quantum processors are interconnected through a one-dimensional nearestneighbor topology, as shown in Figure 7. Again, the rationale for this choice is to consider the worst-case scenario in terms of overhead induced by the distributed computation. In fact, the considered topology is characterized by the lowest possible number of communications qubits -i.e., 2n − 2 with n denoting the number of quantum processors -since the removal of any communication qubit would disconnect the network into two disjoint subsets of quantum processors. And the quantum processors are arranged in a line -rather than in a starto maximize both the number of non-adjacent quantum processors and the maximum distance -in terms of hops -between two non-adjacent quantum processors.
From the above, it becomes clear that the considered architecture represents the worst-case scenario in terms of overhead induced by the distributed computation. Hence, the actual overhead induced by any realworld architecture will be always upper-bounded by the communication overhead induced by the considered architecture.
Clearly, we need to choose a metric for measuring the overhead induced by the distributed computation. As discussed in Section 2.1, there exists a general consensus on circuit depth as a key performance metric of circuit compilation. Hence, in the following, we measure the overhead in terms of number of additional layers required to distribute the computation of a single layer in the original quantum circuit. Furthermore, we also evaluate the overhead in terms of how many calls to the link entanglement generation process are required from the compiling algorithm.

Basic Strategies for Distributing CNOTs
Let us consider a single layer of the original n-qubit quantum circuit. Clearly, the number of CNOTs in each layer is lower or equal to n 2 , given that at most n 2 gates can be executed simultaneously (and thus belong to the same layer) by operating on different pairs of qubits.
As discussed in Section 4.1, we aim at considering the worst-case scenario in terms of overhead induced by the distributed computing architecture. Hence, each CNOT within the quantum circuit -given that it operates on physical qubits stored at different processors as discussed in Section 4.1 -is a remote CNOT. As a consequence, the compiler must map at most n 2 remote CNOTs in each layer.

Entanglement Swapping Based Strategy
The first strategy for implementing remote CNOTs is based on the entanglement swapping technique discussed in Section 3.1 and shown in Figure 6.
Accordingly, each remote CNOT is implemented by firstly generating link entanglement [43] among neighbor nodes. To this aim, different techniques for entanglement generation can be employed, depending on the particulars of the underlying qubit technology [20]. Nevertheless, link entanglements can be simultaneously generated, given that each processor is equipped with two communication qubits. Once generated, the entanglement is simultaneously 7 swapped at intermediate nodes so that a Bell state is distributed between the two remote processors and, finally, the remote CNOT is obtained as shown in Figure 4b.
The entanglement swapping based strategy is outlined in Figure 8b in terms of basic tasks. Within the figure, 7 In general, the capability to generate (and to re-generate, once depleted) and distribute entangled Bell states through different links in parallel depends on the quantum resources available, i.e., both the number of communication qubits at each processor and the inter-connection (shared bus vs. point-to-point) among the communication qubits. Differently, the possibility to simultaneously swap the entanglement at the intermediate nodes depends only on classical resources, i.e., the possibility to simultaneously transfer classical information.

FIGURE 7:
Worst-case scenario in terms of overhead induced by the distributed computation: the quantum processors are interconnected through a one-dimensional nearest-neighbor topology, and only one data qubit is available at each quantum processor. Intra-processor coupling between communication qubits omitted for the sake of simplicity.   Figure 8a requires two preliminary tasks: i) Link Entanglement, for distributing the entanglement between neighbor nodes, and, ii) the Entanglement Swapping, for entangling the two remote processors involved within the CNOT. Clearly, the swapping task is omitted whenever the CNOT operates between data qubits stored at processors that are neighbor within the network topology, as for the i-th layer.

Entanglement Swapping Based Strategy
The first strategy for implementing remote CNOTs is based on the entanglement swapping technique discussed in Section III-A and shown in Figure 6. Accordingly, each remote CNOT is implemented by firstly generating link entanglement [43] among neighbor nodes. To this aim, different techniques for entanglement generation can be employed, depending on the particulars of the underlying qubit technology [20]. Nevertheless, link entanglements can be simultaneously generated, given that each processor is equipped with two communication qubits. Once generated, the entanglement is simultaneously 7 swapped at intermediate nodes so that a Bell state is distributed between the two remote processors and, finally, the remote CNOT is obtained as shown in Figure 4b.
The entanglement swapping based strategy is outlined in Figure 8b in terms of basic tasks. Within the figure, the particulars of each task are omitted for the sake of clarity. For instance, entanglement swapping -although depicted as a single block -is indeed obtained with a quantum circuit composed by three layers as shown in Figure 6b. Similarly, the link entanglement generation requires a quantum circuit with a depth equal or greater than two, depending on the par- 7 In general, the capability to generate (and to re-generate, once depleted) and distribute entangled Bell states through different links in parallel depends on the quantum resources available, i.e., both the number of communication qubits at each processor and the inter-connection (shared bus vs. point-to-point) among the communication qubits. Differently, the possibility to simultaneously swap the entanglement at the intermediate nodes depends only on classical resources, i.e., the possibility to simultaneously transfer classical information.
ticulars of the quantum technology underlying entanglement generation and distribution [20].
Nevertheless, the figure 8 provides a clear intuition of both: i) the sequentiality constraints between the different tasks, and ii) the parallelism achievable within each task. Specifically, whenever the CNOTs overlaps 9 within the network topology (as for the CNOTs of the layer #i+1 in Figure 8a), they must be executed sequentially. Differently, CNOTs that don't overlap (as for the CNOTs of the i-th layer in Figure 8a) can be executed simultaneously. Since we are interested in assessing the worst-case overhead induced by distributed computation, in the following we consider the worst-case scenario in which all the CNOTs of an arbitrary layer of the quantum circuit overlap within the network topology. Hence, we have that the depth overhead of the entanglement swapping based strategy does not exceed the following depth: 8 We note that -for the sake of simplicity -in Figure 8b we simply mapped the j-th logical qubit Q j of layer #i in Figure 8a onto the j + 1-th processor, ignoring so any optimization achievable with a proper mapping of the logical qubits of the quantum circuit onto the physical qubits of the quantum processor. 9 The term "overlap" indicates the case when the execution of the considered CNOTs involves overlapping sets of intermediate processors as a consequence of the constraint we imposed on the network topology of having 2n − 2 communication qubits. With reference to the example in Figure 8a, the CNOT between Q 0 and Q 2 in the layer #i+1 overlaps with the CNOT between Q 1 and Q 4 , being the communication qubits at the processors #1 and #2 needed to both of them. Differently, the CNOT between Q 3 and Q 5 does not overlap with CNOT between Q 0 and Q 2 and, hence, they can be performed in parallel.  : Worst-case scenario in terms of overhead induced by the distributed computation: the quantum processors are interconnected through a one-dimensional nearest-neighbor topology, and only one data qubit is available at each quantum processor. Intra-processor coupling between communication qubits omitted for the sake of simplicity.   Figure 8a requires two preliminary tasks: i) Link Entanglement, for distributing the entanglement between neighbor nodes, and, ii) the Entanglement Swapping, for entangling the two remote processors involved within the CNOT. Clearly, the swapping task is omitted whenever the CNOT operates between data qubits stored at processors that are neighbor within the network topology, as for the i-th layer.

Entanglement Swapping Based Strategy
The first strategy for implementing remote CNOTs is based on the entanglement swapping technique discussed in Section III-A and shown in Figure 6. Accordingly, each remote CNOT is implemented by firstly generating link entanglement [43] among neighbor nodes. To this aim, different techniques for entanglement generation can be employed, depending on the particulars of the underlying qubit technology [20]. Nevertheless, link entanglements can be simultaneously generated, given that each processor is equipped with two communication qubits. Once generated, the entanglement is simultaneously 7 swapped at intermediate nodes so that a Bell state is distributed between the two remote processors and, finally, the remote CNOT is obtained as shown in Figure 4b.
The entanglement swapping based strategy is outlined in Figure 8b in terms of basic tasks. Within the figure, the particulars of each task are omitted for the sake of clarity. For instance, entanglement swapping -although depicted as a single block -is indeed obtained with a quantum circuit composed by three layers as shown in Figure 6b. Similarly, the link entanglement generation requires a quantum circuit with a depth equal or greater than two, depending on the par- 7 In general, the capability to generate (and to re-generate, once depleted) and distribute entangled Bell states through different links in parallel depends on the quantum resources available, i.e., both the number of communication qubits at each processor and the inter-connection (shared bus vs. point-to-point) among the communication qubits. Differently, the possibility to simultaneously swap the entanglement at the intermediate nodes depends only on classical resources, i.e., the possibility to simultaneously transfer classical information.
ticulars of the quantum technology underlying entanglement generation and distribution [20].
Nevertheless, the figure 8 provides a clear intuition of both: i) the sequentiality constraints between the different tasks, and ii) the parallelism achievable within each task. Specifically, whenever the CNOTs overlaps 9 within the network topology (as for the CNOTs of the layer #i+1 in Figure 8a), they must be executed sequentially. Differently, CNOTs that don't overlap (as for the CNOTs of the i-th layer in Figure 8a) can be executed simultaneously. Since we are interested in assessing the worst-case overhead induced by distributed computation, in the following we consider the worst-case scenario in which all the CNOTs of an arbitrary layer of the quantum circuit overlap within the network topology. Hence, we have that the depth overhead of the entanglement swapping based strategy does not exceed the following depth: 8 We note that -for the sake of simplicity -in Figure 8b we simply mapped the j-th logical qubit Q j of layer #i in Figure 8a onto the j + 1-th processor, ignoring so any optimization achievable with a proper mapping of the logical qubits of the quantum circuit onto the physical qubits of the quantum processor. 9 The term "overlap" indicates the case when the execution of the considered CNOTs involves overlapping sets of intermediate processors as a consequence of the constraint we imposed on the network topology of having 2n − 2 communication qubits. With reference to the example in Figure 8a, the CNOT between Q 0 and Q 2 in the layer #i+1 overlaps with the CNOT between Q 1 and Q 4 , being the communication qubits at the processors #1 and #2 needed to both of them. Differently, the CNOT between Q 3 and Q 5 does not overlap with CNOT between Q 0 and Q 2 and, hence, they can be performed in parallel. Worst-case scenario in terms of overhead induced by the distributed computation: the quantum processors are interconnected through a one-dimensional nearest-neighbor topology, and only one data qubit is available at each quantum processor. Intra-processor coupling between communication qubits omitted for the sake of simplicity.   Figure 8a requires two preliminary tasks: i) Link Entanglement, for distributing the entanglement between neighbor nodes, and, ii) the Entanglement Swapping, for entangling the two remote processors involved within the CNOT. Clearly, the swapping task is omitted whenever the CNOT operates between data qubits stored at processors that are neighbor within the network topology, as for the i-th layer.

Entanglement Swapping Based Strategy
The first strategy for implementing remote CNOTs is based on the entanglement swapping technique discussed in Section III-A and shown in Figure 6. Accordingly, each remote CNOT is implemented by firstly generating link entanglement [43] among neighbor nodes. To this aim, different techniques for entanglement generation can be employed, depending on the particulars of the underlying qubit technology [20]. Nevertheless, link entanglements can be simultaneously generated, given that each processor is equipped with two communication qubits. Once generated, the entanglement is simultaneously 7 swapped at intermediate nodes so that a Bell state is distributed between the two remote processors and, finally, the remote CNOT is obtained as shown in Figure 4b.
The entanglement swapping based strategy is outlined in Figure 8b in terms of basic tasks. Within the figure, the particulars of each task are omitted for the sake of clarity. For instance, entanglement swapping -although depicted as a single block -is indeed obtained with a quantum circuit composed by three layers as shown in Figure 6b. Similarly, the link entanglement generation requires a quantum circuit with a depth equal or greater than two, depending on the par- 7 In general, the capability to generate (and to re-generate, once depleted) and distribute entangled Bell states through different links in parallel depends on the quantum resources available, i.e., both the number of communication qubits at each processor and the inter-connection (shared bus vs. point-to-point) among the communication qubits. Differently, the possibility to simultaneously swap the entanglement at the intermediate nodes depends only on classical resources, i.e., the possibility to simultaneously transfer classical information. ticulars of the quantum technology underlying entanglement generation and distribution [20].
Nevertheless, the figure 8 provides a clear intuition of both: i) the sequentiality constraints between the different tasks, and ii) the parallelism achievable within each task. Specifically, whenever the CNOTs overlaps 9 within the network topology (as for the CNOTs of the layer #i+1 in Figure 8a), they must be executed sequentially. Differently, CNOTs that don't overlap (as for the CNOTs of the i-th layer in Figure 8a) can be executed simultaneously. Since we are interested in assessing the worst-case overhead induced by distributed computation, in the following we consider the worst-case scenario in which all the CNOTs of an arbitrary layer of the quantum circuit overlap within the network topology. Hence, we have that the depth overhead of the entanglement swapping based strategy does not exceed the following depth: 8 We note that -for the sake of simplicity -in Figure 8b we simply mapped the j-th logical qubit Q j of layer #i in Figure 8a onto the j + 1-th processor, ignoring so any optimization achievable with a proper mapping of the logical qubits of the quantum circuit onto the physical qubits of the quantum processor. 9 The term "overlap" indicates the case when the execution of the considered CNOTs involves overlapping sets of intermediate processors as a consequence of the constraint we imposed on the network topology of having 2n − 2 communication qubits. With reference to the example in Figure 8a, the CNOT between Q 0 and Q 2 in the layer #i+1 overlaps with the CNOT between Q 1 and Q 4 , being the communication qubits at the processors #1 and #2 needed to both of them. Differently, the CNOT between Q 3 and Q 5 does not overlap with CNOT between Q 0 and Q 2 and, hence, they can be performed in parallel.   Figure 8a requires two preliminary tasks: i) Link Entanglement, for distributing the entanglement between neighbor nodes, and, ii) the Entanglement Swapping, for entangling the two remote processors involved within the CNOT. Clearly, the swapping task is omitted whenever the CNOT operates between data qubits stored at processors that are neighbor within the network topology, as for the i-th layer. the particulars of each task are omitted for the sake of clarity. For instance, entanglement swapping -although depicted as a single block -is indeed obtained with a quantum circuit composed by three layers as shown in Figure 6b. Similarly, the link entanglement generation requires a quantum circuit with a depth equal or greater than two, depending on the particulars of the quantum technology underlying entanglement generation and distribution [20].
Nevertheless, the figure 8 provides a clear intuition of both: i) the sequentiality constraints between the different tasks, and ii) the parallelism achievable within each task. Specifically, whenever the CNOTs overlaps 9 within 8 We note that -for the sake of simplicity -in Figure 8b we simply mapped the j-th logical qubit Q j of layer #i in Figure 8a onto the j+1-th processor, ignoring so any optimization achievable with a proper mapping of the logical qubits of the quantum circuit onto the physical qubits of the quantum processor. 9 The term "overlap" indicates the case when the execution of the considered CNOTs involves overlapping sets of intermediate processors as a consequence of the constraint we imposed on the network topology of having 2n − 2 communication qubits. With reference to the example in Figure 8a, the CNOT between Q 0 and Q 2 in the layer #i+1 overlaps with the CNOT between Q 1 and Q 4 , being the communication qubits at the processors #1 and #2 needed to both of them. Differently, the CNOT between Q 3 and Q 5 does not overlap with CNOT between Q 0 and Q 2 and, hence, they can be performed in parallel. the network topology (as for the CNOTs of the layer #i+1 in Figure 8a), they must be executed sequentially. Differently, CNOTs that don't overlap (as for the CNOTs of the i-th layer in Figure 8a) can be executed simultaneously. Since we are interested in assessing the worst-case overhead induced by distributed computation, in the following we consider the worst-case scenario in which all the CNOTs of an arbitrary layer of the quantum circuit overlap within the network topology. Hence, we have that the depth overhead of the entanglement swapping based strategy does not exceed the following depth: where n denotes the number of logical qubits within the quantum circuit and d es is a constant factor (independent from the characteristics of the original quantum circuit) given by: with c le and c bsm denoting the number of layers required to perform the link entanglement task and the entanglement swapping task, respectively, and c cx denoting the number of layers required to perform a remote CNOT once the Bell state has been distributed between two processors. The actual values of c le , c bsm and c cx depend on the particulars of the underlying hardware technology.
From (2), we have that the actual depth of an arbitrary d-depth quantum circuit compiled with the entanglement swapping based strategy will be always lower than n 2 d, neglecting the constant d es . Hence, the depth overhead grows linearly with the number of logical qubits of the quantum circuit to be compiled. And given that this result holds for the worst-case scenario (onedata-qubit processors arranged in a one-dimensional network topology), the actual depth overhead induced by any arbitrary distributed architecture will be always upper-bounded by (2).
We further note that classical information must be exchanged between the quantum processors. For instance, the entanglement swapping task requires the transmission of classical information (i.e., the measurement output) throughout the quantum network. Hence, in case of long-distance quantum processors, the actual execution time of the compiled quantum circuit may be affected by the latency induced by the classical communications.
Finally, due to the complex and stochastic nature of the physical mechanisms underlying quantum entanglement [43], several attempts can be required for establishing a link entanglement, and this may impact as well the execution time of the compiled quantum circuit. Indeed, we should consider link entanglement as the critical task for distributed quantum computation, given that the remaining tasks require only local quantum operations and classical communications. From this perspective, the entanglement swapping based strategy requires at most n 2 repetitions of the link entanglement task, regardless of the original quantum circuit and regardless of the characteristics of the network topology underlying the distributed computing architecture.

Data-Qubit Swapping Based Strategy
The entanglement swapping based strategy takes full advantage of the augmented connectivity enabled by the communication qubits -as discussed in Section 3.1to allow interactions between remote processors within each layer.
Nevertheless, whenever the original quantum circuit presents repetitions of the same CNOT interaction pattern between logical qubits -as for layers #i+1 and #i+2 in Figure 9a -a more elaborate strategy -based on moving the data qubits -can provide better performance.
The strategy is shown in Figure 9: the objective is to arrange (i.e., to swap) the data qubits within the quantum processors so that eventually each CNOT of the original layer operates on qubits stored at neighbor processors within the network topology.
Intuitively, the strategy goal can be modeled as an array sorting problem. Indeed, similarly to classical sorting, the n data qubits (representing the values to be sorted) must be ordered within the network topology (representing an array with size equal or greater than Algorithm 1 Data-Qubit Swapping Input: n-qubit circuit layer L with mod(n,4) = 0 and n 2 CNOTs Output: layer L with each CNOT operating on neighbor qubits 1: function Sort(L) 2: if ∃ CNOT(q i , q j ) with i, j ≤ n 2 then 3: // ∃ CNOT(q k , q l ) with k, l > n 2

11:
Swap(q i , q n 2 −1 ) 12: end if 14: if L = ∅ then 15: Sort(L) 16: end if 17: end function n ). However, differently from classical sorting where any couple of values can be swapped regardless from their position within the array, with the data-qubit swapping the constraints arising from the underlying network topology must be carefully taken into account. To this aim, by taking advantage of the sorting network theory, it is easy to model the network topology constraints through the notion of insertion network (or, equivalently, bubble network ). As a consequence, the overall depth of the equivalent quantum circuit grows with the number n of logical qubits as [44]: instead of a logarithmic log n depth factor as for classical sorting. Nevertheless, sorting networks -and in general classical sorting -are based on the assumption that there exists a total (monotonic) order over the array elements. Hence, there exists a unique solution to the sorting problem. Conversely, the data-qubit swapping based strategy admits several equivalent solutions for the arranging problem, as exemplified in Figure 10. We now formalize these considerations with the following theorem. Theorem 1. Let us consider the i-th layer of an arbitrary n-qubit quantum circuit. The depth of the corresponding compiled quantum circuit, obtained through the data-qubit swapping based strategy, does not exceed the following depth: where d qs and d qs are constant factors (independent from the characteristics of the original quantum circuit)  : Data-qubit swapping based strategy. Swapping data qubits between remote quantum processors can be advantageous whenever the original quantum circuit presents repetitions of the same CNOT interaction pattern, as for layers #i+1 and #i+2 in Figure 9a. Although not shown in the figure, the entanglement swapping tasks (as in Figure 8b) are needed whenever it is necessary to swap data qubits stored at processors that are not neighbor within the network topology.
Proc. #4  swapping based strategy, does not exceed the following depth: where d qs and d qs are constant factors (independent from the characteristics of the original quantum circuit) given by: with c le and c bsm denoting the number of layers required to perform the link entanglement task and the entanglement swapping task, respectively, and c cx denoting the number of layers required to perform the remote CNOTs once the Bell state has been distributed between two processors. Proof: The proof easily follows by recognizing that: i) the function SORT(·), defined in Algorithm 1, is called at most n 4 times, and ii) after these calls to SORT(·), all the CNOTs, by acting on qubits stored at neighbor processors, can be executed at once through link entanglement followed by local operations, as shown in Figure 4b. More specifically, in each call we have two disjoint cases (line 2).
In the former case (lines 3-6), there exists a CNOT acting within the first half portion -i.e., the first n 2 logical qubits -of the original layer. Since we are considering the worstcase -namely, a layer with n 2 CNOTs -then we have that there exists at least one CNOT acting on the last half portion -i.e., the last n 2 logical qubits -of the original layer. Hence, the two CNOTs do not overlap and two simultaneous SWAP operations can be executed -one in each half portion of the original quantum circuit -as shown here: . . q k q l . . . q k+1 . . . q n so that, within the compiled circuit -the two CNOTs act on qubits stored at neighbor processors. Within the previous diagram, as well as in Algorithm 1, we omitted some minor particulars for the sake of simplicity. For instance, we implicitly assumed that i (and k) is odd -i.e., mod(i, 2) = 1 -so that q j must be swapped with q i+1 . Clearly whether i should be even, q j must be swapped with q i−1 .
In the latter case (lines 8-12), each and every CNOT acts on two logical qubits belonging to both the half portions of the original layer. Let us consider, with no lack of generality, the CNOT acting on the n 2 -th logical qubit (i.e., the last qubit of the first half portion) and let us denote as q l the second qubit on which such a CNOT operates. Since we are considering a layer with n 2 CNOTs, then we have that there exists a CNOT acting on the l − 1-th qubit. By denoting as q i the second qubit on which such a CNOT operates, it follows i < n 2 . Although the two CNOTs overlap, by properly selecting two SWAP operations as shown here:   FIGURE 9: Data-qubit swapping based strategy. Swapping data qubits between remote quantum processors can be advantageous whenever the original quantum circuit presents repetitions of the same CNOT interaction pattern, as for layers #i+1 and #i+2 in Figure 9a. Although not shown in the figure, the entanglement swapping tasks (as in Figure 8b) are needed whenever it is necessary to swap data qubits stored at processors that are not neighbor within the network topology.
Proc. #4  swapping based strategy, does not exceed the following depth: where d qs and d qs are constant factors (independent from the characteristics of the original quantum circuit) given by: with c le and c bsm denoting the number of layers required to perform the link entanglement task and the entanglement swapping task, respectively, and c cx denoting the number of layers required to perform the remote CNOTs once the Bell state has been distributed between two processors. Proof: The proof easily follows by recognizing that: i) the function SORT(·), defined in Algorithm 1, is called at most n 4 times, and ii) after these calls to SORT(·), all the CNOTs, by acting on qubits stored at neighbor processors, can be executed at once through link entanglement followed by local operations, as shown in Figure 4b. More specifically, in each call we have two disjoint cases (line 2).
In the former case (lines 3-6), there exists a CNOT acting within the first half portion -i.e., the first n 2 logical qubits -of the original layer. Since we are considering the worstcase -namely, a layer with n 2 CNOTs -then we have that there exists at least one CNOT acting on the last half portion -i.e., the last n 2 logical qubits -of the original layer. Hence, the two CNOTs do not overlap and two simultaneous SWAP operations can be executed -one in each half portion of the original quantum circuit -as shown here: . . q k q l . . . q k+1 . . . q n so that, within the compiled circuit -the two CNOTs act on qubits stored at neighbor processors. Within the previous diagram, as well as in Algorithm 1, we omitted some minor particulars for the sake of simplicity. For instance, we implicitly assumed that i (and k) is odd -i.e., mod(i, 2) = 1 -so that q j must be swapped with q i+1 . Clearly whether i should be even, q j must be swapped with q i−1 .
In the latter case (lines 8-12), each and every CNOT acts on two logical qubits belonging to both the half portions of the original layer. Let us consider, with no lack of generality, the CNOT acting on the n 2 -th logical qubit (i.e., the last qubit of the first half portion) and let us denote as q l the second qubit on which such a CNOT operates. Since we are considering a layer with n 2 CNOTs, then we have that there exists a CNOT acting on the l − 1-th qubit. By denoting as q i the second qubit on which such a CNOT operates, it follows i < n 2 . Although the two CNOTs overlap, by properly selecting two SWAP operations as shown here:  Figure 9: Data-qubit swapping based strategy. Swapping data qubits between remote quantum processors can be advantageous whenever the original quantum circuit presents repetitions of the same CNOT interaction pattern, as for layers #i+1 and #i+2 in Figure 9a. Although not shown in the figure, the entanglement swapping tasks (as in Figure 8b) are needed whenever it is necessary to swap data qubits stored at processors that are not neighbor within the network topology.

FIGURE 9:
Data-qubit swapping based strategy. Swapping data qubits between remote quantum processors can be advantageous whenever the original quantum circuit presents repetitions of the same CNOT interaction pattern, as for layers #i+1 and #i+2 in Figure 9a. Although not shown in the figure, the entanglement swapping tasks (as in Figure 8b) are needed whenever it is necessary to swap data qubits stored at processors that are not neighbor within the network topology.
Proc. #4  swapping based strategy, does not exceed the following depth: where d qs and d qs are constant factors (independent from the characteristics of the original quantum circuit) given by: with c le and c bsm denoting the number of layers required to perform the link entanglement task and the entanglement swapping task, respectively, and c cx denoting the number of layers required to perform the remote CNOTs once the Bell state has been distributed between two processors. Proof: The proof easily follows by recognizing that: i) the function SORT(·), defined in Algorithm 1, is called at most n 4 times, and ii) after these calls to SORT(·), all the CNOTs, by acting on qubits stored at neighbor processors, can be executed at once through link entanglement followed by local operations, as shown in Figure 4b. More specifically, in each call we have two disjoint cases (line 2).
In the former case (lines 3-6), there exists a CNOT acting within the first half portion -i.e., the first n 2 logical qubits -of the original layer. Since we are considering the worstcase -namely, a layer with n 2 CNOTs -then we have that there exists at least one CNOT acting on the last half portion -i.e., the last n 2 logical qubits -of the original layer. Hence, the two CNOTs do not overlap and two simultaneous SWAP operations can be executed -one in each half portion of the original quantum circuit -as shown here: . . q k q k+1 . . . q l . . . q n q 1 . . . q i q j . . . q i+1 . . . q n 2 . . . q k q l . . . q k+1 . . . q n so that, within the compiled circuit -the two CNOTs act on qubits stored at neighbor processors. Within the previous diagram, as well as in Algorithm 1, we omitted some minor particulars for the sake of simplicity. For instance, we implicitly assumed that i (and k) is odd -i.e., mod(i, 2) = 1 -so that q j must be swapped with q i+1 . Clearly whether i should be even, q j must be swapped with q i−1 .
In the latter case (lines 8-12), each and every CNOT acts on two logical qubits belonging to both the half portions of the original layer. Let us consider, with no lack of generality, the CNOT acting on the n 2 -th logical qubit (i.e., the last qubit of the first half portion) and let us denote as q l the second qubit on which such a CNOT operates. Since we are considering a layer with n 2 CNOTs, then we have that there exists a CNOT acting on the l − 1-th qubit. By denoting as q i the second qubit on which such a CNOT operates, it follows i < n 2 . Although the two CNOTs overlap, by properly selecting two SWAP operations as shown here:

FIGURE 9:
Data-qubit swapping based strategy. Swapping data qubits between remote quantum processors can be advantageous whenever the original quantum circuit presents repetitions of the same CNOT interaction pattern, as for layers #i+1 and #i+2 in Figure 9a. Although not shown in the figure, the entanglement swapping tasks (as in Figure 8b) are needed whenever it is necessary to swap data qubits stored at processors that are not neighbor within the network topology.
Proc. #4  swapping based strategy, does not exceed the following depth: where d qs and d qs are constant factors (independent from the characteristics of the original quantum circuit) given by: with c le and c bsm denoting the number of layers required to perform the link entanglement task and the entanglement swapping task, respectively, and c cx denoting the number of layers required to perform the remote CNOTs once the Bell state has been distributed between two processors. Proof: The proof easily follows by recognizing that: i) the function SORT(·), defined in Algorithm 1, is called at most n 4 times, and ii) after these calls to SORT(·), all the CNOTs, by acting on qubits stored at neighbor processors, can be executed at once through link entanglement followed by local operations, as shown in Figure 4b. More specifically, in each call we have two disjoint cases (line 2).
In the former case (lines 3-6), there exists a CNOT acting within the first half portion -i.e., the first n 2 logical qubits -of the original layer. Since we are considering the worstcase -namely, a layer with n 2 CNOTs -then we have that there exists at least one CNOT acting on the last half portion -i.e., the last n 2 logical qubits -of the original layer. Hence, the two CNOTs do not overlap and two simultaneous SWAP operations can be executed -one in each half portion of the original quantum circuit -as shown here: . . q k q k+1 . . . q l . . . q n q 1 . . . q i q j . . . q i+1 . . . q n 2 . . . q k q l . . . q k+1 . . . q n so that, within the compiled circuit -the two CNOTs act on qubits stored at neighbor processors. Within the previous diagram, as well as in Algorithm 1, we omitted some minor particulars for the sake of simplicity. For instance, we implicitly assumed that i (and k) is odd -i.e., mod(i, 2) = 1 -so that q j must be swapped with q i+1 . Clearly whether i should be even, q j must be swapped with q i−1 .
In the latter case (lines 8-12), each and every CNOT acts on two logical qubits belonging to both the half portions of the original layer. Let us consider, with no lack of generality, the CNOT acting on the n 2 -th logical qubit (i.e., the last qubit of the first half portion) and let us denote as q l the second qubit on which such a CNOT operates. Since we are considering a layer with n 2 CNOTs, then we have that there exists a CNOT acting on the l − 1-th qubit. By denoting as q i the second qubit on which such a CNOT operates, it follows i < n 2 . Although the two CNOTs overlap, by properly selecting two SWAP operations as shown here:   FIGURE 9: Data-qubit swapping based strategy. Swapping data qubits between remote quantum processors can be advantageous whenever the original quantum circuit presents repetitions of the same CNOT interaction pattern, as for layers #i+1 and #i+2 in Figure 9a. Although not shown in the figure, the entanglement swapping tasks (as in Figure 8b) are needed whenever it is necessary to swap data qubits stored at processors that are not neighbor within the network topology.
Proc. #4 (c) Alternative equivalent arrangement. swapping based strategy, does not exceed the following depth: where d qs and d qs are constant factors (independent from the characteristics of the original quantum circuit) given by: with c le and c bsm denoting the number of layers required to perform the link entanglement task and the entanglement swapping task, respectively, and c cx denoting the number of layers required to perform the remote CNOTs once the Bell state has been distributed between two processors. Proof: The proof easily follows by recognizing that: i) the function SORT(·), defined in Algorithm 1, is called at most n 4 times, and ii) after these calls to SORT(·), all the CNOTs, by acting on qubits stored at neighbor processors, can be executed at once through link entanglement followed by local operations, as shown in Figure 4b. More specifically, in each call we have two disjoint cases (line 2).
In the former case (lines 3-6), there exists a CNOT acting within the first half portion -i.e., the first n 2 logical qubits -of the original layer. Since we are considering the worstcase -namely, a layer with n 2 CNOTs -then we have that there exists at least one CNOT acting on the last half portion -i.e., the last n 2 logical qubits -of the original layer. Hence, the two CNOTs do not overlap and two simultaneous SWAP operations can be executed -one in each half portion of the original quantum circuit -as shown here: . . q k q k+1 . . . q l . . . q n q 1 . . . q i q j . . . q i+1 . . . q n 2 . . . q k q l . . . q k+1 . . . q n so that, within the compiled circuit -the two CNOTs act on qubits stored at neighbor processors. Within the previous diagram, as well as in Algorithm 1, we omitted some minor particulars for the sake of simplicity. For instance, we implicitly assumed that i (and k) is odd -i.e., mod(i, 2) = 1 -so that q j must be swapped with q i+1 . Clearly whether i should be even, q j must be swapped with q i−1 .
In the latter case (lines 8-12), each and every CNOT acts on two logical qubits belonging to both the half portions of the original layer. Let us consider, with no lack of generality, the CNOT acting on the n 2 -th logical qubit (i.e., the last qubit of the first half portion) and let us denote as q l the second qubit on which such a CNOT operates. Since we are considering a layer with n 2 CNOTs, then we have that there exists a CNOT acting on the l − 1-th qubit. By denoting as q i the second qubit on which such a CNOT operates, it follows i < n 2 . Although the two CNOTs overlap, by properly selecting two SWAP operations as shown here:  given by: with c le and c bsm denoting the number of layers required to perform the link entanglement task and the entanglement swapping task, respectively, and c cx denoting the number of layers required to perform the remote CNOTs once the Bell state has been distributed between two processors.
Proof. The proof easily follows by recognizing that: i) the function SORT(·), defined in Algorithm 1, is called at most n 4 times, and ii) after these calls to SORT(·), all the CNOTs, by acting on qubits stored at neighbor processors, can be executed at once through link entanglement followed by local operations, as shown in Figure 4b. More specifically, in each call we have two disjoint cases (line 2). In the former case (lines 3-6), there exists a CNOT acting within the first half portion -i.e., the first n 2 logical qubits -of the original layer. Since we are considering the worst-case -namely, a layer with n 2 CNOTs -then we have that there exists at least one CNOT acting on the last half portion -i.e., the last n 2 logical qubits -of the original layer. Hence, the two CNOTs do not overlap and two simultaneous SWAP operations can be executed -one in each half portion of the original quantum circuit -as shown here: q 1 . . . q i q i+1 . . . q j . . . q n 2 . . . q k q k+1 . . . q l . . . q n q 1 . . . q i q j . . . q i+1 . . . q n 2 . . . q k q l . . . q k+1 . . . q n so that, within the compiled circuit -the two CNOTs act on qubits stored at neighbor processors. Within the previous diagram, as well as in Algorithm 1, we omitted some minor particulars for the sake of simplicity. For instance, we implicitly assumed that i (and k) is odd -i.e., mod(i, 2) = 1 -so that q j must be swapped with q i+1 . Clearly whether i should be even, q j must be swapped with q i−1 .
In the latter case (lines 8-12), each and every CNOT acts on two logical qubits belonging to both the half portions of the original layer. Let us consider, with no lack of generality, the CNOT acting on the n 2 -th logical qubit (i.e., the last qubit of the first half portion) and let us denote as q l the second qubit on which such a CNOT operates. Since we are considering a layer with n 2 CNOTs, then we have that there exists a CNOT acting on the l−1-th qubit. By denoting as q i the second qubit on which such a CNOT operates, it follows i < n 2 . Although the two CNOTs overlap, by properly selecting two SWAP operations as shown here: q 1 . . . q i . . . q n 2 −1 q n 2 . . . q l−1 q l . . . q n q 1 . . . q n 2 −1 . . . q i q l−1 . . . q n 2 q l . . . q n we have that the SWAPs can be simultaneously executed so that, within the compiled circuit -the two CNOTs act on qubits stored at neighbor processors. Regardless of which case holds, each call to the SORT(·) function compiles two CNOTs. By recalling that at most n 2 CNOTs are present in a layer, the thesis follows.
From (5), we have that the actual depth of an arbitrary d-depth quantum circuit compiled with the dataqubit swapping based strategy will be always lower than n 4 d, neglecting the constant factors. Hence, the depth overhead is asymptotically lower than the overhead induced by the entanglement swapping based strategy. However, an explicit comparison between the two strategies depends on the particulars of the underlying qubit technology through the exact expressions of d es and d qs . Furthermore, it depends also on the repetitions of the same CNOT interaction patterns within the original quantum circuit.
As regards to the number of repetitions of the link entanglement task, in general it depends on the characteristics of the underlying qubit technology as discussed in Section 4.3. With reference to the IBM quantum processors, where a SWAP operation is obtained through a sequence of three CNOTs, from (5)- (7) we have that the data-qubit swapping based strategy requires at most 2n repetitions of the link entanglement task, regardless of the original quantum circuit and regardless of the characteristics of the network topology underlying the distributed computing architecture.

Discussion
As already mentioned above, the performance of the two strategies firmly depends on the particulars of the underlying hardware technology through the parameters d es , d qs and d qs given in equations (3), (6) and (7).
To better clarify this point, let us consider d qs , which inherently denotes the cost for a SWAP operation. Having assumed through the manuscript the CNOT being the fundamental multi-qubit gate, a single remote SWAP operation can be obtained through three remote CNOTs as in Figure 3b. And this is the rationale for the constant factor equal to 3 in (6), which accounts for the cost of three remote CNOTs.
Clearly, by changing the assumptions on the underlying hardware technology, the expression of d qs changes as well. For instance, photonic technology can provide the SWAP gate as the native operation [45], and in such a case d qs is equal to 1. Nevertheless the main result -i.e., equation (5) -continues to hold. Furthermore, whenever the SWAP gate is the native operation, a single CNOT can be obtained through two consecutive SWAPs interleaved by single-qubit operations [46]. Hence, the expression of d es must change accordingly but the main result -i.e., equation (3) -continues to hold as well.
Indeed it is worthwhile to note that, despite the differences between the performance of the two strategies, there exists a one-to-one mapping between the strategies. Specifically, there exists an admissible transformation allowing to map the compiled circuit obtained with a strategy into the compiled circuit obtained with the other strategy. And the corresponding computational task exhibits a polynomial-time complexity 10 for every original circuit.

Performance Analysis
Here we perform a performance analysis for the compiler design conducted in Section 4.
More in detail, in Section 5.1 we illustrate the algorithmic implementation of the compiler, proving so its attractive featurea polynomial time complexity that grows polynomially with the number of logical qubits and linearly with the depth of the quantum circuit to be compiled -from a computational perspective. Then, in Section 5.2 we validate the theoretical upper bounds on the number of layers that result from compiling a layer of remote CNOTs, derived in Section 4.2, against an extensive set of medium-size quantum circuits of practical interest. Finally, with Sections 5.3 and 5.4 we conclude the performance analysis through an unfair -as clearly shown with Figure 14 -comparison with the state-of-the art for two different network topologies.

Compiler Implementation
We implemented the strategies discussed in Section 4.2 in Python, using Qiskit [26] as the development framework. Given a quantum circuit described in the QASM format, the compiler proceeds to instantiate a distributed architecture that mimics the one described in Section 4.1. To model the worst-case scenario depicted in Figure 7, each QPU has one data qubit and two communication qubits. Moreover, each QPU has two neighbor QPUs, with the exception of the outer QPUs that have one neighbor QPU only. Each qubit of the circuit is assigned to the data qubit of one QPU.  if control and target are not on neighboring QPUs then qpu 0 ← QPUs of control qpun ← QPUs of target for i ∈ {0, .., n − 2} do add link entanglement between qpu i and qpu i+1 to E end for add entanglement swap between qpu 0 and qpun to E end if add link entanglement between qpu 0 and qpun to E add a remote CNOT between control and target to E end function function Teleport(q1, q2) if q1 and q2 are not on neighboring QPUs then qpu 0 ← QPUs of q1 qpun ← QPUs of q2 for i ∈ {0, .., n − 2} do add link entanglement between qpu i and qpu i+1 to E using two available EPR pairs between QPUs end for e 01 , e 02 ← the two communication qubits at qpu 0 e n1 , e n2 ← the two communication qubits at qpun add entanglement swap between e 01 at qpu 0 and e n1 at qpun to E add entanglement swap between e 02 at qpu 0 and e n2 at qpun to E add quantum teleportation between q1 and e n1 to E add quantum teleportation between q2 and e 02 to E add local swap between q1 and e 02 add local swap between q2 and e n1 end if end function The compilation process is summarized and illustrated in Figure 11. Reading the circuit from left to right, the front layer, i.e., a layer comprising only CNOT gates that can be executed in parallel, is updated. To this end, one-qubit gates are immediately mapped to the compiled circuit, while CNOT gates are added to the front layer. This is done until all logical qubits are interested by a CNOT or there are no more CNOT gates that can be executed in parallel in the current front layer. Then, the front layer is compiled, rendering all currently involved CNOT gates executable. This process is repeated until no more front layers can be computed, meaning that all the circuit gates have been mapped to the distributed architecture.
Front layer compilation is based on Algorithm 2, where one can choose between the two strategies discussed in Section 4.2. When the entanglement swapping based strategy is adopted, remote CNOT gates preceded by entanglement swapping are applied whenever the involved QPUs are not neighbors. Note that to perform entanglement swapping, as well as remote CNOTs, we need to generate link entanglement between all involved QPUs.
Regarding the more advanced data-qubit swapping based strategy, the list representing the interaction between qubits is prepared and then Algorithm 3 is used to compute the data swap operations needed to reorder the qubits. Referring to Figure 12a, the list representing the interaction between qubits before swapping would be [112323], while the sorted list after swapping would be [112233], meaning that no overlapping CNOTs are left in the layer.
The data-qubit swapping routine operates on lists with a number of elements -i.e., a number of logical qubits -that is a multiple of 4. For this reason, a couple of dummy values, set to −1, may have to be added to the end of the list whenever the number of CNOTs is odd. Given that the algorithm searches for swaps form left to right, the dummy couple at the end will be left untouched. After creating masks to keep track of already swapped qubits, the algorithm finds all necessary swaps in exactly n 4 steps, where n is the number of elements in the list, i.e., the number of logical qubits interested by CNOTs. Taking into account the topology of the worst-case scenario shown in Figure 7, to perform a swap at least three remote CNOTs are needed. After all the dataqubit swaps have been applied, the necessary remote CNOTs have to be placed. Ideally, this last step would involve only remote CNOTs between neighbor QPUs, but when not all qubits of the circuit are involved in the front layer, such as Q 1 in Figure 12a, it may be still necessary to perform some entanglement swapping operations, as shown in Figure 12b. This is because, when sorting pairs, our algorithm does not take into account QPUs that are not involved in the current front layer. Consequently, the implementation of the dataqubit swapping based strategy is actually an hybrid between the two strategies described in Sections 4.2, avoiding data-qubit swapping if not necessary and resorting instead to entanglement swapping.
As shown in Figure 12, starting from the layer in Figure 12a, following the data-qubit swapping based strategy, in Figure 12b we apply a remote SWAP between qubit 3 and qubit 4, and then we can execute all remote CNOTs in parallel.
Moving from the topology illustrated in Figure 7 to the one in Figure 13, we can devise a different strategy to execute swaps by exploiting the augmented connectivity, described in Algorithm 2. Specifically, to perform a SWAP between QPU Q x and Q y , our compiler uses one communication qubit at each QPU as a buffer memory and exploits quantum teleportation to move the state of a data qubit from Q x to Q y and vice versa, in parallel. After the two parallel teleportations, we can execute two parallel local SWAPs between communication qubits and data qubits at each QPU to effectively achieve data qubit swapping between Q x and Q y . This is clearly beneficial as it only requires one layer of link entanglement generation between the interested QPUs, unlike the previous scenario where we needed three layer of link entanglement generation to perform three remote CNOTs.
The difference between data-qubit swapping and entanglement swapping lies in the fact that, if the subsequent front layers are similar (in terms of CNOT interaction pattern) to the one just compiled, not much data swapping and very little entanglement swapping (given that the front layers involve most of the qubits) will be necessary to compile those layers.
Regarding the computational complexity, Alg. 4 reads the circuit from left to right while updating the front layer, so its computational complexity is O(d), where d is the depth of the circuit, i.e., the number of layers. Alg. 2 loops through all necessary swaps found by Alg. 3, and applies remote CNOTs. As we must take into account for possible entanglement swapping operations between QPUs, applying a remote CNOT has a computational complexity of O(n), with n being the number of QPUs. Given that the we need at most n/4 swaps, the computational complexity of Alg. 2 turns out to be O(n 2 ). The overall computational complexity of distributing a circuit is therefore O(dn 2 ).

Compiling Overhead Validation
We validate the theoretical upper bounds (derived in Section 4.2) on the number of layers that result from compiling a layer of remote CNOTs, considering an extensive set of medium-size quantum circuits (the largest ones requiring 16 qubits, with the exception of a GHZ and two random circuit with 20 qubits). Specifically, we consider quantum circuits that are publicly available and widely adopted for testing quantum compilers [24,25] 11 , plus a few quantum chemistry circuits for the implementation of the unitary quantum Coupled-Cluster [47,48] (qUCC) and the RYRZ heuristic [49] wavefunction Ansätze.

FIGURE 12:
Improving over the worst-case scenario topology. Only one data qubit is available at each quantum processor, but neighboring quantum processors are connected with two quantum links, realized by 2 EPR pairs.  each QPU memory to one data qubit but we could not impose any limitation over the number of communication qubits per QPU nor the topology of the quantum network, which is always assumed to be an hypercube. Such a topology is shown in Figure 14b, where one can clearly see the connectivity disparity, compared to the worst-case topology illustrated in Figure 14a. In Figures 15-17, the results of the comparative evaluation are plotted. Figure 15 shows the number of Link Generation Layers, i.e., the number of layers in the distributed circuit that comprise only Bell state generation and distribution between QPUs. It is clear that our compiler requires less layers of link generation for almost every tested circuit. Having fewer layers of link generation reduces the time that data qubits have to spend idle, i.e., possibly affected by decoherence, while waiting to be able to perform remote operations. Figure 15c also shows that choosing the data-qubit swapping based strategy against the entanglement swapping based strategy is generally the best choice.
Regarding the depth of the distributed circuits, illustrated in Figure 16, we can see that for some circuits our compiler clearly outperforms Andrés-Martínez's one. Figure 16c confirms that the data-qubit swapping based strategy is better than the entanglement swapping based one.
With respect to the number of generated Bell states, depicted in Figure 17, it can be observed that our compiler consumes a fair amount of Bell states compared to Andrés-Martínez's compiler. This was expected and it is mostly due to the fact that Andrés-Martínez's compiler benefits from a hypercube network topology, as showed in Figure 14b. Using such a topology means that, in most cases, a link between two QPUs can be directly generated with just one Bell state, which is in direct contrast with the worst-case topology that we used, depicted in Figure 14a. In our linear topology, to generate a link between two non neighboring QPUs we need to perform entanglement swapping, generating and consuming Bell states shared by all the others QPUs in between. Nevertheless, the time needed to generate one Bell state should be the same as to generate n Bell states in parallel, and with Figure 15 we already showed that our compiler usually needs fewer layers of link generation.
It is worthwhile to note that with network topologies different from the considered one -namely, the worst-case topology where each CNOT must be mapped into a remote CNOT -new optimization challenges arise. As instance, whenever multiple data qubits are available at each (or some nodes), only a subset of CNOTs must be mapped into remote operations. Hence, the compiler should be able to optimize choices such as which sub-circuit should be mapped to which node or which CNOT should be performed via communication qubits. Clearly, the optimal strategies -as well as the metrics to measure the optimality of a strategy -represents interesting open problems. 14 VOLUME 4, 2016 (a) A layer with three parallel CNOTs.
neighboring quantum processors are connected with two quantum links, realized by 2 EPR pairs.  each QPU memory to one data qubit but we could not impose any limitation over the number of communication qubits per QPU nor the topology of the quantum network, which is always assumed to be an hypercube. Such a topology is shown in Figure 14b, where one can clearly see the connectivity disparity, compared to the worst-case topology illustrated in Figure 14a. In Figures 15-17, the results of the comparative evaluation are plotted. Figure 15 shows the number of Link Generation Layers, i.e., the number of layers in the distributed circuit that comprise only Bell state generation and distribution between QPUs. It is clear that our compiler requires less layers of link generation for almost every tested circuit. Having fewer layers of link generation reduces the time that data qubits have to spend idle, i.e., possibly affected by decoherence, while waiting to be able to perform remote operations. Figure 15c also shows that choosing the data-qubit swapping based strategy against the entanglement swapping based strategy is generally the best choice.
Regarding the depth of the distributed circuits, illustrated in Figure 16, we can see that for some circuits our compiler clearly outperforms Andrés-Martínez's one. Figure 16c confirms that the data-qubit swapping based strategy is better than the entanglement swapping based one.
With respect to the number of generated Bell states, depicted in Figure 17, it can be observed that our compiler consumes a fair amount of Bell states compared to Andrés-Martínez's compiler. This was expected and it is mostly due to the fact that Andrés-Martínez's compiler benefits from a hypercube network topology, as showed in Figure 14b. Using such a topology means that, in most cases, a link between two QPUs can be directly generated with just one Bell state, which is in direct contrast with the worst-case topology that we used, depicted in Figure 14a. In our linear topology, to generate a link between two non neighboring QPUs we need to perform entanglement swapping, generating and consuming Bell states shared by all the others QPUs in between. Nevertheless, the time needed to generate one Bell state should be the same as to generate n Bell states in parallel, and with Figure 15 we already showed that our compiler usually needs fewer layers of link generation.
It is worthwhile to note that with network topologies different from the considered one -namely, the worst-case topology where each CNOT must be mapped into a remote CNOT -new optimization challenges arise. As instance, whenever multiple data qubits are available at each (or some nodes), only a subset of CNOTs must be mapped into remote operations. Hence, the compiler should be able to optimize choices such as which sub-circuit should be mapped to which node or which CNOT should be performed via communication qubits. Clearly, the optimal strategies -as well as the metrics to measure the optimality of a strategy -represents interesting open problems. VOLUME 4, 2016 (b) The layer distributed with the Sort strategy.

FIGURE 12:
Improving over the worst-case scenario topology. Only one data qubit is available at each quantum processor, but neighboring quantum processors are connected with two quantum links, realized by 2 EPR pairs.  each QPU memory to one data qubit but we could not impose any limitation over the number of communication qubits per QPU nor the topology of the quantum network, which is always assumed to be an hypercube. Such a topology is shown in Figure 14b, where one can clearly see the connectivity disparity, compared to the worst-case topology illustrated in Figure 14a. In Figures 15-17, the results of the comparative evaluation are plotted. Figure 15 shows the number of Link Generation Layers, i.e., the number of layers in the distributed circuit that comprise only Bell state generation and distribution between QPUs. It is clear that our compiler requires less layers of link generation for almost every tested circuit. Having fewer layers of link generation reduces the time that data qubits have to spend idle, i.e., possibly affected by decoherence, while waiting to be able to perform remote operations. Figure 15c also shows that choosing the data-qubit swapping based strategy against the entanglement swapping based strategy is generally the best choice.
Regarding the depth of the distributed circuits, illustrated in Figure 16, we can see that for some circuits our compiler clearly outperforms Andrés-Martínez's one. Figure 16c confirms that the data-qubit swapping based strategy is better than the entanglement swapping based one.
With respect to the number of generated Bell states, depicted in Figure 17, it can be observed that our compiler consumes a fair amount of Bell states compared to Andrés-Martínez's compiler. This was expected and it is mostly due to the fact that Andrés-Martínez's compiler benefits from a hypercube network topology, as showed in Figure 14b. Using such a topology means that, in most cases, a link between two QPUs can be directly generated with just one Bell state, which is in direct contrast with the worst-case topology that we used, depicted in Figure 14a. In our linear topology, to generate a link between two non neighboring QPUs we need to perform entanglement swapping, generating and consuming Bell states shared by all the others QPUs in between. Nevertheless, the time needed to generate one Bell state should be the same as to generate n Bell states in parallel, and with Figure 15 we already showed that our compiler usually needs fewer layers of link generation.
It is worthwhile to note that with network topologies different from the considered one -namely, the worst-case topology where each CNOT must be mapped into a remote CNOT -new optimization challenges arise. As instance, whenever multiple data qubits are available at each (or some nodes), only a subset of CNOTs must be mapped into remote operations. Hence, the compiler should be able to optimize choices such as which sub-circuit should be mapped to which node or which CNOT should be performed via communication qubits. Clearly, the optimal strategies -as well as the metrics to measure the optimality of a strategy -represents interesting open problems.   : Improving over the worst-case scenario topology. Only one data qubit is available at each quantum processor, but neighboring quantum processors are connected with two quantum links, realized by 2 EPR pairs.    each QPU memory to one data qubit but we could not impose any limitation over the number of communication qubits per QPU nor the topology of the quantum network, which is always assumed to be an hypercube. Such a topology is shown in Figure 14b, where one can clearly see the connectivity disparity, compared to the worst-case topology illustrated in Figure 14a. In Figures 15-17, the results of the comparative evaluation are plotted. Figure 15 shows the number of Link Generation Layers, i.e., the number of layers in the distributed circuit that comprise only Bell state generation and distribution between QPUs. It is clear that our compiler requires less layers of link generation for almost every tested circuit. Having fewer layers of link generation reduces the time that data qubits have to spend idle, i.e., possibly affected by decoherence, while waiting to be able to perform remote operations. Figure 15c also shows that choosing the data-qubit swapping based strategy against the entanglement swapping based strategy to the fact that Andrés-Martínez's compiler benefits from a hypercube network topology, as showed in Figure 14b. Using such a topology means that, in most cases, a link between two QPUs can be directly generated with just one Bell state, which is in direct contrast with the worst-case topology that we used, depicted in Figure 14a. In our linear topology, to generate a link between two non neighboring QPUs we need to perform entanglement swapping, generating and consuming Bell states shared by all the others QPUs in between. Nevertheless, the time needed to generate one Bell state should be the same as to generate n Bell states in parallel, and with Figure 15 we already showed that our compiler usually needs fewer layers of link generation.
It is worthwhile to note that with network topologies different from the considered one -namely, the worst-case topology where each CNOT must be mapped into a remote CNOT -new optimization challenges arise. As instance, Figure 13: Improving over the worst-case scenario topology. Only one data qubit is available at each quantum processor, but neighboring quantum processors are connected with two quantum links, realized by 2 EPR pairs.  Table 1: Validation of the theoretical upper bounds derived in Section 4.2 against a heterogeneous set of quantum circuits. Each circuit is characterized by a number of qubits n and a number of CNOT layers, i.e., layers that comprise only CNOT gates. For each compiling strategy, there is a theoretical upper bound on the number of layers that are necessary to realize the remote CNOTs and an actual number of layers resulting from the compilation process. The ratio between the latter and the former is also reported.
More into details, Table 1 reports a sample of the results that have been collected by compiling the circuits with both the entanglement swapping based and the data-qubit swapping based strategies. Within the table, the first column shows the name of the circuit and the second column shows the number n of logical qubits   (2) and (4), respectively -whereas the fifth and the eighth column shows the depth of the compiled CNOT layers. For computing the upper bound values and collecting the experimental results, we set the parameters c le , c bsm and c cx in equations (3), (6) and (7) as unit factors, thus obtaining d es = 3, d qs = 9 and d qs = 2. Table 1 clearly shows that the upper bounds on the number of layers that result from compiling the layers of remote CNOTs are widely respected and hold, for all the considered examples. Indeed, by comparing the actual depth with the theoretical one, it becomes evident the overestimation of the bounds given in Section 4.2. The rationale for this lays in the number of CNOTs in each layer, which are usually significantly lower than what assumed. For instance, let us consider the GHZ circuits, where each layer contains a single CNOT and hence the derived bounds -by assuming n/2 CNOTs in each layer -overestimate the depth. Nevertheless, the discrepancy can be easily fixed by substituting the n/2 factor with the actual estimation on the average number of CNOTs in each layer with no loss of generality.

Experimental Results for the Worst-Case Topology
We compared our compiler with the one proposed by Andrés-Martínez et al. [41], in respect of which we were able to set each QPU memory to one data qubit but we could not impose any limitation over the number of communication qubits per QPU nor the topology of the quantum network, which is always assumed to be an hypercube. Such a topology is shown in Figure 14b, where one can clearly see the connectivity disparity, compared to the worst-case topology illustrated in Figure 14a. In Figures 15-17, the results of the comparative evaluation are plotted. For a better readability of the figures, we omit data related to a 20 qubit random circuit that presents values far greater than the rest of the data set, for all compiling strategies including the state of the art (such data have been included in Table 1). Figure 15 shows the number of Link Generation Layers, i.e., the number of layers in the distributed circuit that comprise only Bell state generation and distribution between QPUs. It is clear that our compiler requires less layers of link generation for almost every tested circuit. Having fewer layers of link generation reduces the time that data qubits have to spend idle, i.e., possibly affected by decoherence, while waiting to be able to perform remote operations. Figure 15c also shows that choosing the data-qubit swapping based strategy against the entanglement swapping based strategy is generally the best choice.
Regarding the depth of the distributed circuits, illustrated in Figure 16, we can see that for some circuits our compiler clearly outperforms Andrés-Martínez's one. Figure 16c confirms that the data-qubit swapping based strategy is better than the entanglement swapping based one.
With respect to the number of generated Bell states, depicted in Figure 17, it can be observed that our compiler consumes a fair amount of Bell states compared to Andrés-Martínez's compiler. This was expected and it is  Figure 15: Comparing the entanglement swapping based and data-qubit swapping based strategies of our compiler with Andrés-Martínez's compiler [41], in terms of link entanglement generation layers. Our compiler distributes circuits on the worst-case topology, with only one link between neighboring QPUs (illustrated in Figure 7), while Andrés-Martínez's one exploits a more favorable topology, i.e., the hypercube illustrated in Figure 14b. Both topologies are characterized by one data qubit per QPU.  Figure 16: Comparing the entanglement swapping based and data-qubit swapping based strategies of our compiler with Andrés-Martínez's compiler [41], in terms of circuit depth. Our compiler distributes circuits on the worstcase topology, with only one link between neighboring QPUs (illustrated in Figure 7), while Andrés-Martínez's one exploits a more favorable topology, i.e., the hypercube illustrated in Figure 14b. Both topologies are characterized by one data qubit per QPU.
mostly due to the fact that Andrés-Martínez's compiler benefits from a hypercube network topology, as showed in Figure 14b. Using such a topology means that, in most cases, a link between two QPUs can be directly generated with just one Bell state, which is in direct contrast with the worst-case topology that we used, depicted in Figure 14a. In our linear topology, to generate a link between two non neighboring QPUs we need to perform entanglement swapping, generating and consuming Bell states shared by all the others QPUs in between. Nevertheless, the time needed to generate one Bell state should be the same as to generate n Bell states in parallel, and with Figure 15 we already showed that our compiler usually needs fewer layers of link generation.
It is worthwhile to note that with network topologies different from the considered one -namely, the worstcase topology where each CNOT must be mapped into a remote CNOT -new optimization challenges arise. As instance, whenever multiple data qubits are available at each (or some nodes), only a subset of CNOTs must be mapped into remote operations. Hence, the compiler should be able to optimize choices such as which subcircuit should be mapped to which node or which CNOT should be performed via communication qubits. Clearly, the optimal strategies -as well as the metrics to measure (c) Figure 17: Comparing the entanglement swapping based and data-qubit swapping based strategies of our compiler with Andrés-Martínez's compiler [41], in terms of consumed EPR pairs. Our compiler distributes circuits on the worst-case topology, with only one link between neighboring QPUs (illustrated in Figure 7), while Andrés-Martínez's one exploits a more favorable topology, i.e., the hypercube illustrated in Figure 14b. Both topologies are characterized by one data qubit per QPU.  Figure 18: Comparing the entanglement swapping based and data-qubit swapping based strategies of our compiler with Andrés-Martínez's compiler [41] over link entanglement generation layers. Our compiler distributed circuits on the topology illustrated in Figure 13, with two links between neighboring QPUs(illustrated in Figure 7), while Andrés-Martínez's one exploits a more favorable topology, i.e., the hypercube illustrated in Figure 14b. Both topologies are characterized by one data qubit per QPU.
the optimality of a strategy -represents interesting open problems.

Additional Experimental Results
To show that the proposed compiling strategies can be applied to more complex topologies, we tested them on a slight variation of the worst-case scenario. This new topology is depicted in Figure 13, where we doubled the number of EPR pairs per QPU, hence we doubled the number of links between neighbor QPUs. As described in Section 5.1, this setting enables our compiler to perform data-qubit swapping in a more efficient way and also greatly reduces the number of layers dedicated to the link entanglement generation, as we can generate and use double the number of links in parallel. Such a performance improvement is clearly shown in Figure 18.
The same considerations apply to the depth of the compiled circuits, illustrated in Figure 19, where we can observe an appreciable improvement against the worstcase scenario and the state of the art compiler. As for the number of generated EPR pairs, aside from an imperceptible difference with the worst-case scenario, the advantage of an hypercube topology is still evident.  Figure 19: Comparing the entanglement swapping based and data-qubit swapping based strategies of our compiler with Andrés-Martínez's compiler [41], in terms of circuits' depth. Our compiler distributed circuits on the topology illustrated in Figure 13, with two links between neighboring QPUs (illustrated in Figure 7), while Andrés-Martínez's one exploits a more favorable topology, i.e., the hypercube illustrated in Figure 14b. Both topologies are characterized by one data qubit per QPU. (c) Figure 20: Comparing the entanglement swapping based and data-qubit swapping based strategies of our compiler with Andrés-Martínez's compiler [41] over consumed EPR pairs. Our compiler distributed circuits on the topology illustrated in Figure 13, with two links between neighboring QPUs (illustrated in Figure 7), while Andrés-Martínez's one exploits a more favorable topology, i.e., the hypercube illustrated in Figure 14b. Both topologies are characterized by one data qubit per QPU.

Conclusion
In this paper, we have discussed the main challenges arising with compiler design for distributed quantum computing. Then, we analytically derived an upper bound of the overhead induced by quantum compilation for distributed quantum computing. The derived bound accounts for the overhead induced by the underlying computing architecture as well as the additional overhead induced by the sub-optimal quantum compiler. To this aim, we designed a quantum compiler with three key features: i) general-purpose, namely, requiring no particular assumptions on the quantum circuits to be compiled, ii) efficient, namely, exhibiting a polynomial-time computational complexity so that it can successfully compile medium-to-large circuits of practical value, and iii) effective, being the total circuit depth overhead induced by the quantum circuit compilation always upper-bounded by a factor that grows linearly with the number of logical qubits of the original quantum circuit. We validated the theoretical upper bound against an extensive set of medium-size quantum circuits of practical interest, and we confirmed the validity of the compiler design through an extensive performance analysis. 19