Distributed Quantum Computing and Network Control for Accelerated VQE

Interconnecting small quantum computers will be essential in the future for creating large scale, robust quantum computers. Methods for distributing monolithic quantum algorithms efficiently are thus needed. In this work we consider an approach for distributing the accelerated variational quantum eigensolver (AVQE) algorithm over arbitrary sized - in terms of number of qubits - distributed quantum computers. We consider approaches for distributing qubit assignments of the Ansatz states required to estimate the expectation value of Hamiltonian operators in quantum chemistry in a parallelized computation and provide a systematic approach to generate distributed quantum circuits for distributed quantum computing. Moreover, we propose an architecture for a distributed quantum control system in the settings of centralized and decentralized network control.


Introduction
To execute large scale quantum algorithms on a quantum computer will require a quantum computer to have a large number of qubits to both perform error correction and computation. One path to creating quantum computers with many qubits is to construct a network of smaller-scale quantum computers and perform distributed computing amongst them. This scheme is known as distributed quantum computing or quantum multi-computing. Envisioned in IBM's road-map for scaling quantum devices is a plan to create quantum interconnect to network dilution refrigerators each holding a million qubits to create a massively parallel quantum computer [1]. In any system that is converted from monolithic to distributed, a layer of communication complexity is added in order to perform distributed operations across devices. In the context of distributed quantum computing, the control system is tasked with handling the needed communication and potentially synchronization between devices. The specifics of the stack strongly depend on the architecture of the distributed system.
Connecting smaller quantum computers is one way to gain more power out near term quantum devices. Another possibility is to use variational quantum algorithms [2]. Variational quantum algorithms are hybrid classical-quantum algorithms: they leverage classical optimization together with reduced-depth quantum circuits to generate an approximate solution to a problem. One such example is the variational quantum eigensolver (VQE). VQE can be used in quantum chemistry to estimate the ground state energy of molecular chemical Hamiltonians. An implementation of VQE on existing quantum computers have been presented by the quantum group at Google [3].
A modified version of VQE called Accelerated VQE or α-VQE has been specifically designed to make best usage of the near-term quantum hardware. The α in α-VQE represents the trade-off parameter between run-time -which could be long for some variational algorithms -and circuit depth [4]. In other words, α allows to fine tune the run time to cope with the limited coherence time of nearterm quantum machines. α-VQE uses a more efficient method for estimating the expectation value of a Hamiltonian than standard VQE, which is a quantum algorithm called Accelerated Quantum Phase Estimation or α-QPE. α-VQE replaces the expectation value estimation stage of standard VQE with α-QPE, thereby potentially enhancing VQE when longer qubit stability is achieved.
In this work, we take these concepts and combine them to construct a method for running α-VQE on a distributed system of quantum computers. We begin with a technical overview of the higher level concepts that are used throughout the project. Next, we begin to decompose α-VQE. To estimate expectation values in VQE, an Ansatz state has to initialized. In Section 3 we consider various approaches for distributing the Ansatz states over an arbitrary distributed quantum computer and we propose a method for distributing the circuits needed to perform the Ansatz initialization. In Section 4, we describe two different architectures for performing distributed quantum computing and propose network control systems based on Deltaflow.OS.

Summary of Contributions
In this work, we construct a framework for performing accelerated VQE on a distributed system of quantum computers. Our framework requires two inputs: the number of qubits in a distributed collection of QPUs and circuitry needed to run VQE. As an outcome we produce a mapping of the monolithic system to a distributed system such that the Hamiltonian expectation estimation can be performed in a parallelized and distributed computation. Moreover, we design a control system architecture that can be used to execute the distributed quantum gate instructions. This process is not strictly confined to α-VQE and many of the ideas can be adapted for VQE in its standard form (or when α = 0) and potentially other variational quantum algorithms. The strategy for distributing qubits and scheduling can also be adapted for other types of quantum algorithms, not necessarily variational.

Related Work
We expand on the work of the accelerated variational quantum eigensolver in [4], extending to distributed quantum architectures. The method of decomposing quantum algorithms that we use in our framework has been explored in [5,6], where a full example of decomposing Shor's algorithm was proposed in [7]. Control systems for quantum computing have been proposed in [8,9], but these do not discuss the control between networked quantum computers. In [10], a quantum multi-computer architecture optimized to perform Shor's algorithm is proposed. Here we consider splitting Ansatz states and use a distributed and parallelized approach for executing α-VQE, but there are overlapping ideas in these listed works in terms of the requirements for networking quantum computers.

Technical Prerequisites
In this paper, we make use of theory from distributed quantum computing, software based control systems, distributed operating systems and dataflow programming schemes. In this section, we give a brief overview of each of these topics as a primer.
communication have been developed in other contexts and as we will see, can potentially be applied to distributed quantum computing.

Software Control Systems of Quantum Hardware
To perform the gate operations on the qubits in a quantum computer requires a system that can translate gate instructions to physical interactions with the qubits. Currently, quantum algorithms are generally written in terms of circuits of quantum gates. The circuits are designed with the assumption of noiseless qubits. The software takes these circuits as input, optimizes them, and converts them to a data format such that the control system controlling the qubits can execute the instructions.
A first step into defining how such a system functions is to draft a model of the full stack of the quantum computer. Such an architecture has been proposed in [9], defining the software and hardware stack for a quantum computer. Here a protocol is defined to convert the classical and quantum instructions to binary strings such that they can be executed at the machine level. This stack incorporates error correction into the model and injects the additional instructions between gate operations when needed. To execute the instructions, hardware is in place that quickly reads the binary strings and then runs the optical control on the qubits. Such hardware is proposed in [20]. Here cryogenic field programmable gate arrays (FPGAs) are incorporated into the control system architecture to manipulate semiconductor-based qubits. In [21], it is explained how moving the parts of the classical control of quantum system to lower-latency hardware like FPGAs can greatly benefit near-term quantum computing.
An important part of this system stack is to be able to control the amount of messages being passed to the hardware. The number of messages can grow very quickly when error correction is considered, and the bandwidth of the system can be used up completely with just instructions for error correction. In [8] QuEST (Quantum Error-Correction Substrate), an architecture that delegates the task of quantum error correction to the hardware, for overcoming this is proposed.
Software and hardware will have to work closely together in an highly optimized way in order to reduce the amount of instructions while performing them with as low a latency as possible. As quantum hardware technology improves and as more research towards quantum software deepens, this area of quantum computing will become central to executing quantum algorithms on large scale quantum hardware.

Distributed Operating Systems
In general, an operating system (OS) is a system that manages the resources on a computer, such as the random access memory or the CPU, such that multiple programs can run simultaneously without undesirably interfering with each other. An OS also provides an interface between the user and the hardware. We will refer to the class of OSs that run on a single computer a centralized OS. A distributed OS is an OS that runs on a cluster or group of computers which are physically separated and connected via a network [22]. To the user of a distributed OS, it should appear as if their programs are running on a centralised OS. More specifically, a distributed OS should behave as an ordinary centralised OS with the caveat that the programs could be running at any physical location which is not known to the user.
A distributed OS can be deployed in multiple ways [23,Chapter 8]. One way is to deploy the OS such that there is a distinction between the types of nodes in the distributed system, "nodes" meaning the computers in the cluster. The distinction is generally that there is one computer which controls the rest of the system and the controlled nodes follow all the commands of this "controller" unit. An alternative configuration is that the network connecting the computers in the cluster are connected via an internet and a set of internet protocols are used to request resources from the nodes in the cluster and perform inter-process communication. One can think of this as a client-server relationship [24]. In this case, we call the operating system a network OS. The main difference between a distributed OS and a network OS is in a network OS, the user is aware that multiple systems are being use, albeit programs appear to be running on a single system.
One can deploy their systems as a distributed OS or a network OS or as a hybrid of the two. When deploying a distributed operating system one needs to find a good balance of some key properties to ensure that the operating system is robust, efficient, and can be scaled up. For example, one can potentially make the system very robust by adding abundant inter-node communication, but this could make the system inefficient or can overload the processors with messages to process.
In this work, we are focused on a specific system with a specific use case. These are systems that have classical control but have hardware that establish quantum entanglement and classical communication. We explore how one can design a distributed OS where the distributed part we focus on is a cluster of quantum computers. We take into account the two models, the client-server model over a entanglement network and the single controller model. We use a specific control system, namely Deltaflow.OS which we explain in Section 4, to explore these two models in depth.

Dataflow Programming
Dataflow programming is a method of programming that uses a network flow, or a directed graph, approach for developing algorithms [25]. The nodes in the network hold the logic of the program -or are constant valued -and the flow in the network represents the inputs to the next nodes in the flow which is then processed and output to the next node in the network until the program is complete. Generally, in dataflow programming the nodes run in parallel and asynchronously. In real implementations, the nodes run idly, waking when they receive input to process. When the program starts, some nodes are selected to initialize without input, triggering the start of the program. Commonly used hardware programming languages like Verilog and VHDL use dataflow programming as a paradigm.
In Figure 1 is an example of a simple dataflow program. The constants 3 and 5 are inputs to the + node which takes two inputs and outputs their sum. The output of the addition is passed to the × node, which takes two inputs and output the product. In this case, 2 and 3 + 5 are inputs to × and the complete output is 16. In this work, we use Deltaflow to add control to the network hardware. Deltaflow is built on the dataflow programming paradigm. We will discuss in Section 4 how Deltaflow can be used to define the logic in the control blocks for an overall network control.

Distributing α-VQE
A problem to overcome when dealing with near-term quantum computing devices is that the ability to run deep circuits is greatly reduced due to low coherence time of qubit systems without error correction. A classical-quantum hybrid class of algorithms called "variational quantum algorithms" allow to run reduced depth circuits performing some of the algorithm on near-term quantum hardware and some on classical hardware. In particular, the variational quantum eigensolver (VQE) algorithm is a variational hybrid-quantum algorithm that can be used to find the minimum eigenvalue of a chemical Hamiltonian. It uses a quantum portion of the hardware to estimate the eigenvalues for a particular Ansatz of Pauli operations combining to form the Hamiltonian. VQE uses the quantum system to determine an expectation value and these expectation values are then combined to find an expectation value of the full Hamiltonian [26].
Using classical optimization techniques, various Ansätze -plural of Ansatz -are prepared with the goal of finding an estimate to the eigenstate with the lowest eigenvalue. The drawback of VQE is that the number of times the Ansatz state and expectation value needs to be prepared is proportional to 1/ǫ 2 , where ǫ is the desired precision, which could lead to long run-times [2]. Another way to estimate eigenvalues of unitary operations is using the quantum phase estimation (QPE) algorithm explained more in depth in Section 3.2.1. The advantage to using QPE is that the number of times the experiment is conducted to find the estimate is proportional to a constant. The downside is of course that the circuit depth grows proportionally to 1/ǫ.
As quantum hardware technologies improve, it will allow for longer coherence times of qubits and in turn allows for deeper quantum circuits. To make use of this ability, and to "squeeze" as much power out of the quantum hardware that is available, Wang et. al proposed the Accelerated VQE (α-VQE) algorithm [4]. We again attempt to squeeze more power out of our quantum hardware by considering how one could implement α-VQE for a distributed quantum computer.
When using VQE for quantum chemistry applications, it is common to prepare parameterized circuits that generate entangled Ansatz states. A commonly used Ansatz is the unitary coupled cluster Ansatz [27], which grows in number of qubits required to prepare the Ansatz as Hamiltonian complexity increases. A critical part of using a distributed quantum computer for quantum chemistry is therefore preparing Ansatz states over an array of quantum computers. When distributing any quantum circuit across devices, the main complication that arises is when a controlled two qubit gate needs to be applied across two QPUs. There are two approaches we consider here. We assume that only entanglement and classical communication are used to achieve this. Alternative to this, we could consider physically moving qubits between QPUs but this is a much noisier task and we ignore this option. We consider two approaches: Teleporting one of the two qubits to the other QPU so that they are on the same QPU and then perform the two qubit gate on one QPU locally, the second approach is to use the mechanism introduced in Ref. [5] where Yimsiriwattana et. al introduce "cat-entangle" and "cat-disentangle" protocols seen in Fig. 2.
Comparing these two approaches in terms of number of operations needed, we find that using the approach of Yimsiriwattana et. al is more efficient. In order to use teleportation in a distributed system, we would require 2 Bell pairs to teleport the qubit from one QPU and back again. Using the method of Yimsiriwattana et. al requires just 1 Bell pair to perform a non-local control gate and this Bell pair can also be used to perform multiple control gates when the control qubit is the same as is done in [28] for distributed quantum Fourier transform.
Using the approach of Yimsiriwattana et. al, in the first subsection we consider how, given a collection of QPUs and an electronic molecular Hamiltonian, we can generate a schedule that can be used to estimate the expectation value of the Hamiltonian. We develop two approaches for solving this: the first is a greedy algorithm and the second uses constraint programming. In the next subsection, we consider how we can perform the needed α-QPE step that is required for estimating expectation value over a distributed system and merge the ideas to produce a complete version of a distributed α-VQE.

Scheduling Hamiltonians
An electronic molecular Hamiltonian H can be written as a sum of a polynomial number (with respect to the system size) of Pauli matrices in the form of Eq. (1), where each P i ∈ {I, σ x , σ y , σ z } ⊗n is a tensor product of qubit n Pauli operators (or the identify), called a Pauli string, and each a i ∈ R, In order to more effectively use a networked quantum computer, we wish to use a parallelized and distributed approach to expectation value estimation. We motivate the approach as follows. Given the linear nature of estimating ψ|H|ψ , we can break up the summation into its pieces. We need to prepare an n qubit Ansatz for each piece of the sum in order to estimate each ψ|P i |ψ independently to later rejoin the expectation values to estimate ψ|H|ψ . Given the distributed QPU architecture, we need to allocate the qubits in such a way that Ansatz states can be prepared for each P i in the sum. Later, the coefficients a i can be merged to produce a single value for ψ|H|ψ .
For Hamiltonians that require a large number of qubits, in this subsection, we consider methods that distribute the expectation calculation of the Pauli strings between a given distributed quantum computer. Here we model a collection of quantum processors {QPU 1 ,...,QPU m } as a collection of q i ∈ N qubits (respectively), all of which are located in the same device. Given a set of QPUs and a Hamiltonian in the form of a summation of Pauli strings, a distributed layout of the qubits with the required allocation of communication qubits is produced.
We enforce the following restrictions. Because the goal is to run α-VQE, we know ahead of time that one additional qubit (additional to the qubits in the Ansatz) is reserved for each Ansatz to perform α-QPE. Ontop of this, we need to reserve qubits for entanglement between QPUs which is necessary when an Ansatz is split between QPUs. The worst case for this occurs when there is a three qubit control gate (equivalent to a Toffoli gate) where the chain qubits are allocated on different QPU while performing α-QPE. In this case, since we are using the method of cat-entangling and disentangling, we need to reserve 2 qubits from each QPU for entanglement. We depict such a distribution in Figure 3. We formalize this as a problem: Problem 1 (Ansatz Distribution Problem). Given a Hamiltonian H = n i=1 a i P i where each Pauli string P i ∈ {I, σ x , σ y , σ z } ⊗ni and a collection of m QPUs described by the number of qubits on the system [q 1 , q 2 , ..., q m ], output a series of rounds that can be used to estimate, for a given Ansatz |ψ , the expectation ψ|H|ψ . In order to prepare an Ansatz, when P i is split between two QPUs, 2 qubit from each QPU have to be allocated in order to perform non-local operations for preparing the Ansatz |ψ across two or more QPUs. Moreover, 1 qubit needs to be reserved for α-QPE. The solution to this problem outputs a schedule of distributions in which one can run over the distributed system to obtain an estimate to ψ|H|ψ .
For the task of distributing the qubits, we take various approaches to this problem. In its essence, this problem is a resource allocation problem. We can therefore gain insight from common solutions to such problems. Common approaches for resource allocation problems are greedy algorithms and constraint programming. We propose an algorithm of each approach in this section. Figure 3: Distribution of a 11-qubit Ansatz on three QPUs with 6 qubits each. One qubit is reserved for α-QPE in green. Communication qubits are reserved in orange. The Ansatz qubits are in red. Two qubits are reserved for communication to accommodate for any control-control gates that could occur when running α-QPE that need to cross QPUs.

Greedy Ansatz Distribution
In the greedy algorithm approach, we greedily fill the QPUs with as many Ansatz states that can possibly fit and for the remaining needed qubits, we split then across the QPUs reserving the needed qubits as needed. When the QPUs cannot fit any more Ansätze, the execution of those estimations are moved to the next round. In detail, we propose Algorithm 1. We refer to an algorithm called doesNotFit which simply runs a similar logic as the main algorithm but just ensures a distribution exists for one particular Ansatz. We refer the reader to Appendix A, Algorithm 10 for the detailed algorithm.
• n the qubits for Ansatz • p the number of Pauli strings to distribute • Parameters for recursion defaulted to schedule = {} and round = 1 Output: An Ansatz distribution schedule used to compute ψ|H|ψ for an Ansatz |ψ of size n qubits. GreedyDistribution(Q, n, schedule, round): 1: if p = 0 or n = 0 : 2: return schedule 3: if j = 1 :

Constraint Programming Approach
As another approach to solving Problem 1, we use constraint programming. The trade off with constraint programming is that setting up a collection of constraints is generally straight forward but solving constraint problems on a finite domain is generally NP-complete, trading simplicity for time. We construct the multi-objective constraint program in detail in Constraint Program 2. Using this constraint program repeatedly, we can produce a schedule by running the constraint program on the maximum number of Ansätze that fit in the system and using a solution from the output, once per round, until all Ansätze are covered.

Constraint Program 2 Constraint Programming Distribution
Input: • Q = [q1, ..., qn], ∀i, qi ∈ N a list of the number of qubits for each QPU in the system • A ∈ N the number of qubits in the Ansatz • m ∈ N the number of Ansätze to fit Variables: • xij ∈ {0, ..., A}: The number of qubits from Ansatz 0 The number of qubits used to split Ansatz i between QPUs j and k Objective Functions:

Constraints:
1. There's only one QPE qubit per Ansatz: If the Ansatz is split, then both QPUs use qubits: Ansatz is completely covered with one QPE qubit: n j=1 xij + yij = A + 1, ∀i ∈ {1, ..., m} 4. Qubits allocated do not exceed the number of qubits on the QPU. Note we can recycle the splitting qubits for multiple splits of the same Ansatz.
The Ansatz fits on one QPU or it is split:

Distributing α-VQE
As discussed in earlier sections, The variation quantum eigensolver (VQE) is a variational algorithm that uses a combination of quantum and classical components and can be used to estimate ground state energies in electric molecular Hamiltonians. To perform chemical calculations, VQE is used with a statistical sampling sub-routine to estimate expectation values with a given Ansatz with a classical optimizer to pick the parameters to minimize the expectation value. In Ref. [4], a generalization of VQE is proposed, called α-VQE. The generalization replaces the statistical sampling step with a subroutine called α-QPE, which for the selection of α ∈ [0, 1] can behave as VQE does, but also can become more efficient by choosing α > 0, which requires the ability to run deeper circuits on quantum hardware. In this section, we take the proposed α-VQE in [4] and map it to a distributed system. The main theme in this section is applying non-local control gates over separated QPUs. We follow the approach of Refs. [5,6] using entanglement and classical communication to perform control gates across distributed systems, relying on the pre-allocated qubits from the previous section to hold the entanglement across devices.

Distributing α-QPE
The quantum phase estimation (QPE) algorithm is an essential ingredient to many popular quantum algorithms -one such being Shor's algorithm. First discussed by Kitaev in [29], QPE is used to estimate the phase of a quantum state |ψ that appears after applying a specific unitary operation U to it, where |ψ is an eigenstate of U . Specifically, QPE aims to estimate the phase φ in U |ψ = e 2iπφ |ψ with high probability. In Fig. 4, we depict a circuit representation of QPE applied to a qubit |ψ where n qubits are used to estimate φ.
Here, we adapt a modified version of QPE developed in Ref. [4] called α-QPE for a distributed system. α-QPE is a modified version of rejection filtering phase estimation (RFPE) whose circuit diagram is given in Fig. 5. α-QPE uses a free parameter α that is chosen depending on the available circuit depth on the specific hardware running the algorithm. With this α, M and θ are selected as M = 1/σ α and θ = µ − σ. Here, σ and µ are parameters for a normal N (µ, σ 2 ) prior distribution in the first round of α-QPE for sampling values of φ, the "eigenphase" in U |φ = e ±iφ |φ . Here U is modified to be a rotation operator that rotates an Ansatz |ψ by an angle φ in the plane spanned by {|ψ , P |ψ }, where P is a Pauli string. More precisely, with the goal of estimating | ψ|P |ψ |, given an Ansatz preparation circuit R := R(λ) for some parameter vector λ ∈ R n and a reflection operator Π := I − 2 |0 0|, U := RΠR † P RΠR † P † and the circuit depicted in Fig. 5 is executed to obtain a value E. When E is obtained, rejection sampling is performed to produce a posterior distribution, which can be shown to again be normal, in which to again sample values of ψ. This process is repeated until sufficient accuracy is reached. Once an estimate for φ is obtained, one can recover | ψ|P |ψ | using the relation | ψ|P |ψ | = cos(φ/2). In [4], mechanisms to recover the sign of ψ|P |ψ are provided.
In this subsection we tackle three key steps in to adapt α-QPE for a distributed system: The first is mapping the state preparation circuit R(λ) across multiple QPUs, the second is then to map U to a distributed system, and the third, performing the controlled operation in Fig. 5. We solve these in order. The solution to the first task takes Ansatz preparation circuit R(λ) and develops a mechanism such that it can be applied when some qubits are physically separated. Here we consider R(λ) a variational form, a parameterized circuit used to prepare an Ansatz. We give an algorithm to achieve this in Algorithm 3.
The high level idea of Algorithm 3 is, given the circuit representation of R(λ) as a series of layers, where each layer is a collection of gates in a layer of circuit, and a mapping of qubits, to search for any control gates where the control and target are physically separated between two QPUs. When found, insert, between the current layer and next layer in the circuit, the necessary steps to perform the control gate in a non-local way using the cat-entangling method. We also ensure that entanglement is established between the two QPUs ahead of time by pre-pending an entanglement generation step. As an optimization, the cat-disentangler step can be shifted to a later layer if the non-local control gate has the same control qubit and no operations on that control qubit in between controlled gates. Note that we can generate a distributed R(λ) † in the same way. From the previous subsection, the proposed solutions to Problem 1 ensure that there are two qubits reserved on each QPU for the entanglement qubits needed for non-local operations. Producing the layering of a circuit can be done in a straight forward way and we assume that this structure is the input to the algorithm. We depict an example of running the algorithm in Fig. 6.

Algorithm 3 Local to Distributed Circuit
Input: • A circuit representation of unitary U where U is a list of list of gates. Each list represents a layer in the circuit. Gates have the form Gate(ID) or CON T ROL(G, ID1, ID2) where G is the gate to applied using control qubit with ID1 and target qubit ID2. The ID in the form (i, j) where i is the QPU and j the qubit on that QPU.
• A qubit layout map qubitM ap on a collection ID tuples of the form (ID1, ID2). Output: An equivalent circuit that accommodates for non-local controlled gates. DistributedRemapper(U, qubitM ap): ⊲ Add a placeholder layer in case an extra first layer is needed 2: for layer l ∈ U do 3: for gate ∈ layer l do 5: if gate is CON T ROL : if Entanglement is not established between QPUs i and s :

Algorithm 4 GetSeriesControlGates
Input: • U the circuit as described in Algorithm 3 • layerl l the current layer to decompose in U • s the QPU for the target qubit • (i, j) the control qubit Output: The series of gates directly following from layer l that are control gates with with control qubit (i, j). GetSeriesCGates(U, layer l , s, (i, j)): 1: layers ← {layer l+1 , ..., layern} ⊆ U ⊲ Skip current layer 2: gates ← [ ] 3: for layers ∈ layers do 4: for gate ∈ layers do 5: if gate is CON T ROL and gate = ( , (i, j), (s, )) : ⊲ Same control and target qubit 6: gates.add(gate)  The next step is to map U := RΠR † P RΠP † R † to a distributed system. One observation that can be made immediately is, since P is a Pauli string, P † = P , so there are no additional steps needed to map P † . P is a separable operation (i.e. there are no 2 qubit gates) and therefore we can apply each piece of P in a single layer with no added inter-QPU communication. For mapping R(λ) † to a distributed system, as discussed, given an R(λ) as a circuit that is not distributed, we can obtain R(λ) † . To obtain the mapping, we can run Algo. 3 with R(λ) † as the input with the same Ansatz distribution. Next, we consider the n qubit reflection operator Π which can be decomposed (locally) as a series of single qubit gates and CN OT operations. We can therefore again use Algo. 3 to map a provided reflection Π to a distributed architecture given the Ansatz distribution as input.
For the control part of α-QPE, we consider the controlled version of U , c − U , because of the structure of U , one can see that the only operation that is in need of control is in the reflection Π since if Π is not applied, c − U is reduced to the identity. Here it will be the case that we need to execute control-control gates (CC-gates). If the Ansatz is split between QPUs, then two qubits need to be reserved on each QPU to accommodate for CC-gates. This is guaranteed by the scheduling algorithm in the previous subsection and there will always be two free qubits reserved such that we can apply Algo. 3 again after adding a control connection to each gate of the circuit representing Π the distributed form, excluding the previously added non-local steps, to produce a circuit that achieves the controlled version of Π.
The remaining steps of α-QPE are the two complications that arise which are discussed in Ref. [4] Section 2b. At each iteration of α-QPE the Ansatz |ψ = 1/ √ 2(|φ + |−φ ) needs to be collapsed into either |φ or |−φ . In Ref. [4], Wang et. al propose a statistical sampling method which one can apply a constant number of iterations in order to, with high confidence, both estimate the sign of ψ|P |ψ and ensure that | ψ|P |ψ | > δ. When this bound holds, then with high confidence, |ψ can be efficiently collapsed to either one of |φ or |−φ . Once this is performed, we apply the α-QPE procedure as normal. If high confidence cannot be achieved, then instead of using the α-QPE circuitry, statistical sampling continues. Statistical sampling in this setting implies repeatedly preparing |ψ , applying the single layer Pauli string P , in order to estimate ψ|P |ψ . When the bound does not hold, statistical sampling is performed until ψ|P |ψ is estimated with sufficient precision in the normal VQE sense. We follow the method of Wang et. al, but use the modified R(λ) circuit needed to prepare the Ansatz over a distributed quantum computer. We write this whole procedure in Algorithm 5.

Distributed α-VQE
To conclude the mapping of a localized, monolithic version of α-VQE to the distributed version, we need to replace the α-QPE subroutine with the distributed α-QPE version from the previous section. For completeness we write distributed α-VQE as an algorithm in Algorithm 6.

Analysis
In this section, we analyse the properties of the distributed quantum circuits in relation to the Ansatz size. First, we compare the duration of computation using three methods of performing the estimates of the expectation values: estimating in parallel, on one single QPU the size of the Ansatz, and using parallel and distributed computing. When running in parallel, one Pauli string is estimated per QPU. The limitation is that the Ansatz can be only as big as the smallest QPU, minus the qubit for α-QPE. In the single QPU case, we assume the full Ansatz can fit on the QPU, and therefore no gates are distributed. Finally, in the distributed and parallel case, Pauli strings are estimated similarly to the parallel case, but multiple Ansätze can be placed on a single QPU as well as split between multiple QPUs with distributed control gates.
To get an estimate for the number of gates used, we analye the pieces of the U operator defined in the previous section. The reflection operator Π has the equivalent cost, up to 2n single qubit gates to an (n + 1)-qubit Toffoli gate [4, Section II.B]. Without ancilla qubits, currently the circuit depth to implement such a gate grows linearly O(n) [30] with improved linear scaling with 1 ancilla qubit [31]. When ⌈ n−2 2 ⌉ ancilla qubits are available, the depth can scale as O(log n) [32] to implement with 6n − 6 CNOT gates. The additional ancilla qubits to decrease the circuit depth could be considered in the Ansatz distribution phase from Subsection 3.1, and we leave it to future work to analyse this change. Here we assume no additional ancilla qubits. For the Ansatz preparation R(λ), in most of the applications to date, the circuit depth is Ω(n) [33], meaning it has a tight upper and lower bound proportional to the number of qubits, which could be the most significant overhead in this process.
We demonstrate the time trade-off. In Fig. 7 we assume we have a QPU cluster with 5 QPUs each with 10 qubits. We determine a rough upper bound on the number of gates needed to perform distributed computing and summarize the time weight and gate quantity scaling in Table 1. In Fig.  8, we show the maximum number of qubits that an Ansatz can be composed of using four different sized QPUs and with respect to the adding additional QPUs of the that size to the distributed system.

Operation
Execution time weight Quantity scaling

Applications for Quantum Chemistry
In this section, we take an example of a electronic molecular Hamiltonian for the chemical H 2 . To estimate the Hamiltonian for this molecule with 2 electrons and 2 active orbitals, we require 4 qubits when using a Bravyi-Kitaev transformation. We can quickly obtain the Hamiltonian using the Pennylane Python library [36]. The Hamiltonian in this case, under the Bravyi-Kitaev transformation is of the form, where we are concerned in the number of elements in the sum and less so about the constant factors and therefore to perform α-VQE, we will need to estimate 15 Pauli strings. In this example, we will consider a distributed quantum system of 3 QPUs each containing 9 qubits. If we use these parameters as input to the algorithms in Section 3.1.1, the output configuration would be the one depicted in Figure 9. In one round, 4 Ansäzte can fit across this distributed system, and so at least 4 rounds need to be executed. We can use the same allocation for the first 3 rounds and in the last round eliminate the distributed Ansatz in order to reduce the need for cross communication between QPUs.
For the Ansatz preparation, we use the circuit R(λ) depicted in Fig. 10 (a). From the 4 Ansätze, three of them will be able to run the α-QPE step without distribution of the Ansatz. The fourth Ansatz is on the other hand distributed and will need to use the circuit in Fig. 10 (b) for preparation. For simplicity, we include arbitrary qubit rotations which are represented by the R(λ 1 , λ 2 , λ 3 ) gates, where λ i ∈ [− π 2 , π 2 ] for i ∈ {1, 2, 3}. Next we need to perform the reflection operation Π described in Section 3.2.1, whose circuit is shown in Figure 11 (a). An equivalent circuit is also shown which decomposes the 4-qubit Toffoli gate into a series controlled and single qubit gates. We again need a distributed version of the reflection operation to support the Ansatz which is distributed. We show this circuit in 11 (b). Here we introduce gates for the cat-entangler and cat-disentagler sequences. Here, 4 qubits are allocated for performing the non-local gates. Now, for running α-QPE, we need a circuit for c − Π, which is the control part of c − U . Here is where it is critical to have 2 entanglement qubits for each splitting of the Ansatz on each QPU since, as seen in Figure 12, there are controlcontrol gates that occur across QPUs. With this collection of gates, we can run α-QPE and therefore using the algorithm in Section 3.2 run α-VQE. Figure 9: A distributed Ansatz of size 4 on three QPUs with 9 qubits. The green outlined qubits are reserved for running α-QPE. The red outline qubits are for the Ansätze. The orange outlined are qubits reserved for entanglement between QPUs for non-local gates. One qubit is left idle.
The red dashed lines represent the individual QPUs. Figure 10: Distributed circuit mapping for R(λ).
The square gates in the 4 qubit gates represent the catentangler/disentager sequence.
To run α-QPE, one needs to perform a controlled U operation M times, where U = RΠR † P RΠR † P . The control portion to consider is c − Π. We depict the c − Π part, where the other parts of U are applied before and after what is depicted, which do not need to be controlled.

Networked Control Systems for Distributed QC
Because it will be difficult in the near future to construct large, monolithic quantum computers, it will be a viable option to instead connect smaller quantum computers together using a network in a distributed manner. One can therefore consider networked control systems (NCSs) to manage the distribution of resources for running quantum algorithms. Such a system could allow for more flexibility regarding hardware configurations and the ability to dynamically add more devices while minimizing integration overheads. A NCS is a network of devices connected together using the network in order to perform a specific mutual task orchestrated by a control system [37,38]. Among the other thing, NCSs are used to perform distributed or parallel computing, controlling a fleet of robots or drones, or smart grid systems deployed in modern cities [39]. Networked control systems can have various architectures for the control system part. These systems can either have a centralised controller where communications amongst the nodes are restricted to local area network (LAN) or a decentralised controller system that is connected via an internet or wide area network (WAN). These two scenarios resemble how distributed quantum computers could potentially be networked. In the first case, one can consider a single owner of multiple quantum devices where all of the quantum devices are located in the same room or building, specifically, the network owner would know the network topology and information about hardware in the network. In the second setting, multiple quantum computers located possibly far apart potentially connected by multi-hop connections where the owner of the hardware between the hops is possibly different. Here, more advanced protocols that consider security and robustness will be needed potentially leading to a fully fledged quantum Internet.
In order to use a network of distributed quantum computers efficiently, it is important that one develops robust communication protocols such that communication and control between the quantum devices in the network is efficient and reliable. In this section, we consider quantum systems with classical control and a separated quantum processing units. We consider a QPU to be a combination of a three layered system depicted in Fig. 13. The QPU in this case is a layered system with inputs and outputs to a communication network through a classical computer, or a CPU. The CPU interfaces with the network as well as controls the FPGA based on the control instructions from the network which in turn controls the qubits to perform quantum operations on the qubit layer. Qubit measurements and other classical messages are transmitted back to the network via a reversed path.
We consider the two different network configurations described and get into more detail about how these systems could be implemented in practice. We list the communication requirements needed to perform distributed quantum computations. We explore some available protocols to achieve these requirements under two scenarios. In the first one, there is a centralised controller of the system and communication to devices is classical information and quantum entanglement can be sent directly to other quantum processors without routing. The second case is when control over the network is not centralised, but has a single user. We then propose a control system using Deltaflow.OS to control system to orchestrate distributed quantum computing. QPU CPU FPGA Qubits Figure 13: Internal layering of a QPU. We assume there is a layered architecture. The CPU instructs the FPGA which in turn controls and measures the qubits. The CPU also interfaces with the network.
Classical Network

Controller Communication Controller Communication Entanglement Entanglement
Low-Latency Low-Latency Figure 14: A networked control system with a centralised controller. The triangular arrowheads represent classical connections and the diamond shaped arrow heads represent quantum connections used for establishing EPR pairs. Here we assume the network between the QPUs is completely connected in terms of quantum and classical connections, that is, each QPU has the same connections with any other QPU. In the completely connected network, this network is single use for transmitting with low latency. Moreover, each QPU is connected to a common bus line that handles all the latency-tolerant message exchanges among the nodes.
Controller 1 Controller 2 · · · · · · Controller n Figure 15: An inter-networked distributed quantum computer with decentralized controllers. Here, there are independent controllers that control their respective quantum processing systems. Entanglement is generated with an entanglement network requiring possibly multi-hop entanglement routing. The controller is placed between the QPU and the quantum network since in this scenario, a quantum processing layer will be needed.

Control System Architectures for Distributed Quantum Computing
In this section, we discuss two possible network architectures for distributed quantum computing control systems. The major difference between the two systems is centralization of the control. In the first system, we consider a distributed architecture with a centralized control. In the second, the control is split such that each QPU in the system has its own control. In this section, we describe these two systems in depth. In later sections, we go into detail regarding the communication requirements needed to run the systems and potential protocols to achieve them.

Centralised-Controlled Distributed Quantum Systems
The first distributed quantum computing scenario we consider is depicted in Figure 14. This scenario is one where there is a single controller and the quantum hardware behaves only according to the instructions that are fed from this controller. The QPU systems are connected to the controller via classical network and further they are connected to each other both classically and quantumly -so that they can generate entanglement amongst themselves. The main idea here is that the CPUs in the network have a static IP and can be accessed by the centralized control. The finer synchronization between the QPU nodes is delegated to the CPU controlling the FPGA layer of the QPU from the centralized control ahead of execution time. The CPUs control the FPGAs and the FPGAs communicate over fixed low-latency links. This latency can be accounted for for control instruction scheduling. At a small and medium size, this network scheme will be best suited, but when many nodes are added to the network, a system with a distributed control is better suited, which we discuss in the next subsection.

Decentralised-Controlled Distributed Quantum Systems
With a decentralised control system, the nodes in the network are no longer in a "master-slave" relationship because the hardware is no longer controlled by a single entity. Resources in this setting need to be requested from various parties and there is no guarantee that the requested resource will be available at the time of request. Access to the controllers is hidden behind a firewall and their IP, MAC, and inner network configuration is potentially not exposed. We assume that the QPUs are offered by various vendors that have agreed to offer a base set of services: They provide access to quantum hardware for a maximum amount of time per instruction set execution, they offer classical communication input and output to a pre-specified IP address where the communication stream is established prior to execution, and they allow for remote entanglement to be established between quantum devices on specified quantum hardware. In this case, the control information between QPUs is needed and we will need a low-latency protocol that works in the network layer so that the control messages can be routed.

Distributed Quantum Algorithm Scheduling
In order for networked quantum hardware to execute instructions in a synchronous fashion, a method of dictating to the devices when the instructions should be executed is needed. In this section, we propose a temporal operation schedule, that is, a schedule of the operations with precise timestamps for execution. These schedules can then be sent to each QPU in the network with a time to begin execution. Because quantum gates generally have an upper bound for how long they take to execute, we can use this information when generating the schedule. Here, we assume that all gates have a known execution time as well and that latency times for classical communication and entanglement generation are known. We formalize the problem as follows: Comparing and creating a hybrid temporal planning approach with constraint programming for quantum circuit scheduling has been investigated in Ref. [40] for the max-cut problem. The difference here is the level of compilation is deeper as they include swap operations since they limit to nearest-neighbour interactions between the qubits. A temporal planning and constraint programming approach is therefore sensible. Here, we do not enforce nearest-neighbour interaction, and assume this process is included in the worse case timings for two qubit gates for swapping qubits to their nearest neighbour if needed and remapping the index of the qubit so it does not have to be swapped back. We assume at the end of each layer of gates, each qubit will be free to be operated on and no swapping is needed and therefore do not use any constraint programming.
The output of the schedule for each QPU will have the form of Table 3. Table 2 is an intermediate schedule which is used before splitting the schedules for each QPU. For a list of all possible commands and their descriptions, see Appendix A. We approach this problem as follows. We start with highlevel instructions which are entanglement generation, single qubit gates, classical communication, and control gates. We generate an instruction list using this gate set. We then take the high-level circuits and break them down into finer control instructions. Once the full schedule is created, we can split the instructions such that the instruction schedule is for a single QPU. The instruction sets can then be sent to their respective QPUs and the algorithm can start. In order to ensure gates are performed in the correct order, we layer the circuits as done in the previous section and schedule the circuits layer by layer, iteratively constructing a full schedule. In complete form, we propose Algorithm 7.

Command
QPUs  Table 3: Ti is the time to execute the particular gate. The output of Algorithm 7 will generate a collection of schedules in this form for each QPU.

Protocols
In order to run a distributed quantum algorithm with a distribute quantum computer using the architectures proposed in the previous section, certain communication requirements are needed to

Algorithm 7 Distributed Scheduler
Input: • QP U s the collection of QPUs in the system • C = {l1, ..., ln} the circuit to schedule as a series of layers where each li = {g1, ..., gm}.
• gateT ime a mapping of gate names to time the gate takes to execute for each QPU. Output: A schedule of gate operations for each QPU to run.
1: layerEndtime ← 0 2: gateSchedule ← [ ] 3: for li ∈ C do ⊲ Make a first pass schedule based on the layers of the circuit 4: for gj ∈ li do 5: gateSchedule.add((gj, QPUs(gj), layerEndtime)) 6: end for 7: layerEndtime ← maxj (gateT ime(gj)) + layerEndtime 8: end for 9: QP U Schedules ← {} 10: for QP U ∈ QP U s do ⊲ Split the schedules so each QPU has its own 11: for step ∈ gateSchedule do 13: if QP U ∈ QPUs(step) ∧ |QPUs(step)| = 1 : The first requirement considered is the classical communication between the controllers and the QPUs. Here what is needed is a method for sending the computation instructions to the QPUs which can be done at slower latency, as well as a method for sending low-latency control bits between the QPUs. We explore methods for achieving this in the two cases. Clock synchronization is a commonly used scheme in distributed control systems. We consider an example of architectures using clock synchronization on a large scale. Lastly in the multi-vendor case, we discuss certification steps needed to ensure multiple vendors are able to execute distributed quantum algorithms in a cooperative way.
Selecting the specific hardware that can execute these protocols is left to future work as tasks such as entanglement generation is still it a primitive state and may not exist to the extent we need for years to come. Also as qubit technologies improve, the need for as-low-as-possible latency could be loosened, and other protocols could be used in replacement. Here we explore examples that could potentially achieve what is needed to perform distributed quantum computing.

Classical Communication
In order to run distributed quantum algorithms, there are specific non-local tasks that need to be carried out by the distributed system such as receiving control commands, sending measurement results to the controller, and sending qubit measurements between the QPUs at low latency to perform non-local control gates. In this section, we explore communication protocols which can be used by the control system to accomplish running distributed algorithms. Here we explore some examples of existing protocols that exist at an industry level.
For the centralized control case, we neglect routing of information and assume each node is connected both classically and quantumly to another. As discussed, We propose that there is a classical network connecting the QPUs to a centralized controller forming a "master-slave" relationship with the additional network of dedicated connections between the QPUs for the sole purpose of low-latency communication. This communication does not go through the CPU of the QPU, but directly between the FPGAs to perform the non-local gates.
When using a centralized control system, to perform slower communications between the QPUs and the controller one could consider two options. The first is to simply connect the CPU portions of he network to the controller using a local area network, and communicate using TCP/IP to from the controller to the QPUs. In this case one would need to closely monitor that communication traffic does not overwhelm the network. Another approach that has this feature built in is to use a protocol often used in industrial control systems. The Modbus communication protocol [41]. Modbus is an open protocol used in a centralized controller master-slave setting as is this centralized controller setting. It is a messaging structure that allows for heterogeneous devices to communicate with a centralized controller and to receive control messages from the controller. The Modbus protocol can be used over a local network using TCP/IP making it easier to install into existing commonly used Ethernet networks. With Modbus, the controller can send the control instructions to the CPU portion of each QPU which can be sent to the FPGA to perform the portion of the quantum algorithm. With Modbus, the controller can also receive the qubit measurement results from the QPUs once the algorithm is complete.
For low latency communication of short messages (< 1 byte) between FPGAs there are existing methods that can be used to communicate at the ultra-low latency range (< 1 ms). In the high performance computing domain, FPGA networks for ultra-low latency, high bandwidth communication are realized. Here we need to consider that the FPGAs may be meters apart. Connecting the FPGAs with, for example, 10 Gigabit Ethernet for sending short messages and using custom communication protocol and small form-factor pluggable (i.e. SFP+) transceivers, latency of 300 nanoseconds is possible for each link [42]. With fibre, the latency can be even further reduced.
Another approach that can be integrated again comes from the industrial control domain. Industrial control, especially in the power sector faces issues where some devices need constant monitoring and reacting to the changes needs to be done at very fast speeds. A method used is the Mirrored Bits [43] protocol. Mirrored Bits is a communication protocol for ultra-low latency communication adding additional a latency of approximately 200 µs for message processing in addition to the latency from transmitting signals over the communication link. Mirrored Bits could be used in this setting to transmit qubit measurement data. The devices that perform the Mirrored Bits protocol which are manufactured by Schweitzer Engineering Laboratories are programmable and can trigger different routines on the FPGA depending on the input bit. These devices are commonly used to frequently monitor sensor data to trigger emergency shut offs as fast as possible, for example.
In the decentralized case, a dedicated wide area network could be established between the vendors such that a link-layer (of the OSI model) protocol is used for the classical control information between the QPUs. Low latency communication can be achieved using the link-layer protocol called the Generic Object Oriented Substation Event (GOOSE). In particular, IEC 61850 is a GOOSE Ethernet protocol meeting time sensitive communications and high-speed performance requirements of automation applications. At the link-layer over an Ethernet network experimental results show GOOSE can be used to transmit in the 0.5 ms scale [44]. When routing is involved, naturally, the latency will grow. If TCP/IP protocols are used over the Internet are considered, then it is unlikely one could create any latency guarantees. If instead there are dedicated wide area networks with routing, one could consider the network-layer version of GOOSE called Routable GOOSE [45]. In [44], R-GOOSE is analysed over a wide area network using a particular data distribution service and was shown to transmit with average latency of around 8 ms.
Overall, there can be much to learn from looking into the power automation industry, as many lowlatency and fast reaction systems have been developed which have carryover into distributed quantum computing. These protocols have been tested for robustness and security and could potentially fit well for doing distributed quantum computing over a wide area network. Moreover, networking FPGAs

Clock Synchronization
For centralized control, the assumptions of clock synchronization and full connectivity at a small scale are not overly restrictive. High precision clock synchronization can be achieved even in very large configurations (i.e. that comprise a large number of devices) using methods such as in the White Rabbit Project [46]. White Rabbit is used at CERN to synchronize over 1000 nodes with subnanosecond accuracy. This is achieved using Ethernet with lengths of up to 10 km, with experiments demonstrating an average of 160 ps skew between similar clocks -regarding clock environmental variables such as temperature -after several hours [47]. This protocol can be integrated in the centralized controller case so that all of the hardware used has synchronized clocks. Once the number of nodes in the network becomes large, routing and efficient network topologies becomes critical.
In the decentralized case, the controllers will need to perform a coarse-grain time synchronization via classical network synchronization protocols and a fine-grain synchronization, a precise notation of time can be shared. GPSs directly connected to FPGAs can be a solution where shielding does not stop the incoming signals. This process is common in distributed physical experiments such as in the Super-Kamiokande Detector [48] and the CERN-OPERA experiment [49]. Each node will need to implement extra steps to guarantee that the timing information is constantly accurate.

Entanglement Generation
To perform the non-local control gates needed in the distributed circuits, the ability to share high quality entanglement between quantum processors is critical. Entangle generation has been achieved in various qubit technologies such as in optical photons, NV-centres, and super conducting qubits [50], but entanglement generation in quantum networks is an ongoing research topic. We consider deterministic entanglement generation schemes such that there is a guaranteed entangled pair available shared between the quantum processors when it is needed. Experimental results demonstrating deterministic delivery of entanglement using NV-centers in diamond as qubits have been shown in [51], generating heralded entanglement at a rate of 39 hertz, three orders of magnitude better than previous known results and guaranteeing an entangled pair every 100 milliseconds with fidelity greater than 0.5 without pre-or post-selection. Methods for improving the results further are also proposed. This gives evidence that using entanglement to perform distributed quantum computing can become more feasible using various qubit technologies. As technology regarding entanglement generation and qubit stability improves, the deterministic entanglement generation rate can be improved.

Customer-Vendor Certification
In the case of a decentralized controller, additional protocols are required to ensure that the all parties are able to execute the distributed quantum algorithms and an execution schedule can be made such that non-local operations are performed synchronously. In this setting, the user has no control over the quantum hardware and therefore a protocol for ensuring the user's instructions can be executed is needed. We write in Protocol 9 a protocol for creating contracts between vendors and the user to ensure the desired instructions are carried out as specified.

Deltaflow as a Networked Control System
To orchestrate a centralized control system, we propose a scheme that uses Deltaflow.OS as the control. Deltaflow is based on the dataflow programming paradigm described in Section 2.4. The Deltaflow language is a Python based language that allows users to specify graphs and edges representing the dataflow between them. It is a hosted domain-specific language: the nodes are filled with code corresponding to the hardware that is represented. A Deltaflow OS is a tool for running Deltaflow programs on a specific piece of hardware. When given a Deltaflow program it performs compilation steps to transform it into chunks that run on the native hardware, runs those chunks, and exists in parallel to provide system services, networking and communication abstractions, and time sharing facilities. A Deltaflow OS provides the same functionality as an OS and compiler combination like Linux+GCC [52].

Centralized Control
As discussed, Deltaflow uses nodes in a graph to control the flow of information within a hardware network. The nodes contain the logic of what instructions to perform when an input message is received. In this section, we describe the highest level nodes that are used and their instruction logic to conduct a distributed α-VQE when there is a single centralized control. Within the highlevel nodes can be more nodes but these nodes are hardware specific, and we leave this open for future work. In Figure 16, we depict a Deltaflow graph laid on top of the hardware. There are four unique nodes in the DeltaGraph: Controller, Classical Communication, Quantum Gates, and Time Reference. We describe the controller responsibilities and the controlled node tasks below. When all checks pass, respond positively to user with latest possible start time of execution adjusted for user system time difference 8: Await start time confirmation, send acknowledgement to user, and await for final acknowledgement from user

Protocol 9 Contract Creation Protocol
Controller Node: The controller node is the main interface between the user and the distributed quantum hardware. This interaction closely resembles that of a distributed operating system as discussed in Section 2.3. The controller receives the algorithm parameters from the user, namely, the user sends the Ansatz preparation circuit designed for a single QPU and a Hamiltonian. Once received, the controller node handles the following: 1. Takes user specified Ansatz preparation circuit and decomposed Hamiltonian, and confirms execution parameters 2. When confirmed, the Hamiltonian is distributed across the QPUs and a gate execution schedule is made 3. The schedule is split according to the locality of the execution, that is, the entire instruction set is not sent to each node, but rather just the parts that are executed by that particular node.
4. Once all instructions are sent, the controller listens for incoming messages from any node in the DeltaGraph signaling an error. If such an error is received, then the execution process is aborted and the user is informed.
5. Once scheduled run time has elapsed, the controller listens for measurement results from the QGNs and sends them to the user.
Classical Communication Node (CCN): The CCN handles listening and sending classical data, that is information encoded into bits, to other QPUs in the network and collecting and forwarding classical control information from and to the quantum gates node (QGN) to perform any non-local control operations. The CCN receives the precise times for when to listen and when to transmit in the compiled instruction set from the controller. When time, the CCN communicates with the QGN to collect measurement results of the entangled qubits needed to perform the non-local gates and transmits the information to the paired QPU. The paired QPU will have its own CCN which will be listening for classical input and will know what to do with the information based on the predetermined instructions. Each CCN communicates with the controller to receive instructions at the beginning of execution. CCNs can also inform the controller of any failures in communication so that the controller can abort the run-time process.
Quantum Gates Node (QGN): The QGN controls the quantum hardware and controls the gate operations performed on and measurements of the qubits. It communicates to the CCN and the controller. The communication link with the CCN is used for sending the measurement results of the entangled qubits in the cat-entangler and cat-disentagler steps to the respectively paired QPU. With the controller, communication to receive instructions is needed as well as to transmit the qubit measurements at the end of the algorithm. Moreover, the QGN conducts entanglement half of an entanglement generation scheme, controlling the particular hardware as per the instructions and entanglement generation protocol selected.
Time Reference Node (TRN): The TRN is a node used to maintain synchronous time within the DeltaGraph. The duty of the TNR is simply to periodically broadcast the system time so that all nodes in the network can update their own clocks to fix any clock drift that occurs during run time. Latencies from the TRN to the nodes can be accounted for, improving the overall resolution and precision of the timing reference.

Decentralized Control
In the decentralized controller setting, the DeltaGraph is split between the QPUs in the network. Each QPU has their own DeltaGraph and the inter-communication is handled by the nodes in the DeltaGraph. In this case, there needs to be an agreed upon contract between each participating party which we have provided in Protocol 9. Here, for simplicity, we assume that there are no malicious parties and no eavesdroppers. Unlike the centralized case in the previous section, each remote QPU has its own controller and time reference node. The DeltaGraph in this case is depicted in Figure 17. The main differences are that the independent time reference nodes allows different clock hardware for each vendor. Different mechanisms could be in place to adjust the time at each vendor which is controlled by the respective time reference node. Time reference nodes are connected such that at each vendor they can perform different pre-selected clock synchronization protocols. The controller nodes' main tasks are interfacing with the user and orchestrating the instructions provided by the user for the specific hardware of the vendor. The controller for each vendor interfaces with the user similarly how it is described in the alternative method of deploying a distributed operating system in Section 2.3, where the user is aware of the network topology and hardware capabilities of each node. Controller Node: Each controller node has the responsibility of interfacing with the user and distributing the execution instructions to the other nodes. Much of the responsibilities of the controller in this case match that of the centralized controller case, but here the controller is only aware of the instructions that occur on the hardware in one QPU stack rather than the entire instruction set. In the centralized case, it was the duty of the controller to map the Ansatz circuit and schedule the Ansäzte to the distributed quantum hardware. In this case, that duty is moved to the user. The user queries QPU vendors, gathering the necessary hardware parameters in order to create a schedule that is executed on each vendor's hardware. The point of this is that the user will have more control over how many qubits they wish to run on each vendor's hardware. Variables such as cost of execution, time duration of execution, and vendor availability can be integrated into the user's decision when distributing their circuits across multiple vendors. The controller here can choose to reject or accept instruction sets depending on the vendor's hardware and availability. We summarize the responsibilities of the controller as follows:  Figure 17: A DeltaGraph for a decentralized control distributed quantum computer.

Conclusions and Outlook
In summary, we have explored how the generalized VQE algorithm α-VQE can be distributed across arbitrary sized quantum computers connected with entanglement and classical communication networks. We proposed various approaches for splitting the Ansatz across the distributed system and provided algorithms for splitting the circuitry needed to split perform α-QPE, the central algorithm around α-VQE. We show in our analysis that with this approach, larger Ansatz stats used on distributed systems at the cost of run time. Next, we explore how one could network together a distributed quantum computer using two different architectures and we collect the necessary protocols needed to achieve this. We finish with a network control proposal using the Deltaflow.OS, the software based control system for controlling quantum systems at the various classical and quantum hardware levels. What remains open is to implement this system at a small scale, using both simulation and physical systems. Initial effort has been made in this direction, but more work is needed to have a complete proof of concept. Already, an implementation using quantum network simulator QuNetSim [53] and Deltaflow together has shown distributed quantum computing can be simulated. More, Deltaflow has been shown to work with distributed circuit boards which leads to the first steps of simulating distributed quantum algorithms. The methods for distributing the Ansatz states could consider more parameters for further optimization, such as the coherence times of the qubits and location of the control gates in the circuits. We aim to explore this in more detail in future work.
As discussed, distributed quantum computing is a promising path to developing large scale quantum computers. Much effort has gone into this in the classical computing domain, and the overlap between the fields is high. We can use this knowledge to design robust and secure distributed quantum computers, and as quantum technologies improve, this will surely become a reality.

A.2 Control System Commands
Two qubit gate G on qubits with memory IDs qID 1 , qID 2 , where qID 1 is the control if needed. qID 1 is potentially a classical register.