Simultaneous execution of quantum circuits on current and near-future NISQ systems

In the NISQ era, multi-programming of quantum circuits (QC) helps to improve the throughput of quantum computation. Although the crosstalk, which is a major source of noise on NISQ processors, may cause performance degradation of concurrent execution of multiple QCs, its characterization cost grows quadratically in processor size. To address these challenges, we introduce palloq (parallel allocation of QCs) for improving the performance of quantum multi-programming on NISQ processors while paying attention to the combination of QCs in parallel execution and their layout on the quantum processor, and reducing unwanted interference between QCs caused by crosstalk. We also propose a software-based crosstalk detection protocol that efficiently and successfully characterizes the hardware's suitability for multi-programming. We found a trade-off between the success rate and execution time of the multi-programming. This would be attractive not only to quantum computer service but also to users around the world who want to run algorithms of suitable scale on NISQ processors that have recently attracted great attention and are being enthusiastically investigated.


I. INTRODUCTION
Current processors, called Noisy Intermediate-Scale Quantum (NISQ) [6], are not immune to noise, which causes a high error rate and greatly affects the reliability of the computation.
With the advent of cloud quantum computing systems, quantum computing has become familiar to researchers and developers around the world.The more users and tasks from various demands and backgrounds have increased, the more it is important to maximize the throughput of the NISQ processors.To operate efficiently, executing multiple quantum tasks concurrently can be one solution.However this method is not trivial and involves fundamental challenges [7]- [10] .
It is difficult to explain the whole range of errors of computation on NISQ processors by using only information from standard error characterization techniques such as randomized benchmarking [11], [12].
Though quantum tomography estimates statistical reliability of a quantum state, it is requires large amounts of data exponential in the number of qubits for complete state tomography.To maximize the utilization and performance of a NISQ processor, we should take into account not only the standard model of isolated qubits but also the detection model for context-dependent errors, which requires a small cost [13].
In the case of superconducting qubit systems, effects of crosstalk on the gate errors is a serious problem [14], [15].When multiple quantum circuits (QC) are executed in parallel, as resource usage of the processor increases, unwanted interference may occur due to crosstalk noise between independent QCs, which may affect the calculation results.
In this work, we introduce palloq (parallel allocation of QCs), a system for improving the performance of multiprogramming on NISQ processors.Our system consists of two parts: first, the multiple circuit composer to maximize searches for an effective combination of the application circuits (a knapsack-like problem); second, the crosstalk-aware layout method that allocates hardware qubits to multiple  [1]- [5].As the core of the service, the quantum processor provides its computing resources by performing quantum operations and measurements.Digital Analog control exchanges the quantum and classical information by converting program instruction into analog signals and measurement results into classical data.The real-time control system is responsible for classicalquantum interaction, mid-circuit measurements, and feed-forward operations.Another end of the cloud computing is the browser-based user interface.End-users create the requests (jobs) on a web browser, send them to the system via a web server through the Internet, and receive the results of the computation.The browser-based user interface provides AAA (authentication, authorization, accountability), and in some cases, quantum programming tool and its development environment, such as a GUIbased quantum circuit composer.
QCs taking into account both local error rate and nonlocal noise (crosstalk).We also take into account the cost of crosstalk characterization and propose a novel detection method.We show the performance of our system by executing dozens to low hundreds of queued quantum tasks as a multi-programming workload on the real-world cloud quantum computing platform, IBM Quantum Experience, and measure the success rate of the individual QCs and total execution time.This paper is organized as follows.In Sec.II, we review the cloud quantum computing environment, quantum multiprogramming and crosstalk effect on NISQ processors.In Sec.III, first, we describe the crosstalk characterization approach in previous work, then introduce our novel characterization method.In Sec.IV, we propose compilation methods for efficient multi-programming.In Sec.V, we show the experimental results on quantum processors and evaluate our proposed approach.Finally, we conclude paper in Sec.VI

II. BACKGROUND A. CLOUD QUANTUM COMPUTING
With the advent of cloud quantum systems i.e. quantum computing as a service (QCaaS) [16]- [22], many researchers and developers from a variety of domains are becoming quantum users.QCaaS provides the quantum resource that allows opportunities ranging from conducting basic experiments [23] to developing applications that include quantum simulation, quantum machine learning, and optimization [24].
Cloud quantum computing architectures consist of components that include Quantum Processors, Analog Digital control, Real-time control systems, WEB servers and classical databases, the Internet, and End users, as in Fig. 1.
Why use a cloud quantum system instead of a local server or a quantum laptop?In general, there are several reasons for the migration of the service from desktop and corporate server rooms to a cloud platform [25], [26].For the individual users, the total control of the software, OS and low-level utility, and subsequent revisions to other programs comes with a price.For the service providers, the internetbased service can be developed, tested, and operated on the platform provider's choice instead of coping with various user's environments.In the case of quantum computation, the development and operation, which includes daily calibration, of a quantum computer are very expensive and specialized tasks [27], [28].Languages, tools, and environments for the development of the quantum program are still not sufficient.By providing them comprehensively as cloud services, users can utilize the quantum resources without being bothered by maintenance.
The rapid increase in users, urgent access for limited quantum resources, and the number of queued jobs are becoming serious issues.

B. PERFORMANCE OF QUANTUM COMPUTER
Several metrics for performance analysis of quantum computers have been proposed [29]- [31].Quantum Computing performance is governed by three factors: 1). the size of the problem that be encoded which is determined by the number of physical qubits on a processor, 2). the size of a quantum circuit that can be faithfully executed, which is mainly determined by error rates of each operation and lifetime of qubits, and 3). the number of circuits that can  The idea of multiple execution.The lattice graph represents the quantum processor, nodes denote qubits and edges are two qubits connection.The QC with the green box represents quantum circuits placed on physical qubits of the processor.Fig. 2a represents single quantum circuit execution.Because the limited sized circuits are tolerable on NISQ system, the idle qubits (yellow circles) reduce the throughput of the processor.Fig. 2b describes the idea of multiple circuit execution concurrently.This approach reduces idle qubits and is expected to increase the throughput.be executed per unit time, which is related to the quantum and classical processing speed.In this paper, we focus on the improvements of 2: the output fidelity of QCs, and 3: the number of QCs executed at a time by a quantum multiprogramming.

C. QUANTUM MULTI-PROGRAMMING
Quantum multi-programming is a method for improving the throughput and utilization of the NISQ processor by executing multiple QCs simultaneously, instead of keeping the unemployed qubits idle, as shown in Fig. 2. In previous work, several challenges were discussed as follows [7], [8]: 1).Fair hardware resource allocation for every individual task.The difficulty of this issue comes from the variations of characteristics of each physical qubits in the processor including operational error rates and qubit lifetimes [32].To solve this, the compiler needs to take this error information into account to optimize the circuit.2).Avoidance of unwanted interference between the individual QC. 3).Optimization of the operational timing of each circuit to minimize the unnecessary decoherence effect.In the case of multi-programming several QCs with different depth (duration of execution), the shorter circuits suffer wait duration until the longer circuit's operation ends, which may cause the decaying of a quantum state prepared by shorter circuits and reduce the output reliability.
Improving the utilization of the processor by executing multiple programs concurrently can increase the unwanted interference between independent QCs.To reduce serious destructive interference, one option is to monitor and compare the performance of parallel execution and to feed the result into the next execution phase, either single or in parallel [7].Rather than that, Ohkura [9], and Dou and Liu [10] discussed directly focusing on the crosstalk noise on the device which causes non-local errors on QCs of multi-programming.They tried to characterize the crosstalk in the processor and optimize qubit allocation along with it.The problem is the crosstalk characterization grows quadratically in the number of hardware qubits, as we discuss in Sec.III.

D. CROSSTALK IN NISQ PROCESSOR
Crosstalk is known to be a significant source of noise in the quantum processor.This type of error can be explained from several aspects, but it is simply the unwanted interaction between coupled qubits in the processor.It is known that there is a trade-off between the strength of qubit interaction and the magnitude of unwanted crosstalk noise [33], [34].One type of crosstalk is caused by simultaneous operations between specific pairs of qubits.In this paper, we focused on the unwanted interaction due to the two qubit (CX gate) operations.This types of crosstalk is known to occur in the current quantum architectures including superconducting systems and trapped ions [35], [36].
The tuning and mitigation of crosstalk directly become big challenges when developing larger processors [34], [37], [38].There are several software approaches to reducing crosstalk error introduced in previous work.In the case of tunable quantum processors including Google's architecture [39], we can tune qubit frequencies or control specific couplers to disable and shut down the leakage errors [14], [40].RB applies random clifford operations with varying the length of gates and estimates its error rates.In the case of Simultaneous RB, applying RB on more than two hardware areas and comparing the error rates to single RB case, measure the conditional error rates of hardware qubits.In this research, we only conducted two-qubits RB and SimRB.

apply RB
In contrast, in fixed frequency qubit systems including IBM Q System, we can optimize the circuit scheduler to avoid concurrent execution of correlated qubits in the processor [41].In this paper, we focused on this fixed qubit system, and provide the solution by a novel layout method in the circuit compilation process.

III. CROSSTALK CHARACTERIZATION
To understand the performance of quantum processors affected by local and non-local errors and its ability to concurrently execute of multiple QCs, we need to characterize crosstalk in the processor.To simplify the problem, we only take into account how big the average crosstalk on the processor is rather than where these occur, i.e. the location of qubits.

A. THE COST OF CHARACTERIZATION
The complexity of crosstalk characterization often scales exponentially with the system size.Recently, several works showed ways to suppress the cost of crosstalk characterization.One practical protocol introduced in [10], [41] is the comparison of the error rate in the case of individual and parallel execution by using Simultaneous Randomized Benchmarking (SimRB).For example, in Fig. 3, some pairs of two-qubit errors can be detected in parallel, e.g.(q i , q j ) and (q k , q l ).If the error rate from SimRB (q i , q j )|(q k , q l ) is significantly different from the individual RB (q i , q j ), there is an unwanted correlation between them.
The combination of these two-qubit pairs grows quadratically in the size of the processor.To avoid this situation, in a previous study [41], Murali et al. provide some rules to reduce this overhead.1).Characterize one hop pairs.Through the experiments, they found the tendency of occurrence of crosstalk is limited only at one hop on the IBM Q system they used.This rule is suppressing the detection cost by ignoring the pairs more than one hop apart.They also pointed out some older devices have long-range crosstalk strong enough to be a concern.2).Characterize high crosstalk pairs only.They also found the existence of crosstalk tends to be stable in time and space.
Although this method can detect crosstalk distribution on the processors, it still takes several hours, and it may be impractical in the current situation because the size of processors is continuously getting bigger, and these experiments are queued and run on the cloud system in the presence of other participants.

B. PHYSICAL BUFFER AND SUCCESS RATE
In this section, we introduce a physical buffer, which is the number of idle qubits between QCs, to mitigate crosstalk effect of the concurrent execution of multiple QCs.We conducted a preliminary experiment to quantify how the physical buffer affects the output reliability of individual tasks in the concurrent execution.We used the Toffoli gate as benchmark.Due to the compilation, that includes the SWAP gate that consists of three CX gates and the qubit routing along with the topology of the processor, shown in Fig. 4, for a total of 10 CX gates.

FIGURE 4: Toffoli gate on chain topology
We varied the number of physical buffer between circuits and the number of Toffoli gates as parameters to see how the success rates change, shown in Fig. 5. Fig. 6 shows that in only the case of adjacent circuits, i.e. no physical buffer, the success rate significantly drops.With the increase of number of buffer, the success rate is recovered.And this change appears only for circuits of more than 20 CX gates.This leads us to the following insights.For the concurrent execution in practice, 1).we don't care about crosstalk for the shorter circuits, and 2).there is a threshold number of hops of physical buffer that can improve the output fidelity.And in this case, 1 physical buffer is enough until the CX circuit depth reaches 30.

C. BENCHMARK OF CROSSTALK IMMUNITY
Although crosstalk is the major source of error and may decrease the reliability of the concurrent execution, with the increasing number of qubits in the processors, the cost to characterize the noise increase quadratically.We show a novel detection protocol combining RB methods with relatively low detection costs.To analyze the crosstalk tolerance of processors, we utilized the coefficient of variation of gate errors as the metric and compared several processors.The graph denotes the quantum processor, nodes are qubits and edges are two qubit connection of the superconducting qubit system.The blue boxes represent the quantum circuits of Toffoli operation placed on physical qubits.We placed 5 circuits and vary the physical buffer, then measure the success rate of the Toffoli placed in the center.
We focus on how the crosstalk impacts the average and variance of error rates.First, we apply RB for every qubit or two-qubit pair in the processor and calculate the average and variance of error rates in the single execution case.Then, we run SimRB for all the qubits at the same time and also calculate the average and variance.Comparison of average and variance of error rates between those two cases leads to the quantitative analysis of crosstalk effects on the whole performance and immunity of the processor.
We conducted this benchmark on several current IBM quantum processors and showed the comparison of those performances in Fig. 7.We measured the CX gate error rate of each processor by utilizing RB.Blue box plots represent the distribution of CX gate error rate and black dots are the error value of each physical two-qubit connection.The orange box plots also represent error rates but in the case of concurrent execution (SimRB).We ran several patterns of CX gates combination that can be executed concurrently and took the average value of error rates of each case.
For all processors, the variance and average error rates increase in the case of concurrent execution, indicating the presence of the crosstalk noise.In particular, IBMQ Toronto, IBMQ Sydney, and IBMQ Manhattan show significant interference by other operations performed on the other regions.The performance of multi-programming directly depends on  1) of Toffoli gates in the case of concurrent execution.The horizontal axis is the physical qubit buffer introduced in Fig. 5.We compared the PST to the left most data point, which is the single execution case.Each color describes the number of CX gates contained in the operation, 10, 20 and 30.Since we utilize the Toffoli gates with SWAP Fig. 4, one toffoli gate contains 10 CX gates.
this crosstalk interference as we discuss in Sec.V.

IV. THE PALLOQ SYSTEM
We proposes palloq, a system including layout synthesis for multiple QCs and a job scheduler to manage efficient and high fidelity quantum multi-programming.The detail of procedure are explained in appendix Pseudocode.We published the source code at https://github.com/rum-yasuhiro/palloq.The palloq responsible for the compilation phase and the user's job management which is provided by WEB server in the cloud computing architecture we showed in Fig. 1.
This compiler pass takes several QCs written in Open-QASM [42] and the local gate error information of device as input.
Our layout synthesis consists of a heuristic based on Noise-Adaptive layout which analyzes the device's calibration data and searches for better allocation using a greedy approach [43].First, it parses the calibration data and hardware qubit connection to create a weighted graph G HW (V, E).V , E, and weight represents the physical qubits as the vertices, two-qubit connections of the hardware as the edges, and the reliability r = 1 − , where is error rate of two-qubit operation between physical qubits.In the same way, each input QC is treated as a weighted directed graph G QC (V, E), where vertices V are the qubits in the QC, edges E are twoqubit gates, and weight is the number of two-qubit gates performed on the same two-qubit pair.The compiler searches for the best reliable physical qubits candidates heuristically and allocates them to the highest weighted edges in the As we discussed in Sec.III-B, for the shorter circuits, we don't care about the crosstalk, and for the relatively larger circuits, we only care about the physical distance between circuits and optimize them locally.Our software takes physical qubit distance as input.Every time each circuit is allocated to hardware qubits, then disable the qubits around them to create a distance to others.

V. EXPERIMENTS AND EVALUATION
To evaluate our proposal, we conducted an experiment varying the physical buffer among the multiple circuits.In the entire experiment, we focus on the output reliability and total execution time and hardware usage.
We use small benchmark circuits from previous work [44] as shown in Table 1.The details of the processors and the software we used are shown in Table Tab.3.

A. METRICS
To quantify the performance of our proposed method, we utilized Probability of Successful Trial (PST) [32].For NISQ system that run the given programs a specified number of times, known as "shots", the quantum computation outputs the answer as a probability distribution of candidate bit strings.The number of successful trials is defined as how many times the each shots hit the correct bit strings.We chose the benchmark circuits whose computational result includes only one or a small number of answers for the solution space for ease of handling with PST.When the QC finishes successfully without any error, PST is 1.
As a baseline we compare the results from a noisy quantum device to a classical simulator as a noiseless case.PST is defined as follows: P ST = N umber of Successf ul T rials N umber of T otal T rials

B. EXECUTION DURATION AND OUTPUT RELIABILITY
First, we prepared 100 QCs from the benchmark set and queued them as an input to our compiler.Varying the physical qubit buffer 0 to 3, we count the output PST and total execution time of concurrent execution.Fig. 8 shows that PST of 100 circuits varies with physical buffer.For all quantum devices, in the case of buffer more than one, the average PST is better than the case of buffer zero, which is the densest layout and highest through put case.Which means, in these experiments, we obtain higher reliability of computation at the expense of throughput of concurrent execution.
Fig. 9 shows total execution time of 100 circuits as we vary physical buffer.For the execution time, we count circuit duration time (dt) of all concurrent execution round and 1 dt = 2 9 ns.It shows a larger buffer reduces hardware usage and total execution time linearly.For all devices, around 80 % of physical qubits are used in the case of buffer 0. In the case of buffer 2 and 3, the hardware usage is only half of the densest case.On the other hand, for all devices, the total circuit duration time increased by a factor of two from the densest to sparsest.

C. ANALYSIS
Here we describe the trade-off between crosstalk and throughput of concurrent execution.In general we desire the success of computation with higher reliability and throughput.The success rate of concurrent computation, which depends on the gate fidelity (local errors) and crosstalk.The throughput, which is defined as the number of executed QC per dt, where dt is duration time of quantum computation, relies on many factors like gate duration time, measurement duration time, and the size of processors.Based on experimental results shown in Fig. 8, we defined the improvement (gain) of output reliability g as: where the P ST i is average value of PSTs of i physical qubit buffered layout of concurrent execution.In this experiment, for all cases, i = 0 and j is 2 or 3. Based on the result shown in Fig. 7, we defined the crosstalk presence in the processor ct as: utilizing the coefficient of variation (CV): where σ is the standard deviation and µ is the average of the given distribution.In this case, crosstalk presence ct represents the degree of change of variation of CX error rate from RB to SimRB.Fig. 10 shows the relationship between the presence of crosstalk on the device and gain of improvement by physical buffer.
We used three 27-qubit devices and two 65-qubit devices, and for both cases we can see the positive correlation.Referring to Fig. 9, as we increase the number of physical qubit buffer, the throughput of computation drops.For processors with relatively strong crosstalk, it is worth reducing the throughput to gain the improved output fidelity of the computation.

VI. CONCLUSION
This paper proposed and evaluated a compiler method for concurrent execution of multiple quantum circuits which includes layout and schedule.First, we showed a practical VOLUME 4, 2021 crosstalk characterization method to reduce the detection cost that is critical for the near-term scale quantum processors.For the evaluation, we show our compiler efficiently processes the multiple quantum circuits avoiding the crosstalk and trade-off between the success rate of quantum circuits and the throughput of the processors.Also taking into account the crosstalk noise is meaningful not only for improving the performance of multiprogramming but also when considering the security of future cloud quantum computation.

APPENDIX. SETUP FOR EXPERIMENTS
The quantum circuit for benchmark our proposed software, listed in Tab. 1.The processors we use are listed in Tab. 2 And the version of software packages we use are listed in Tab. 3 Each experiment was conducted on the dates listed in Tab. 4.

APPENDIX. PSEUDOCODE
Here we show the pseudocode of the algorithm we introduced in Sec.IV.

FIGURE 1 :
FIGURE 1: Cloud Quantum Computing Architecture Various implementations of quantum processors have been proposed[1]-[5].As the core of the service, the quantum processor provides its computing resources by performing quantum operations and measurements.Digital Analog control exchanges the quantum and classical information by converting program instruction into analog signals and measurement results into classical data.The real-time control system is responsible for classicalquantum interaction, mid-circuit measurements, and feed-forward operations.Another end of the cloud computing is the browser-based user interface.End-users create the requests (jobs) on a web browser, send them to the system via a web server through the Internet, and receive the results of the computation.The browser-based user interface provides AAA (authentication, authorization, accountability), and in some cases, quantum programming tool and its development environment, such as a GUIbased quantum circuit composer.

FIGURE 2 :
FIGURE 2:The idea of multiple execution.The lattice graph represents the quantum processor, nodes denote qubits and edges are two qubits connection.The QC with the green box represents quantum circuits placed on physical qubits of the processor.Fig.2arepresents single quantum circuit execution.Because the limited sized circuits are tolerable on NISQ system, the idle qubits (yellow circles) reduce the throughput of the processor.Fig.2bdescribes the idea of multiple circuit execution concurrently.This approach reduces idle qubits and is expected to increase the throughput.

FIGURE 3 :
FIGURE 3: Simultaneous Randomized Benchmarking (SimRB) Upper diagram shows ordinary RB on two-qubits.RB applies random clifford operations with varying the length of gates and estimates its error rates.In the case of Simultaneous RB, applying RB on more than two hardware areas and comparing the error rates to single RB case, measure the conditional error rates of hardware qubits.In this research, we only conducted two-qubits RB and SimRB.

FIGURE 5 :
FIGURE 5: Multiple Toffoli placement varying physical buffer.The graph denotes the quantum processor, nodes are qubits and edges are two qubit connection of the superconducting qubit system.The blue boxes represent the quantum circuits of Toffoli operation placed on physical qubits.We placed 5 circuits and vary the physical buffer, then measure the success rate of the Toffoli placed in the center.

FIGURE 6 :
FIGURE 6: Physical buffer of multiple circuits and Success rate Each line denotes Probability of Successful Trial (PST) Eq. (1) of Toffoli gates in the case of concurrent execution.The horizontal axis is the physical qubit buffer introduced in Fig.5.We compared the PST to the left most data point, which is the single execution case.Each color describes the number of CX gates contained in the operation, 10, 20 and 30.Since we utilize the Toffoli gates with SWAP Fig.4, one toffoli gate contains 10 CX gates.

FIGURE 8 :FIGURE 9 :
FIGURE 8: Success rate and the physical buffer These box plots represent the distribution of PST of benchmark circuits executed on the real quantum devices.

FIGURE 10 :
FIGURE 10: Crosstalk presence and improvement of concurrent execution Crosstalk immunity Each black dot represents CX gate error rate of two qubit pairs.(i, j) denotes the label of two qubit pair which is remarkably high error rate.
circuit's graph.Repeat this procedure until all the circuit components are placed.

TABLE 3 :
Software version