Quantum Volume in Practice: What Users Can Expect from NISQ Devices

Quantum volume (QV) has become the de-facto standard benchmark to quantify the capability of Noisy Intermediate-Scale Quantum (NISQ) devices. While QV values are often reported by NISQ providers for their systems, we perform our own series of QV calculations on 24 NISQ devices currently offered by IBM Q, IonQ, Rigetti, Oxford Quantum Circuits, and Quantinuum (formerly Honeywell). Our approach characterizes the performances that an advanced user of these NISQ devices can expect to achieve with a reasonable amount of optimization, but without white-box access to the device. In particular, we compile QV circuits to standard gate sets of the vendor using compiler optimization routines where available, and we perform experiments across different qubit subsets. We find that running QV tests requires very significant compilation cycles, QV values achieved in our tests typically lag behind officially reported results and also depend significantly on the classical compilation effort invested.


Introduction
Quantum volume (QV) [1] has been designed as a benchmark measure for Noise Intermediate-Scale Quantum (NISQ) devices. Informally speaking, a NISQ backend that has passed a QV protocol test of 2 m will largely correctly execute any quantum circuit on m qubits with up to m random 2-qubit gates on each of those qubits, thus giving a good guideline to users of the device as to what circuit sizes and depths appear reasonable to run on the device. We will give the more formal definition later. In this research, we aim to characterize the QV values of different NISQ backends as it would likely be experienced by regular, albeit somewhat sophisticated users of these systems.
QV has been defined to include the compilation from abstract circuit representation to the hardware connectivity and native gateset of NISQ devices. This is a necessary part of such a benchmarking definition, however this also means that heavy circuit compiler optimization can be a very large factor impacting the computed QV measure. Our QV testing approach thus starts from a compiler-agnostic perspective. Specifically, all of the initial compiled circuits we send to the different backends are initially compiled using the Qiskit [2] transpiler when required, and otherwise directly submitted to the backend. However, once the circuits are sent to the backend, further circuit optimization may occur. A user of NISQ [3,4] Quantum Processing Units (QPUs) will generally not have the tools, inclination, expertise, or time to perform heavy circuit compilation. Instead, they will use available open source software. This is why we use the Qiskit [2] transpiler as an initial basis for comparison (not all systems allow the user to directly handle compilation of the quantum circuits).
Our main findings are: 1. Preparation of QV circuits can be remarkably time intense depending on the system and not well standardized across different providers, even though OpenQASM [5] is used by several vendors; there is no standardized way across vendors to turn a logical QV circuit, expressed in a standard circuit description language into an optimized, device-specific instruction sequence.
2. The qubit mapping and routing problem that quantum compilers address is quite difficult [6,7]. The compilation toolchain within the software ecosystem and the backend itself greatly impacts circuit execution quality. Therefore, comparisons between QV values should also take into account the compilation method that was used. In particular, we find that using IBM Q's custom QV compiler for IBM Q devices indeed improves QV values over a standard Qiskit compilation method. 3. QV values achieved by users typically lag officially reported values, and in some cases the QV values of backends are not reported. The highest QV values we measured for each vendor are IBM at 16 using default transpilation and 32 when using more optimized compilation, Quantinuum (lower bound) at 512, IonQ Harmony at 8, Rigetti at 8, and Oxford Quantum Circuits at ̸ = 2. These results are consistent with the backend error rates. 4. Providing a qubit subset choice for the logical to physical qubit mapping allows us to characterize the regions of the backend (i.e. the qubit and 2-qubit interacting gates) that give the highest fidelity QV results. Not all qubits and connections on a NISQ device are of the same quality; thus even if a device passes a QV test, such success often relies on selecting a good qubit subset, which is not trivial to identify. This fact slightly compromises the original intuitive appeal of the QV measure: passing a QV 2 m test does not necessarily imply that the device will generally handle any circuit of depth and width m well because it may not start with a good qubit subset choice.
Overall, we find that hardware vendors have made great progress in the past five years in their device quality. While even advanced users may not quite reach the officially reported QV values, they can expect to just lag a small factor behind. Nevertheless, our findings also point to the need for quantum circuit optimization for any practical application through advancing compilation tools: while we have had the opportunity to run an extensive amounts of QV test runs, most quantum computing users should not have to go through such intense optimization procedures to test their algorithms.
This article is structured as follows. Following a literature review on the current state of quantum volume research, in Section 2 we summarize the methods and backends we will test. In particular, we differentiate the tests between black-box execution on all backends, and then more customized compilation and execution on the IBM Q backends, including connected subgraph compilation and more heavily optimized compilation from a toolkit provided by IBM Q. Next, we present the results of this analysis across all 24 NISQ devices in Section 3. We conclude with a discussion in Section 4 about what these results show about the state of NISQ computers and the QV metric. The QV circuits as well as the corresponding measurement counts are publicly available [8].

Literature review
Quantum Volume was proposed as a near-term metric for modern quantum computers that encompasses all aspects of the computational ability of the QPU including connectivity, qubit number, compiler software, and error rates [1]. Following this, more advanced compiler and routing techniques [7] and Qiskit Pulse optimization [9] allowed measurement of QV 64 on some IBM Q systems [10]; we use this advanced implementation for some of our experiments.
There are several other NISQ hardware independent benchmarks similar to quantum volume that have been proposed, including mirror circuits [15], application oriented benchmarks [16,17], volumetric benchmarks [18] which are generalizations of the QV benchmark, and machine learning motivated metrics such as the proposed q-BAS score [19]. Using six scalable application oriented benchmarks, numerous devices (across IBM Q, Rigetti, and IonQ) have been benchmarked [20]. For another NISQ benchmark, Atos has proposed a metric called Q-score as an application relevant benchmark [21]. One of the other key computation metrics that needs to be established for NISQ computers, as it was with classical computers, is a notation of speed. To this end, Circuit Layer Operations per Second (CLOPS) has been proposed as a viable NISQ computer speed metric [22].
A recent interesting topic is applying error mitigation to the quantum volume protocol in order to demonstrate an increased effective quantum volume [23].

Methods
A QV circuit consists of sequences of random qubit index permutations followed by a layer of random two qubit unitaries from Special Unitary matrices of degree 4 (SU(4)). Compiling layers of SU(4) gates with arbitrary connectivity between qubits is especially challenging for compilers [25] in large part due to the hardness of circuit mapping and qubit routing [26] on restricted connectivities. Given in [1], a generic square QV circuit with depth d and width m is a sequence of d circuit layers:
where each of these layers (e.g. U (d) ) is of the form indexed by t ranging from 1 to d. Each layer is acting on m ′ := ⌊ m 2 ⌋ pairs of qubits (meaning that if m is odd then one qubit in each layer will be idle). Each layer is created by choosing a uniform random permutation π t of the m qubit indices, and then applying the two qubit unitary gates U (t,i) πt(2i−1),πt(2i) from SU(4) acting on qubits π t (2i − 1) and π t (2i) for all m ′ pairs of qubits being used in this layer. Figure 1 is an example diagram of the logical structure of a quantum volume circuit.
For a given QV circuit, the relevant question is how well the quantum device implemented and executed the circuit [1]. To this end the QV protocol uses the heavy output generation problem [27]. Given a QV circuit U , it will have an ideal bitstring output distribution of where x is a bitstring output with length equal to m. The central idea of the heavy output generation problem is to partition all possible observable bitstrings into two equal-sized sets; one of which is has a lower ideal output probability, and one of which has a higher ideal output probability (i.e. the set of heavy outputs). The quantum hardware implementation of U is then considered successful if more than 2 3 of the measured output bitstrings fall into the heavy output set. More formally, given the full ideal probability distribution p U (x), we sort each probability such that p 1 ≤ p 2 · · · ≤ p 2 m . Then we can partition according the median of the probabilities p median in order to get the heavy output set of bitstrings for U : Thus the heavy output probability (HOP) for a quantum circuit U implemented on a backend is defined as the number of heavy bitstrings found in the distribution (i.e. the number of elements in H U ) out of the total number of samples taken on the backend. This measurement is then repeated for multiple QV circuits (say k distinct QV circuits, each with their own random permutations and random seeds) in order to to determine if the quantum backend in question can reliably sample heavy output probability distributions with probability greater than 2 3 . In the limit of the number of QV circuits (and for large circuit size m and depth d values) the expected mean HOP approaches 1+ln (2) 2 ≈ 0.85 [1,27]. For k distinct QV circuits, each circuit QV i has a measured heavy output  probability denoted as HOP (QV i ). Treating the HOP outcome as a binomial distribution (i.e. either the backend passes 2 3 or it fails to pass 2 3 ), over many circuits this can be approximated as a normal distribution. Using this approximation we can compute confidence intervals on the resulting distribution: Equations (5) and (6) show the formula for computing the mean HOP , then the standard deviation of the distribution, and then Equations (7) and (8) show the formula for computing the 99% confidence interval. These statistical tests are important because they show when a particular distribution of heavy output probabilities is above the 2 3 threshold with a high confidence level. (10) Figure 2: Impact of compilers: starting from the original QV circuit (top) defined in Qiskit using u3 and cx gates we compile to backends with different connectivities, gatesets, and software. We use a simple QV = 8 (n = 3) circuit for illustration purposes. Note that the backend compiled circuits for IonQ and Quantinuum are not shown; these backends do not currently support returning the backend executed circuits to users.
In order to determine if a device (or qubit subset of the device) passed the QV protocol, we use the following criteria: 1. The mean of the heavy output probability (HOP) states is above 2 3 (see Equation (5)). 2. 2σ below the mean of the HOP state probability is also above 2 3 (see Equation (6)). 3. The distribution is above 2 3 with a 0.99 z-confidence interval (see Equations (7) and (8)). These requirements increase in strictness; 2σ corresponds to a z-confidence of 0.977 (which has a corresponding to z value of 2), compared to the 0.99 z-confidence. The QV metric was originally defined on a quantum computer with at least m qubits [1] and d(m) denoting the largest d such that criteria 1) and 2) are satisfied: When trying to find the quantum volume, we can restrict ourselves to tests with d = m. For simplicity, in the remainder of the article we will denote n = log 2 QV , and n = m = d. By this definition, the best QV value found on a backend will be the QV value of that backend. In particular this means that distinctions between different qubit subsets have not been specifically reported when applying the QV protocol. One of the methods we investigate is distinguishing how the QV circuits perform on specific qubit subsets, as opposed to simply taking the best performing value found on the device (see Section 2.3). However, this is not possible on all devices because some systems do not currently support specifying a qubit subset to compile the logical circuit to.
In order to standardize the circuits used on all backends, 1,000 QV circuits per QV circuit size are generated using the Qiskit quantum volume method [28]. Figure 4 shows the gate counts for the raw uncompiled circuits. Table 1 summarizes the details and published hardware metrics of the 24 NISQ backends we test. Table 1 also summarizes the QV values we found on each of the 24 backends. These values are differentiated between 1. the QV value found when using the black-box execution method (little or no qubit mapping or basis gate conversions), and 2. the best QV value found across all circuits executed on that backend that were compiled using more time intensive compilation procedures (these heavier compilation procedures are specific to IBM Q and Rigetti backends, see Sections 2.3, 2.2 and 2.4). The procedure on all backends is to start at a small QV circuit size (e.g. n = 3), and then iterate to either larger circuit sizes or smaller circuit size depending on the results of the initial test. Once the device clearly fails to pass at a given n, the procedure is terminated. However, this procedure has not fully completed for the H1-2 Quantinuum backend due to usage limitations, so far passing experiments up to n = 9. Therefore, its reported value of QV 512 is a lower bound on the true (black-box) quantum volume of the device.
Next we will detail the full QV circuit compilation procedure, a high level diagram of which is shown in Figure 3. Across all backends (except the Quantinuum backend) we use the Qiskit transpiler method [2] to take the quantum volume circuits and compile them to a specific gateset and a specific qubit layout. The gateset supplied for compilation corresponds to the native gateset that is supported by the backend to the best of our knowledge. No other information (e.g. gate execution times, error rates, etc) is supplied to the transpiler. The software versions we use are qiskit=0.33.1, qiskit-terra=0.19.1, and amazon-braket-sdk=1.9.5, up to amazon-braket-sdk=1.21.0.
When submitting jobs through the Amazon Braket SDK (this includes the backend providers IonQ, Rigetti, and OQC), there is a small but important difference between the Amazon Braket SDK and Qiskit; Amazon Braket does not have a measure gate. Instead, the measure gates are applied implicitly to qubits which had gates applied. This becomes important in the n = 3 QV circuit case, where it is possible to create a QV circuit where all 3 layers are only acting on 2 qubits throughout the entire QV circuit; resulting in an idle qubit. In the Qiskit implementation, this qubit was still measured (even though no gate operations had been applied to it). Therefore, to maintain consistency with Qiskit and to not encounter non-contiguous qubit indexing errors, in the Amazon Braket implementation a single identity gate was applied to the idle qubit.
The IBMQ backends allow jobs to be submitted to the backend, where a job is a collection of circuits. However, there are constraints on the number of circuits (and the number of samples) that can be used per job. For all IBMQ experiments, we submit 4 jobs each composed of 250 QV circuits in order to execute each sequence of 1, 000 QV circuits. Thus, while these jobs are submitted in sequence, there is possibly a time delay in between these 4 blocks of circuits.

Black-box quantum volume
The simplest method we use is to simply submit the uncompiled circuits to the specified backend (and let the backend or system handle compilation). However, how this method is implemented varies depending on the backend. In some cases, directly submitting the uncompiled circuits to the backend is not possible because the gateset is incompatible or the software is incompatible, therefore requiring custom code to be developed to handle this conversion (for example in the case of the backends provided through Amazon Braket). In other cases the system allows for very quick and direct submission of the uncompiled circuits (for example in the case of Quantinuum). We give details in the following subsections for each hardware vendor.

Quantinuum
The black-box method to access the H1-2 Quantinuum backend is simple: The Honeywell API allows users to submit OpenQASM [5] code to each of these backends with no initial compilation needed.
In accordance with the black-box execution approach, we do not compile or optimize these circuits at all before   submitting them to the backend via the provided Quantinuum Python API. The API allows the optional no-opt compiler flag to be specified, which for the black-box approach we set to the default (which is False), which allows the backend to perform compiler optimizations. The current published QV value of the H1-2 system is 4096 [14] (increased from the previous 2048 [13]).
Our access to the Quantinuum backend was granted through Oak Ridge National Laboratories OLCF program.

Black-box IBM Q
For the IBM-Q backends, the black-box method we use is to call the Qiskit execute method [2,31,32] using the flags optimization level=3 and layout method=noise adaptive. Although not specified by the user, sx, rz, cx, x are the IBM Q basis gates. This method compiles the original QV circuits onto that backend. Note that the Qiskit transpiler does not always successfully compile a group of circuits; the transpiler exits after unsuccessfully compiling for 1,000 iterations. Therefore, for a group of circuits we run the execute method until it successfully compiles. Overall, this makes the method not real-time efficient.

IonQ
The 11 qubit IonQ Harmony backend was accessed through Amazon Braket. In order to submit jobs through this service, the circuits need to be specified using the supported gates. In order to help the compilation, we compile the QV circuits to the IonQ gateset. The compiler we use in this stage, as with the other backends, is the Qiskit [2] transpiler. Here the Qiskit gateset we use for compilation is rxx, ry, rz, rx. Once converted to QASM, we convert the Qiskit gateset to a supported Amazon Braket gateset; xx, ry, rz, yx (with XX being a native two qubit gate supported by the IonQ backend [33]). These gates can then be converted to Amazon Braket SDK code, and submitted to the IonQ backend. However, the circuit that is compiled and run on the backend is not visible to the user.
IonQ has not published the quantum volume of the 11 qubit trapped ion quantum computer Harmony, available through Amazon Braket [16]. However, there has been application oriented benchmarking of the 11 qubit device [34].

Oxford Quantum Circuits (OQC)
The Oxford Quantum Circuits backend Lucy [35] can be accessed through Amazon Braket. Using the Qiskit transpiler the QV circuits were compiled using the uni-directional Linear-Nearest-Neighbors (LNN) ring gate connectivity, optimization level 3, and basis gates rz, sx, x, and ECR (these are the basis gates for the OQC Lucy backend). Then these circuits are converted into Amazon Braket syntax and submitted to the backend. As with the other Amazon Braket available backends, we allow the compilation software stacks of Amazon Braket and OQC handle all aspects of compilation and qubit assignment. One of the options the user has when submitted circuits to the OQC Lucy backend via Amazon Braket is to specify whether to turn off all circuit optimizations (this is accomplished in the Amazon Braket case by creating the circuit with a verbatim box option), or to leave on all circuit optimizations. By default we leave on all circuit optimizations which allows the OQC compiler to select which qubits to use and to optimize the circuit (Figure 2 circuit (10) shows what one of these optimized circuits looks like, having been returned from the backend).

Black-box Rigetti
The route we took for the blackbox compilation on Rigetti devices was by accessing the Aspen-11 and Aspen-M-1 devices through Amazon Braket.The Qiskit gateset we use for compilation is cz, rz, rx. Once compiled to QASM, we convert the Qiskit gateset to a supported Amazon Braket gateset; cz, rz, rx. Note that XY is also a native gate of the Rigetti devices [36], however it is not currently a supported gate for the Qiskit transpiler. Additionally CPHASE is another native gate for the Rigetti backends, however it was not used by the Qiskit to implement the QV circuits in this compilation route. These gates (CPHASE, XY, cz, rz, rx) form the native gateset of the Aspen-11 and Aspen-M-1 devices [37]. Once each circuit is represented in the Amazon Braket compatible gateset, it is submitted to the Rigetti backends. From there, the Rigetti QUILC compiler [38] compiles the supplied circuit to the backend connectivity based on the latest calibration data of the backend. The resulting compiled circuit is sent back to the user as Quil code, allowing us to analyze the circuit that was executed on the backend [37,[39][40][41]. In other words, for the black-box approach we let the compilation stacks of Amazon Braket and Rigetti handle compilation and qubit assignment.
The quantum volume of the Aspen-11 and the Aspen-M-1 have both been measured to be 8 [42], and previous Rigetti Aspen devices (specifically Aspen-4) have had a measured quantum volume of 8 [37] as well.

Rigetti Qubit Subset Enumeration
For a more detailed analysis of Rigetti Aspen-11 and Aspen-M-1 devices, we used the Rigetti Cloud Computing Services (QCS) platform in order to perform qubit subset enumeration for the QV circuits. Figure 3 shows the data pipeline for this compilation; starting at the uncompiled QV QASM circuits we employ the Qiskit transpiler to convert the circuits into a connectivity matching a qubit subset of either Aspen-11 or Aspen-M-1 using the gateset of rz, rx and cx. Next this representation was converted into the Quil instruction set [39] and then the Rigetti QUILC compiler [38] was called in order to compile the Quil program into a device compatible form. Lastly the circuit was submitted to the device to be executed (using 100 samples).
One important distinction between this method and the black-box Rigetti approach described in Section 2.1.5 is the use of Active Reset [37]. Active Reset decreases the amount of computation time used on the device by speeding up the process of qubit reset; the cost is a decrease in qubit fidelity. For the Aspen-11 and Aspen-M-1, the Active Reset fidelities, based on vendor provided calibration data on the QCS platform are typically approximately 99%. The Amazon Braket execution process (i.e. the blackbox approach of Section 2.1.5) always uses Active Reset. By contrast, in the case of the qubit subset enumeration done on the QCS platform, we do not use the Active Reset capability. This difference can be directly seen in Figure 2; circuit (6) was submitted through Amazon Braket and it specifies Active Reset for the three qubits at the start of the circuit, whereas circuit (7) was compiled using QUILC for the purpose of the qubit subset enumeration procedure, and no active reset is used in that circuit.

Qiskit Transpiler with Qubit Subset Enumeration: IBM Q
In order to more thoroughly test QV values achievable, we specify which groups of qubits to compile the QV circuits to, instead of letting the compiler (local or backend) handle this. This is specified using the initial layout parameter in the Qiskit transpiler.
The Qiskit gateset we use for compilation is sx, rz, cx, x. We compile each of the 1,000 QV circuits onto each of the connected subgraphs of each of the IBM Q backends. We learned that this compilation process is quite time intensive, and requires HPC multiprocessing. We spent approximately 100,000 CPU hours (including the custom QV passmanager compilation time) to compile the n = 3 through n = 7 circuits onto 19 different IBM Q backends (see Table 2). The main reason for the enormous amount of compile time required is that the compiler often reaches a maximum 1,000 iteration error, thus possibly requiring many attempts to compile a given circuit onto a connected subgraph of the hardware. Although this is an allowed component of the QV protocol, it is not time efficient for users; the black-box compilation (Section 2.1.2) is closer to what a typical user would implement.
The arguments we use for the transpiler method are the coupling map of the backend, optimization level=3, initial layout of the connected subgraph, and basis gates=x, sx, cx, rz. Occasionally, the Qiskit transpiler will use some neighboring qubits outside of this connected subgraph in an attempt to improve circuit fidelity. Based on the compilation described in [1], this is an accepted part of the definition for QV. One of the causes of the transpiler reaching a maximum 1,000 iteration error is because of a qubit subset choice that is poorly chosen. For this procedure, we do not care about the order of the qubit subset choice (because the order of the qubits used in the circuit can be remapped for different connectivities); therefore we also randomly shuffle the qubit subset while attempting to compile each circuit.

Custom Compilation using QV Passmanager: IBM Q
Lastly we use the custom QV compiler techniques introduced in [7,9,10] in order to compare how heavier compilation affects the measured QV compared to using the standard Qiskit transpiler. The software used in these experiments is published on the qiskit-tutorials Github [43,44]. Specifically we implement the same qubit subset compilation of Section 2.3, except we now use the custom QV compiler [43,44]. In particular, we attempt to compile a sorted qubit subset, and then several random permutations of the qubit subset. However, some circuits fail to compile to some qubit subsets using the custom QV compiler. The passmanger flags are all set to the default, in particular these are flags we specify: basis gates, coupling map, qubit subset, backend props=backend.properties(), instruction durations (provided from the backend), synthesis fidelity=.99, pulse optimize=True.
This method requires even more computation time than the compilation with qubit subset enumeration from Section 2.3. The routing and qubit assignment optimization [7,10] is done with the CPLEX solver [45]; meaning that this method requires a CPLEX license to compile these circuits. For all compilation we set the BIPMapping (i.e. the CPLEX optimization of routing and qubit assignment) timeout to 5000 seconds [46]. This custom transpilation also uses Qiskit Pulse [9,47] optimization to increase circuit fidelity. The use of Qiskit Pulse additionally requires precise timing of the gate instructions, which is enforced by delay gates in the circuits. An example of the usage of these delay gates can be seen in Figure 2 circuit (4).
Additionally, we cut off compilation of all of the circuits after a few days using HPC resources; in some cases there were circuits (and qubit subset choices) that were not attempted to be compiled because of the time constraint we set. Therefore, either because of compiler errors preventing compilation, or because of the time constraint we imposed, not all circuits for all qubit subsets could be compiled across the backends we tested using this compilation method. We also restricted these compilations to a subset of the available IBM Q backends, as opposed to Section 2.3 where we compiled circuits for all available IBM Q backends.

Results
We compile and execute the QV circuits on the backends listed in Table 1. First we show results for black-box compilation and execution, which is the most general and widely available method across the backends (for example, not all backends allow full specification of which qubits to use in the circuit). Next we show results for IBM Q backends when we enumerate compilation across the qubit subsets (of size n) of a backend. This allows us to characterize the QV protocol results across the entire chip of the IBM Q backends. Lastly, we evaluate the custom IBM Q QV passmanager for circuit compilation [10,43,44] on a restricted set of IBM Q backends and connectivities on those backends. Unless otherwise noted, all experiments used 100 samples for each circuit execution. Figure 2 show the differences in circuit compilation starting from the original un-compiled circuit to the compiled circuits that are submitted to the different devices. In order from the top to bottom circuit drawings in Figure 2: (1) uncompiled QV circuit; all subsequent circuits use this description to compile to the other gatesets or connectivities.
(2) Qiskit transpiled circuit on qubits 0, 1, 2 on ibm perth with optimization level 3. Corresponding methodology described in Section 2.3.  . We see that at n = 2 the mean HOP is very close to 2 3 , at times passing 2 3 for a smaller number of circuits, but the 2σ value consistently remains below 2 3 .
(3) Transpiled using the black-box execute method on ibm perth. Corresponding methodology described in Section 2. In Figure 2, the difference in structure that results from the same logical n = 3 circuit is perhaps surprising, even though some of the diversity can be explained with the different native gate sets that the compilers aim to optimize to. Interestingly, when comparing circuit (6) and circuit (5) in Figure 2, we see that although the supplied QV circuit used 6 two-qubit cz gates, the software stack that compiled the circuits chose to use a total of 9 cz gates in the final Quil compiled circuit, which is sub-optimal. Note that although the OQC Lucy backend native gateset uses ECR as the two qubit gate, the returned QASM code that was compiled for the backend (circuit (10)) used   CNOT gates for two qubit interactions. Figure 4 show the average circuit statistics in terms of gate depth, one qubit gate counts, and two qubit gate counts, across the different device compiled circuits. Across different QV circuit sizes the custom QV passmanager compilation reduces the CNOT count on average compared to the two other IBM Q compilation procedures.
In order to visualize how the QV protocol progresses as we execute each circuit on the given QPU, we show cumulative Heavy Output Probability (HOP) figures where the x-axis is the circuit index, and the y-axis is the Heavy Output Probability (HOP). In these figures, we plot the ideal HOP distribution, the individual circuit measured HOP values, the cumulative mean of the HOP values (up to index i in the plot), 2σ below the cumulative HOP mean, and lastly we color shade the region with z-confidence > 0.99 if more circuits are executed past that confidence level (this can be seen in Figures 6, 11, and 14). We encourage close attention to the x-axis whenever comparing HOP plots: while we usually plot up to the full 1,000 circuits, we cut off earlier when it is clear that the test has been passed.
In all HOP figures we plot both the mean HOP (solid orange horizontal line), as well as the individual ideal HOP values for each circuit (shown as high transparency orange points). Note that the 1,000 QV circuits have smaller ideal HOP at n = 2 and n = 3 compared to larger values of n. This can be seen in the ideal distributions of Figure 9 for n = 2 compared to n = 5 or greater plots (for example Figure 5. This is to be expected for smaller QV circuit sizes [12], even though in the limit the ideal HOP distribution approaches 1+ln (2) 2 .

Black-box Quantinuum
As described in Section 2.1.1, the QV circuits were submitted directly to the backend as the uncompiled QASM file (the circuit statistics on the uncompiled QV circuits can be see in Figure 4) which is entirely comprised of CNOT and U3 gates (see Figure 2). No user side circuit optimization, basis gate conversions, or transpilation was performed on these circuits. Figure 5 shows that the H1-2 device passes the QV test for circuit sizes up to n = 8. In order to save resources, as with IonQ, execution was terminated once the QV protocol criteria were met. Due to usage constraints, larger circuit sizes are still being tested. Therefore, for the H1-2 backend we can only provide a lower bound (n = 9) on the QV value of the QPU. The n = 5 experiments used 100 shots for each circuit. The n = 6, 8, 9 experiments used 20 shots for each circuit. As a comparison, n = 8 was reached after only 140 circuits, which is considerably shorter than in particular most IBM took to reach some of their best results.  Table 2) for n = 3 (left), n = 4 not including ibm washington (middle), n = 5 (right) when using the Qiskit transpiler in order to compile the raw circuits onto each backend and qubit subset choice.

Black-box IBM Q
The QV results were nearly identical across all IBM Q backends when using the black-box execute method. That is, every backend passed the n = 3 QV test, and failed n = 4 and n = 5 with the exception of a particular qubit subset of ibmq manila for n = 4. Table 1 summarizes these results. Figure 6 shows the results where ibmq manila passed for n = 4 sized QV circuits, but failed to pass at n = 5. Figure 7 shows that the 11 qubit IonQ backend passes the QV test at n = 3 and fails to pass at n = 4. For n = 3, execution was stopped once the results passed the z-confidence threshold of 0.99. Because the mean HOP for n = 4 was definitively lower than 2 3 , we stopped execution at 500 circuits. Figure 8 shows the HOP distribution from executing the 1,000 QV circuits at n = 2 on the OQC Lucy backend. This plot shows that the mean HOP is consistently below 2 3 . In the compilation of these circuits, the OQC compiler applied optimizations to the circuit(s) that were submitted through Amazon Braket. Figure 9 shows that using black-box compilation and job submission, the Aspen-11 device fails to pass the QV test at n = 3 and n = 2. Note that n = 2 was tested, unlike the other backends, since the n = 3 test failed. We additionally tested the Aspen-M-1 backend for n = 2, which also failed to pass the QV protocol.

Rigetti Qubit Subset Enumerations
Here we summarize the results from executing the qubit subset enumeration QV protocol on Aspen-M-1 and Aspen-11 on the QCS platform [37]. Not all qubit subsets of Aspen-11 and Aspen-M-1 were tested. For Aspen-11 at n = 2, 12 out of the possible 48 connected qubit subsets were tested, 8 of which passed with a 99% z-confidence level. For Aspen-11 at n = 3, 51 out of the possible 72 connected qubit subsets were compiled to and executed, none of which passed with high confidence. For Aspen-M-1 at n = 3 26 out of the possible 184 connected qubit subsets were tested, one of which passed with a 99% z-confidence level. Figure 10 shows four specific examples of cumulative HOP distributions where n = 2 both passed and failed on Aspen-11, and where n = 3 both passed and failed on Aspen-M-1. Table 2 shows the qubit subset enumeration results for each of the IBM Q backends when the QV protocol is applied using the Qiskit transpiler with no modifications to the transpilation procedure; only compilation flag level 3 (which is the highest flag supported by this method), the connectivity graph, and the required basis gates are provided as additional transpiler arguments. Because of the size of the backend, only some of the qubit subsets of ibm washington at n = 4 were tested. Running n = 3, 5 on ibm washington would have also required significant additional QPU time. Figure 11 shows two HOP plots for two different IBM Q backends at n = 4. Importantly, this compilation procedure resulted in the highest QV value found across all tested IBM Q backends was n = 4 (QV = 16). Figure 13 shows heatmaps of several IBM Q backends in terms of QV protocol success counts across the entire chip. Notably, as with the error rates on the chip, the distribution of higher success rate qubits is not uniform across all qubits. While this is to be expected, this result shows the importance of backend connectivity and the error rates of particular gates; the QV value for a backend does not necessarily hold for all the qubits on the backend. Figure 12 shows the distribution of mean HOP values from the Qiskit transpiled QV circuits across all IBM Q backends and connectivities, organized into three histograms corresponding to n = 3, 4, 5. Two observations are noteworthy; first all three histograms seem to have bimodal characteristics, which could correspond to different processor generations. Second, although no n = 5 QV protocol passed z-confidence of 0.99, the histogram shows that some mean HOP values did cross the 2 3 threshold; but the amount over 2 3 was not significant. Table 3 summarizes the QV results on a subset of the IBM Q backends when using the custom QV passmanager for circuit compilation. Compilation for ibmq mumbai and ibm auckland were successful, but failed to execute on the backends due to an internal error relating to the pulse instruction durations. The pulse gate duration and timing needs to be specified very precisely (see Figure 2, circuit (4) where the QV passmanager specifies Delay gates in order to make the Pulse gates work correctly on the backend); it appears that the compilation to the ibmq mumbai and ibm auckland backends failed because of an error related to the circuit timing. Lastly, additional circuits compiled using this custom Passmanager were not executed on ibm washington due to the significant QPU time usage it would require. Importantly, this custom compiler increased the measured quantum volume on many of the backends compared to the default Qiskit transpiled circuits; going from n = 4 when using the default Qiskit transpiler, to n = 5. However, this custom passmanager did not work on all backends, required heavy computation time, and it did not consistently find the same QV values reported by the vendor, although not all qubit subsets were evaluated on ibmq device   lima  belem  quito  jakarta  bogota  manila  lagos  perth  casablanca  guadalupe   sydney  toronto  brooklyn   hanoi  cairo  mumbai  montreal  auckland  washington   # qubits  5  5  5  7  5  5  7  7  7  16  27  27  65  27  27  27  27  27  127  IBM log 2 QV  3  4  4  4  5  5  5  5  5  5  5  5  5  6  6  7  7 Table 2: Successful quantum volume experiments across the IBM Q backends compared to the vendor provided log 2 QV. Denominator is the number of connected subgraphs of size n on the backend, numerator is the the number of those subgraphs that passed the quantum volume protocol test. * On ibm washington, in part due to the significantly larger chip size than the other IBM Q backends, not all circuits were able to be run; the true number of connected qubit subsets on ibm washington is 272, but we only tested 262 of those.

Custom Compilation using QV Passmanager: IBM Q
all of the backends we tested. Therefore although it does improve circuit fidelity, the more custom compilation techniques are not feasible for a typical user. Figure 14 shows a side-by-side of (different) qubit subset HOP results from ibmq toronto and ibmq montreal, showing which qubits are used on a chip can change the QV result.

Quantum Volume over time
Another important aspect of a benchmark for quantum computers is that the noise profile on the device changes over time which means that benchmarks on a specific part of the hardware may not be consistent over time. Figure 15 shows the mean HOP after having run all 1, 000 QV circuits for n = 3 and n = 4 on fixed qubit subsets of the ibm auckland device. The mean HOP achieved at each of these time points noticeably varies over the approximately 400 hours of the data collection. This shows that at points this set of qubits fails to pass the n = 3 QV test (because the mean HOP is below 2 3 , but at other points the mean HOP for n = 4 goes above 2 3 . This shows that there is significant variation in the QV test over time. Therefore it is important to show the aggregate behavior of a NISQ device on the QV benchmark, for example what is shown in Figure 13.

Discussion
Quantum Volume is designed to be a benchmark that can compare quantum backends to other quantum backends with different underlying hardware. What we found is that the particular QV protocol used (i.e. how many circuits are run), and how the QV circuit are compiled, massively impacts the measured quantum volume. For end users who will employ the simple compiler methods available in the quantum SDK's of the hardware vendors, the heavy compilation quantum volume results do not reflect the expected backend fidelity they would experience when running jobs because they are not using those more advanced compiler options. The qubit subset enumeration QV results (for IBM-Q) reveals a lot more detail than the black-box method. These results give a more detailed analysis of the quantum device's capabilities; not only allowing more accurate comparisons across different backends, but also a more detailed picture of which regions of the device give the best results. Figure 13 shows that the specific qubits used to execute circuits on the backend greatly impact the circuit fidelity. Using the approach of compiling the same circuits to different qubit subsets of a backend allows even greater evaluation of the performance of a backend; applying this methodology to application benchmarks [16,17] is interesting future work.
Although we were not able to exhaustively evaluate the the more advanced IBM Q compilation features across all backends and qubit subsets, the more advanced compilation methods clearly increased the measured QV values (see Table 3).   The error rates and connectivity clearly translate to higher quantum volumes. The Quantinuum H1-2 backend had the lowest overall error rate across all backends we tested, and it also had the highest quantum volume by a significant amount.
We shows that importantly, although not surprisingly, the QV benchmark drifts over time even for fixed qubits. It is known that there is significant time dependence on NISQ noise profiles [48] and therefore it follows that benchmarks for these devices would also see time dependence. This is notable because it means that NISQ benchmarks also need to take into account the aggregate behavior of the device, not just a subset of qubits.
Other NISQ benchmarks, such as CLOPS for quantifying execution speed, should also continue to be evaluated in order to provide additional context for the performance of these backends. Another future research area is to quantify the correlation between NISQ benchmarks (such as QV) and error metrics such as aggregate error [49,50], cross-talk, and average qubit fidelity. On average it is clear that error rates, as well as connectivity and compilers, are the primary factors impacting NISQ device performance. However, exactly quantifying the error experienced by a circuit during execution can be difficult because it relies on time sensitive calibration data, as well as knowing the exact circuit that was executed on the backend. Additionally QV circuit execution can occur over an extended period of time. Therefore, determining the relationship between aggregate error and NISQ benchmarks is an interesting research avenue.
Due to time and usage limitations (in particular the current trapped ion qubit technology costs more time per sample compared to superconducting qubit technology) we were unable to complete the full quantum volume tests on the Quantinuum H1-2 device. One of the limitations of the QV benchmark in this respect is that it requires many samples and many circuit executions, which is not always feasible on all platforms. One possible modification of the QV protocol would be to make use of data re-sampling in order to quantify HOP distributions using a smaller number of samples. This has been introduced in the literature [12] and would make the QV protocol more accessible.
Overall we find that quantum volume gives a good basis for comparing different NISQ backends if the settings threshold. Experiments for n = 3 were run on qubits (10,12,13), experiments for n = 4 on qubits (10,12,13,14).
for such a comparison are constant. There are many particular details which can impact the the measured quantum volume of a NISQ device. The most significant appears to be the compiler; on one hand, heavy optimization can yield better circuit fidelity, but on the other hand a poor choice of qubit layout can significantly hinder the circuit fidelity. Other important settings include how many circuits are used in the test. For instance Figure 11 shows that the successful measurement of n = 4 for ibmq guadalupe would not occur if we executed less than 400 circuits. Therefore, the quantum volume metric is only useful if there is a consistent basis for comparison (i.e. relatively consistent compiler, and consistent experimental settings).
An important takeaway for users of NISQ computers when running near term applications is that it is prudent to try different qubit subsets on the hardware and to try different compilation strategies (for example noise adaptive compiler settings vs specifying some qubits to use) in order to get a good sense of what the device's full capabilities are.

A.1 2-qubit gates
We first define the parametrized 2-qubit gates used in this paper: • XY is one of the native two qubit gates for the Rigetti systems, and it is parameterized by an angle θ. For fixed angle θ = π, the XY (π) gate is also known as iSWAP . The iSWAP gate is used for illustration purposes to show a Quil compiled circuit in Figure 2. XY gates are native to the Rigetti hardware but currently not supported in Qiskit, though the iSWAP gates in Figure 2 correspond to actual XY (π) gates.
• RXX is an angle parameterized gate used as the only two qubit gate when compiling the raw QV circuits in Qiskit to be submitted to the 11 qubit IonQ Harmony device through Amazon Braket.

RXX(θ)
= e −iθ/2(XX) • CPHASE is one of the native two qubit gates for the Rigetti systems, and it is parameterized by the angle ϕ. Next, we define the more common and maybe better known unparametrized 2-qubit gates: • CZ is the third two qubit gate for the Rigetti systems; it is not parameterized. • The ECR gate is the two qubit gate used by the OQC Lucy backend. We compile the raw QV circuits to the OQC Lucy native gateset before submitting the circuits to the Amazon Braket system.
• ZZ is the two qubit native gate for the Quantinuum H1-2 device. In the worfkflow for this paper the QV circuits are not compiled to the Quantinuum gateset on the user side, instead that compilation is done entirely on the server side.

A.2 1-qubit gates
We again present parametrized gates first: • The rz gate is the most commonly used single qubit gate; it is used by the Quantinuum, IonQ, IBM Q, OQC, and Rigetti devices. • U 1 is one of the native single qubit gates for the Quantinuum H1-2 device (the other single qubit gate is rz). As with the ZZ gate, we do not use this gate for compiling and submission to the H1-2 device as the compilation is handled server side. • The U3 gate is used when the raw QV circuits are generated using Qiskit, as its parameters cover the whole space of single-qubit unitaries. Those raw QV circuits are defined entirely in terms of U3 and CX gates. Subsequent compilations transforms those circuits into other hardware specific gatesets. The compiled circuits returned from the OQC Lucy backend also make use of the U3 gate. We conclude with the definitions for fixed single-qubit gates: • sx is a native single qubit gate on IBM Q and OQC devices.
• The Pauli X gate is a native single qubit used on IBM Q and OQC systems.
x = 0 1 1 0 • The Hadamard gate is used in description of the compiled QV circuits returned by the OQC device (written in OpenQASM).