Effects of Dynamical Decoupling and Pulse-Level Optimizations on IBM Quantum Computers

Currently available quantum computers are prone to errors. Circuit optimization and error mitigation methods are needed to design quantum circuits to achieve better fidelity when executed on NISQ hardware. Dynamical decoupling (DD) is generally used to suppress the decoherence error, and different DD strategies have been proposed. Moreover, the circuit fidelity can be improved by pulse-level optimization, such as creating hardware-native pulse-efficient gates. This article implements all the popular DD sequences and evaluates their performances on IBM quantum chips with different characteristics for various well-known quantum applications. Also, we investigate combining DD with the pulse-level optimization method and apply them to QAOA to solve the max-cut problem. Based on the experimental results, we find that DD can be a benefit for only certain types of quantum algorithms, while the combination of DD and pulse-level optimization methods always has a positive impact. Finally, we provide several guidelines for users to learn how to use these noise mitigation methods to build circuits for quantum applications with high fidelity on IBM quantum computers.


I. INTRODUCTION
Quantum computing is rapidly growing in recent years and various technologies have been developed for different quantum platforms, the leading candidates being superconducting and trapped-ion devices. Several companies such as IBM, Rigetti, and IonQ provide publicly available cloud-based services which allow users to access their platforms remotely. But, today's quantum computers are still prone to errors caused by either unavoidable interactions with the environment or imperfect quantum controls. They are qualified as Noisy Intermediate-Scale Quantum (NISQ) computers [1]. The largest quantum chip to date has 127 qubits released by IBM [2]. Quantum error correction (QEC) has been proposed to eliminate the noise impact and help achieve a fault-tolerant quantum device [3,4]. However, the implementation of quantum error correction codes requires a large number of ancilla qubits, which is not feasible on current hardware. Therefore, alternative approaches are needed to tackle the noise issue.
In order to reduce the noise impact and improve the quantum circuit fidelity, several quantum software were designed and they have made a great contribution to different circuit design processes, such as circuit synthesis [5,6], or qubit mapping [7,8]. In addition, quantum error mitigation (QEM) was introduced for error suppression on NISQ devices [9]. There are different QEM techniques, such as readout error mitigation [10], dynamical decoupling (DD) [11], crosstalk mitigation [12,13], zeronoise extrapolation (ZNE) [14], etc. Most methods require supplementary circuit executions to build the error map for mitigation. Whereas DD is one of the simplest * siyuan.niu@lirmm.fr † aida.todri@lirmm.fr strategies, which aims at mitigating the decoherence error (also called idle error) without any circuit overhead, and is the focus of this paper. The thrust of DD is to insert periodically a series of pulses to the idle qubits and return the qubits to their original states. There is a high probability to have idle qubits during execution due to the variation of gate latencies and limited parallelism caused by the anti-commutative gates. It has been shown in [15] that the idle qubits can be almost 10 times more subject to errors if adjacent two-qubit operations are executing at the same time on IBM superconducting device. DD plays an important role in reducing the idle error and has been used in quantum volume experiments [16], noise spectrum characterization [17], and decoherenceprotected quantum gate implementation [18], etc. Moreover, there are different DD strategies, such as Hahn echo [19], CPMG [20], XY4 [21], robust KDD [22], etc. They have diverse impacts on decoherence error suppression for different quantum devices. It has been shown in [23] that DD is able to extend the lifetime of one-qubit states as well as entangled two-qubit states for IBM and Rigetti devices using XY4 sequence. But this sequence is shown to be vulnerable to experimental imperfections, while the use of robust DD was demonstrated to be capable of correcting the pulse errors on Rigetti devices [24]. Recently, IBM released Qiskit Pulse [25,26] allowing users to design and customize the gate pulse implementations. Some works have already attempted to optimize the pulse controls and reduce the pulse durations with the help of Qiskit Pulse [27][28][29]. However, such method often needs additional calibrations which is time-consuming and requires a deep familiarity with quantum control. In [30,31], the authors proposed a new technique to create more hardware-native pulse-efficient gates, which improves the gate fidelity without the overhead of extra calibrations.
So far, the aforementioned techniques were only tested separately for limited benchmarks and quantum hard-ware. Several questions still remain unclear: (1) What are the impacts of different DD sequences on specific quantum algorithms? (2) For a certain benchmark, does the impact of different DD sequences vary across different quantum chips? (3) Will the combination of DD and pulse-level optimization methods further improve the circuit fidelity? In our work, we address these questions and our main contributions can be listed as follows: • We explore the performance of different DD sequences on various quantum applications and evaluate the experiments on several IBM devices with different qubit numbers and quantum volumes. To the best of our knowledge, this is the first attempt to illustrate the behavior of applying robust KDD sequence to IBM quantum devices.
• We combine the DD technique with pulse-efficient optimization method to demonstrate their benefits on Quantum Approximate Optimization Algorithm (QAOA) to solve Max-Cut problem.
• Based on the experimental results, we provide guidelines and insights for users to apply application-oriented dynamical decoupling and pulse-level optimization techniques.

A. Dynamical decoupling
Dynamical decoupling is widely used in suppressing the decoherence error by reducing the interaction between the system and the environment. Considering a systembath Hamiltonian H shown in (1), H s and H B being the Hamiltonian of the system and the bath respectively, and the interacting term being H SB (see (2)), where σ α i is the Pauli matrix acting on qubit i, B α i is the operator of the environment, and α ∈ {x, y, z}. DD aims at reducing the impact of system-environment interaction and various DD protocols have been developed to improve the performance of quantum computers. Here, we review the main strategies of DD implementations and a summary of these DD sequences is shown in Table I.
• Hahn echo. The spin-echo sequence [19] is used to reduce the inhomogeneous effects from the environmental magnetic field. It applies a π pulse to the spin system to inverse the spins after a period  of time t and let the system be refocused during the same duration. At that time, T * 2 effects are removed and we can obtain a T 2 echo.
• CP and CPMG. Carr and Purcell proposed a series of π pulses separated by a constant interval, known as the CP sequence, to further reduce the effect of self-diffusion in the inhomogeneous field [32]. But, extra pulses can introduce more errors and destroy the state of the system. Therefore, Meiboom and Gill improved the CP pulse by developing the CPMG sequence [20], retaining the CP pulse but introducing an additional phase shift in the first pulse to compensate the pulse errors.
• UDD. Hahn echo and CP/CPMG are equidistant pulses. While UDD aims at optimizing the π sequence based on CP/CPMG by varying the intervals between each pulse [34]. It is proven to be the optimal pulse to suppress low frequency noise and is insensitive to thermal fluctuations. It outperforms the equidistant DD sequences especially for systems whose spectral densities have high frequencies with sharp cutoff. For other more general cases, equidistant DD sequences were shown to perform better [35].
• XY4. All the DD sequences above only rotate around one single axis. They are exclusively useful when the system-environment interaction is orthogonal to the rotation axis. XY4 is the simplest DD sequence to generally suppress system-environment interaction along three directions [21]. It inserts alternatively π rotations around X and Y axes independently of the initial state.
• XY8 and XY16. DD can sometimes detriment the fidelity due to the accumulation of errors caused by pulse imperfections. One approach to reducing the errors is to combine the basic sequence with its inverse for self-correction so that the unwanted terms can be canceled [33]. XY4 is usually chosen as the basic cycle. XY8 is composed of XY4 sequence and its inverse while XY16 contains XY8 and its inverse.
• KDD. The other approach to avoid pulse imperfections is to replace each pulse of a DD sequence by a robust composite pulse, which is designed to generate ideal rotations even if there exist pulse imperfections [36]. KDD constructs a DD block using a 5-pulse composite π pulse, and combine two of such DD block with and without a phase shift π/2 as (KDD φ , KDD φ+π/2 ) 2 [22]. It is composed of 20 pulses in total.
DD has been demonstrated to have the capability to mitigate the decoherence errors on IBM and Rigetti platforms using XY4 sequences [23]. The experiment is performed as follow: it prepares different initial states by varying the angles of the rotation gates, inserts XY4 sequences, and compares the difference between input and output states. The impacts of different DD sequences on Rigetti device are reported in [24]. A large number of DD sequences are evaluated, including XY4, XY8, KDD, etc., and quantum process tomography is used to characterize the evolution. KDD is shown to be the most robust pulse sequence against pulse imperfections. Both papers evaluate DD sequences on simple benchmarks instead of real quantum applications, and there is no experiment to illustrate applying KDD to IBM quantum devices. Moreover, the experiments in [15] show that the naive implementation of DD (inserting DD sequences to all the idle qubits when it is possible) can not always improve the circuit fidelity. Therefore, Das et al. proposed an Adaptive Dynamical Decoupling framework to estimate the DD impact for each circuit and adjust DD sequence to ensure it improves the circuit fidelity [15]. This method achieves fidelity improvement but introduces a large overhead of DD impact characterization for a given application. Also, Ravi et al. proposed VAQEM [37], an approach that dynamically inserts DD sequences for variational algorithm with the overhead of tuning DD sequence. In our paper, instead of carefully adjusting DD sequences for better error mitigation with extra circuit overhead, we exhaustively evaluate DD sequences on extensively used quantum applications, to give a highlevel idea about if general DD techniques can really be beneficial for these applications. It is also the first attempt to implement KDD on IBM devices to check its performance. The experiments are performed on various IBM devices with different qubit numbers and quantum volumes to provide general insights about applicationoriented DD noise mitigation on IBM quantum devices.

B. Pulse-efficient technique
CN OT gate is the only two-qubit operation included in the basis gates for IBM quantum devices and its cal-ibrations are provided through the IBM quantum dashboard. It is implemented by a R ZX ( π 2 ) gate along with some single-qubit gates [38] on the device. R ZX gate is realized by the echoed cross-resonance gate [39], which is specific for the IBM fixed-frequency superconducting transmon qubit device due to its low overhead and high fidelity. When executing a circuit on IBM quantum hardware, every multi-qubit operation needs to be transpiled to the CN OT basis which is not flexible and less efficient.
Therefore, [30,31] proposed a pulse-efficient circuit transpilation framework and has shown to achieve higher fidelity than CN OT basis transpilation for certain benchmarks. Instead of only using a R ZX (θ) with a fixed angle θ = π 2 for CN OT gate, a flexible echoed R ZX gate is implemented with arbitrary angle and R ZX (θ) = XR ZX (− θ 2 )XR ZX ( θ 2 ) to enable transpilations to R ZX basis. This method does not require any additional pulse calibration since the calibration of R ZX gate can be easily calculated from the CN OT calibrations. (c) The transpilation result with basis gate sets {R Z , SX, X, R ZX }, and the duration is 1025 dt. We replace R Z ( mπ 2 ) √ XR Z ( nπ 2 ) by U to simplify the circuit. m, n ∈ R and can be varied across different transpilation passes.
An example of transpiling R ZZ gate to two different basis gate sets is shown in Fig. 1. We first transpile the R ZZ gate to the IBM CN OT -basis gate sets ( Fig. 1(b)) and then transpile again the original circuit to R ZX -basis gate sets (Fig. 1(c)). The duration of the second circuit is reduced by 51.1% compared to the first gate sets. This technique exposes the echo of the cross-resonance gates which enables at most one single-qubit gate between each non-echoed R ZX gate so that it can shorten the total circuit duration.
Quantum algorithms requiring a lot of two-qubit control-rotation gates, such as R ZZ or R Y Y , which can be directly compiled to R ZX gates along with some single-qubit gates, are particularly benefited from this pulse-efficient transpilation framework, for example QAOA [40], quantum hamiltonian simulation [41], etc.
FIG. 2: Examples of different quantum circuits. Note that, the Graph State circuit is constructed linearly based on the target hardware topology.

III. METHODS
We construct DD sequences according to Table I, and n is the number of repetition time for the basic DD pulse. For Hahn echo, we insert one X or Y gate during the idle time. We set n = 2 for CP/CPMG, CP (resp.CPMG) being implemented as the sequence X − X (resp. Y − Y ). For XY4, XY8, and XY16, n is set to 1 so that the sequence of pulses is inserted once. All the aforementioned DD sequences are equidistant and symmetrical: there is a delay of τ /2 at the beginning and the end of the idle time, and a delay of τ between each DD pulse. Whereas UDD is composed of non-equidistant pulses, it acts differently from CPMG for all n > 2. As there is no clear conclusion about which repetition number n we should pick to have the best performance of UDD on IBM quantum devices, we set n = 8 as demonstrated in Qiskit tutorial. The repetition gate is set to X and Y , and is marked as UDD X and UDD Y in the results. For KDD, (θ) φ represents a rotation of θ around the axis defined by φ.
, we insert 60 gates in total, where delay is only introduced between each θ φ , which means there is no space between the three gates constructing θ φ . Note that, since R Z gate is free on IBM quantum devices, KDD is translated to "only" 20 pulses in the end.
First, we evaluate the DD effects on IBM quantum computers by applying different DD sequences to various well-known quantum applications, including Bernstein-Vazirani (BV) algorithm, Hidden Shift (HS) algorithm, Quantum Fourier Transform (QFT), and Graph State (GS). The basis quantum circuit structures of these quantum applications are shown in Fig. 2. BV algorithm implements an oracle function f (x), which represents the dot product between x and a secret string s, with the objective of finding s. HS algorithm constructs an oracle that encodes two functions f and g, and there exists a secret string s such that g(x) = f (x + s). The goal of the algorithm is to find s. QFT is the quantum version of discrete Fourier transform, and is the essential part for many other quantum algorithms, such as Shor's algorithm [42], quantum phase estimation algorithm [43], etc. Graph State is a quantum state prepared based on a graph. Specifically, we can build a Graph State according to a given hardware topology, where there is an edge on the graph when the two qubits are coupled, and the connection is represented by a CZ gate. It can entangle all the qubits of the device and is important for error correction.
Second, we evaluate the performance of combining different DD sequences with pulse-efficient transpilation technique. We apply them to QAOA to solve the Max-Cut problem, since QAOA ansatz is composed of R ZZ gates along with some single-qubit gates, which can be profited from the pulse-efficient method. We generate randomly 3-regular graphs using ReCirq [44] and random graphs using Networkx with different degrees as our benchmarks. For example, we use 3-reg-4 to represent a 3-regular graph with degree of 4 (qubits) and rand-4-0.5 for a random graph with 4 nodes/qubits and a probability for edge creation of 0.5 (see Fig. 3).
We choose the following metrics for different benchmarks to demonstrate the impact of DD and pulse-level optimization technique on application fidelity.
• Probability of Successful Trial (PST) [45]. This metric is defined by the ratio of the number of trials that give the expected result to the total number of trials, and higher is better. It is dedicated to the benchmarks with one certain correct result, such as BV algorithm and HS algorithm.
• Jensen-Shannon Divergence (JSD). It is used to measure the similarity between two probability distributions, and lower is better. It is suitable for Graph State circuit and QFT whose output is a distribution.
• Approximation ratio [46]. It is specifically designed to evaluate the performance of QAOA circuit and is defined as C /C min , where C is the expectation value obtained by the quantum computer and C min is calculated by the classical NumPyMin-imumEigensolver. We aim at maximizing the approximation ratio and 1 means a perfect solution.

IV. EXPERIMENTAL RESULTS
We perform the experiments on various IBM quantum devices with different qubit numbers and quantum volumes (QV) (see Table II). The size of the benchmarks varies depending on the quantum device. For the first experiment when evaluating the effects of applying DD sequences to different applications including BV, HS, QFT, and GS, the circuit size varies from 3 to 6 for IBM Q 7 Jakarta. Whereas for other devices, the circuit size changes from 3 to 12, since there are too many noises accumulating to obtain meaningful results for benchmarks with more than 12 qubits. When exploring the performance of combining DD and pulse-efficient optimization technique on QAOA for the second experiment, the degree of the two types of graphs (3-regular graph and random graph) ranges from 4 to 6 for IBM Q 7 Jakarta, and 4 to 12 for other devices. The QAOA ansatz has one layer with parameters initialized randomly and optimized using COBYLA optimizer on the simulator. We only execute the final ansatz with optimized parameters on the quantum hardware. All the benchmarks are compiled using Qiskit with the highest optimization level and executed 8192 times.
There are some limitations when inserting certain DD pulses to the idle time. If the DD sequences contain a large number of pulses, such as XY16 and KDD, it might not be possible to insert them to some small benchmarks whose idle time might not be long enough. Moreover, Hahn echo can only be applied if the inverse of the inserted single X or Y gate is able to be absorbed into the neighboring gates to ensure the equivalence of the quantum state. Therefore, we check if the DD pulses are actually inserted to the benchmark for each experiment, and we remove the circuit without any DD pulses inserted from the result.
In order to clearly show the impact of various DD sequences and pulse-efficient optimization method, we use the relative results for the three metrics. The original benchmark with no optimization method applied is marked as the baseline. PST is divided by the baseline and we use JSD results to divide the baseline, so that we can obtain the relative results when applying DD sequences. If the relative result is larger than one, it means that there is an improvement, and larger is better. Whereas for QAOA experiments, we use the difference between approximation ratio with DD sequence or pulseefficient technique and the baseline as the relative result due to the possible negative value of the approximation ratio. If the difference is larger than zero, it indicates an enhancement and also larger is better. Each experiment has been repeated three times and results with similar trends were obtained. We show the average of the three experiments for all the results.
We first demonstrate applying DD sequences to various quantum applications. Hahn echo is not applicable for all the benchmarks. XY16 and KDD are too long to insert for certain small BV circuits (less than 4 qubits). All the DD sequences cannot be applied to HS algorithm, since the duration of the idle time is always equal to the duration of one single-qubit gate. Thus, the idle time is too short to fit any DD sequences. The relative PST results for BV circuits, the relative JSD results for QFT and Graph State circuits are shown in Fig. 4, Fig. 5, and Fig. 6.
The quantitative analysis of the relative results is shown in Table III. For BV algorithm (see Fig. 4), if the BV circuit involves a large number of qubits, the PST fidelities can be dropped dramatically such that all the PSTs are below 0.1 even with DD applied for error mitigation. Therefore, we only show the results whose PSTs are larger than 0.1. The relative PST results demonstrate that all the DD sequences are able to enhance the BV circuit fidelity, especially UDD X. On average, the fidelity is improved by 1.09x, 3.82x, 2.79x, and 1.68x compared with the baseline for IBM Q Jakarta, Guadalupe, Toronto, and Montreal respectively. Whereas for QFT, inserting DD sequence is more favorable for circuits with more than 5 qubits. Therefore, the relative JSD results on IBM Q 7 Jakarta are not encouraging. Overall, DD can be beneficial for QFT circuits but are not as stable as for BV circuits. The fidelity is improved by 1.14x, 1.18x, and 1.28x on average for IBM Q Guadalupe, Toronto, and Montreal. For Graph State circuit, the performance of DD is also not always steady and different across IBM quantum devices. DD is helpful for Graph State circuits on IBM Q Guadalupe but not for other devices, with an increase of fidelity by 1.17x on average. The variance of the results for different DD techniques becomes smaller for relatively large-scale QFT and Graph State circuits which might be due to the accumulating errors.
To conclude, DD can always improve the fidelity for BV algorithm to different extents across IBM quantum chips. But for other applications, the performance of DD remains uncertain and can be different on various quantum devices. Moreover, for certain benchmark like HS algorithm, the idle time is always short so that DD is not applicable regardless of the circuit size.
Second, we show the results of using QAOA to solve Max-Cut problem for different graphs by employing: (1) diverse DD sequences (see Fig. 7), (2) DD sequences + pulse-efficient optimization technique (see Fig. 8). The quantitative analysis of the relative results is shown in Table IV. Most of the DD sequences can help enhance the approximate ratio for QAOA, especially CP sequence. The approximate ratio of all the graphs is improved by 6.8%, 5.9%, 3.7%, and 3.7% on average for IBM Q Jakarta, Guadalupe, Toronto, and Montreal respectively. There is no clear relationship between the level of improvement and the size of the benchmark. Comparing to the baseline circuit without applying any error mitigation technique, purely the pulse-efficient method is already able to raise the approximation value by 8.9%, 5.5%, 5.3%, and 4.8% on average for the four devices. Associating pulse-efficient technique with DD sequences can further improve the approximation ratio and the negative impact induced by DD can mostly be canceled. The combination of DD and pulse-efficient technique can improve the approximation ratio by 26.5%, 49.2%, 81.1%, 75.7% compared with only applying DD for the four IBM devices.
One interesting thing is that, even though KDD is a robust DD sequence that is designed to be robust against pulse imperfection and has shown improvements on Rigetti device, its performance is not as good as expected on IBM quantum devices. For all the quantum applications tested including QAOA circuit, there is no growth of circuit fidelity using KDD compared with other DD strategies and actually it decreases the circuit fidelity for most of the cases.

A. Design guideline
It is important to explore different noise mitigation and pulse-level optimization methods, including DD and pulse-efficient optimization technique to better build the quantum circuits. Here, we list several guidelines that can help the community utilize these methods to design circuits with higher fidelity.
• DD cannot always improve the circuit fidelity and it is highly application-dependent. It is recommended to use DD on specific quantum algorithms, including Bernstein-Vazirani algorithm, and QAOA.
• When applying DD sequences, it is recommended to check the transpiled circuit to verify if DD is suitable to insert.
• The robust sequence KDD does not work well on IBM quantum devices for most of the quantum algorithms tested.
• The pulse-efficient optimization technique is favorable for QAOA circuits and combined with DD, it can improve the approximation ratio.

B. Future work
As the experiments have demonstrated some discouraging effects when inserting DD sequences for certain quantum applications, exploring the hardware physics behind them becomes important. Application-dependent adaptive dynamical decoupling [15] and variational algorithm-focused VAQEM [37] have been proposed to provide the most beneficial DD sequences but with large overhead and limited analysis of different DD strategies. A smarter approach for designing appropriate application-oriented DD sequences considering hardware physics is left to future works. Recently, DD has been proven to be able to suppress ZZ-crosstalk for fixed frequency transmon superconducting device [47]. It is interesting to further investigate the performance of DD-based crosstalk suppression at application-level, so that it can contribute to the quantum parallel executions [48,49]. Finally, the pulse-level optimization experiments can be delved to other quantum applications to provide more guidelines, such as VQE [50], quantum simulation [51], etc.

VI. CONCLUSION
Today's quantum hardware is prone to noise in the NISQ era. Therefore, circuit optimization and error mitigation approaches are required to increase the output fidelity. In this paper, we focus on two pulse-level circuit optimization methods: dynamical decoupling and pulseefficient optimization technique. First, we implement various DD strategies on IBM quantum devices including non-universal, universal, and robust ones. Second, we apply these DD sequences to several well-known quantum applications, such as QFT and Bernstein-Vazirani algorithm, to evaluate comprehensively the impact of diverse DD techniques on IBM quantum devices with dif-   If the relative results ¿ 0, it means an improvement on fidelity. Otherwise, a negative impact is introduced on fidelity.
ferent characteristics. We also merge DD with pulseefficient transpilation method and investigate them on QAOA circuits to solve Max-Cut problem. Based on the experimental results, we found that DD techniques always show positive impact for some benchmarks, while for others, DD demonstrate some discouraging effects.
As there is no overhead-free application-oriented DD approach, we provide a list of design guidelines for users to better understand these pulse-level optimization methods and figure out how to improve the circuit design for various quantum applications.