Mutation Testing of Quantum Programs: A Case Study With Qiskit

As quantum computing is still in its infancy, there is an inherent lack of knowledge and technology to test a quantum program properly. In the classical realm, mutation testing has been successfully used to evaluate how well a program's test suite detects seeded faults (i.e., mutants). In this article, building on the definition of syntactically equivalent quantum operations, we propose a novel set of mutation operators to generate mutants based on qubit measurements and quantum gates. To ease the adoption of quantum mutation testing, we further propose QMutPy, an extension of the well-known and fully automated open-source mutation tool MutPy. To evaluate QMutPy's performance, we conducted a case study on 24 real quantum programs written in IBM's Qiskit library. Furthermore, we show how better test suite coverage and improvements to test assertions can increase the test suites' mutation score and quality. QMutPy has proven to be an effective quantum mutation tool, providing insight into the current state of quantum tests and how to improve them.


I. INTRODUCTION
Quantum computation uses the quantum bit (qubit)-the quantum mechanical analog of the classical bit-as its fundamental unit instead of the classical computing bit. Whereas classical bits can take on only one of two basic states (e.g., 0 or 1), qubits can take on superpositions of those basic states (e.g., α · |0 + β · |1), where α and β are complex scalars such that |α| 2 + |β| 2 = 1, allowing a number of qubits to theoretically hold exponentially more information than the same number of classical bits. Thus, quantum computers can, in theory, quickly solve problems that would be extremely difficult for classical computers. Such computation is possible because of qubit properties such as the superposition of both 0 and 1, the entanglement of multiple qubits, and interference [1], [2].
The field of quantum computing is evolving at a pace faster than originally anticipated [3]. For example, in March 2020, Honeywell announced 1 a revolutionary quantum computer 1 https://www.honeywell.com/us/en/press/2020/03/honeywell-achievesbreakthrough-that-will-enable-the-worlds-most-powerful-quantum-computer based on trapped-ion technology with quantum volume 64the highest quantum volume ever achieved, twice the state of the art previously accomplished by IBM. Quantum volume is a unit of measure indicating the fidelity of a quantum system. This important achievement shows that the field of quantum computing may reach industrial impact much sooner than initially anticipated.
While the fast approaching universal access to quantum computers is bound to break several computation limitations that have lasted for decades, it is also bound to pose major challenges for many, if not all, computer science disciplines [4], e.g., software testing. Testing is one of the most used techniques in software development to ensure software quality [5], [6]. It refers to the execution of software in in vitro environments that replicate (as close as possible) real scenarios to ascertain its correct behavior. Despite the fact that, in the classical computing realm, testing has been extensively investigated, and several approaches and tools have been proposed [7]- [12], such approaches for quantum programs (QPs) are still in their infancy [13]- [15]. It is worth noting that 1) QPs are much harder to develop than Engineering uantum Transactions on IEEE Fortunato et al.: MUTATION TESTING OF QUANTUM PROGRAMS: A CASE STUDY WITH QISKIT classical programs, and therefore, programmers, mostly familiar with the classical world, are more likely to make mistakes in the counterintuitive quantum programming one [16] and 2) QPs are necessarily probabilistic and impossible to examine without disrupting execution or without compromising their results [17]. Thus, ensuring a correct implementation of a QP is even harder in the quantum computing realm [18].
Mutation testing [19], [20] has been shown to be an effective technique in improving testing practices, hence helping to guarantee program correctness. Big tech companies, such as Google and Facebook, have conducted several studies [21]- [23] advocating for mutation testing and its benefits. The general principle underlying mutation testing is that the bugs considered to generate buggy program versions represent realistic mistakes that programmers often make. Such bugs are deliberately seeded into the original program by simple syntactic changes to create a set of buggy programs called mutants, each containing a different syntactic change.
To assess the effectiveness of a test suite at detecting mutants, these mutants are executed against the input test suite. If the result of running a mutant is different from the result of running the original program for at least one test case in the input test suite, the seeded bug denoted by the mutant is considered detected or killed.
Just et al. [24] performed a study on whether mutants are a valid substitute for real bugs in classical software testing, and they concluded that 1) test suites that kill more mutants have a higher real bug detection rate and 2) mutation score is a better predictor of test suites' real bug detection rate than code coverage. We have no reason to believe that it would be any different in quantum computing. Thus, and to shed light on whether manually written test suites for QPs are effective at detecting mistakes that programmers might often make, in this article, we aim to investigate the application of mutation testing on real QPs.
In this article, we focus our investigation on the most popular open-source full-stack library for quantum computing [25], IBM's Quantum Information Software Kit (Qiskit) [26] 2 . Qiskit is one of the first software development kits for quantum to be released publicly and provides tools to develop and run QPs on either prototype quantum devices on IBM Quantum Experience infrastructure or simulators on a local computer. In a nutshell, Qiskit translates QPs written in Python into a lower level language called OpenQASM [27], which is its quantum instruction language. Many famous quantum algorithms such as Shor [28] and Grover [29] have already been implemented using Qiskit's API. 3 Building from our previous work [30], in detail, the main contributions of this article are the following: 1) a set of five novel mutation operators, leveraging the notion of syntactically equivalent gates, tailored for QPs. 2) a novel Python-based toolset named QMutPy that automatically performs mutation testing for QPs written in the Qiskit's [26] full-stack library. 3) an empirical evaluation of QMutPy's effectiveness and efficiency on 24 real QPs. 4) a detailed discussion on extending test suites for QPs to kill more mutants and, therefore, detect more bugs.
To the best of our knowledge, the study described and evaluated in this article is the first comprehensive mutation testing study on real QPs. Our results suggest that QMutPy can generate fault-revealing quantum mutants and surfaced several issues in the test suites of the real QPs used in the experiments. We have discussed two improvements to test suites, viz. increasing code coverage and improving the quality of the test assertions. Such improvements significantly increase the mutation score of the test suites-hence leading to QPs of higher quality.
The rest of this article is organized as follows. We present current available open-source mutation tools and detail the extension done for QMutPy in Section II. We detail how our experiment was conducted and subjects were selected in Section III. We present our results in Section IV. We discuss and execute improvements to current test suites and how they were impacted in Section V. In Section VI, we mention published works about mutation tools and current quantum testing tools. Finally, Section VII concludes this article.

II. MUTATION TESTING OF QUANTUM PROGRAMS
In this section, we explain our mutation strategy, including the five novel mutation operators tailored for QPs, and the implementation details of QMutPy-our proposed Pythonbased toolset to automatically perform mutation testing for QPs written in Qiskit's [26].

A. QUANTUM MUTATION OPERATORS
Similar to classical programs, a QP is fundamentally a circuit in which qubits are initialized and go through a series of operations that change their state. These operations are commonly known as quantum gates. Two of the most used quantum gates are the not gate and the Hadamard gate, usually referred to as the x gate and the h gate, respectively. They are single-qubit operations, i.e., they change the state of one qubit [31]. The x gate is analogous to the classical NOT gate; it simply inverts the current qubit state. The h gate is quantum specific; it puts the qubit in a perfect state of superposition (i.e., equal probability of being 1 or 0 when measured). At the time of writing this article, Qiskit v0. 29.0 provides support to more than 50 quantum gates. 4 This includes single-qubit gates (e.g., h gate), multiple-qubit gates (e.g., cx gate), and composed gates, also known as circuits (e.g., QFT circuit). Given their importance on the execution and result of a QP, as a simple typo on the name of the gate could cause bugs that developers may not be aware of, our set of mutation operators to generate faulty versions of QPs is based on single-and multiqubit quantum gates, in particular, syntactically equivalent gates. We argue that our quantum mutants match real-world bugs as: 1) Liu et al. [32] described quantum mutation to be helpful to assess the correct behavior of QPs and 2) three out of the eight common bug patterns in Qiskit programs described by Zhao et al. [33] are related to quantum gates as so are the majority of our mutation operators. Nevertheless, and as part of our future work, we will investigate and develop novel quantum mutation operators based on conceptual mistakes a developer might make when developing QPs.
Formally, a gate g is considered syntactically equivalent to gate j if and only if the number and the type of arguments 5 required by both g and j are the same. At the time when we performed our experiment, we had identified 40 gates that had syntactical equivalents. Fig. 1 lists all gates and their syntactically equivalent ones. For instance, the h gate has ten syntactically equivalent gates: i, id, s, sdg, sx, t, tdg, x, y, and z. Note that these gates do not perform or compute the same operation; they are simply used in the same manner and require the same number and type of arguments.
The following subsections briefly describe the five quantum mutation operators proposed in this article. Our examples are based on the implementation of 5 Optional arguments are not taken into consideration. LISTING 1 Example of the QGR operator LISTING 2 Example of the QGD operator. In Python, a pass statement is a nop that when executed nothing happens. It is useful as a placeholder when a statement is required syntactically, but no code needs to be executed [34].

1) QUANTUM GATE REPLACEMENT (QGR)
This mutation operator first identifies each call to a quantum gate function (e.g., circuit.x() 7 ) and then replaces it with all syntactically equivalent gates, e.g., circuit.h(), 8 one at a time. For instance, for the x quantum gate, ten mutants are generated as there are ten syntactically equivalent gates (see Fig. 1). Listing 1 exemplifies the QGR operator.

2) QUANTUM GATE DELETION (QGD)
Adding and removing quantum gates from a QP can significantly impact its output. The Quantum Gate Deletion (QGD) operation deletes an invocation to a quantum gate. Listing 2 exemplifies the QGD operator.

3) QUANTUM GATE INSERTION (QGI)
This quantum mutation operator performs the opposite action of the QGD operator. Instead of deleting a call to a quantum gate, it inserts a call to a syntactically equivalent gate. For each quantum gate in the source code, this mutation operator creates as many mutants as the number of syntactically equivalent gates. For example, for the x gate, which has ten syntactically equivalent gates, it creates 11 mutants, one per equivalent gate. Note that the x gate itself can be inserted in the source code, counting as a valid mutant. Listing 3 shows an example of the Quantum Gate Insertion (QGI) operator.

4) QUANTUM MEASUREMENT INSERTION (QMI)
In quantum computing, measuring a qubit breaks the state of superposition and the qubit's value becomes either 1 or 0 (as in classical computing), which can be considered a mutation by design. This operator adds a call to the measure function 9 for each quantum gate call. Listing 4 shows an example of the Quantum Measurement Insertion (QMI) operator.

5) QUANTUM MEASUREMENT DELETION (QMD)
Contrary to QMI, the Quantum Measurement Deletion (QMD) operator removes each measurement from a QP, one at a time. Without a measure call, the QP keeps the superposition state and, as a consequence, does not converge the qubit to either 1 or 0. Listing 5 shows an example of the QMD operator.

B. QMUTPY TOOLSET
QPs written in Python and using Qiskit library are a mix of classical operations (e.g., initialization of variables and loops) and quantum operations (e.g., initialization of quantum circuits and measuring qubits). Thus, we foresee that the most suitable mutation tool for QPs would be one that: 1) supports Python programs and the two widespread testing frameworks for Python: unittest and pytest; 2) supports various classical mutation operators (e.g., Assignment Operator Replacement and Conditional Operator Insertion); 3) supports the creation of a report that could be shown to a developer or easily parsed by an experimental infrastructure (as the one described in Section III); 4) fosters wide adoption, the learning curve to install, configure, and use the tool ought to be low.
In this section, we first describe the most relevant mutation testing tools out there and which requirements they fulfill. Then, we selected a tool to build upon and describe its workflow and added features.

1) PYTHON-BASED MUTATION TESTING TOOLS
Mutatest [38], mutmut [36], MutPy [35], and Cosmi-cRay [37] are the most popular mutation testing tools for Python that are available through pip 10 (the package installer for Python). Table I reports the most relevant features of each Regarding testing frameworks, mutmut supports all test runners (because mutmut only needs an exit code from the test command). mutation tool. In the following subsections, we describe their advantages and disadvantages. Albeit being open-source, fully automated, and supporting classical mutation operators, not all tools fulfill all our requirements.
Mutatest [38] only supports pytests, whereas, e.g., the programs in the Qiskit-Aqua's repository 11 require unittest. It neither produces a report of a mutation testing session. Thus, any postmortem analysis (e.g., statistical analysis) could not be easily performed.
mutmut [36] does not allow one to instantiate the tool with a single mutation operator or a defined set of mutation operators. Thus, a developer that decides to use it would have to wait for all mutants to be analyzed. This can be severely time consuming as a program could have thousands of mutants, and more importantly, a developer would not be able to, e.g., only select quantum mutation operators. Thus, using mutmut would be unproductive.
MutPy [35] and Cosmic Ray [37] are similar in nature. Both provide a reporting system, support unittest and pytest, and allow one to select a subset of mutation operators. However, from our own experience installing and running the tools, MutPy's learning curve is more gradual than Cosmic Ray's.
The tool that better fulfills all the requirements we aimed for in a mutation tool is MutPy [35], which we extended and named QMutPy (details in Section II-B3).

2) MUTPY FLOW
MutPy's workflow is composed by four main steps. Given a Python program P, its test suite T , and a set of mutation operators M, MutPy's workflow is as follows: 1) MutPy first

Engineering uantum
Transactions on IEEE loads P's source code and test suite; 2) executes T on the original (unmutated) source code; 3) applies M and generates all mutant versions of P; and 4) executes T on each mutant and provides a summary of the results either as a yaml or html report.
Since steps 1 and 2 are self-explanatory, we will focus on steps 3 and 4. In step 3, MutPy parses the code and for each mutation operator 12 checks if there are mutants to be generated. Mutants in MutPy are generated through the Python Abstract Syntax Tree (AST). When a possible mutation is found, the corresponding node from the AST is removed, and a mutated node is created and injected into the unmutated source code.
In step 4, MutPy executes T on the mutated version and produces a report. Each report includes information such as the number of mutants, whether each mutant was either killed, survived, incompetent (e.g., mutants that make the source code uncompilable), or timeout, the time it took to run T on P, the time it took to run T on each mutant.

3) QMUTPY
QMutPy 13 is built on top of the open-source Python mutation testing tool MutPy. Installing and using QMutPy is simple and straightforward. One only needs to clone QMutPy's repository and follow the installation and usage instructions available in the README 14 file.
We extended MutPy by implementing the quantum mutation operators described in Section II-A which developers can freely use to perform mutation testing on their QPs written in Qiskit. Notwithstanding, addressing the technical challenges of implementing the quantum operators, we added support to MutPy to mutate AST calls, 15 which is not possible in its original version. Interested readers can find more information on the technical challenges we faced to implement the quantum mutation operators described in Section II-A and how we addressed them in Fortunato et al.'s [39] recent work.

III. EMPIRICAL STUDY
We have conducted an empirical study to evaluate QMutPy's effectiveness and efficiency at performing mutation testing on QPs. In particular, in this study, we aim to answer the following research questions. RQ1: How efficient is QMutPy at creating quantum mutants? RQ2: How many quantum mutants are generated by QMutPy? RQ3: How do test suites for QPs perform at killing quantum mutants? 12 MutPy supports 20 classical mutation operators and seven experimental mutation operators. If a user does not specify any mutation operator, MutPy applies all of them in alphabetical order. 13 QMutPy is publicly available at https://github.com/danielfobooss/mutpy. 14 https://github.com/danielfobooss/mutpy/blob/master/README.rst 15 https://docs.python.org/3/library/ast.html#ast.Call RQ4: How many test cases are required to kill or timeout a quantum mutant? RQ5: How are quantum mutants killed?
As a baseline, we have compared the results achieved by QMutPy's quantum mutation operators with MutPy's classical mutation operators. 16 Note that works [40]- [43] on quantum mutation are very preliminary and no other classical or quantum mutation tool could have been used in our empirical study as a baseline (see Sections II-B1 and VI for more information).
We show our commitment to open science [44] by making QMutPy and our experimental infrastructure (data and scripts) available to the research community to assist in future research. The QMutPy tool is available at https:// github.com/danielfobooss/mutpy and all data and scripts are available at https://github.com/jose/qmutpy-experiments.

A. EXPERIMENTAL SUBJECTS
To conduct our empirical study, we require: 1) real QPs written in the Qiskit's framework [26] (as, currently, QMutPy only supports Qiskit's quantum operations); 2) QPs written in Python 17 ; 3) an open-source implementation of each QP; and 4) a test suite of each QP. To the best of our knowledge, there are four primary candidate sources of QPs that fulfill 1): the Qiskit-Aqua's repository 18 itself, the "Programming Quantum Computers" book's repository 19 from O'Reilly, the "Qiskit Textbook Source Code"'s repository 20 from the Qiskit Community, and the official "Qiskit tutorials"'s repository 21 .
Qiskit-Aqua's 22 repository provides the implementation of 24 QPs in Python, including the successful Shor [28], Grover [29], and HHL [45], and a fully automated test suite for each program. Hence, it fulfills all our requirements.
O'Reilly's book provides the implementation of 182 QPs, 29 written using the Qiskit's framework. However, no test suite is provided for any of the 182 programs. Hence, it does not fulfill 4). "Qiskit Textbook Source Code"'s and "Qiskit tutorials"'s repositories provide Jupyter Python notebooks with examples on how to interact with the Qiskit's framework. No test suite is available for any of the examples. Hence, it does not fulfill 2) nor 4).  The test suite of each QP was identified and selected based on each program's name. In Qiskit, a QP is named after the algorithm it implements and to its test suite is given the prefix "test." For example, the test suite test_shor.py corresponds to the program shor.py. Code coverage was measured using the Coverage.pytool.
(LOCs), the number of correspondent test cases, the time required to run the tests, and the code coverage at line level of the tests.
On average, the considered QPs have 184 LOC, where the smallest program has 56 LOC (numpy_ls_solver) and the largest has 443 (vqc). The number of tests and the time required to run all tests differs significantly. The number of tests ranges from 1 test (classical_cplex and numpy_ls_solver) to 593 tests (grover), and the runtime ranges from nearly 0 s (numpy_ls_solver) to 1627 s (vqc).
Regarding code coverage, on average, QPs' test suites cover 90% of all LOCs. This is in accordance with best practices [46] and also with a previous study conducted by Fingerhuth et al. [25], where the ratio of code exercised by QPs' tests was slightly above the industry-expected standard.

B. EXPERIMENTAL SETUP
All the experiments were executed on a machine with an AMD Opteron 6376 CPU (64 cores) and 64 GB of RAM. The operating system installed on this machine was CentOS Linux 7. We used Python version 3.7.0 in our experiments because it is the version supported by QMutPy and one of the required versions of Qiskit. We used the GNU Parallel tool [47] to run all the experiments in parallel.
In our experiments, we ran QMutPy with two configurations: with classical mutation operators only and with quantum mutation operators. For both the configurations, we used MutPy's default parameters. For each QP/test suite, we collected the number of generated mutants, the number of mutated LOCs, the ratio of mutants per LOC, the number of mutants killed, the number of mutants that survived and were exercised as well as that survived and were not exercised by the test suite, the number of incompetent mutants, the number of timeout mutants, the mutation score calculated with the number of survived mutants exercised and not exercised by the test suite, and, finally, the time it took to run all mutants.

C. EXPERIMENTAL METRICS
To be able to compare the effectiveness of each test suite at killing mutants, we first compute its mutation score [19], i.e., the ratio of killed mutants to the total number of mutants (excluding incompetent mutants, e.g., mutants that introduce noncompiling changes). Formally, the mutation score of a test suite T is given by where O represents the set of mutation operators, o is a single mutation operator, |M o | is the number of mutants injected by o, |I o | is the number of incompetent mutants generated by o, and |K o | is the number of mutants (of o) killed by T . As some mutants might not be killed by T because the mutated code is not even executed by T , in our empirical analysis, we also report a mutation score, which ignores mutants that are not executed by T . This score would allow one to assess the maximum mutation score T could achieve. Formally, this score is computed as where |E o | represents the number of mutants injected by m and exercised by T . Regarding time, we compute and report three different runtimes: 1) total time to perform mutation analysis on test suite T , which includes the time to create the mutants and run all tests on all mutants (Runtime column in Table IV); 2) time to inject a mutant in a nonmutated code (Generate mutant in Fig. 3); and 3) time to create a mutated module after injecting the mutant (Create mutated module in Fig. 3).
We also perform the Kruskal-Wallis nonparametric test [48], with a significance level of 0.01, and Cohen's d effect-size measure to evaluate the statistical significance of the results reported in Section IV. Note that, in Section V, we performed ad-hoc experiments on specific tests, and therefore, there are not enough data points to perform a statistical analysis.

D. THREATS TO VALIDITY
Based on the guidelines in [49], we discuss the threats to validity.

1) THREATS TO EXTERNAL VALIDITY
The QPs used in our empirical evaluation might not be representative of the whole QPs population. Moreover, the state of test cases selected for each QP might not be complete (i.e., we may have missed other test cases in Qiskit-Aqua that test the QPs' code). Note that the lack of real-world QPs is a well-known challenge [40], [50]. Another threat is that we compared the results for only one yet popular quantum framework (Qiskit). Caution is required when generalizing to other frameworks (e.g., Cirq).

2) THREATS TO INTERNAL VALIDITY
The main threat to internal validity lies in the complexity of the underlying tools leveraged to build QMutPy as well as the ones supporting our experimental infrastructure. To mitigate this threat, the authors have peer-reviewed the code before making the changes final.

3) THREATS TO CONSTRUCT VALIDITY
The parameters for drawing our conclusions may not be sufficient. In particular, by default, MutPy (hence, QMutPy) runs a test case t on a mutant m for five times the time t takes to run on the nonmutated version. Increasing this number may lead to different results (i.e., fewer timeouts).

IV. RESULTS
Section III defines the methodology and protocol for our mutation analysis and poses a set of research questions related to QMutPy's effectiveness and efficiency. The following subsections answer these questions in detail. Fig. 2 summarizes our results, detailing and classifying all of our mutation operations for each QP and mutation operator. Fig. 3 shows the distribution of time QMutPy takes to generate a mutant using classical and quantum mutation operators. On the one hand, the time taken to remove or inject new nodes into the program's AST is higher on all quantum mutation operators (except QMD) than on classical mutation operators. The latter takes up to a maximum of 2.68s (SCD), whereas the former takes up to 5.53 s (QGD), 11.36 s (QMI), 61.13 s (QGR), and 75.04 s (QGI). On the other hand, the time taken to create a mutated version, i.e., to convert the mutated AST back to Python code, is relatively small (less than 0.1 s) for all classical and quantum mutation operators.  24 QPs were considered in our study, here, we only list the ones for which QMutPy was able to generate at least one mutant (either classical or quantum). Column "Quantum Program" lists the subjects used in our experiments. Column "# Mutants" reports the number of mutants per subject. Column "# Mutated LOC" reports the number of LOCs with at least one mutant and the ratio of mutants per line of code. Column "# Killed" reports the number of mutants killed by the subject's test suite. Column "# Survived" reports the number of mutants that survived and were exercised by the test suite, and the number of mutants that survived and were not exercised by the test suite. Note that any buggy code or mutant that is not exercised by the test suite cannot be detected or killed. Column "# Incompetent" reports the number of mutants considered incompetent, e.g., mutants that make the source code uncompilable. Column "# Timeout" reports the number of mutants for which the subject's test suite ran out of time. Column "% Score" reports the mutation score considering all mutants killed and survived (but excluding incompetents) and reports the mutation score considering all mutants killed by the test suite and all mutants that survived and were exercised by the test suite. Column "Runtime" reports the time, in minutes, QMutPy took to run all mutants and each mutant on average. According to Fig. 3, there is no runtime difference between creating a mutated version with a classical mutation operator or a quantum mutation operator.

A. RQ1: HOW EFFICIENT IS QMUTPY AT CREATING QUANTUM MUTANTS?
QMutPy is statistically significantly slower (p-value < 2.20e −16 for an effect-size measure of 2.03), up to 16× times more, at generating quantum mutants than at generating classical mutants.
We hypothesize the following reasons to explain its performance while developing our quantum operators.

1) Mutation operators based on functions calls (i.e., calls to quantum gates):
Our set of quantum mutation operators, conversely to the classical ones, is based on function calls. Mutating a function is more complex than mutating, for example, a constant or a logical operator (e.g., "+") since specific grammar exists (e.g., ast.BinOp) for these types of mutations. It is worth noting that classical mutation operators that also modify function calls (e.g., SCD) are also more time consuming than operators that work at, e.g., logical operator level, as the LOD. 2) Search for quantum gates: Quantum mutation operators QGR, QGD, QGI, and QMI first visit all nodes of the AST and for each function call checks whether it is a call to a quantum gate. As the number of function calls in a program is typically high, we estimate that the consecutive checking is time consuming. Possible solutions to address this problem would be to create a new type of operation in the Python AST, analogous to logical operators, but dedicated explicitly to quantum gates. 3) Modifying or adding nodes in the AST: Although quantum mutation operators QGR, QMD, and QGD only modify one node of the program's AST, QGI and QMI not only modify one node but also add another to the end of the AST. We estimate this to increase the runtime of these operators.

B. RQ2: HOW MANY QUANTUM MUTANTS ARE GENERATED BY QMUTPY?
To answer this research question, we analyze our data at two different levels: 1) mutation operator, i.e., how many mutants are generated by each quantum mutation operator (see Table III); and 2) program level, i.e., how many quantum mutants are generated per program (see Table IV). For these subresearch questions, we focus on the columns "# Mutants" and "# Mutated LOC" on both the tables. As shown in Table III (column "# Mutants"), on average, our set of quantum mutation operators generated 140 mutants. The operator that generated fewer mutants is QMD (12 mutants), whereas QGI (328 mutants) is the one generating more mutants. These results show the following.

1) Quantum measurements are not that common in QPs
(as only 12 measurements were mutated). 2) Out of the 40 quantum gates with at least one syntactical equivalent gate, 28 appear in the evaluated QPs.
3) The insertion and replacement of quantum gates with their syntactical equivalent ones represent 90% of all quantum mutants. This shows the importance of syntactically equivalent gates, tailored for QPs, in mutation testing.
It is worth noting that the average number of mutants generated by our quantum mutation operators is slightly below the number of mutants generated by classical mutation operators (140 versus 186, which CRP highly dominates). As there are many more LOCs that could be targeted by classical mutation operators (e.g., usage of constants) and many more classical operators (18 versus our set of 5 quantum ones), it is expected that there are more classical mutants than quantum mutants. Nevertheless, the top-two quantum mutation operators (i.e., QGI and QGR) generated more mutants than 15 out of the 18 classical mutation operators.
On average, for 11 out of 24 QPs, QMutPy mutates four LOCs and generates 14 different quantum mutants per mutated line. It generates a total of 696 quantum mutants, 140 per mutation operator. Overall, the number of quantum mutants generated by QMutPy is not statistically significantly lower (p-value = 5.98e −06 for an effect-size measure of 0.17) than the number of classical mutants. As we can see in Table IV (column "# Mutants"), QMutPy generates at least one quantum mutant for 11 out of the 24 QPs. This means that the remaining programs use neither quantum gates nor measurements. It is worth noting that the quantum technique used impacts the number of generated quantum mutants, e.g., fewer (or no) mutants were generated for hybrid algorithms (e.g., vqe and qaoa) compared to purely quantum algorithms (e.g., classical_cplex and cplex_optimizer). Thus, more quantum mutation operators should be investigated and developed to support those QPs.
On average, QMutPy generated 64 quantum mutants (e.g., one mutant for vqe and qsvm and 207 mutants for shor). Given that our set of mutation operators targets function calls, which might not occur as often as, e.g., classical arithmetic operations in a program, on average, QMutPy only mutated four LOCs with an average of 13 mutants per line (see the column "# Mutated LOC"). In contrast, at least one classical mutant was generated for all programs. 147 mutants on average (+83) and 64 LOCs mutated (+60) with an average of three mutants per line (−10). Note that QPs are composed of more traditional programming blocks, such as conditions, loops, and arithmetic operations, than calls to the quantum API. Thus, as there are many more LOCs that can be mutated using classical mutation operators than using quantum mutation operators, it is expected to have fewer quantum mutants in a QP.

C. RQ3: HOW DO TEST SUITES FOR QPS PERFORM AT KILLING QUANTUM MUTANTS?
The question aims to analyze the quality and resilience of test suites designed to verify QPs. As mentioned before, the idiosyncrasies underlying QPs (e.g., superposition and entanglement) make testing far from trivial. We argue that QMutPy's mutants can be used as benchmarks to assess the quality of tests designed to verify QPs. Table IV reports the results of performing mutation testing on the 24 QPs described in Table II, whereas Table III summarizes the results per mutation operator.
As we can see in Table III, out of the 696 mutants generated by our quantum mutation operators, 325 (46.70%) were killed by the programs' test suites. QGI, the mutation operator that generated more mutants, killed 102 mutants out of 328, followed by QGR with 170 killed mutants out of 300 generated. The nonkilled mutants either survived to the test suites (307, 44.11%), were not even exercised by the test suites (two QMD mutants, 0.29%), or resulted in a timeout (62, 8.91%). In comparison, out of the 3527 generated by classical mutation operators, 1264 (35.84%) were killed, 971 (27.53%) survived, 353 (10.01%) were not exercised by the test suites, and 885 (25.10%) timeout. Note that +10.86% more quantum mutants are killed than classical ones and that only 0.29% of all the quantum mutants are not exercised by the test suites, as opposed to 10.01% (+9.72%) of all the classical mutants. These results show that the programs' test suites might have been designed to mainly verify the quantum aspect of each program.
To verify whether quantum mutants are not killed by chance and that instead the tests were tailored to verify quantum behavior, we conducted a small experiment on two QPs, i.e., shor and grover. We first removed all the assertions from shor's and grover's test suites, then reran our mutation analysis on each QP, and finally recomputed mutation scores. The mutation scores achieved in this experiment dropped from 53.34% to 24.22% (shor) and from 50.32% to 20.00% (grover). This further shows that the intention of testing specific quantum behavior is the main reason tests kill quantum mutants.
At program level, on average, the mutation score achieved by all programs' test suites was 57.69% if all mutants are considered [see (1)] and 62.23% if only mutants covered by the test suite are considered [see (2)]. Recall that noncovered mutants would never be killed by any test as the mutated code is never executed. The mutation score achieved by each test suite ranged from 0% (vqc and vqe, more on this in Section V-A) to 100% (hhl and qsvm). The mutation score achieved by all programs' test suites on classical mutants was 33.51% on average (considering all programs) and 41.61% if we only consider the same set of 11 programs for which quantum mutation operators were able to generate at least one mutant. The programs' test suites achieved a higher mutation score on quantum mutants than on classical mutants, +20.62% (62.23% versus 41.61%), hence reinforcing the idea that the test suites have been designed to mainly verify the quantum characteristics of each QP.
Regarding the time required to run mutation testing, on average, test suites took 148.18 min to run on quantum mutants. Note that although different programs have more/less mutants or test cases, the runtime of each QP's test suite on quantum mutants differs largely. For instance, shor's test suite, the QP with more quantum mutants, took 779.68 min; qsvm, the QP with fewer mutants and tests, took 47.85 min; and grover, the QP with more tests, took 212.24 min. In comparison to classical mutants, programs' test suites took longer to run on quantum mutants than on classical. For example, qsvm's test suite took 47.85 min to run on the only generated quantum mutant and 4.79 min on average ( 674.82 min 141 classical mutants ) on each classical mutant. The reasons behind these time differences are explained in Section IV-A.
Test suites for QPs achieved a mutation score statistically significantly higher than the mutation score achieved on classical mutants (62.23% versus 33.51%), p-value = 2.00e −05 for an effect-size measure of 0.92.

D. RQ4: HOW MANY TEST CASES ARE REQUIRED TO KILL OR TIMEOUT A QUANTUM MUTANT?
The questions aims to understand the effectiveness of current quantum test suites. Fig. 4 shows the distribution of the number of tests required to kill or timeout each mutant per mutation operator and per QP.
At the mutation operator level, the average number of tests needed to kill or timeout each quantum mutant is 9 (e.g., one test for QMI and 73 tests for QMD). The average number of tests needed to kill or timeout each classical mutant is 26, with 10 out of 18 classical mutation operators executing more than 500 tests.
At program level, the average number of tests needed to kill or timeout a quantum mutant is 13 (e.g., one test for bernstein_vazirani, iqpe, and qsvm, and 73 for grover). Regarding classical mutants, the average number of tests needed to kill or timeout each classical mutant was 18 (considering all programs) or 64 if only the ten programs for which at least one quantum mutant was generated and killed or timeout are considered.
Although on average quantum mutants require −65.38% tests to be killed or timeout than classical mutants (9 versus 26), there is no statistically significant difference (p-value = 0.52 for an effect-size measure of −0.10) between the number of tests required to either kill or timeout a classical mutant and a quantum mutant.

E. RQ5: HOW ARE QUANTUM MUTANTS KILLED?
With this question, we aim to analyze what kills quantum mutants. Fig. 5 depicts the overall number of mutants killed by an assertion or an error. Fig. 6 shows us the same but by mutation operation.
Out of the 1589 killed mutants, we observed that twothirds of mutants were killed by errors (1067) and the other one-third by test assertions (522). Fig. 6 reports the number of mutants killed by errors and test assertions per mutation operator. Overall, the majority of classical mutants are killed by errors. As already mentioned, we argue that Qiskit's test suites are mainly designed to check for the correct behavior of QPs. Therefore, they are less resilient to classical mutations and likely to be killed by errors instead of test assertions. This observation does not hold for quantum mutants.
QGD, QGR, QGI, and QMD mutants are killed more often by test assertions than by errors. We also observed that QMI mutants, as expected, are killed by errors only. The reason is that Qiskit does not have a fail-safe mechanism for inserting measurements. When a measurement operation is inserted in a random position, the circuit may become unprocessable and an error is thrown. Developing better approaches to reduce the number of design errors of QMI mutants remains, however, as future work.
On the one hand, classical mutants are mainly killed by errors. Quantum mutants, on the other hand, are statistically more likely (p-value = 0.01 for an effect-size measure of 0.80) and mainly killed by test assertions (with the exception of QMI mutants).

V. IMPROVING QUANTUM TEST SUITES
The results in Section IV suggest that there is room for improvement in Qiskit's test suites. For example, we observed that 150 out of the 207 quantum mutants generated for shor survived.   We draw on two hypotheses to guide our discussion on how to improve QPs' test suites to kill more quantum mutants: h 1 : The low mutation score achieved by each test suite is due to their low coverage.
The low mutation score achieved by each test suite is due to their low number of test assertions.
Note that the described mutations and improvements to the test suites are available at https://github.com/jose/ qmutpy-experiments. Fig. 7 shows the relation between coverage and mutation score overall, of each test suite, and for each mutation operation. We computed the Spearman-rank correlation coefficient between coverage and mutation score of each test suite and observed that mutation score and coverage are correlated (+0.28, i.e., mutation score increases with coverage, p-value 1.02e −06 ). Thus, with this first hypothesis, we aim to investigate whether increasing the coverage of QPs, e.g., covering mutated LOCs that are not exercised by the program's test suite, leads to a higher mutation score. Table IV shows that there are two QPs (hhl and vqc) that have one mutant, generated by the QMD operator, that survived the test suites and are not covered by any test. The mutants are generated by the QMD operator and are in uncovered methods: construct_circuit (see Listing 6) and get_optimal_vector (see Listing 8), respectively. We extended hhl's and vqc's test suite, 23,24 as shown in Listings 7 and 9, respectively, to cover these methods and added a more specific test assertion to each test. The test assertions verify that the number of combinations of qubits measurements is correct, which it would not be if no measurement was performed. We verified that our hypothesis holds by rerunning the mutation analysis using the augmented test suites. In both the QPs, the mutants that survived our initial mutation analysis are killed by the augmented test suites. That is, hhl's mutation score increased from 50% to 100% (coverage increased from 86.55% to 89.16%), and vqc's mutation score from 0% to 50% (coverage increased from 93.26% to 94.43%).

B. IMPROVING TEST ASSERTIONS
As mentioned before, QPs are probabilistic in nature. Suppose a quantum circuit with two qubits. When read, these qubits could either be 00, 01, 10, or 11. Suppose that the correct behavior is to observe 00 with 25% probability and 11 with 75%. If, instead, we observe a survived mutant that measured 00, 01, 10, and 11 with some probability, then we would have a false negative since the mutant should have been killed. VOLUME 3, 2022 Engineering uantum We argue that asserting the number of measurements in the test suites is necessary to avoid these false negativeshence improving the mutation score. To verify this intuition, we augmented shor's test suite 25 (the QP with the most generated quantum mutants; see Table IV) with additional test assertions, as shown in Listing 10. The added assertions check the correctness of the number of obtained measurement values.
Similar to h 1 , we rerun the mutation analysis using the augmented test suites to verify that h 2 holds. Mutation score achieved by shor's original test suite was 53.34% (50 mutants killed and 150 survived out of 207). The augmented test suite achieved a mutation score of 72.81% (109 mutants killed and 91 survived). In detail, the augmented test suite killed six out of eight QGD mutants (+3 than the original test suite), 32 out of 99 QGI mutants (+19), 63 out of 91 QGR mutants (+37), and the same QMD and QMI mutants (one out of one and seven out of eight, respectively) as the original test suite.

VI. RELATED WORK
To the best of our knowledge, there are four works in the literature that have performed quantum mutation on QPs [40]- [43]. However, these are preliminary attempts to conduct quantum mutation testing, empirically evaluating these prior works on the same set of QPs and tests, and comparing those tools' performance with QMutPy is impossible due to several limitations.
Ali et al. [40] performed mutation analysis on automatically generated tests for QPs to assess their effectiveness at finding seeded faults. Their study introduces four mutation operators: QGD, QGI, and QGR (with no concept of syntactically equivalent gates), and a classical operator named "replace mathematical operator." Such studies could further benefit from a fully automated tool such as QMutPy, which supports a more extensive set of mutation operators, including 20 classical operators. 25 https://github.com/Qiskit/qiskit-aqua/blob/stable/0.9/test/aqua/test_ shor.py Mendiluze et al. [41] proposed Muskit, a Python mutation tool for Qiskit QPs. Muskit supports the mutation of 19 Qiskit's gates, the mutation operator QGD, as defined in Section II-A2, and the mutation operators QGI and QGR but with no concept of syntactically equivalent gates. QMutPy, on the other hand, supports two additional mutation operators, i.e., QMD and QMI, which can mutate measurement calls, and is able to mutate 40 gates (+21 than Muskit). To use Muskit, one must provide the specification of the QP so that Muskit is able to assess whether a mutant has been killed by a test. This requires expertise in quantum computing and/or on Qiskit. As the manually written tests used in our study are equipped with test assertions, QMutPy does not require any program specification to assess whether a test kills a mutant. Mendiluze et al. [41] also conducted an experimental evaluation of Muskit on four QPs, one shared with our study, the Bernstein-Vazirani cryptography algorithm. They reported that Muskit generated 343 mutants for that algorithm (255 generated by the QGI operator, nine QGD, and 79 QGR) and achieved a mutation score of 77.35%. In our study, QMutPy only generated 88 mutants (44 generated by the QGI operator, four QGD, and 40 QGR) but achieved a mutation score of 91.32%. These differences can be explained as follows.

1) Mendiluze et al.'s [41] implementation of the
Bernstein-Vazirani algorithm is 14 lines long and contains nine gates, 26 and the implementation available on Qiskit-Aqua's repository (and used in our study) is 80 lines long and contains four gates 27 only. 2) We performed mutation analysis with the 33 manually written tests as opposed to the 64 automatically generated tests used by Mendiluze et al. [41]. As the manually written tests achieved a higher mutation score, further research on the automatic generation of tests for QPs should be conducted (see, e.g., [40]). 3) As put forward by Mendiluze et al. [41], the large number of mutants generated by the mutation operators QGI and QGR that survived might be equivalent/irrelevant mutants. QMutPy only injects or replaces syntactically equivalent gates, thus keeping the number of equivalent mutants, if any, low.
Finally, MTQC [43] is a Java quantum mutation tool that supports Qiskit and Q# QPs. MTQC supports the mutation of 17 Qiskit's gates (versus 40 in QMutPy) and a subset of operations performed by our QGR operator (52 versus 225; see dark squares in Fig. 1). At the time of writing this article, no study has been conducted with MTQC. We could not include MTQC in our study as: 1) it does not support unittest, a requirement to run Qiskit-Aqua's manually written tests and 2) it requires one to manually use its GUI to perform the mutation analysis, one program at a time, which is time consuming and prone to mistakes.

VII. CONCLUSION
In this article, we propose a mutation-based technique to test QPs, coined QMutPy, that is capable of mutating QPs for Qiskit, the IBM quantum framework. This is a first attempt to perform mutation testing on QPs with a tool that is easy to use and works at scale. Furthermore, QMutPy offers classical and more quantum mutation operators than the approaches/tools proposed in the literature.
To demonstrate the effectiveness of QMutPy, we carried out an empirical study with 24 real QPs (selected from Qiskit). We observed several issues that may lead to future failures-nonoptimal code coverage, low mutation scores, and minimal number of test cases. Furthermore, we observed that quantum mutants required fewer test cases to be killed than classical mutants. This is likely due to the objective of the designed test suites-checking for the QP's behavior. As a consequence of our observations, we draw on two potential ways to improve test suites: coverage and assertion improvements. We show how both the improvements can increase the mutation score significantly on the QPs considered in our study. 28 As for future work, we plan to extend QMutPy with other mutation operators, offer it to other quantum frameworks (e.g., Cirq and Q#), and run our mutation analysis on real quantum computers.