Mutation Testing in Practice: Insights From Open-Source Software Developers

Mutation testing drives the creation and improvement of test cases by evaluating their ability to identify synthetic faults. Over the past decades, the technique has gained popularity in academic circles. In practice, however, little is known about its adoption and use. While there are some pilot studies applying mutation testing in industry, the overall usage of mutation testing among developers remains largely unexplored. To fill this gap, this paper presents the results of a qualitative study among open-source developers on the use of mutation testing. Specifically, we report the results of a survey of 104 contributors to open-source projects using a variety of mutation testing tools. The findings of our study provide helpful insights into the use of mutation testing in practice, including its main benefits and limitations. Overall, we observe a high degree of satisfaction with mutation testing across different programming languages and mutation testing tools. Developers find the technique helpful for improving the quality of test suites, detecting bugs, and improving code maintainability. Popularity, usability, and configurability emerge as key factors for the adoption of mutation tools, whereas performance stands overwhelmingly as their main limitation. These results lay the groundwork for new research contributions and tools that meet the needs of developers and boost the widespread adoption of mutation testing.

Mutation Testing in Practice: Insights From Open-Source Software Developers Ana B. Sánchez , José A. Parejo , Sergio Segura , Member, IEEE, Amador Durán , and Mike Papadakis Abstract-Mutation testing drives the creation and improvement of test cases by evaluating their ability to identify synthetic faults.Over the past decades, the technique has gained popularity in academic circles.In practice, however, little is known about its adoption and use.While there are some pilot studies applying mutation testing in industry, the overall usage of mutation testing among developers remains largely unexplored.To fill this gap, this paper presents the results of a qualitative study among opensource developers on the use of mutation testing.Specifically, we report the results of a survey of 104 contributors to opensource projects using a variety of mutation testing tools.The findings of our study provide helpful insights into the use of mutation testing in practice, including its main benefits and limitations.Overall, we observe a high degree of satisfaction with mutation testing across different programming languages and mutation testing tools.Developers find the technique helpful for improving the quality of test suites, detecting bugs, and improving code maintainability.Popularity, usability, and configurability emerge as key factors for the adoption of mutation tools, whereas performance stands overwhelmingly as their main limitation.These results lay the groundwork for new research contributions and tools that meet the needs of developers and boost the widespread adoption of mutation testing.Index Terms-Mutation testing, mutation tools, qualitative study, GitHub.

I. INTRODUCTION
M UTATION testing assesses the adequacy of tests based on their ability to detect artificial faults.Such faults are generated by applying mutation operators, which introduce simple syntactic changes into the source code of a computer program.For instance, a mutation operator can introduce a fault by changing a relational operator or negating a conditional expression.The injected faults are known as mutations, and Ana B. Sánchez, José A. Parejo, Sergio Segura, and Amador Durán are with the Universidad de Sevilla, 41012 Seville, Spain (e-mail: anabsanchez@us.es;japarejo@us.es;sergiosegura@us.es;amador@us.es).
Digital Object Identifier 10.1109/TSE.2024.3377378 the new faulty versions of the program under test are known as mutants.The adequacy of test suites is measured by the mutation score (a.k.a mutation coverage), which is the ratio of detected (a.k.a.killed) mutants over the total number of mutants.Mutants that are not killed by the employed test cases, i.e., that remain alive, are considered valuable as they showcase ways for improving the test suites.
The origin of mutation testing can be traced back to the 70s in the seminal papers by DeMillo et al. [1], [2].Since then, mutation testing has flourished until becoming a wellestablished testing technique and a recurrent research topic in software testing and software engineering venues [3].Its ability to evaluate and compare the effectiveness of testing approaches has also made mutation testing a popular technique for research purposes.Recent literature reviews on the topic have identified more than 400 research papers on mutation testing and about 130 mutation testing tools-most of them research prototypes-for a variety of programming languages and artefacts, including Java, C, C++, C#, JavaScript, HTML/CSS, Ruby, and UML models, among others [3], [4].
The impact and widespread adoption of mutation testing in research comes in contrast to the general belief of the current practice.This can be attributed to the lack of evidence of the use of mutation testing in practice.While some case studies [5] and technical reports on large companies such as Google [6] and Facebook [7] have been published, the general use of the mutation technology by practitioners remains unknown.As a result, the overall impact of mutation testing beyond research is still largely unexplored [4], [8].To address this issue, Sánchez et al. [4] made a quantitative study on the use of mutation testing tools in practice using repository mining.Specifically, they systematically searched and analysed thousands of GitHub repositories that include evidence of the use of mutation testing tools.Among other findings, their results revealed a significant adoption of mutation testing for development purposes in opensource projects, with tools like Infection [9] (PHP) and PIT [10] (Java) being among the most widely used ones.
In this work, we seek to investigate "how" developers are using mutation testing and to gain insights into the benefits and limitations of the method as perceived by the practitioners working on open-source projects.That is, we focus on the usage of mutation testing within software development, including the specific tasks performed with the help of mutation, the perception of the developers on the value provided by the technique, and the key benefits and limitations of existing tools.All-in-all we investigate the extend to which the practitioners views differ from that of researchers with the aim to identify potential points for improvement.
To this end, we aim to answer the following research questions: RQ 1 How mutation testing tools are used in real-world cases?We aim to study when and how mutation testing tools are used in practice, including the factors that drive their adoption.RQ 2 What is the perceived impact of mutation testing in practice?We aim to study how developers assess the impact of mutation testing in their projects, including both positive (e.g., improving test case quality) and negative points (e.g., execution cost).RQ 3 What are the key benefits and limitations of mutation testing tools (as perceived by practitioners)?We plan to investigate the degree of satisfaction of developers with current tool support and their views on how to improve it.
To answer these questions, we perform a survey on the use of mutation testing among developers of open-source software projects.To do so, we used the data collected by Sánchez et al. [4], based on the analysis of over 3.5K GitHub repositories including evidence of the use of mutation testing tools.That data, allowed us to find the survey participants, i.e., open-source project contributors who are using mutation testing tools.Overall, we collected and reported results based on 104 valid responses from mutation testing practitioners.
The results of our study reveal a high level of satisfaction with mutation testing across various programming languages and mutation testing tools.Developers find the technique beneficial for enhancing their test suites, identifying bugs, and deciding when to stop testing.Additionally, mutation coverage is often used to guide code refactoring (e.g., removing redundant or irrelevant code), contributing to the overall quality of the code.Mutation testing tools are commonly integrated into continuous integration/deployment workflows and build tools.The adoption of mutation tools is mostly driven by their popularity, usability, and configurability.Performance stands out as the primary impediment to the practical implementation of the technique, along with the challenge of seamlessly integrating the tools with Integrated Development Environments (IDEs).Together, these results provide a new perspective on the state of practice of mutation testing, laying the foundation for novel research contributions and tools that meet the needs of developers and foster its broad adoption.
The rest of the article is organised as follows.Section II describes the background and related work.Our research methodology is detailed in Section III.The findings of our study are presented in Section IV.The potential threats to validity are discussed in Section V. Finally, we present the conclusions of our work in Section VII.

II. BACKGROUND AND RELATED WORK
In this section, we first introduce mutation testing and then discuss related empirical studies on mutation testing, as well as related qualitative studies on software testing.

A. Mutation Testing
Mutation testing is a well-known fault-based technique that aims at evaluating and improving the fault-revealing potential of test suites.This technique not only encourages testers to exercise as much code as possible, but also to uncover possible programming mistakes.The technique relies on injecting faults, known as mutations, which are simple syntactic modifications of the code.The faulty versions of the program under test where mutations are injected are known as mutants, which, when not detected by test cases, have the potential to reveal real faults Chekam et al. [11].
Two underlying hypotheses support the mutation testing approach.The first one is the so-called Competent Programmer Hypothesis [12], which states that programmers tend to write programs that are largely correct, in some sense missing simple cases, similar to mutants in mutation testing.The second one is the so-called Coupling Effect Hypothesis [1], which states that tests detecting simple mutants are also able to detect others, including the majority of the more complex ones.
Every mutant as well as the original program are executed against the test suite.The program outputs of the original and mutant programs are compared in order to identify any difference that would indicate that the injected fault was triggered by the test suite and would classify that mutant as detected or killed.On the contrary, when the output is the same, the mutant remains alive and requires further analysis, as it can point out a deficiency in the fault detection ability of the test suite.This is not always the case, however, because a mutant can turn out to be functionally equivalent to the original program; these are the so-called semantically-equivalent mutants.It follows that a tester should aspire to kill as many mutants as possible to increase the detection power of the test suite.The number of killed mutants over the whole set of non-equivalent mutants is called mutation score or mutation coverage.
The injection of mutations is generally systematised with the development of mutation tools, which implement different mutation operators.These operators are applied each time a pattern is found in the program (e.g., each appearance of the relational operator '>' is replaced by '<').As we show in followup sections, more than one hundred mutation testing tools have already been developed and these cover most of the widely-used programming languages, including Java, C/C++, Python or C#.The list of the mutation testing tools has grown significantly in the last years with many mutation testing tools specialised in emerging domains such as Deep Learning [13] and Smart Contracts [14].

B. Studies on Mutation Testing
Mutation testing has been extensively analysed and studied in the scientific literature.Several systematic literature reviews [3], [15], [16] have addressed mutation testing from a general point of view, while others have focused on more particular issues within the topic, such as the techniques to reduce its cost [17], [18].Some other studies have examined various mutation testing tools and compared them following a variety of perspectives.A recent survey by Kintis et al. [19] assesses the effectiveness of three Java mutation testing tools (PIT [10], MuJava [20] and Major [21]) in the detection of faults.The conducted experiments revealed that, while none of these mutation tools completely subsumed the others, an improved version of PIT with research purposes was the most effective tool at inducing test cases that could reveal real faults.
Another group of studies deepens on technical aspects regarding the usage of mutation testing tools, such as efficiency, controllability, compatibility, and integration with a test environment.Although most of these studies focus on Java tools [22], [23], some other programming languages have been considered too, like C# [24].The study by Delahaye and du Bousquet [22] identifies three different profiles that can influence the election of a mutation tool: teaching, research, and industry.They conclude that PIT is a good choice for the industry and the teaching profiles, where tools should be easy to apply and, in the particular case of the industry, should have a good balance between efficiency and meaningfulness of results.
Regarding the application of mutation tools in the development of software projects, some studies have shown the possible benefits of transferring mutation testing concepts from academia to industry, carrying out empirical studies with opensource projects [11], [25] and industrial projects [5].In fact, a recent study by Petrović et al. [26] showed that mutation testing has positive long-term effects on the testing practices of developers.Also, some recent studies analyse the use of the technique in large companies.For example, in [6], it is reported how Google implements its own mutation system for seven programming languages and applies a diff-based probabilistic approach to reduce the high computational cost of traditional mutation analysis.The study by Beller et al. [7] reports that more than half of the mutants generated-based on some errorinducing patterns-were not detected by the tests developed at Facebook.Sánchez et al. [4] reported the findings of a study that investigated the use of mutation testing in practice by looking into GitHub.Specifically, the authors performed a systematic search for GitHub repositories including traces of the use of 127 mutation testing tools.Then, the authors focused on the top ten more widely used tools and manually revised the repositories importing them, over 3.5K.Among others, the results of the study showed 1) a notable upturn in interest in mutation testing in recent years, mostly focused on a small set of highly popular tools; 2) the predominant use of mutation testing in development, followed by teaching & learning, and research; and, 3) some of the most popular mutation testing tools in GitHub are rarely referenced in research papers, e.g., such as Infection [9] or Humbug [27] for PHP.

C. Qualitative Studies on Software Testing
Prado and Vincenzi [28] conducted a qualitative study of professionals with unit testing experience aiming to understand how to improve the cognitive support provided by the testing tools, taking into account the perspective of practitioners on their unit testing review practices.The responses of practitioners revealed some primary tasks which require cognitive support, including monitoring of pending and executed unit testing tasks, and navigating across unit testing related artefacts.
Beller et al. [29] reported the results of a large-scale field study with 416 software engineers whose development activity was closely monitored over the course of five months, with the objective of knowing when, how and why developers (do not) test in their IDEs.
A recent qualitative study by Habchi et al. [30] explores the sources, impacts, and mitigation strategies of flaky tests through interviews with 14 practitioners.Flaky tests are tests that are non-deterministic, i.e., for the same versions of code and test, can pass and fail on different runs.Their results shows that flakiness stems from interactions between system components, testing infrastructure, and external factors.They also highlight the impact of flakiness on testing practices and product quality and show that the adoption of guidelines together with a stable infrastructure are key measures in mitigating the problem.
In [31], Perry et al. conduct a large-scale user study with 47 participants examining how users interact with an artificial intelligence (AI) code assistant to solve a variety of security tasks.AI code assistants enable developers to write code faster by suggesting code and helping during edition.The authors found that participants with access to an AI code assistant often produced more security vulnerabilities than those without access, possibly due to a false sense that they write more secure code with this kind of assistants.

III. RESEARCH METHODOLOGY
In this section, we describe the research process followed, summarised in Fig. 1.Specifically, we provide details about the selection of potential participants, the design of the survey, the protocol for contacting participants, and the qualitative analysis of the collected data.To ensure clarity, Table I displays the number of repositories and potential participants associated with each step of the process for each mutation tool.

A. Participant Selection
Our goal was to select practitioners who possess experience in working with mutation testing.To accomplish this, we initially identified open-source repositories that included evidence of utilising mutation tools, and then we reached out to the contributors of these repositories and encouraged them to participate in the survey.This process was performed in two primary steps, described below.

1) Repository Selection (Step 1):
We started by identifying open-source projects including evidence of the use of mutation testing tools for development purposes as follows.
Previous repository selection (step 1.1).In the previous study conducted by Sánchez et al. [4], the ten most frequently utilised mutation tools on GitHub (listed in the first column of Table 1) were identified.Following that, they performed a manual examination and categorisation of 3,581 active GitHub repositories that imported these tools.These repositories serve as the foundation for the present study.The number of selected Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.repositories per mutation testing tool is documented in the second column of Table I.

Development repository filtering (step 1.2).
We were interested in surveying developers using mutation testing for development purposes (i.e., as a method to evaluate and improve the quality of test suites) rather than for research or teaching.To achieve this, we used the data collected by the manual revision of the repositories performed by Sánchez et al. [4].In particular, we included repositories classified as "development purposes" and with owners from industry or public institutions only, and excluded those related to teaching (e.g., course materials) and research (e.g., replication packages).This reduced the number of target repositories from 3,581 to 1,574 (44%), leaving two of the tools out of the study, MuJava and Major, as shown in the third column of Table I.
2) Contributor Selection (Step 2): Potential participants were selected among the owners and contributors of the 1,574 repositories previously selected in the three steps described below.
Contributor data extraction (step 2.1).We proceeded to gather information on the owners and contributors of each selected repository with the aim of establishing communication with them.Specifically, we used the REST API 1 of GitHub to automatically extract owner and contributors' data from the 1,574 selected repositories.In particular, we collected the name, bio, email, Twitter profile, company, location and type (user, organization, or bot) of each contributor/owner.Repositories that were no longer available on GitHub at the date of the data revision (Oct 10th, 2022) were discarded.This reduced the number of target repositories from 1,574 to 1,468, as shown in the fourth column of Table I.Retrieved data were saved into CSV files for further analysis and to enable replicability.
Repository and contributor filtering (step 2.2).To complete this work with reasonable resources, we focused on the most popular repositories and their most active contributors.Specifically, we considered up to 100 repositories for each tool (ordered by popularity, i.e., number of stars) and up to 100 contributors for each repository (ordered by their number of commits to the project).This made a total of 503 repositories under study (as shown in the fifth column of Table I) with a potential number of participants of 3,375 based on the number of contributors found on them (as can be seen in the sixth column of Table I).
Exclusion of bots & contributors without contact information (step 2.3).Finally, we first discarded the contributors who were bots, and secondly, those who did not have email to contact them.Table I summarises the resulting data in the seventh and eighth columns, i.e., 3,207 and 1,192 contributors, respectively.This made a total of 503 repositories and 1,192 potential participants for our study.

B. Survey Design
To design the survey, we followed Creswell's guidelines [40] for conducting qualitative research questions.The process took several iterations where four of the authors independently reviewed the questionnaire and discussed it until a consensus was reached (e.g., grouping similar questions together to make them more coherent and concise).The final version of our questionnaire, included in the Appendix (see the supplemental material), contains 21 questions, 17 closed-ended, in which additional comments can also be included, and 4 open-ended.
The first four questions are related to the participant profile demographics: age, gender and years of experience as a developer and as a user of mutation testing.The next two questions ask the participants to indicate which mutation tools they have used and how they have learned about them.The rest of the survey contains the core questions aimed to respond the research questions of our study, which are summarised in Table II.Some of them use a 5-level Likert scale.We include a last question for additional comments or suggestions.Each row in Table II shows a question including the research question to which it is related, an identifier, the type of question (MC: multiple choice, OC: one choice, or FT: free text), and the number of respondents that answered it.

C. Participant Contact
With the aim of maximising the number of responses, we followed a series of recommendations to improve the survey participation rate [41], [42]: i) Use the word questionnaire (not survey).The word "survey" often has a negative connotation (especially in inboxes).ii) Ask participants for help.iii) Avoid the spam filter.Avoid unknown senders, such as info@ or noreply@, instead, use a corporate email.iv) Personalize.Personalize the subject line and/or, at least, the content of the email, e.g., the greeting.v) The subject line must be appealing.vi) Explain the purpose of the survey.vii) Publish the results.viii) Be grateful.
We contacted the participants in two different ways.Firstly, we wrote a personalized email, following the recommendations mentioned above, to the selected contributors.Secondly, we created a GitHub issue for each selected repository requesting the participation of any contributor with experience using the specific mutation testing tool.The e-mail and issue templates are available in the supplementary material [43].
In total, we automatically sent 1,192 emails, one to each selected contributor, and we posted 503 issues, one for each selected repository.The survey was sent out initially on November 29th 2022.A kind reminder was sent later, on December 15th, only to those participants who had replied indicating that they were out of the office or would reply later, and they had not yet done so.During the review process, reviewers required additional responses to increase the significance of the conclusions drawn in this paper.Thus, we performed an additional sent out of the survey on December 7th 2023 (448 emails and 330 issues), we stopped accepting responses on January 2nd 2024.
We received positive feedback from 26 people who replied to our e-mail or GitHub post expressing congratulations for the initiative.Conversely, 8 developers complained about the creation of an issue in their repository.One of the responses received requested permission to share the survey with other colleagues working with mutation testing tools.Other of the participants shared the link to the survey on Twitter encouraging to participate2 .

D. Qualitative Analysis
In this section, we comment on the survey responses received and the analysis of the collected data.

1) Survey Responses:
The 127 questionnaire responses received correspond to a response rate of 10.65% if we consider only the emails sent (notice that we also posted issues on GitHub).The raw data of the responses provided by all the participants is available in the supplementary material [43].
Twenty-three responses were excluded from the analysis for several reasons.One participant did not consent to the conditions of the survey and did not submit any further information.Three participants did not acknowledge the use of any mutation testing tool.Other nineteen responses were discarded because their participants were not able to identify the mutation testing tool used in their projects and they did not answer any other question in the survey.The final dataset contains 104 valid responses (see the last column of Table II for the detailed number of answers per question).
2) Qualitative Analysis of Survey Data: For the analysis of the qualitative data obtained through the survey, we used a combination of narrative analysis, different kinds of charts, and correlation computed using the Mathews coefficient [44] to analyse the groups of answers that were usually answered together for multiple option questions.
Specifically, for the analysis of responses using a Likert scale, we use divergent stacked bar charts as recommended in [45].In order to show the distribution of responses among tools, we use horizontal stacked bar charts, with the specific answers provided in the y-axis and the number of participants in the x-axis, using the hue of the bars to denote the specific tools used by the participants (see Figs. 2 and 3 in the next section).

IV. FINDINGS
In this section, we present the findings after the analysis of the 104 valid responses to the survey (described in the Appendix, available online).Specifically, we first provide demographic data of the participants and then respond to each of the RQs by analysing the answers to the related questions in the questionnaire (see Table II).

A. Demographics
Table III shows the demographic information of the 104 valid participants.Most of them were men (93%), between 26 and 55 years old (89%), with a long experience as developers (73.1% have more than ten years of experience), and have been using mutation testing for several years (only 19% of the participants has less than one year experience using mutation testing).
The most popular tool among participants is Infection [9] with 27.9% of participants using it in their projects.The different flavours of Striker are also quite popular among participants: 17.3% of participants use the Javascript version [38], 8.7% use the.Net version [39].PIT [10] is used by 26.9% of the participants.Mujava [32], Major [21], Stryker for Scala [46], and a custom tool designed for the Rust programming language were each reported as being used by only one participant in the survey.Interestingly, Stryker for Scala and the custom tool were not initially considered for the purposes of repository and contributor selection.It is plausible that the participants who used these tools learned about the survey through a colleague or the tweet mentioned above.Participant P 106 stated using the "mocha" testing framework as the mutation tool.Given that the known method of employing Mocha for mutation testing is through integration with StrikerJS3 , we interpreted this as an indirect reference to StrikerJS as the mutation tool.
Out of the 104 remaining participants, 41 (39.4%) learned about mutation testing by recommendation of a friend or colleague, and 35 (33.7%) learned it from a blog or publication, as shown in Fig. 2.

B. Use of Mutation Testing Tools Among Software Developers (RQ 1 )
In this section, we respond to RQ 1 by analysing the answers to questions Q 1 -Q 7 in the questionnaire.
1) Roles on the Use of Mutation Testing (Q 1 ): Most participants (86.5%) took an active role in the use of mutation testing by actively improving their test cases based on the results of the mutation coverage.A small portion of participants (7.7%), however, indicated that they had just seldom checked the mutation score, without taking further actions.A significant portion of the respondents (73.1%) contributed to the integration and setup of the mutation testing tools in their projects, and over half of them (55.8%)changed the configuration of the mutation tool to meet their needs (see Section IV-B6 for more details).
A positive correlation was found (0.64 using the Matthews correlation coefficient [44]) between participants who contributed to the integration of the mutation testing tools into their projects and those who customised the configuration of the tool.
2) Aim of Using Mutation Testing (Q 2 ): The primary objective of using mutation testing is measuring and improving the quality of the test cases (86.5% of respondents).Among those marking the others option (13.5%), some respondents used the free-text answer to specify their motivations for the use of mutation testing-all of them related to software quality improvement.For example, participant P 30 , user of Infection [9], reported that (s)he uses mutation testing "for removing unnecessary tests and code".Similarly, P 17 , user of Mutant [33], uses the technique "to reduce the code to the minimum needed to pass tests", and P 29 , also a user of Mutant, reported its use for "reducing unused and unwanted code, arriving at most narrow meaning of the code".These answers indicate that mutation testing is not only used to improve the fault-detection capability of test cases, but also to improve code efficiency and maintainability through refactoring, such as eliminating redundant or irrelevant code.It is noteworthy that participant P 103 , user of PIT, indicated the adoption of mutation testing as a measure "to meet the project policies", implying its institutionalization within the project testing guidelines.However, the participant also noted that they "unfortunately stopped using it on a regular basis after the project terminated".
3) When and How Often Mutation Testing Tools Are Used (Q 3 ): Participants stated that they use mutation testing tools on demand (38.5%), every time test cases are executed (26.9%), before committing changes to the repository (19.2%), and prior to deployment in production (19.2%).Nine participants (8.6%) Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
4) Specific Type of Test Cases (Q 4 ): The majority of participants (70.2%) do not limit the calculation of mutation coverage results to a particular type of test case.However, among those participants who do restrict the calculation, the majority focus on unit tests (73.8%), while a smaller portion considers integration tests (21.4%).
5) Invocation of the Mutation Testing Tool (Q 5 ): A majority of the participants (63.5%) commonly utilise mutation testing tools by integrating them into their continuous integration/deployment workflows using tools like Jenkins or Github Actions.Additionally, build tools such as Maven or Make are employed by 43.3% of the respondents.It is worth noting that approximately two-thirds of the participants also execute mutation tools directly from the command line.
6) Modification of Configuration (Q 6 ): A half of the participants (50%) have modified the configuration of the mutation testing tool used in their projects.This highlights configurability as a relevant issue for the adoption of mutation testing tools.Adjustments include, among others, selecting specific mutation operators, restricting mutation to certain file patterns, or defining timeouts.The most usual adaptions are the filtering of the code to be mutated (34.9% of the reported adaptions), and the selection/customisation of the mutation operators to be applied (25.6% of the reported adaptions).As an example, participant P 26 reported modifications on Infection for "adjusted mutations, restricted mutated code, adjusted timeouts and limits, specified multiple output/report targets".Similarly, participant P 70 , user of Stryker4s, reported modifications for "restricting to file patterns that should be mutated.Restricted the test suites that the mutation framework should run.Added custom build tasks to run mutation tests check on multiple restricted modules at once with a single build command.Used annotation to escape specific cases, etc.".7) Selection of the Mutation Testing Tool (Q 7 ): As summarised in Fig. 3, the causes that motivated the choice of the specific mutation testing tool were mainly popularity (53.8%) and usability (46.2%), followed by performance (21.2%) and configurability (16.3%).

Summary of answers to RQ 1
1) The majority of participants (86.5%) employ mutation testing tools as a means to assess and enhance the quality of their test cases.2) Some participants (13.5%) go beyond solely focusing on test cases and utilise mutation testing tools to improve the overall quality of their code, e.g., removing redundant or irrelevant code.3) Mutation testing tools are commonly integrated into continuous integration/deployment workflows (63.5%) and build tools (43.3%), although they are also run from the command line (66.7%).4) The choice of a specific mutation testing tool is mainly driven by its popularity (53.8%) and usability (46.2%).
A half of the developers (50%) have modified the configuration of the tool to meet their needs.

C. Perceived Impact of Mutation Testing in Practice (RQ 2 )
In this section, we respond to RQ 2 by examining the answers to questions Q 8 -Q 10 .
1) Quality Improvement (Q 8 ): Overall, a large majority of participants (96.2%) indicated that mutation testing had contributed to improving the quality of their projects.However, we must exercise caution when interpreting such perceived improvement.As we contacted contributors of repositories using a mutation testing tool-included as a dependency in their project-it is likely that they have a positive opinion about the technique.Those contributors who used a mutation testing tool but did not perceive any positive impact may have removed such tool dependency from their projects and thus were not included in our survey.
When asked about the specific improvements achieved, participants reported that mutation testing helped in improving existing test cases (83.7%), guiding the creation of new test cases (76.0%), detecting more bugs (65.4%), and deciding when to stop testing (48.1%).Fig. 4, shows the specific contributions to quality reported by participants.
Some participants utilized the optional free-text answer to elaborate on the ways in which mutation testing proved advantageous for them.For instance, P 26 using Infection, stated that "It discovered security issues".Additionally, several participants such as P 29 , P 68 , and P 69 leveraged the tool to eliminate dead code, and P 48 stated that "Helped prune out unnecessary code branches".
Four participants (3.8%) reported no improvement in their projects.One participant stated that the tool only provided false positives, while two participants noted that the tool was so slow to run that it was effectively useless or it crashes when they try broadening the scope of is application in the project.These comments highlight potential areas for improvement in mutation testing tools.The fourth participant who perceived no improvement reported that the tool identified missing test coverage, but did not uncover any actual bugs.It is noteworthy, however, that these negative opinions still suggest that mutation testing is providing value, identifying areas lacking coverage and aiding in the development of necessary tests, that would enhance confidence in the software, potentially preventing unnoticed or future bugs.
2) New Testing Practices (Q 9 ): Mutation testing tools appear to have a significant impact on testing methodologies in projects, with 51.9% of participants stating that they contributed to the adoption of new testing practices.When asked about the specific testing practices induced, 33.7% of participants reported generating mutation testing coverage reports as part of the build process, 34.6% of participants defined a minimal mutation coverage threshold, and 27.9% enforced a minimal mutation score during the build process.These findings seem to suggest that emerging research on incremental [47] and deltaaware mutation testing [48], [49] is a fruitful research direction.Additionally, in a response to Q 14 , participant P 10 , user of MutMut [34], indicated "mutation testing encourage the team to adopt Test-Driven Development", which shows the potential of mutation testing to impact testing practices.3) General Satisfaction (Q 10 ): The majority of participants (89.4%) are either very satisfied (40.4%) or satisfied (49.0%) with the impact of mutation testing in their project.Nine participants (8.7%) expressed neutrality answering 'neither satisfied nor dissatisfied'.Only one participant chose the option 'not satisfied,' while another single participant selected 'not at all satisfied'.As previously mentioned, this result must be taken with caution since we contacted contributors of projects where mutation testing is already in use, and therefore it is likely that they are positive about the technique.

Summary of answers to RQ 2
1) The majority of the developers asked (96.2%) consider that mutation testing contributed to improving the quality of their software projects.2) Developers perceive mutation testing as a helpful technique for guiding the improvement (83.7%) and creation (76.0%) of test cases, as well as deciding when to stop testing (48.1%).3) Almost two out of every three participants (65.4%) find mutation testing helpful for detecting bugs in their software 4) Mutation testing fosters the adoption of new testing practices, including the integration of mutation coverage computation as a part of the build process (33.7%) and the definition (34.6%) and enforcement (27.9%) of minimal coverage thresholds.5) A vast majority of the participants (89.4%) are either satisfied (49.0%) or very satisfied (40.4%) with the impact of mutation testing in their projects.

D. Developer Satisfaction With Mutation Testing Tools (RQ 3 )
In this section, we respond to RQ 3 by analysing the answers to questions Q 11 -Q 14 .
1) Usability, Performance, and Usefulness (Q 11 ): Fig. 5 shows a diagram of divergent stacked bars summarising the answers to this question.The specific distribution of responses regarding usability, performance, and usefulness for each specific tool is depicted in Fig. 6.A minimum of 71% of respondents provided positive responses (agree or strongly agree) regarding satisfaction on all the specific aspects evaluated through the questionnaire: usability, performance, and usefulness.Conversely, negative responses (disagree or strongly disagree) on any aspect evaluated were consistently below 14%.Overall, the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.mutation testing tools are perceived as useful (93% of positive responses), user-friendly (80% of positive responses), and performant (71% of positive responses).Negative views are mostly related to performance (14% of negative responses), followed by usability (7% of negative responses).
2) Main Benefits of Mutation Testing Tools (Q 12 ): When asked about the main benefits of using mutation testing tools in their projects-a free-text question-the participants identified as key benefits the enhancement of test suite quality, the improvement of test cases, the identification of missing test cases, and the detection of bugs.As expected, this is in line with the responses to Q 8 , regarding how mutation testing has contributed to improving the quality of their projects.Fig. 7, shows a word cloud generated from the responses, in which the size of each word is determined by the number of appearances in the answers.Common terms such as "mutation", "test", "testing", "code", and "cases" were excluded from the word cloud as they appeared in almost every response.Notably, the words "quality" and "improving" (and the related term "improve") appear prominently, indicating that they were frequently mentioned by the participants.Some participants stressed the ability of mutation testing to measure code coverage and identify relevant test inputs: "When I code a feature I have in mind all the corner cases, but then not always write tests for them -then mutation tests will help me to 'remember' finding hidden bugs and missing test cases" (P 75 using Infection); whereas other emphasise the ability of the technique to improve test oracles: "To prove code is tested, necessary, and that the code and test are making the strongest assertions they can."(P 78 using Mutant).P 113 using StrykerJS stated that the main benefit was "Helping developers who don't do test first development".More generally, some participants pointed out that the main benefit of mutation testing was that developers "feel more secure about the quality of the unit tests" (P 63 using PIT), or that it contributes to "finding undiscovered bugs and help increasing test coverage" (P 35 using PIT).
3) Main Limitations of Mutation Testing Tools (Q 13 ): When asked about the limitations of the tools, performance emerged as the most significant issue for the users of 6 out of the 10 mutation testing tool under study.Specifically, 17 out of the 36 answers referred to performance as a key limitation.
This finding is highlighted in the word cloud shown in Fig. 8, generated from the answers to this question.The words "performance", "speed", and "time" appeared frequently and were particularly prominent in the cloud, indicating that they were major concerns for the participants.For instance, P 26 stated that "tools' performance on large projects makes it only viable for libraries", and P 31 claimed that "it's impossible to run on the whole project, due to performance/scale".Participants P 46 , P 47 , and P 51 just answered "performance", "performance", and "speed" respectively.These findings suggest that, despite research advances in cost-reduction techniques for mutation testing, at least those included in the tools used (e.g., bytecode mutation and coverage-driven mutant execution), performance is still a major obstacle for the adoption of mutation testing in practice.Here it must be noted that the recent advances on learning-based mutant selection and mutant prioritisation [49], [50], [51] that promise to reduce overheads, have not been integrated on the studied tools, indicating the need for a better tooling and adoption of research advances.
Other responses pointed the need for improvement of the tools in terms of usability (e.g., "ease of use & setup overhead") and configurability ("...Conditional tests, depending on specific versions of language or library are not taken into account".4) Single Improvement for Mutation Testing Tools (Q 14 ): Consistently with the above answers, when developers were asked on what single improvement they would like to see on the mutation testing tools, a majority of participants (25 out of 57) requested better performance.Other responses referred to usability and configurability issues such as "an IDE integration.I would test any class/method in one click" (P 37 , using Infection [9]), "good documentation on how to set up in different environments (build tools, CI/CD)" (P 63 , using PIT [10]), and "a better suppression system" (P 69 , also using PIT [10]).Finally one participant pointed out that the main single improvement would be "Mostly just educating developers around what it is" (P 104 using Infection).

Summary of answers to RQ 3
1) The majority of participants (89.4%) are satisfied with the mutation tool used as a way to enhance test suites, detect bugs and, more generally, improve code quality.2) Seven out of every ten developers are satisfied with the usefulness, usability and performance of the mutation tool used in their projects.3) Performance clearly emerges as the main limitation of mutation tools and as the most pressing concern for developers.

V. THREATS TO VALIDITY
To facilitate a comprehensive examination of potential validity threats to our work, we use the terminology and taxonomy established in [52].We have also tailored the threats to survey studies outlined in [53] to align with the categories in this taxonomy.Next, we detail the specific threats to validity for each category and outline the corresponding actions taken to mitigate them.

A. Internal Validity
Internal validity refers to whether there is sufficient evidence to support the conclusions and the sources of bias that could compromise those conclusions.
Our study draws on the 3,581 GitHub repositories using mutation testing tools resulting from our previous work [4], which is based on the assumption that software projects importing a mutation testing tool, such as a library, most likely use or have used that tool at some point, although this might not be always true.This threat was minimised by the number of manually reviewed repositories, over 3.5K, which dilutes the potential effect of these unlikely cases where the mutation tool is imported, but not used.
In the current study, we only selected the repositories classified as development and with owners at industry or non-academic public institutions.The manual review and classification of the repositories in our previous work could have also threatened the validity of the results, since some repositories could have been misclassified.However, the authors participated actively in the classification process following a common review procedure and discussing the cases of doubts in several working sessions until reaching a consensus.
In this work, we proceeded to collect information about the owner and contributors of the selected GitHub repositories.The participants data collection method could also threaten the validity of our work.To mitigate this threat, we resorted to automated queries on the REST API of GitHub to obtain the data of contributors and owners of target repositories.

B. External Validity
External validity in surveys focuses largely on the representativeness of the sample for the target population of the study.In this sense, the implications of the repository mining, and tool selection procedures are discussed.
In order to study the use of mutation tools in practice, we focused on the GitHub repositories selected in our previous work [4], as we mentioned in the previous section.Widening the scope of our work to other platforms beyond GitHub (e.g., Bitbucket [54] or GitLab [55]) could have yielded different results.However, the size and the popularity of GitHub in related mining studies make us confident in the validity of the reported trends.
To ensure that our study was feasible with our available resources, we limited the selection of repositories to a maximum of 100 for each tool (ordered by popularity, i.e., number of stars) and a maximum of 100 contributors for each repository (ordered by contributions, i.e., number of commits).This Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
resulted in 503 repositories under study with 1,192 potential participants, as shown in Table I.We are aware that this may have left contributors who have worked with mutation tools in their repositories out of the survey.However, to reduce this possibility, we worked in two directions.Firstly, as we have already mentioned, we contacted the 100 contributors with the highest number of contributions in the repository, in order to reach the most valuable and most active contributors in the project.Secondly, we also posted an issue with the link to the survey on the most popular repositories using the mutation tools of the study, trying to reach any interested and mutationexperienced contributor of the repository.In addition to this, we also allowed contributors to share the survey directly to other colleagues in the project who were working with mutation.
The method followed for selecting repositories introduces bias, as only those repositories using mutation testing tools as libraries or dependencies were considered.Contributors who used such tools but discontinued were not selected.This threat is hardly mitigable, as finding such dependencies or libraries would require going through the entire version history of each repository in GitHub, making the study practically unfeasible.Consequently, the findings and conclusions of our survey are not universally applicable to all open-source developers who used mutation testing.Instead, they specifically pertain to developers who continued using and finding value in the tool.This selection bias may inadvertently overlook significant issues and drawbacks associated with mutation testing.Despite this, however, this study provides valuable insights into how mutation testing is currently used in practice, highlighting the perspectives of developers on key benefits, drawbacks, and potential improvements in mutation testing tools.
Another factor affecting the generalizability of the study's results is the limited number of responses for some mutation testing tools, as indicated in Table III.To minimize the impact of this limitation on the study's conclusions, we have deliberately refrained from making specific assertions about individual mutation testing tools.The research questions and their corresponding answers are formulated to address mutation testing tools as a whole, rather than focusing on any particular tool.

C. Construct Validity
Construct validity evaluates whether a measurement tool really represents what we are interested in measuring.In this study, most of the questions in our survey focus on the actual practices of participants regarding mutation testing in their projects and their usage of the libraries and tools.For such types of questions, we provided as default options the practices that according to the existing literature and authors' experience are more common.However, in all such cases, multiple responses were allowed.Also, we provided an open text option for participants to provide their views to avoid a systematic bias towards authors' perceptions.For some questions, Likert scales were used to measure attitudes and opinions with a greater degree of nuance than a simple yes/no question.For the design of such questions and scales, we followed the guidelines provided in [56].Specifically, we formulated the statement of the question, followed by a series of five labelled options, where participants choose the option that best corresponds with how they feel about the statement.The specific labels used in such questions were "Strongly disagree", "Disagree", "Neither Agree nor Disagree", "Agree", and "Strongly Agree".
In order to ensure that we introduce no bias in responses due to the redaction, structure, and type of questions included in the questionnaire, we followed Creswell's guidelines [40] for conducting qualitative research questions.

VI. OPEN RESEARCH DIRECTIONS
The results of this work provide a new perspective on the state of practice of mutation testing, laying the foundation for novel research contributions and tools that meet the needs of developers and foster its broad adoption.
During the analysis of the responses provided by participants, the authors realized that some interesting benefits and problems of the application of mutation testing tools reported by developers were provided through the free text options of the questionnaire.This fact, along with difficulties for the design and analysis of the survey suggest that a grounded theory on how mutation testing impacts on software development could be a suitable research direction in the future.
The concerns of developers with the performance of mutation testing tools points out another research direction of practical interest.Using the contact information provided voluntarily by participants who wanted to receive additional information about the results of this work and considering that performance was pointed out as an important issue of mutation testing tools, we traced back a set of 16 different associated open-source repositories for six of the tools.This listing of repositories has been added to the laboratory package of this work [43] and could be used by mutation tools developers and researchers as suitable experimental subjects where the improvements on the performance of such tools could be applied and validated.
Finally, this paper performed a wide and-to some extentshallow study of the use in practice of mutation testing by developers of open-source software.However, a deeper understanding of the phenomenon is still required, and in this sense a study performing in-depth interviews with industry practitioners applying mutation testing would be a great complement for this paper, emerging as an open research direction for future work.

VII. CONCLUSION
This article presents the findings of a qualitative study on the use of mutation testing in practice.To this end, we contacted developers of GitHub repositories using mutation testing tools and invited them to participate in a survey, obtaining a total of 104 valid responses.Results reveal a strong sense of satisfaction among developers regarding mutation testing as a means to enhance test cases, detect bugs, decide when to stop testing, and improve overall code quality.Mutation coverage computation is mostly automated and integrated as a part of continuous integration and deployment workflows.Developers mostly choose mutation testing tools based on their popularity, but usability and configurability emerge as key adoption drivers too.Mutation testing fosters the adoption of new testing practices such as the definition and enforcement of minimal coverage thresholds.Among the points for improvement, performance emerges overwhelmingly as the most pressing concern for developers.This suggests that further efforts are required to bridge the gap between research advancements in cost-reduction techniques for mutation testing and their application in practice.

Fig. 3 .
Fig. 3. Causes motivating the choice of the mutation testing tools.

Fig. 5 .
Fig. 5. Global perception of the usability, performance, and usefulness of mutation testing tools.

Fig. 6 .
Fig. 6.Distribution of responses regarding usability, performance, and usefulness per specific mutation testing tool.

Fig. 7 .
Fig. 7. Word cloud generated from the responses of participants about the benefits of mutation testing tools.

Fig. 8 .
Fig. 8. Word cloud generated from the responses of participants about the limitations of mutation testing tools.

TABLE I EVOLUTION
OF THE NUMBER OF REPOSITORIES AND CONTRIBUTORS PER TOOL RESULTING FROM EACH STEP OF OUR PARTICIPANT SELECTION PROCESS

TABLE III DEMOGRAPHIC
CHARACTERISTICS OF PARTICIPANTS (N = 63) used the free-text answer to indicate other options such as "...as often as possible" (participant P 17 user of Mutant), "...when a pull request is created/modified" (P 43 user of Infection), and "when I am working on improving tests" (P 51 user of MutMut).It is noteworthy that a majority of participants (55 out of 104, 52.9%) have integrated the mutation testing tool as part of the build process of their projects (see responses to Q 5 for details).