Cerebro: Static Subsuming Mutant Selection

Mutation testing research has indicated that a major part of its application cost is due to the large number of low utility mutants that it introduces. Although previous research has identified this issue, no previous study has proposed any effective solution to the problem. Thus, it remains unclear how to mutate and test a given piece of code in a best effort way, i.e., achieving a good trade-off between invested effort and test effectiveness. To achieve this, we propose Cerebro, a machine learning approach that statically selects subsuming mutants, i.e., the set of mutants that resides on the top of the subsumption hierarchy, based on the mutants' surrounding code context. We evaluate Cerebro using 48 and 10 programs written in C and Java, respectively, and demonstrate that it preserves the mutation testing benefits while limiting application cost, i.e., reduces all cost application factors such as equivalent mutants, mutant executions, and the mutants requiring analysis. We demonstrate that Cerebro has strong inter-project prediction ability, which is significantly higher than two baseline methods, i.e., supervised learning on features proposed by state-of-the-art, and random mutant selection. More importantly, our results show that Cerebro's selected mutants lead to strong tests that are respectively capable of killing 2 times higher than the number of subsuming mutants killed by the baselines when selecting the same number of mutants. At the same time, Cerebro reduces the cost-related factors, as it selects, on average, 68% fewer equivalent mutants, while requiring 90% fewer test executions than the baselines.


INTRODUCTION
Research and practice with mutation testing has shown that it can effectively guide developers in improving their test suite strengths [3], [14], and can be used to reliably compare test techniques [5], [51]. A key issue though, is that it is expensive, as a large number of mutants are involved, the majority of which are of low utility, i.e., they do not contribute to the testing process [3], [28], [31]. This means that mutation testers should filter their mutant sets using manual analysis to identify equivalent mutants [9], and perform numerous test executions to discard mutants that do not provide testing value, i.e., mutants that are detected by the tests designed to detect other mutants [3], [28], [31].
Working with large real-world systems makes the problem almost intractable due to the vast numbers of mutants involved. Test execution overheads alone can limit the scalability of the technique. For instance, in our experiments, we needed around 48 hours to execute the mutants for a single component of the systems we examined. At the same time the manual effort required by testers is escalated with larger programs as the number of mutants grows proportionally to program size.
To reduce application cost, it is imperative to limit the number of mutants to those that are actually useful, prior to any manual mutant analysis or test execution. Thus, we need to identify which mutants are killable in order to limit the manual effort involved in their identification, and also to identify the mutants that are subsuming (disjoint) 1  • T. Checkam is with SES, Luxembourg.
1. The term disjoint mutants refers to a minimal subset of mutants that need to be killed in order to reciprocally kill the original set [31], [46]. to reduce unnecessary computations, and to provide accurate adequacy measurements [47].
This problem is known as the mutant selection problem [48] and has been studied in the form of selective mutation [44], [68], i.e., restricting the number of transformations to be used, with limited success [11], [35]. Though, the key issue with mutant selection is the simple syntactic-based nature of the selection process. The issue is that mutants are introduced everywhere with respect to simple language operators, e.g., by replacing an operator with another, that completely ignore the program and particular location semantics. This operator matching mutant selection has the unfortunate effect of introducing mutants independent of their context and program semantics.
We propose Cerebro 2 , a machine learning technique that learns to identify interesting mutants given their context. In particular we learn the associations between mutants and their surrounding code. Our learning scope is a relatively small area around the mutation point that differentiates locally, the mutants that are useful from those that are not. This allows mutating the program elements to fit best to their context, instead of mutating entire codebases with every possible transformation, enabling inter-project predictions.
Cerebro operates at lexical level, with a simple code preprocessing. In particular, a mutant and its surrounding code is represented as a vector of tokens where all literals and identifiers, i.e., user defined variables, types, and method calls, are replaced with predefined, hence predictable, identifier names. This allows restricting the related vocabulary and learning scope to a relatively small fixed size of tokens around the 2. Cerebro is a fictional device appearing in Marvel comics used by the X-Men to detect human mutants. More details in https://en.wikipedia. org/wiki/Cerebro. arXiv:2112.14151v2 [cs.SE] 1 Mar 2022 mutation points. Learning is performed using a powerful and language-agnostic machine translation technique [8] that we train on related code fragments and their labels.
We consider useful, the subset of mutants that resides on top of the subsumption hierarchy and subsumes the others [34], aka subsuming mutants [28], for the set of all possible mutant instances produced by a given set of mutation operators. Mutant M 1 subsumes mutant M 2 if every test case detecting M 1 also detects M 2 . This implies that the tests detecting the subsuming mutant will also detect the subsumed ones thereby making subsumed mutants redundant.
We implemented Cerebro and evaluated its ability to predict (inter-project predictions) subsuming mutants on a large set of programs, composed of 48 C programs (CoreUtils) and 10 Java projects (Apache Commons, Joda-Time, and Jsoup). Our results demonstrate that Cerebro significantly outperforms both, random mutant selection and a supervised machine learning approach (used by previous research) on both, C and Java benchmarks.
In particular, our results show that Cerebro significantly outperforms the baselines. In Java projects, Cerebro obtained 2.81 times higher MCC 3 values, an improvement of 82% in F-measure, 68.88% in Precision, and 85.71% in Recall over the state-of-the-art supervised machine learning. In C programs, Cerebro obtained 2.76 times higher MCC values, 3.72 times higher precision, and slightly increased Recall value (4% higher). The improvement measured in F-measure is approximately 65%.
To put the predictions into a context and understand its influence on mutation testing, we also validated Cerebro in a controlled simulation of the envisioned use case. In particular, we simulate a scenario where testers are guided by mutation testing, i.e., they design test cases based on mutants. Therefore, fewer mutants imply less effort, while stronger mutants imply stronger tests. Our analysis shows that Cerebro achieved more than twice the subsuming mutation scores 4 in both, C and Java programs that we use. At the same time Cerebro required significantly less effort in terms of both, analyzed equivalent mutants and test executions. In C programs, 3.70% of the mutants analyzed by Cerebro are equivalent, while 55.56% and 53.33% analyzed by random mutant selection and supervised learning, respectively are equivalent; Cerebro also required 91% fewer test executions than random selection and supervised learning, respectively. In Java programs, Cerebro required the analysis of 41% and 36% fewer equivalent mutants, and 92% and 87% fewer test executions than random mutant selection and supervised learning, respectively.
All-in-all our paper makes the following contributions: 1) We present Cerebro, a powerful static subsuming mutant selection technique. 2) We provide evidence suggesting that Cerebro successfully predicts subsuming mutants with 0.85 Precision, 0.33 Recall and 0.46 MCC. 3. The Matthews Correlation Coefficient (MCC) [41] is a reliable metric of the quality of prediction models [56], relevant when the classes are of very different sizes, e.g. in case of C programs, 10.2% subsuming mutants (positives) over 89.8% non-subsuming mutants (negatives). 4. Subsuming mutation score (MS*) is the ratio of the killed and the total number of subsuming mutants.
3) We show that Cerebro significantly outperforms the current state-of-the-art, i.e., random mutant selection and previously proposed machine learning technique, by revealing 2 times the subsuming mutants, while analyzing 64% to 67% fewer equivalent mutants and requiring 89% to 92% fewer test executions. The remainder of the paper is organized as follows. Section 2 introduces preliminary concepts necessary in subsequent sections. Section 3 describes the envisioned use case for Cerebro and elaborates on a particular motivating example. Section 4 describes the approach in detail. Section 5 introduces the research questions and Section 6 details the experimental setup. The results of our experimental evaluation are summarized in Section 7. We discuss threats to validity in Section 9. In Section 8 we also discuss the impact of the abstraction process and mutants' context size on Cerebro's prediction performance. Finally, we discuss related work in Section 10, and present our conclusion and future work in Section 11.

Subsuming Mutants
Mutation is a test adequacy criterion in which test requirements are represented by mutants that are obtained by performing slight syntactic modifications to the original program. Then, the tester needs to design test cases in order to kill the mutants, i.e., to distinguish the observable behavior between the mutant and the original program. Some mutants cannot be killed as they are functionally equivalent to the original program. Hence, the quality of a test suite is measured by the mutation (adequacy) score, a percentage metric obtained by the ratio of killed mutants over the total number of (non-equivalent) generated mutants.
Mutation testing is a promising, empirically validated software testing technique that hasn't achieved its full potential yet [48]. It is often considered as computationally expensive, mainly due to the large number of mutants that it introduces, which require analysis and execution with the related test suites. One may notice that the number of mutants is disproportionate with the number of test cases to kill them, since one test case can kill several mutants at the same time. Thus, the effort put into analyzing and executing mutants that do not help to improve test suites is wasted. Hence, it is desirable to analyze only the mutants that add value, i.e., subsuming mutants [3], [28], [31], [34].
Intuitively, subsuming mutants are the minimum subset of all mutants that when killed, by any possible test suite, results in killing the entire set of killable mutants. Given two mutants M 1 and M 2 , it is said that M 1 subsumes M 2 if every test suite T killing M 1 also kills M 2 . Unfortunately, identifying subsuming mutants is undecidable as it is not possible to know a mutant's behavior under every possible input. Thus, researchers typically approximate them through test suites [3], [28], [35], [46], [47].
More precisely, let M 1 , M 2 and T be two mutants and a test suite, respectively, where T 1 ⊆ T and T 2 ⊆ T are the set of tests from T that kill mutants M 1 and M 2 , respectively, and T 1 = ∅ and T 2 = ∅, indicating that both M 1 and M 2 are killable mutants. We will say that mutant M 1 subsumes mutant M 2 , if and only if, T 1 ⊆ T 2 . In case T 1 = T 2 , we say that mutants M 1 and M 2 are indistinguishable for T . The set of mutants which are both killable, and subsumed only by indistinguishable mutants are called subsuming mutants.
For example, if we have a mutant set of 3 mutants (M 1 , M 2 , and M 3 ) and a test set T = {t 1 , t 2 , t 3 }, where M 1 is killed by T 1 = {t 1 }; M 2 is killed by T 2 = {t 1 , t 2 }; and M 3 is killed by T 3 = {t 3 }. We can notice that every time that we run a test (t 1 ) to kill mutant M 1 we will also kill mutant M 2 . However, the opposite does not hold. Thus, we have two subsuming mutants, i.e., M 1 and M 3 .
Subsuming mutation score (MS*) is the ratio between killed subsuming mutants over the total number of subsuming mutants [47]. Subsuming mutation score has been proposed [3], [31], [47] as a reliable metric to evaluate the effectiveness of testing techniques as it does not consider the presence of subsumed mutants. Subsumed mutants can artificially inflate the mutation score of a testing technique and can mislead its apparent ability to detect faults. For instance, following our previous example, a test suite {t 1 , t 2 } kills 66.7% of all the mutants (i.e., M 1 and M 2 ), but 50% of the subsuming ones (M 3 is not killed).
Interestingly, killing subsuming mutants leads to the killing of all killable mutants, thus, testers needs to focus mutation analysis on subsuming mutants. The problem though, is that one needs to know the subsumption relations between mutants in advance, before starting to analyze the mutants and designing tests. To deal with this issue, we introduce Cerebro, a static technique that predicts subsuming mutants without requiring any dynamic analysis, with the aim to help testers decide on which mutants to use when performing mutation-guided test generation [23], [49].

Machine Translation
Machine Translation can be considered as a transformation function transform(X ) = Y , where the input X = {x 1 , x 2 , . . . , x n } is a set of entities that represents a component to be transformed, to produce the output Y = {y 1 , y 2 , . . . , y n }, which is a set of entities that represent a transformed (desired) component. In the training phase, the transformation function learns on the example pairs (X , Y ) available in the training dataset. In our context, X contains the source code with an annotation that indicates the location and type of the mutation operator applied, and Y contains the same information, plus a label that indicates whether the mutant is subsuming or not.
The transformation function is trained to append the label to a given mutant by training the function on the example pairs (Code+MutationAnnotation, Code+MutationAnnotation+Label), where Code+MutationAnnotation represents the source code with an annotation in the statement to indicate the mutation operator type applied. This learned transformation is used as our prediction model for predicting subsuming mutants. Among the several machine translation algorithms that have been suggested over the past years, we use the RNN Encoder-Decoder which is established and is used by many recent studies [59], [61], [62].

RNN Encoder-Decoder architecture
The RNN Encoder-Decoder machine translation is composed of two major components: an RNN Encoder to encode a sequence of terms x into a vector representation, and an RNN Decoder to decode the representation into another sequence of terms y. The model learns a conditional distribution over an (output) sequence conditioned on another (input) sequence of terms: P (y 1 ; . . . ; y m |x 1 ; . . . ; x n ), where n and m may differ. For example, given an input sequence x = Sequence in = (x 1 ; . . . ; x n ) and a target sequence y = Sequence out = (y 1 ; . . . ; y m ), the model is trained to learn the conditional distribution: P (Sequence out |Sequence in ) = P (y 1 ; . . . ; y m |x 1 ; . . . ; x n ), where x i and y j are spaceseparated tokens. A bi-directional RNN Encoder [8] (formed by a backward RNN and a forward RNN) is considered the most efficient to create representations as it takes into account both past and future inputs while reading a sequence [6]. Figure 1 shows an overview of how the testing process is performed when it is guided by mutation. We adapted this figure from the one published in [4, Figure 5.2]. Given a program P as input, the mutation testing process starts by creating a set M of mutants forming the test requirements. Test requirements are satisfied when tests kill the mutants. Since the number of mutants are excessive and form the key cost factor of mutation testing [48], testers select a subset M of mutants from M to focus on their analysis. Then, testers pick a mutant m ∈ M and design a test t capable of killing m or judge it as equivalent and discard it. The process is repeated until the design of test is capable of killing a predefined ratio of mutants (threshold). Finally, the designed test suite T is used to check the correctness of program P (w.r.t. test suite T ). If test suite T detects some bug in program P , then P has to be fixed and the same mutation testing procedure can again be employed.

Use Case Scenario
It is worth mentioning that there are two major cost factors in mutation testing, these are the equivalent and subsumed mutants. This is because they introduce overheads both during test generation and test execution, leading to minor test effectiveness improvements. Therefore, to reduce mutation testing effort while preserving its effectiveness, it is essential to focus on subsuming mutants.  (c) Our motivating example shows that mutants selected by Cerebro lead to stronger test suites than those designed to kill randomly selected mutants, when equal number of mutants is analyzed. Hence, we develop Cerebro, a machine learning technique that learns from mutants' surrounding context to predict which mutants are subsuming. Given the input program, P and the set M of mutants, Cerebro selects a subset M of mutants that is probably subsuming (predicted as subsuming by Cerebro), to be used for mutation testing (to guide testers and evaluate test effectiveness). Based on M , testers and/or automatic test generation techniques can focus on the few strong mutants and design effective test cases.

Motivating Example
Let us consider the code snippet of function max of Figure 2a, which takes three integers as input and returns the maximum number among them. Also, consider (for simplicity) that For the sake of the thorough demonstration, we observed scenarios under the following testing conditions: A test case invoking max(1,2,0) and expecting 2 as a result, kills mutant M 3 , as well as, mutants M 0 , M 2 , M 5 , M 8 , and M 9 . But tests invoking max(2,0,1), max(1,0,2), and max(0,2,1) will kill mutants M 0 , M 2 , M 5 , M 8 , and M 9 , except M 3 . Figure 2b shows a graph representation of the subsumption relation between the 9 killable mutants. Moreover, Figure 2b shows that M 3 subsumes M 0 , M 5 , M 8 and M 2 . Particularly notice that mutants M 5 and M 8 are indistinguishable, since they are killed by the same tests, and subsume mutant M 9 . Although, mutants M 1 and M 6 are equivalent.
In summary, mutants M 3 , M 4 and M 7 are subsuming, indicating that in order to kill every killable mutant it is sufficient to kill only these 3 subsuming mutants.
Cerebro will take as input the program max and the set of mutants, and it will point to those that are most likely subsuming. In an ideal scenario, Cerebro would point only to M 3 , M 4 and M 7 , but it is possible, as in every machine learning based technique, that it does some mistakes, i.e., incorrect predictions of subsuming mutants, pointing to some non-subsuming (subsumed or equivalent mutants) as subsuming.
For instance, consider the case in which Cerebro predicts M 3 and M 4 and M 10 as subsuming mutants. Therefore, a tester will incrementally design test cases to kill all the predicted mutants. Assume that the tester starts by analyzing mutant M 3 and designs a test to kill it, e.g., by invoking max(1,2,0). This test does not kill the rest of the selected mutants. The tester then proceeds to analyze the surviving mutant M 4 , for which he/she designs a test that invokes max(2,0,1) to kill it. Finally, the tester designs a test by invoking max(0,1,2), which kills mutant M 10 and also (non selected) subsuming mutant M 7 . Notice that this test suite designed to kill all mutants selected by Cerebro progressively increments the MS*: first test kills subsuming mutant M 3 leading to a MS* of 33.33%; second test kills subsuming mutant M 4 , obtaining 66.66% of MS*; and finally, third test kills collaterally subsuming mutant M 7 leading to a MS* of 100%.
Consider a scenario in which mutants are selected randomly. For instance, assume that M 9 is the first one to be selected for analysis for which a test case invoking max(0,2,1) is designed to kill it. This test collaterally kills mutants M 5 and M 8 , but it does not kill any subsuming mutant. Then, assume that equivalent mutant M 1 is randomly selected, adding no value to the testing process, but requiring analysis anyway. Afterwards mutant M 0 is randomly selected for which a test case invoking max(2,0,1) is designed to kill it, that fortunately also kills subsuming mutant M 4 . Then, mutant M 2 is randomly selected for which the tester designs a test to kill it by invoking max (1,0,2). This test also kills mutant M 10 , but no subsuming mutant is killed. After that, tester randomly selects mutant M 3 for analysis and designs a test by invoking max(1,2,0) to kill it. This test kills subsuming mutant M 3 and also mutant M 2 . Finally, mutant M 4 is randomly selected for which the tester designs a test to kill it, by invoking max(2,0,2). Hence, all subsuming mutants are killed.
In this particular scenario we can observe that MS* remains at 0% after analyzing the first 2 mutants randomly selected, and reaches a MS* of 33.33% after analyzing the third randomly selected mutant. The analysis of the fourth selected mutant (non-subsuming) did not add value (MS* remains the same). Finally, fifth and sixth analyzed mutants were subsuming, leading to a test suite that obtains MS* of 100% after analyzing 6 mutants. Figure 2c depicts the progress of MS* obtained by the test suites when guided by Cerebro and random mutant selection in the previously described scenarios. Through this example we demonstrate a case where two approaches analyze the same number of mutants (same effort) with Cerebro having higher effectiveness (MS*) than the random mutant selection baseline. At the same time, in order to reach the same MS* as Cerebro, random mutant selection needs more effort, i.e., it will require the analysis of many more mutants than Cerebro (in the example random baseline analyzed two times more mutants than Cerebro).
There are several points we want to highlight about the particular scenarios just described. First, it is essential to notice that mutants selected by Cerebro will be as close as possible to subsuming in the subsumption relation. Killing these (almost subsuming) mutants can help in killing subsuming mutants predicted as non-subsuming by Cerebro, for instance, the test that kills subsumed mutant M 10 , also kills subsuming mutant M 7 that was incorrectly predicted as non-subsuming by Cerebro. Second, it is also important to notice that Cerebro selects the least possible number of equivalent mutants, saving the time of analysis to the tester (in the example, Cerebro did not predict any equivalent mutant as subsuming). Third, notice that the prediction performance obtained by Cerebro does not necessarily reflect its effectiveness in practice, since mutant kills are not independent of one another. While Cerebro reached 66.66% of Precision and 66.66% of Recall in the example, in practice, the test suite designed to kill all selected mutants obtains 100% of subsuming mutation score (MS*). And fourth, it is worth to study the trade-off between the effectiveness and effort of the different mutant selection techniques. We consider all these points in our empirical evaluation to assess the prediction performance, effectiveness, and effort required by Cerebro and the related mutant selection techniques.

APPROACH
The main objective of Cerebro is to automatically learn the silent features/patterns of the context surrounding subsuming mutants without requiring any features definition and/or selection by human intervention, that we can use later to predict if mutants on an unseen source code are likely to be subsuming or not. Thus, we train a machine translator (viz. an encoder-decoder model) to identify subsuming mutants, by feeding it with source code where the statement (to mutate) is annotated with the mutant type and its label (subsuming or not). Machine translators have been successfully used to translate text from one language to another, as they automatically recognize (i) the features of the language (to be translated) and (ii) the required translation (to the desired language). In our case, it is used to automatically identify the features of subsuming mutants without any investment of time and/or resources to define features.
After training, one can input to the translator, an unseen mutant (source code where the statement to mutate is annotated with the mutation annotation). The translator will append the label to the mutant given as input, to predict whether it is subsuming or not. Figure 3 shows an overview of the implementation. For training, Cerebro takes a set of mutants and their corresponding label. In each mutant source code, the statement (to mutate) is annotated with the mutation annotation, and the model learns the label to be appended to this annotation, that indicates whether the mutant is subsuming or nonsubsuming. We can summarize Cerebro's pre-processing, training and testing steps as follows: 1) Abstraction: Producing abstracted code of the actual source code by removing irrelevant information (e.g. comments) and replacing user-defined identifiers and literals (e.g. variable names) by predictable tokens; 2) Pairs Generation: Generating the pairs (inputexpected output) to be used for training, by adding the corresponding label into the mutation annotations; 3) Training: Training the machine translator to learn which label is to be appended to the mutation annotations; 4) Testing: Utilizing the trained translator to predict and append labels to the mutation annotations present in unseen mutant source code. In the remainder of this section we describe each of the aforementioned phases of our approach, in detail.

Abstracting the Irrelevant Information
A major challenge in dealing with raw source code is the huge vocabulary created by the abundance of identifiers and literals used in the code. On such a large scale, vocabulary may hinder the goal of learning features surrounding the subsuming mutants. Thus, to reduce vocabulary size, we abstract source code by replacing user-defined entities with re-usable identifiers. Figure 4 shows an actual code snippet ( Figure 4a) converted into its abstract representation ( Figure 4b). The purpose of this abstraction is to replace any reference to userdefined entities (function names, types, goto labels, variable names and string literals) by identifiers that can be reused across source code file, hence reducing the vocabulary size. Thus, our abstraction approach first detects user-defined entities before replacing them with unique identifiers (new IDs).
New IDs follow the regular expression (fn|tp|lb|vr|lr)_(num) + , where num stands for numbers 1, 2, 3, . . . assigned in a sequential and positional fashion based on the occurrence of that entity. All the user-defined Function names, Type names, Variable names, Labels, and String Literals are replaced with fn_num, tp_num, lb_num, vr_num, and lr_num, respectively. Thus, the first function name found receives the ID fn_1, the second receives the ID fn_2, and so on. If any of these entities appear multiple times in a source code file, it is replaced with the same ID.
Additionally, we remove code comments and add mutation annotations to encode the mutation operator and the corresponding label (to be learned by the translator). Our mutation annotations have the general shape Model is trained on mutant annotations to further append the label (subsuming/non-subsuming). Trained model is provided with an unseen mutant annotation to append the label. The appended label acts as the prediction for the unseen mutant annotation.   Figure 4c). Between the last brackets [], our trained model adds one of the labels S or N, indicating that the mutant obtained by applying the mutation operation, is predicted as subsuming or non-subsuming, respectively.

Pairs Generation
The mutation operation (ReturnValsMutator 5 ) shown in Figure 4c represents a mutant in which the sentence return null is replaced by throw new java.lang.RuntimeException() exception. Notice that this mutant is labeled as subsuming in our dataset, since there is only one test that can kill it, when the input option is null. Hence for training we consider S as the label to be learned by the translator to predict this mutant as subsuming.
To do so, we train in pairs (MutantAnnotation, Mu-tantAnnotation+Label), where the first component is the annotated code shown in Figure 4c, and the second component is the same code with the predicted label, i.e., MST[ReturnValsMutator]MSP[S] in our case, to indicate that the mutant is subsuming. The resulting text is arranged in a single sentence to represent a sequence of space-separated entities (the representation supported by the machine translator). The only difference between the 5. https://pitest.org/quickstart/mutators/#RETURN_VALS input sequence given to the translator and the expected output sequence produced by it, is the predicted label S or N. Using these sequences, we intend to capture as much code as possible around the mutant without incurring the exponential increase in training time.

Building the Machine Translator
To build our machine translator, we train an encoder-decoder model that can transform an input sequence to a desired output sequence. In our representation, a sequence consists of tokens separated by spaces that ends with a newline character. Thus, we train the encoder-decoder by feeding it with pairs of sequences, produced in the previous step. The translator learns to replicate the abstracted source code with the mutation annotation and to append the label (S/N) that will be used as a prediction for the mutant.
We found that training the translator on sequences of maximum 100 tokens in length is computationally feasible, but expensive (740 training hours required on a Tesla V100 GPU). Hence, we also experiment with sequences of 50 tokens in length and demonstrate that the computation cost of training the translator can be further contained (360 training hours required). We name Cerebro trained on sequences of 100 tokens in length as Cerebro-100. Following our naming convention, we name Cerebro trained on sequences of 50 tokens in length as Cerebro-50.

Predicting from appended labels
To predict whether or not a certain mutation at a particular position in an unseen code is subsuming, we abstract the unseen code followed by sequence generation which results in abstracted code sequence attached with mutation annotation as depicted in Figure 3. We feed this sequence into the trained machine translator to yield an output sequence with an appended label. The appended label acts as a prediction (subsuming/non-subsuming) for this specific mutation. If the translator produces an output sequence with a change other than appending the predicted label, the input sequence is predicted as non-subsuming, by default. In our experiments reported in Section 7, this happened in 4.2% and 0.1% of the sequences for C and Java programs, respectively.

RESEARCH QUESTIONS
We start by checking the prediction ability of Cerebro and ask: RQ1 Prediction Performance: How effective is Cerebro in predicting subsuming mutants?
We leverage two datasets, made of C and Java programs, for which extensive mutation analysis has been performed to identify subsuming mutants. We reimplemented 2 techniques that we use as baselines in our analysis. The first baseline is a Random mutant sampling, while the second is a supervised machine learning method based on manually designed features that were used by previous work [11] (e.g., data flow, control flow, etc.). These features are used to train a binary classifier in order to predict whether a mutant is subsuming or not. Further details about the baselines can be found in Section 6.3.
After analyzing the predictions, we turn our attention to the envisioned application scenario; measuring test effectiveness of the predicted mutants. It is important to check the application case because a) predictions may select weak mutants [11] (weak subsuming mutants result in lower test effectiveness than the strong ones), b) selected mutants may not be diverse as they may include mutually subsuming mutants [34], and c) tester benefits are unclear. Thus, we ask: RQ2 Effectiveness Evaluation: How does Cerebro compare with the baselines in terms of subsuming mutation score?
We perform a simulation of a mutation testing scenario where a tester analyzes the selected mutants in order to generate tests [5], [11], [35]. For test effectiveness, we measure the subsuming mutation score (MS*) achieved by the tests that kill the selected mutants. In essence, we evaluate the guidance offered by the mutants when testers design tests to kill the selected mutants. It is worth noticing that in this part of the experiment we control the number of mutants, i.e., all techniques analyze the same number of mutants. Such simulation is typical in mutation testing literature [5], [11], [35] and aims at quantifying the benefit of an approach over the other.
Complementary to the previous question, we compare the effort required by each technique to obtain the same level of test effectiveness. Hence, we first investigate the human effort measured in terms of the number of mutants analyzed by the tester, to reach the same subsuming mutation score using Cerebro and the baselines. Hence, we ask: RQ3 Manual Effort: How many mutants require manual analysis in order to reach a given level of subsuming mutation score?
We perform a similar simulation of a testing scenario in which we measure how many mutants the tester needs to analyze (generate a test case to kill or judge equivalence), until he/she obtains a determined subsuming mutation score. This allows us to quantify the human effort required by each approach to obtain the same benefit.
Related to the previous question, we also investigate the number of test executions necessary to reach the same subsuming mutation score, by following the incremental process of mutation analysis, i.e., a tester picking a mutant and analyzing it. If the picked mutant is killable, he/she generates a test case that kills it, and then checks if the remaining alive (not analyzed and not killed) mutants are collaterally killed by the same test (by executing the generated test on alive mutants). The killed mutants are removed from the set of alive mutants. Then, we ask: RQ4 Computational Effort: How many test executions are required in order to reach a given level of subsuming mutation score?
We perform a simulation as before, but in this case, every time that a test is generated, we count the number of test executions and measure the attained subsuming mutation score, until we reach a given subsuming mutation score.

Benchmarks and Ground Truth
In order to show that our approach is language agnostic, we make our evaluation on a set of C and Java programs.
C-Benchmark: To perform our study that requires strong test suites, we used an independently built dataset from related work [12]. It includes C programs from the GNU Coreutils 6 , that consist of file, text and shell utility programs widely used in Unix systems. The data-set is composed of 48 GNU Coreutils (v8.22) programs aka subjects (mentioned in Table 1), each packaged with an accompanying system test suite, generated by developers. The size of these programs ranges from 1,000 to 14,000 lines of code (LOC), with a median size of 3,500 LOC. For each subject, the data-set includes a mutant-test killing matrix that records, for each mutant, a set of tests that kill it.
The mutant-test killing matrices were obtained by generating mutants using the Mart mutant generation tool [13] and executing them against large test pools. The test pools were built by considering developer tests and adding automatically generated tests using a 24 hours run of KLEE [10]. Additionally, mutation-based test suites were automatically generated using 128 different configurations of SEMu [12], each running for 2 hours, and an additional 'seeded' test generation of KLEE. To reduce the total execution cost, for each program, the 3 functions that were covered by the largest number of developer tests were selected for mutation analysis, i.e., mutants were generated only for these functions.
We use these mutant-test killing matrices to compute the mutant subsumption, following the definition given in 6. https://www.gnu.org/software/coreutils/ Section 2.1, and label each mutant as either subsuming or non-subsuming. To make the problem as balanced as possible (to assist in machine learning), we mark as subsuming all mutants in the top of the hierarchies, including mutually subsumed mutants.
Needless to say, it is possible to have some noise in our labeling process in the sense that mutants labeled as subsuming may be non-subsuming. The data-set reduced this noise by augmenting the test suites with multiple large and diverse test suites generated by different state-of-the-art tools. Please refer to the threat in Section 9 for a related discussion.
Java-Benchmark: For Java we select a set of welltested open source projects from GitHub. We select projects from the Apache Commons Proper 7 repository of reusable Java components, Joda-Time 8 -a date and time library, and Jsoup 9 -an HTML manipulation library. The set counts 10 projects: commons-cli, commons-codec, commons-collections, commons-csv, commons-io, commons-lang, commons-net, commons-text, jsoup, joda-time. These projects contain up to 284 classes. Table 1 reports the version/commit of each project we used for our study. Following a similar procedure done for C in [12], we also build test pools by using developer tests and adding automatically generated tests by running EvoSuite [23] for each project with the default running time, but with multiple coverage metrics 10 . The mutant-test killing matrices were obtained using Pitest [17]. For each project, we run the mutants on the test pools for 48 hours. To reduce execution time, we select the classes processed during that time lapse. Table 2 records the total number of mutants, number (and percentage) of killable and subsuming mutants, and number of test cases conforming to the mutant-test killing matrices. Please note that the difference on the ratio of subsuming mutants with previous research [3], [34], [47] is due to the inclusion of all mutually subsuming mutants. As already explained, we include all subsuming mutants to avoid misleading our learner.

Equivalent Mutants
Early research on mutation testing has demonstrated that deciding whether a mutant is equivalent is an undecidable problem [9]. Mutation testing may produce a mutant that is syntactically different from the original, yet semantically identical, aka equivalent mutant [32]. Undecidability of equivalences means that it is impossible to automatically discard them all. As a result, the tester may never know whether he or she has failed to find a killing test case because the mutant is particularly hard to kill, yet remains killable (a 'stubborn' mutant [65]), or whether failure to find a killing test case derives from the fact that the mutant is equivalent. The best options we have are effective algorithms that can remove most equivalent mutants, e.g., in C data-set [12] authors applied TCE (Trivial Compiler Equivalence) [25], [32] to filter out equivalent and duplicated mutants. Interestingly, early research on mutation testing [2] has shown that humans   also make many mistakes (approximately 20%) when judging mutants (as being equivalent or not). This means that it is unrealistic to expect that automated tools (or testers, in case of manual test case design) kill all killable mutants.
To make a fair approximation of killable mutants we used state-of-the-art test generation tools (KLEE [10], SEMu [12], and EvoSuite [23]), together with mature developer test suites to identify killable mutants. For the remaining live mutants (i.e., mutants that are killed neither by developers written nor automatically generated test suites) we assumed that live mutants are equivalent. Although, this assumption may have some impact on our results (refer to Section 8.4 for an analysis of the impact of this assumption), it allows quantifying the effort involved by testers in analyzing low utility mutants when using the current state-of-the-art advances. Moreover, since Cerebro performs machine learning, it learns from the employed data. This means that the availability of clean data, with a clear signal to learn, will allow Cerebro make better predictions, thereby potentially improving its performance.

Baselines
We consider 2 baselines. The first one is the Random mutant sampling that samples uniformly from the entire set of mutants. The second baseline is a Decision Tree classification based on the features proposed by related work [11], [30].
Previous works showed a strong connection between mutant utility and surrounding code (utility captured through CFG, data flows, AST, etc. features). Thus, we use the mutant features to predict subsuming mutants in both C and Java. Features belong to 4 categories: Mutant Type related features, Control-Flow graph related features, Control and Data dependency related features, and AST related features. In total we used the 28 features, used by the related work [11], for the C programs, and implemented 16 of those features for Java 11 . We excluded features such as AstChildHasIdentifier and AstChildHasLiteral that we found unfeasible to implement in the employed tools, i.e., Pitest works at byte-code level making it difficult to identify the original source code expression. Nevertheless, the excluded features were approximated by mutant type.
After extracting the features, following the related work [11], we trained a stochastic gradient boosted Decision Tree model by using the same configuration as the related work [11]. We followed the same validation setup for Cerebro.

Implementation and Model Configuration
We rely on the srcML tool [18] to convert source code into an XML format to tag literals, keywords, identifiers, comments, and our mutation annotations. This helps in separating userdefined identifiers and string literals (the largest part of the vocabulary) from language keywords as srcML supports C, Java and other languages. Then, we implement the ID replacement to generate the abstracted code.
We follow the sequence pair generation procedure mentioned in Section 4.2 to generate sequences from the abstracted code. These sequences serve as training input for our encoder-decoder model, which we build using tf-seq2seq [1], a general-purpose encoder-decoder framework. Following previous works [61], [62], we configure our model with bidirectional encoder. We use a Gated Recurrent Units (GRU) network [16] to act as the Recurrent Neural Network (RNN) cell, which was shown to perform better than possible alternatives (simple RNNs or gated recurrent units) in related prediction tasks [57]. To achieve good performance with acceptable model training time, we utilize AttentionLayer-Bahdanau [7] as our attention class, configured with 2 layered AttentionDecoder and 1 layered BidirectionalRNNEncoder, both with 256 units.
To determine an appropriate number of training epochs, we conducted a preliminary study involving a validation set, independent of both, training and test sets that we use in our evaluation. Here we incrementally train the model, with checks after every epoch to monitor model training accuracy. We pursue training the model till the training performance on the validation set does not improve anymore. We found 15 epochs to be a good default for our validation sets. Once model training is complete, we follow the procedure explained in Section 4.4 to predict whether an unseen mutant annotation sequence is subsuming or not.

Experimental Procedure
In the first experimental part, we evaluate the prediction ability of our approach, answering RQ1, while in the second part, we evaluate cost-effectiveness of Cerebro, answering RQs2-4.

First Experimental Part
We start by evaluating the prediction performance of Cerebro, and the baselines, using four typical metrics, namely, Precision, Recall, F-measure, and Matthews Correlation Coefficient (MCC) [41]. A confusion matrix is computed for each one of the studied methods, which stores the correct and incorrect predictions. Given a subsuming mutant, if it is predicted as subsuming, then it is a true positive (TP); otherwise, it is a false negative (FN). Given a non-subsuming mutant, if it is predicted as non-subsuming, then it is a true negative (TN); otherwise, it is a false positive (FP). Then we can use the confusion matrix to quantitatively evaluate the prediction performance of Cerebro and Decision Trees prediction models.
Intuitively, Precision is the ratio of mutants truly subsuming among all the mutants predicted as subsuming. Recall is the ratio of mutants correctly predicted as subsuming among all the subsuming mutants. F-measure indicates the weighted harmonic mean of Precision and Recall. Matthews Correlation Coefficient (MCC) [41] is a reliable metric of the quality of prediction models [56], that in contrast to the previous metrics, also takes into account the True Negatives (correctly predicted non-subsuming mutants). It is generally regarded as a balanced measure that can be used even when the dataset is unbalanced, i.e., the classes are of very different sizes, e.g. in case of C programs, 10.2% subsuming mutants (Positives) over 89.8% non-subsuming mutants (Negatives). MCC returns a coefficient between 1 and -1. An MCC value of 1 indicates a perfect prediction, whereas a value of -1 indicates a perfect inverse prediction, i.e., a total disagreement between prediction and reality. An MCC value equals 0 indicates that the prediction performance is equivalent to random guessing.
The mutants selected by Cerebro are the ones predicted as subsuming. For Decision Trees baseline, as it computes a probability of a mutant being subsuming, we followed the probability margin convention and considered those mutants whose predicted probability was higher than 0.5 [11].
To assess the performance we perform a inter-project evaluations. We use 5-folds cross validation, where we evenly split each benchmark in 5 parts (10 programs and 2 projects per fold for C and Java benchmark, respectively). Then, for each benchmark, we repetitively use 1 fold for testing and 4 folds for training (1 part out of 4, is used for validation).

Second Experimental Part
To study the cost and test effectiveness of our approach and the baselines, we simulate a testing scenario where a tester selects a subset of mutants, to use for mutation analysis, and designs tests to kill them. Algorithm 1 provides the pseudo-code of the simulation process we follow in our experiments. It takes as input a set M of mutants to analyze, the test pool P and a target subsuming mutation score tMS * , and returns a test suite T that kills every mutant from M (or reaches the pre-specified subsuming mutation score). Additionally, it returns the subsuming mutation score obtained by the test suite T (currMS * ), number of analyzed mutants (analyzedMut), number of equivalent mutants analyzed (equivMut), and number of test executions (tExec) required to generate test suite T during the simulated mutation testing scenario. if the test pool P can kill mutant m then 8: t ← randomlyPickTestKilling(m, P) 9: add test t to the suite 10: tExec += size(C) run t on mutants from C 11: remove from set C all mutants killed by t 12: else 13: equivMut++ m is judged as equivalent 14: end if 15: currMS * ← calculateMS * (M,T) 16: end while 17: return T, currMS * , analyzedMut, equivMut, tExec The simulation starts by picking (pickNextMutant) the top mutant m, according to the technique used (Cerebro, Decision Trees, and Random), among survived mutants from set C (initialized with all mutants from M). It then checks if there exists some test in the test pool P that kill m (this process simulates a tester picking, analyzing, and designing a test to kill a mutant). If no test kills mutant m, we judge it as equivalent and remove it from C. Otherwise, we randomly pick one test t from the pool that kills m. Then, we run the test t on every mutant from C to check if the same test consequently kills other mutants (killed mutants are then removed from C). This process continues by taking the next survived mutant and finding a test to kill it until every mutant in C has been killed or until the desired subsuming mutation score is reached. We run this simulation with the set of mutants selected by Cerebro, Decision Trees, and Random, respectively, and use the reported values to compare their cost-benefit performance for answering RQ2-4. Since Algorithm 1 includes some random decisions, we repeat this process 1,000 times for all the approaches.
To answer RQ2, we measure the effectiveness (benefit) of the approaches in terms of the subsuming mutation score (MS*), i.e., the ratio between killed and total number of subsuming mutants, achieved by the generated test suites when analyzing the selected mutants. The subsuming mutation score reduces the influence of redundant mutants [34], [47].
For assessing the effectiveness of the approaches, we aim at controlling the number of mutants selected by each tool. In the case of Cerebro, the mutants selected are the ones predicted as subsuming by our model. For Decision Trees baseline, we rank (in descending order) the mutants according to the predicted probability of being subsuming, and follow the ranking to pick mutants (from highest probability to lowest) for analysis. Random baseline randomly ranks the mutants to be selected. Initially, we consider the same number of selected mutants for the 3 approaches, defined as the number of mutants predicted as subsuming by Cerebro. For instance, if Cerebro predicts 20 mutants as subsuming, then Decision Trees and Random baselines will also select the top 20 ranked mutants. Our intention is to compare the effectiveness reached by each approach, when the number of selected mutants is equal.
Additionally, we study the number of equivalent mutants selected by each approach (as these are an important source of redundancy during mutation testing), as well as, the required number of mutants selected by the baselines in order to reach the same subsuming mutation score as Cerebro.
To answer RQ3 and RQ4, we study the effort (cost) required by each approach in two ways. We measure the human effort in terms of the number of analyzed mutants, killable or not, that are presented to testers for analysis (i.e., either designing a test to kill these or judging these as equivalent), when applying mutation testing. Intuitively, for a given set of mutants, the number of analyzed mutants can be considerably smaller than the entire set's size because a test designed by analyzing one mutant can kill other mutants as well. Hence, we also measure the computational effort in terms of the number of test executions performed, during the mutation analysis procedure, i.e., we count the test executions required at every step where a new test is created. As for RQ2, here we also study the number of test executions and the number of mutants that require analysis by the baselines, to reach the same subsuming mutation score as Cerebro.

Prediction Performance (RQ1)
Table 3 records the average (and median) performance metrics. Figure 5 shows the performance comparison in box plot format showing the distribution of performance   indicators (MCC, F-measure, Precision, and Recall) for both approaches in C, and Java Benchmarks.
On average, Cerebro obtains a high Precision, i.e., 0.93 and 0.76 (Cerebro-100), and 0.82 and 0.72 (Cerebro-50) in C and Java benchmarks, respectively. Testers focusing on mutants selected by Cerebro can be confident that these are very likely to be subsuming, providing high utility to the testing process. On the other hand, Recall achieved is low, i.e., 0.26 and 0.39 (Cerebro-100), and 0.21 and 0.31 (Cerebro-50) in C and Java benchmarks, respectively. This indicates that many subsuming mutants are mistakenly predicted as nonsubsuming by Cerebro. In practice these mutants can still be collaterally killed by other (mutually subsumed) subsuming mutants correctly predicted as subsuming by Cerebro (which is often the case, as we will show when answering RQ2 in the following section). Needless to say, any complementary mutation testing and mutant selection technique can be employed to analyze the remaining mutants that are not killed by test suites designed to kill mutants selected by Cerebro.
On comparison with baselines, we observe that Cerebro clearly achieves much higher prediction performance in comparison to Decision Trees in both benchmarks. The differences are statistically significant. 12 In C-Benchmark, on average, Cerebro with its MCC of 0.47 (Cerebro-100), and 0.39 (Cerebro-50) outperforms Random (0.0 MCC). Cerebro also outperforms Decision Trees, on average, with 2.76 times higher MCC and 64% improvement in Fmeasure. It is worth mentioning that while Cerebro achieves 3.72 times higher precision than Decision Trees, Cerebro also offers an improvement of 4% in Recall over Decision Trees.
In summary, Cerebro offers an improvement in prediction capability (MCC) of 2.78 times higher than Decision Trees. Figure 6a and 6d show the average subsuming mutation score (MS*) obtained when selecting the same number of mutants (by all techniques). In C-Benchmark, on average, Cerebro-100 obtains an MS* of 87.50%, which is 2.39 and 2.63 times higher MS* than Decision Trees and Random, respectively. Moreover, Cerebro-50 obtains an MS* of 71.43%, which is 2.02 and 2.17 times higher MS* than Decision Trees and Random, respectively.

Effectiveness Evaluation (RQ2)
In Java-Benchmark, on average, Cerebro-100 obtains an MS* of 95.90%, which is twice higher than Decision Trees, and 69.53% improvement over Random. Moreover, Cerebro-50 obtains an MS* of 95.66%, which is 2.20 times higher than Decision Trees, and 83.33% improvement over Random. The differences are statistically significant, according to the computed p − value. We also compared them with the Vargha-Delaney A measure (Â 12 ) [63], showing that Cerebro achieves better MS* than Decision Trees, and Random, in 92.4%, and 95.7% of the cases.
We also study the selection size needed by Decision Trees and Random to achieve the same MS* obtained by Cerebro. For C-Benchmark, Figure 6b shows that while Cerebro-100 selects only 2.35% of the mutants, Decision Trees, and Random 12. We compared the MCC values using Wilcoxon signed-rank test and obtained a p − value < 5.07e−3 in comparison to Decision Trees. We also compared the MCC values with the Vargha-Delaney A measure [63] and observed that in all (100%)    need to select 85.42% (36.35 times higher), and 87.61% (37.28 times) of the mutants to achieve same MS* as Cerebro. Also, Figure 6e shows that while Cerebro-50 selects only 2.52% of the mutants, Decision Trees, and Random need to select 34.23% (13.57 times higher), and 42.37% (16.79 times) of the mutants, to achieve same MS* as Cerebro. For Java-Benchmark, while Cerebro-100 selects 9.85% of the mutants, Decision Trees, and Random need to select 44.80% (4.55 times higher), and 78.97% (8.02 times) of the mutants, to achieve same MS* as Cerebro-100. Also, while Cerebro-50 selects 11.60% of the mutants, Decision Trees, and Random need to select 41.77% (3.60 times higher), and 75.09% (6.48 times) of the mutants, to achieve same MS* as Cerebro-50. We obtained a statistically significant p − value andÂ 12 when compared these values, evidencing that Cerebro in more than 98.5%, and 99.1% of the cases, selects fewer mutants than Decision Trees, and Random.
We also measure the percentage of equivalent mutants selected. For C-Benchmark, Figure 6c shows that 1.10% of mutants selected by Cerebro-100 are equivalent, whereas 24.44%, and 26.09%, of the mutants selected by Decision Trees, and Random, are equivalent. Also, Figure 6f shows that 4.37% of mutants selected by Cerebro-50 are equivalent, whereas 24%, and 26.23%, of the mutants selected by Decision Trees, and Random, are equivalent. In Java-Benchmark, 9.95% of the mutants selected by Cerebro-100 are equivalent whereas for Decision Trees, and Random, 15.11% (51.86% more), and 19.33% (94.27% more) selected mutants are equivalent. Also, 5.45% of the mutants selected by Cerebro-50 are equivalent whereas for Decision Trees, and Random, 15.86% (2.91 times higher), and 19.26% (3.53 times higher) selected mutants are equivalent. The differences are statistically significant.Â 12 shows that Cerebro in more than 90%, and 98.4% of the cases selects fewer equivalent mutants than Decision Trees, and Random. These results provide evidence that our approach can reduce significantly this long-standing problem of mutation analysis. (e) C-Benchmark: Cerebro-50 analyzes 1.02% mutants, whereas Decision Trees, and Random analyze 11.92%, and 13.17% to reach the same MS* as Cerebro-50. Java-Benchmark: Cerebro-50 analyze 2.52% mutants, whereas Decision Trees, and Random analyze 12% and 17.19% to reach same MS*.

&
-DYD (TY0XWDQWVIRUVDPH$QDO\]HG &HUHEUR '7UHHV 5DQGRP (f) C-Benchmark: 11.31%, 50%, and 50% of the mutants selected by Cerebro-50, Decision Trees, and Random are equivalent. Java-Benchmark: 23.72%, 56.08%, and 57.38% of the mutants selected by Cerebro-50, Decision Trees, and Random are equivalent.  Figures 7a and 7d show the average subsuming mutation score (MS*) obtained by each technique for the same number of analyzed mutants. In C-Benchmark, on average, Cerebro-100 achieved an MS* of 78%, which is an improvement of 89.41%, and 71.20% over the MS* of Random, and Decision Trees, respectively. Moreover, Cerebro-50 achieved an MS* of 65.75%, which is 2.14 times higher than Random and an improvement of 97% over Decision Trees In Java-Benchmark, on average, Cerebro-100 achieved an MS* of 94.90%, an improvement of 49.24% and 71.21% over Decision Trees and Random, respectively. Moreover, Cerebro-50 achieved an MS* of 95.65%, an improvement of 78.65% and 91.94% over Decision Trees and Random, respectively. The differences are statistically significant, according to the computed p − value andÂ 12 . We observed that Cerebro in more than 96.2%, and 98.4%, of the cases is better than Decision Trees, and Random.

Number of Analyzed Mutants (RQ3)
We also study what should be the percentage of mutants to be analyzed by Decision Trees and Random to achieve the same MS* as Cerebro. For C-Benchmark, Figure 7b shows that while Cerebro-100 analyzes 1.21% mutants, Decision Trees, and Random need to analyze 22.33% (18.45 times higher), and 22.80% (18.84 times higher) of mutants to reach same MS* as Cerebro-100. Also, Figure 7e shows that while Cerebro-50 analyzes 1.02% mutants, Decision Trees, and Random need to analyze 11.92% (11.58 times higher), and 13.17% (12.78 times higher) of mutants to reach same MS* as Cerebro-50. In Java-Benchmark, while Cerebro-100 analyzes 3.22% mutants, Decision Trees, and Random need to analyze 12.07% (3.75 times higher), and 18.05% (5.61 times higher) of mutants to reach same MS* as Cerebro-100. Moreover, while Cerebro-50 analyzes 2.52% mutants, Decision Trees, and Random need to analyze 12.00% (4.76 times higher), and 17.19% (6.82 times) of mutants to reach same MS* as Cerebro-50. We obtained a statistically significant p − value andÂ 12 , showing that Cerebro in more than 99% of the cases analyzes less mutants than Decision Trees and Random.  We also measure the percentage of equivalent mutants analyzed by each technique. For C-Benchmark, Figure 7c shows that, on average, Cerebro-100 analyzes 3.70% equivalent mutants, while 53.33% (14.41 times higher), and 55.56% (15.02 times higher) of the mutants analyzed by Decision Trees, and Random are equivalent. Also, 7f shows that Cerebro-50 analyzes 11.31% equivalent mutants, while 50% (4.42 times higher) of the mutants analyzed by Decision Trees and Random are equivalent. For Java-Benchmark, on average, 33.48% of the mutants analyzed by Cerebro-100 are equivalent, while Decision Trees, and Random analyze 52% (55.31% more), and 57.04% (70.37% more) equivalent mutants. Also, 23.72% of the mutants analyzed by Cerebro-50 are equivalent, while Decision Trees, and Random analyze 56.08% (2.36 times higher), and 57.38% (2.42 times higher) equivalent mutants. This indicates that the baselines suggest the consumption of a large effort to analyze redundant mutants, in comparison to Cerebro. The differences are statistically significant.Â 12 suggests that Cerebro in more than 98% of the cases analyzes fewer equivalent mutants than Decision Trees, and Random. Figure 8a and 8c show the average subsuming mutation score (MS*) when the number of test executions are fixed. In C-Benchmark, on average, Cerebro-100 achieves an MS* of 74%, outperforming Decision Trees, and Random by 62.82%, and 66.52% (Decision Trees, and Random achieve 45.45%, and 44.44% of MS*). Also, Cerebro-50 achieves an MS* of 65.52%, outperforming Decision Trees, and Random by 80.95%, and 77.14% (Decision Trees, and Random achieve 36.21%, and 36.99% of MS*). In Java-Benchmark, on average, Cerebro-100 and Cerebro-50 achieve an MS* of 95.65% in both simulations, an improvement of approx. 67%, and 58% over Decision Trees, and Random (Decision Trees, and Random achieve 57.38%, and 60.40% of MS* in first simulation when compared against Cerebro-100, and 50.41%, and 60.21% of MS* in the second comparison simulation against Cerebro-50). We obtained a statistically significant p − value. AlsoÂ 12 suggests that Cerebro in 94.15%, and 95.7%, of the cases is better than Decision Trees, and Random.

Number of Test Executions (RQ4)
We also measure the test executions required by the baselines to achieve the same MS* as Cerebro. Figure 8b shows that, in C-Benchmark, Cerebro-100 requires 291 test executions (median), while Decision Trees, and Random require 3,345, and 3,149. Also, Figure 8d shows that Cerebro-50 requires 125 test executions (median), while Decision Trees, and Random require 1,785, and 2,182. This shows that Cerebro-100 is 10-12 times less and Cerebro-50 is 14-17 times less expensive (computationally) than the baselines.
In Java-Benchmark, Decision Trees, and Random require 517,040, and 795,304 test executions (median) to achieve the same MS* as Cerebro-100, for which 65,741 test executions are required. Moreover, Decision Trees, and Random require 560,866, and 894,494 test executions to achieve the same MS* as Cerebro-50, for which 50,622 test executions are required. This shows that the baselines require 7 to 12 times, and 11 to 17 times higher computational effort than Cerebro-100, and Cerebro-50.
These differences are statistically significant.Â 12 value indicates that in more than 98.7% of the cases, Cerebro executes fewer tests than Decision Trees and Random.

DISCUSSION
Cerebro is a learning-based method, and thus its performance depends on a number of parameters and design decisions we made. To this end, we discuss the key (intuitive) parameters that make the Machine Translation approach we use effective (Section 8.1), together with empirical results demonstrating the potential impact on the model's performance given the design decisions of using unabstracted code sequences (Section 8.2), sequences with decreased length during training (Section 8.3), and the impact of assuming unkilled mutants as equivalent mutants during testing (Section 8.4).

Why Cerebro is a good candidate for subsuming mutant prediction?
There are three main factors that make Machine Translation a good candidate for subsuming mutant prediction. The first one is that it learns to select mutants using the exact local context (entire code snippet composed of 50-100 tokens, represented as a sequence), while previous work considers AST and data-flow abstractions [11], ignoring the exact formulation of the code snippet. In a sense, the key determining factor is the sequence that code tokens appear in the local context (considered code snippet). The second reason is that the machine translator includes a powerful self-attention mechanism, which together with the encoderdecoder architecture makes the learning resistant to noise [60], and able to learn out of imbalanced data. Overall, previous research has shown that this architecture often makes the best predictions for many NLP tasks [20]. This is actually the reason why Machine Translation has been successfully used in code analysis tasks such as mutant generation, code clone detection, test assertions generation, etc. The third reason is the diversity of the selected mutants, i.e., Cerebro selects a few mutants per code block, which allows eliminating local redundancies, while spreading testing across the entire code-base.

Impact of removing code abstraction
We analyzed the impact of using unabstracted code sequences to train our models instead of proposed abstracted code sequences and how it affects the model prediction performance (RQ1). In this experiment, we just removed the code comments and kept everything else as it is. We found a prediction performance reduction for projects in both C and Java benchmarks. For C-Benchmark, the model performance deteriorated by 18

Impact of reducing the sequence length
We also analyzed the impact of reducing the length of sequences that we use to train our models and how it affects the model prediction performance (RQ1). In this experiment, we reduced the sequence length from 50 tokens per sequence to 25 tokens per sequence. Figure 9 and Table 4 shows the average and median scores achieved by the models. For simulation details on Effectiveness Evaluation (RQ2), Number of Analyzed Mutants (RQ3) and Number of Test Executions (RQ4), please refer to our online repository. From these results we found that reducing the length of sequences used by the models to train also deteriorated the model prediction performance for projects in both C and Java benchmarks. For C-Benchmark, the model performance deteriorated by 23.5% in MCC, 22.2% in Precision and 18.1% in Recall. For Java-Benchmark, although we found an improvement of 18.7% in Recall, the overall performance deteriorated by 24.7% in MCC and 28.6% in Precision.

Impact of considering equivalent, mutants that are subsuming, i.e., impact of potential mistakes in our evaluation
In our experiments, we considered the mutants that were not killed by our test suite as unkillable a.k.a. equivalent. Although this being an undecidable problem (as we elaborated in Section 6.2), we analyzed the impact of what would have happened if the mutants that we considered as equivalent were subsuming instead. Hence, we addressed this by introducing noise in our evaluation, i.e., we assumed 2% equivalent mutants in our evaluation set as subsuming For Random selection, the performance also deteriorated in most of the cases, with exceptions of 10% noise in C benchmark, and 6% and 8% noise in Java benchmark where Random's performance improved by 6.48%, and 0.23% and 1.26% improved MS*, respectively. and analyzed the change in performance (MS* achieved) for all the approaches (Cerebro, Decision Trees and Random). We gradually increased the noise percentage from 2% till 10% (i.e., 2%, 4%, 6%, 8%, 10%) and analyzed the change in behaviour for all the approaches (i.e., change in MS*), if it increases or decreases with increase in noise.
We found that Cerebro's and Decision Trees' performances are more or less inversely related to the noise in evaluation ( Figure 10). Higher the noise, lower the MS* achieved by both the approaches (with an exception of 10% noise in C benchmark for Decision Trees where Decision Trees performed better than in case of 8% noise, as detailed in Table 5). For Random selection, the performance also deteriorated in most of the cases, with an exception of 10% noise in C benchmark, and 6% and 8% noise in Java benchmark where Random's performance improved by 6.48%, and 0.23% and 1.26% improved MS*, respectively. Despite the reduction in performance due to introduced noise, Cerebro still achieves higher MS* than the baselines (Figure 10).

THREATS TO VALIDITY
External Validity: Threats may relate to the subjects we used. Although our evaluation expands to both C and Java projects of different sizes, the results may not generalize to other projects or programming languages. We consider this threat as low since we have a large sample of programs, i.e., we perform one of the largest mutation testing studies to date.
Other external threat lies in the operators we used, since our prediction approach might not work for different types of mutants. To reduce this threat, we employ modern mutation tools, for both C and Java that implement a large variety of mutation operators. For the C-Benchmark, taken from [12], 816 simple operators across 18 categories were considered; while for creating our Java-Benchmark, we consider the group "ALL" of mutation operators provided by Pitest [17], resulting in 112 simple operators across 29 categories.
Internal Validity: Threats may relate to the restriction that we impose on sequence length, i.e., a maximum of 100 tokens. This was done to enable reasonable model training time, approximately 740 hours. Moreover, restricting the sequence length to 50 assisted to reach an appropriate training time of 360 hours. However, it resulted in a prediction performance deterioration of approximately 15%, as discussed in Section 8. Other threats maybe due to the use of machine translation for classification. This choice was made for simplicity, to use the related framework out of the box, similar to the related studies [61], [62]. Still a potential "sequence to class classifier" may yield better results, though such improvements should be marginal given the low number of unexpected labels we get, i.e., on average, 2.15% of the mutants do not get a valid label (4.2% in C and 0.1% in Java).
Threats may also relate to the features we implemented for training the Decision Trees baseline. We follow the guidelines provided in [11], to extract the 16 features for our Java dataset. Unfortunately, many of the 28 features for C programs presented in [11] depend on the semantic of the C language, that we found unfeasible to be replicated in Java. However, the prediction performance of Decision Trees in Java are in line with the results obtained for C, indicating that the impact of this threat is low.
Other internal validity threats could be related to the test suites we used and the mutants considered as subsuming and equivalent. To deal with this issue, we used well-tested programs and state-of-the-art tool to generate extensive pools of tests (KLEE [10], SEMu [12], and EvoSuite [23]). Since identifying subsuming and equivalent mutants is an undecidable problem, in our experimental setup, we approximate them through an extensive pool of tests. This has been a typical process followed in related mutation testing studies [3], [28], [35], [46], [47]. To be more accurate, our underlying assumption is that the extensive pool of tests used in our experiments are a valid representation of all possible tests that a tester can manually or automatically generate. This assumption allowed us to identify the minimal set of mutants (i.e., subsuming mutants) that a tester needs to kill in order to kill every other killable mutant (i.e., subsumed mutants). Also, we assumed that unkilled mutants are equivalent. Even if this may not be the case, it is likely that the testers guided by mutation won't be able to kill all the killable mutants. Here it must be noted that since Cerebro is quite precise, its feeding with less noisy data, i.e., correct labels, will make it perform better, i.e., more accurate labelling in training will result in better predictions. Nevertheless, we also investigate the impact of having such noisy data and found minor discrepancies, please refer to Section 8.4.
Cerebro's use may also pose additional threats. In particular, Cerebro required approximately 5 minutes for preprocessing of the projects and 5 minutes for classification (decoding results). While this time overhead is low, compared to the hours of test executions, it may still be important. Although our implementation is non-optimal and involves no parallelism, however our encoding and decoding can easily be parallelized, since mutant instances are independent of one another.
Construct Validity: Our assessment metrics, subsuming mutation score, number of equivalent mutants and number of test executions may not reflect the actual testing cost / effectiveness values. These metrics have been suggested by literature [5], [35], [48] and are intuitive, i.e., number of selected and analyzed mutants essentially simulate the manual effort involved by testers, subsuming mutation score the level of covering the test requirements [3], [47], and number of test executions capture the computational effort involved. Here it should be noted that automated test generation tools may reduce this cost but they require testers to check the related test oracles. Similarly, equivalent detection techniques and related heuristics may also reduce the manual effort involved [33]. Though, in C we applied TCE (Trivial Compiler Equivalence) [25], [32] to filter out equivalent and duplicated mutants and our approach still provided significant benefits. Similarly, the use of test executions capture the computational effort involved independently of the test execution framework and optimizations used [15], [48], [64], [69], the machines and the level of parallelization used during test execution. Nevertheless, the differences are substantial making such threats unlikely to happen. Overall, we mitigate these threats by following suggestions from mutation testing literature [5], [35], [48], using state-of-theart tools, performing several simulations, forming very large and diverse test pools, and got consistent and stable results across our subjects.

RELATED WORK
Mutation testing has been established as one of the strongest test criteria [3], [14]. Despite its potential, mutation is considered to be expensive since it introduces too many mutants. To this end, random mutant sampling [19], [50] and selective mutation [44] (restricting mutant instances according to their types) have been proposed as potential solutions. Unfortunately, these approaches fail to capture relevant program semantics and performing similarly to random mutant sampling [11], [35], [68].
Other attempts regard the selection of relevant program locations, which should be mutated. Sun et al. [58] proposed selecting mutants that reside in diverse static control flow graph paths. Gong et al. [24] identified dominator nodes (using static control flow graph) to select mutants.
More recent attempts regard the identification of interesting mutants (pairs of mutant types and related locations). Petrovic and Ivankovic [52] and Just et al. [30] proposed using the code AST in order to identify "useful" mutants. Petrovic and Ivankovic used what they called arid nodes (special AST nodes), while Just et al. used the AST parent and child nodes, in order to identify high utility mutants. Mirshokraie et al. [42] employed complexity metrics together with test executions to select killable mutants. Similarly, Titcheu et al. [11] employed static features, including data flow analysis, complexity and AST information, in order to perform mutant selection, wrt mutants linked with real faults.
In our analysis we approximate the performance of the above approaches through the two baselines we adopt and show that our approach significantly outperforms these. Random mutant sampling is performing comparably to operator mutant selection [68], while the supervised baseline we consider simulates the AST-based and complexity-based approaches.
Perhaps the closest work to ours, is from Marcozzi et al. [39], which attempts to identify subsumed mutants using verification techniques (such as weakest precondition). While Marcozzi et al.'s approach is particularly powerful, it targets weak mutation and not strong as we do. This results in several false positives in the strong mutation case due to failed error propagation [14]. Moreover, Marcozzi et al.'s approach is time consuming, requires complex computations and infrastructure while Cerebro is fast and simple. Nevertheless, future research should attempt to combine these methods.
Tufano et al. [61] proposed using Neural Machine Translation to learn mutations from bug fixes with the aim of introducing mutations that are syntactically similar to real bugs. Cerebro relies on the same technology, though it targets a different problem; the identification of high utility mutants, among those given by regular mutation testing tools, while Tufano et al. aim at generating mutants regardless of their potential. This indicates that Cerebro can complement Tufano et al by selecting relevant mutants. Nevertheless, we focus on subsuming mutants, that could help measuring test adequacy and designing test suites, which are unlikely to be supported by Tufano et al. as there is no notion of subsumption in the bug-fixing sets they use. Moreover, we make no assumption on the availability and repetitiveness of historical bugs and their fixes.
Predictive mutation testing (PMT) [67] attempts to predict whether a given test can kill a given mutant without performing any mutant execution. The approach relies on a set of both static and dynamic features (relying on coverage and code attributes) and achieves relatively good results (on average with 10% error). Though, PMT mainly targets intra-project predictions, while Cerebro targets inter-project. Nevertheless, PMT is incomparable to Cerebro since it aims at evaluating test execution results, while we do mutant selection prior to any test execution. In other words, we aim at identifying the mutants to be used for test design/generation, while PMT to verify whether mutants are killed by some tests. Therefore, the two methods target different but complementary problems.
Evolutionary Mutation Testing (EMT) [21] utilises dynamic features (execution traces) in order to identify interesting locations and mutant types. As such, EMT requires tests and user feedback, which make it different but complementary to ours; Cerebro can set a starting point for EMT or integrate its predictions within EMT's fitness function. Higher-order mutation [28] aims at dynamically optimizing mutants based on given test suites. This means that Cerebro can be directly applied to support test generation prior to any test generation, while higher-order mutation is only applicable after test generation. Perhaps more importantly, Cerebro does not introduce any expensive dynamic mutant execution, while higher-order mutation introduces major mutant execution overheads.

CONCLUSION AND FUTURE WORK
We presented Cerebro, a method that learns to select subsuming mutants (subset of mutants that subsumes the others, i.e., tests killing them also kill all the mutants of the given mutant set) from given mutant sets. Experiments with 58 programs showed that Cerebro identified subsuming mutants with 0.85 precision and 0.33 recall at an inter-project scenario (trained on different projects than the ones it was evaluated). These predictions enable testers designing test cases capable of killing more than two times the subsuming mutants that they would kill if they were using either randomly selected mutants or another previously proposed machine learningbased mutant selection technique. At the same time Cerebro entails the analysis of 66% fewer equivalent mutants and 90% less mutant executions, indicating a large reduction on the practical effort/cost of the approach.
Recently, it has become increasingly common to pretrain the entire model on a data-rich task, which causes the model to develop general-purpose abilities and knowledge that can then be transferred to downstream tasks [53]. In this practice aka Transfer Learning and its applications to computer vision [27], [45], pre-training is typically done via supervised learning on a large labeled data set like ImageNet [54]. In contrast, modern techniques for transfer learning in Natural Language Processing (NLP) often pretrain using unsupervised learning on unlabeled data [20], [37]. The resulting pre-trained models are further trained on specialized datasets to accomplish the desired tasks. Unsupervised pre-training for NLP is attractive and seems a good fit for neural networks as it have been shown to exhibit remarkable scalability, i.e., it is often possible to achieve better performance simply by training a larger model on a larger data set [26], [29], [38], [55]. It will be worthwhile to explore such available pre-trained models [22], [40] and if these can be further refined to address our specific prediction task.
On the other hand, as we have shown that Cerebro is proficient in capturing the silent features and patterns of the code context, it is promising to explore Cerebro in securityspecific task such as prediction of zero-day vulnerabilities, which pose a very high risk [66]. Vulnerabilities are fewer in comparison to defects, limiting the information one can learn from. Also, their identification requires an attacker's mindset [43], which developers or code reviewers may not possess. Lastly, the continuous growth of codebases makes it difficult to investigate them entirely and track all code changes. For instance, Linux kernel, which is one of the projects with the highest number of publicly reported vulnerabilities, reached 27.8 million LoC (Lines of Codes) at the beginning of 2020 [36]. Hence, it will also be rewarding to explore Cerebro in this line of work.