Test Input Prioritization for Machine Learning Classifiers

Machine learning has achieved remarkable success across diverse domains. Nevertheless, concerns about interpretability in black-box models, especially within Deep Neural Networks (DNNs), have become pronounced in safety-critical fields like healthcare and finance. Classical machine learning (ML) classifiers, known for their higher interpretability, are preferred in these domains. Similar to DNNs, classical ML classifiers can exhibit bugs that could lead to severe consequences in practice. Test input prioritization has emerged as a promising approach to ensure the quality of an ML system, which prioritizes potentially misclassified tests so that such tests can be identified earlier with limited manual labeling costs. However, when applying to classical ML classifiers, existing DNN test prioritization methods are constrained from three perspectives: 1) Coverage-based methods are inefficient and time-consuming; 2) Mutation-based methods cannot be adapted to classical ML models due to mismatched model mutation rules; 3) Confidence-based methods are restricted to a single dimension when applying to binary ML classifiers, solely depending on the model's prediction probability for one class. To overcome the challenges, we propose MLPrior, a test prioritization approach specifically tailored for classical ML models. MLPrior leverages the characteristics of classical ML classifiers (i.e., interpretable models and carefully engineered attribute features) to prioritize test inputs. The foundational principles are: 1) tests more sensitive to mutations are more likely to be misclassified, and 2) tests closer to the model's decision boundary are more likely to be misclassified. Building on the first concept, we design mutation rules to generate two types of mutation features (i.e., <bold>model mutation features</bold> and <bold>input mutation features</bold>) for each test. Drawing from the second notion, MLPrior generates <bold>attribute features</bold> of each test based on its attribute values, which can indirectly reveal the proximity between the test and the decision boundary. For each test, MLPrior combines all three types of features of it into a final vector. Subsequently, MLPrior employs a pre-trained ranking model to predict the misclassification probability of each test based on its final vector and ranks tests accordingly. We conducted an extensive study to evaluate MLPrior based on 185 subjects, encompassing natural datasets, mixed noisy datasets, and fairness datasets. The results demonstrate that MLPrior outperforms all the compared test prioritization approaches, with an average improvement of 14.74%<inline-formula><tex-math notation="LaTeX">$\sim$</tex-math><alternatives><mml:math><mml:mo>∼</mml:mo></mml:math><inline-graphic xlink:href="li-ieq1-3350019.gif"/></alternatives></inline-formula>66.93% on natural datasets, 18.55%<inline-formula><tex-math notation="LaTeX">$\sim$</tex-math><alternatives><mml:math><mml:mo>∼</mml:mo></mml:math><inline-graphic xlink:href="li-ieq2-3350019.gif"/></alternatives></inline-formula>67.73% on mixed noisy datasets, and 15.34%<inline-formula><tex-math notation="LaTeX">$\sim$</tex-math><alternatives><mml:math><mml:mo>∼</mml:mo></mml:math><inline-graphic xlink:href="li-ieq3-3350019.gif"/></alternatives></inline-formula>62.72% on fairness datasets.


Test Input Prioritization for Machine Learning Classifiers
Xueqi Dang , Yinghua Li , Mike Papadakis , Jacques Klein , Member, IEEE, Tegawendé F. Bissyandé , and Yves Le Traon Fellow, IEEE Abstract-Machine learning has achieved remarkable success across diverse domains.Nevertheless, concerns about interpretability in black-box models, especially within Deep Neural Networks (DNNs), have become pronounced in safety-critical fields like healthcare and finance.Classical machine learning (ML) classifiers, known for their higher interpretability, are preferred in these domains.Similar to DNNs, classical ML classifiers can exhibit bugs that could lead to severe consequences in practice.Test input prioritization has emerged as a promising approach to ensure the quality of an ML system, which prioritizes potentially misclassified tests so that such tests can be identified earlier with limited manual labeling costs.However, when applying to classical ML classifiers, existing DNN test prioritization methods are constrained from three perspectives: 1) Coveragebased methods are inefficient and time-consuming; 2) Mutationbased methods cannot be adapted to classical ML models due to mismatched model mutation rules; 3) Confidence-based methods are restricted to a single dimension when applying to binary ML classifiers, solely depending on the model's prediction probability for one class.To overcome the challenges, we propose MLPrior, a test prioritization approach specifically tailored for classical ML models.MLPrior leverages the characteristics of classical ML classifiers (i.e., interpretable models and carefully engineered attribute features) to prioritize test inputs.The foundational principles are: 1) tests more sensitive to mutations are more likely to be misclassified, and 2) tests closer to the model's decision boundary are more likely to be misclassified.Building on the first concept, we design mutation rules to generate two types of mutation features (i.e., model mutation features and input mutation features) for each test.Drawing from the second notion, MLPrior generates attribute features of each test based on its attribute values, which can indirectly reveal the proximity between the test and the decision boundary.For each test, MLPrior combines all three types of features of it into a final vector.Subsequently, MLPrior employs a pre-trained ranking model to predict the misclassification probability of each test based on its final vector and ranks tests accordingly.We conducted an extensive study to evaluate MLPrior based on 185

I. INTRODUCTION
M ACHINE learning classifiers have seen remarkable suc- cess in various domains [1], including image recognition [2], natural language processing [3], [4], and recommendation systems [5], [6].However, the prevalence of black-box models, especially in deep learning, has raised concerns about their lack of interpretability, which refers to the extent to which a model's internal mechanism and decision-making processes can be comprehended and explained transparently to humans.Interpretability becomes particularly vital in safety-critical domains like healthcare and finance [7], where model decisions can profoundly impact individuals' lives and societal well-being.
Compared to black-box models, classical machine learning (ML) algorithms (e.g., XGBoost [8], decision tree [9] and logistic regression [10]) offer more interpretable solutions, making them an appealing choice for domains that prioritize transparency and comprehensibility.
While classical ML classifiers are inherently interpretable, ensuring their accuracy and reliability remains a challenge.Testing is a fundamental practice for ensuring the quality of ML systems.However, a significant challenge in ML testing is the labeling cost issue [11] (i.e., labeling test inputs to verify the correctness of predictions can be costly).This challenge arises due to several factors: 1) manual annotation is still the mainstream for labeling; 2) test sets can be large-scale, which increases labeling efforts; 3) domain-specific knowledge can be required in certain domains for labeling tabular data, such as the medical domain [12], [13], [14].For instance, when applying XGBoost for chronic kidney disease (CKD) detection [12], labelling the CKD dataset for model training/testing requires specialized medical expertise to determine whether a patient has CKD.
To deal with the labelling cost problem, one intuitive solution is to prioritize tests that can cause the ML model to behave incorrectly (i.e., inputs that are more likely to be misclassified by the model).Early identification and labelling of such tests can save the manual labelling effort and enhance the overall efficiency of the testing process.In the literature, various test prioritization approaches [15], [16] have been proposed in the field of DNN testing.These techniques can be broadly classified into three categories: coverage-based [17], [18], [19], confidencebased [16], [20] and mutation-based [15] approaches.
Coverage-based approaches prioritize test inputs based on the neuron coverage of DNNs.Confidence-based methods identify possibly-misclassified test inputs by quantifying the classifier's output confidence for each test.One notable confidence-based approach is DeepGini [16], which leverages the Gini score as a metric to quantify confidence levels for effective test prioritization.Recently, Weiss et al. [20] conducted a comprehensive study to assess existing test prioritization methods, containing the evaluation of a series of confidence-based metrics, including Vanilla Softmax, Prediction-Confidence Score (PCS), and Entropy.Mutation-based techniques propose a set of mutation operations and utilize the mutated results for test prioritization.While these approaches have made considerable progress in prioritizing potentially-misclassified test inputs, they still face certain challenges and limitations.
First, prior studies [16] have demonstrated that coveragebased methods are ineffective and time-costly compared to confidence-based approaches.Second, the mutation-based test prioritization approach, PRIMA [15], is not applicable to classical ML models due to the lack of adapted model mutation operators.Third, while confidence-based test prioritization approaches can be adapted for classical ML models, there are several limitations associated with their application in this context.We outline the main limitations as follows.Specific details can be found in the background section (cf.Section II).
• Single dimension on binary classification models Binary classification models categorize test inputs into two classes, and in confidence-based approaches, the likelihood of a test being misclassified primarily relies on the model's prediction probability p. Tests with p values closer to 0.5 will be consistently prioritized regardless of the specific method used, as demonstrated through experimental results.• Lack of model-specific insights Confidence-based approaches, viewing the model as a black box and relying solely on its prediction probabilities, do not take into account the transparency and interpretability provided by classical ML models, leading to suboptimal prioritization.• Ignoring attribute features Confidence-based methods neglect the attribute features of classical ML test datasets, which can directly map tests into space and indirectly reflect the distance between samples and the model's decision boundary.However, confidence-based approaches ignore this crucial feature information in the process of test prioritization.In this paper, we propose MLPrior (Classical ML-oriented Test Prioritization), a test prioritization approach specifically tailored for classical machine learning (ML) models.MLPrior addresses the aforementioned limitations, leveraging the characteristics of classical ML classifiers (i.e., interpretable models and carefully engineered attribute features) to prioritize test inputs.The core ideas behind MLPrior are twofold: 1) tests more sensitive to the injected mutations are more likely to reveal bugs, and 2) test inputs closer to the decision boundary of the model are more likely to be predicted incorrectly.Both premises have been validated by existing studies [21], [22], [23], [24], with a detailed explanation provided in the Background section.Building upon the aforementioned premise, MLPrior utilizes the characteristics of classical ML classifiers to prioritize test inputs, addressing the limitations of confidence-based methods in the following way.
• Premise 1 -tests more sensitive to the injected mutations are more likely to reveal bugs.Based on this premise, we design mutation rules specifically based on the characteristics of classical ML models and their datasets.1) Model mutations.Leveraging the white-box nature of most classical ML models, we design mutation rules specifically tailored for classical ML models.These rules involve modifying the model's architecture parameters or weight parameters to perform model mutations.
2) Input mutations.Considering the tabular format of classical ML datasets, which is different from the complex data structures of DNN datasets (e.g., text and images), we design input mutation rules specifically tailored for classical ML datasets.• Premise 2 -test inputs closer to the decision boundary of the model are more likely to be predicted incorrectly.To effectively capture the spatial relationship between a test input and the decision boundary, we aim to transform the attribute features of each test into a vector to indirectly reveal the underlying proximity between the input and the decision boundary.Recognizing the carefullyselected features of the classical ML test set, we design transformation rules to convert the original attributes of each test into a feature vector for test prioritization.Using model mutation rules and input mutation rules, we create a feature vector for each test.More specifically, we generate mutants based on the mutation rules.These mutants are then executed to generate mutation features for the purpose of assessing the sensitivity to the injected mutations.As a result, we obtain three types of features for each test: model mutation features (MMF), input mutation features (IMF), and original attribute features (OAF).
• Model mutation features (MMF) MMF can capture the impact of model mutations on a test input.Here, if an input can kill many mutated models (i.e., the predictions for this input via the mutated models and the original model are different), indicating that this input is sensitive to model mutations, MLPrior considers this input more likely to be misclassified.• Input mutation features (IMF) IMF can capture the impact of mutations on test inputs.If the prediction result for a given test input is different from that of many of its mutated inputs, indicating that the predictions for the input are sensitive to the mutations, MLPrior considers this input more likely to be misclassified.• Original attribute features (OAF) OAF can capture the spatial relationship between a test input and the decision boundary.It directly reflects the original attribute information of each test.
MLPrior combines three types of features for each test input in the target test set to generate a final feature vector.This vector is then used by a pre-trained ranking model to effectively predict the probability of misclassification for that input.MLPrior offers several advantages: • Generality: MLPrior can be adapted to a wide range of classical ML models by making simple adjustments to the model mutation rules (i.e., enabling them to target the architecture parameters or weight parameters of the evaluated model).Our proposed approach MLPrior is designed to leverage the attribute features of ML test sets for test prioritization.MLPrior demonstrates broad applicability across various contexts.One specific application pertains to banking loan operations, where classical ML models are employed to determine whether a loan can be granted to a user.In this particular scenario, classical ML models utilize a set of user attributes (e.g., gender, age, and transaction history) to predict the viability of granting a loan to a user.Incorrect predictions can lead to significant losses for the bank.For instance, if the bank mistakenly grants a loan to a user without the ability to repay, these users can fail to meet their repayment obligations, increasing the risk of default and causing damage to the bank's assets.In this context, MLPrior can identify and prioritize users who are more likely to be misclassified by the model.Consequently, two main advantages arise: Firstly, these potentially misclassified users can be prioritized for manual inspection, resulting in a decrease in losses caused by inaccurate predictions generated by the model.Secondly, developers can manually inspect the attributes of misclassified users and analyze which attributes led to prediction errors, using this information to optimize the model.
We conducted an extensive study to evaluate MLPrior's performance utilizing 185 subjects (i.e., paired datasets and ML models).The evaluation encompassed different types of test inputs, including natural data, mixed noisy data, and fairness data.Ensuring fairness in machine learning is essential to prevent bias and discrimination against specific groups during predictions.Fairness has become a critical ethical consideration in diverse machine learning domains, including recruitment, loan approvals, and medical diagnosis [25].In these domains, the absence of fairness can lead to unjust treatment of particular groups, affecting individuals' lives and rights.Therefore, the evaluation of MLPrior's effectiveness on fairness datasets assumes crucial importance.To generate the fairness datasets, we followed the approach of prior research [26].Specifically, we selected a group of test inputs and modified their gender and age attribute values while retaining their original labels.Moreover, we carefully selected a group of test prioritization approaches that can be adapted to prioritize test inputs in the context of classical ML models as the comparative methods, which have been demonstrated effective in existing studies [16], [20].Additionally, we utilize random selection as the baseline approach.
The experimental results demonstrate the superior performance of MLPrior compared to existing methods, with an average improvement of 14.74%∼66.93% on natural datasets, 18.55%∼67.73%on mixed noisy datasets, and 15.34%∼ 62.72% on fairness datasets.We publish our dataset, results, and tools to the community on Zenodo1 .
To sum up, our work has the following major contributions: • Approach.We propose MLPrior, a novel test prioritization approach specifically designed for classical ML models.• Study.We conduct an extensive study based on 185 subjects involving natural, mixed noisy, and fairness test inputs.We compare MLPrior with existing DNN test prioritization approaches.Our experimental results demonstrate the effectiveness of MLPrior.• Performance Analysis.We assess the influence of various ranking models on MLPrior's effectiveness.Furthermore, we evaluate the contributions of different types of features to MLPrior's effectiveness.Additionally, we explore the impact of parameter settings on MLPrior's effectiveness.

A. Machine Learning and ML Testing
Machine Learning (ML) has gained widespread adoption in various domains, demonstrating significant utility in safetycritical sectors like autonomous vehicle systems [27] and medical intervention protocols [28].Existing literature [11] pointed out that ML can be broadly classified into two primary branches: classical Machine Learning [8], [29] and Deep Learning [30], [31].Classical Machine Learning encompasses a range of approaches, including decision trees [9] and logistic regression [10].These classical algorithms remain widely employed in various industrial applications [32], [33].DNNs consist of interconnected nodes (neurons) organized in layers, with each layer responsible for learning and abstracting different levels of features from input data.In contrast to DNNs, classical ML models are generally more interpretable [34].Interpretability in machine learning refers to the degree to which a model's internal mechanisms and decision-making processes can be understood and transparently explained to humans.Interpretability is crucial in domains where transparency and interpretability are essential, such as healthcare [35] and finance [36].Therefore, classical machine learning models retain distinct advantages in certain application domains.
In order to emphasize the importance of interpretability in safety-critical domains, we present several typical harms caused by black-box ML systems in the financial and healthcare industries: 1) Risk Management Challenges in Finance Weber et al. [37] highlighted that, in the financial field, a high degree of transparency and interpretability is required for effective risk management.The lack of interpretability in black-box models can make it challenging for financial institutions to understand how decisions are made, thereby increasing the difficulty of risk management.
2) Legal and Ethical Issues in Finance Chen et al. [38] pointed out that, according to legal and ethical principles, financial companies are required to provide clear explanations for the reasons behind specific loan application rejections.However, with black-box models, loan applicants are unaware of how their scores are calculated.Even if model explanations are provided, there can be a disconnect between the explanations for loan rejection and the actual model calculations, as the explanations could be created after the fact.
3) Trust Issues in Healthcare Adadi et al. [39] discussed the constrained acceptance of black-box models in clinical settings due to trust and transparency issues.Moreover, Verdicchio et al. [40] raised a vital question: "If doctors cannot understand why a black-box model diagnoses, why should patients trust the treatment recommendations?".This implies that black-box models lack interpretability, making it difficult to explain the fundamental reasons behind their diagnostic or treatment recommendations.Therefore, patients and doctors can be skeptical of the system's suggestions and even refuse to follow its recommendations because they cannot be certain if these recommendations are based on sound medical reasoning.This lack of trust and understanding can significantly affect patients' confidence in the proposed treatments, potentially hindering their willingness to undergo specific medical procedures.
4) Responsibility Issues in Healthcare Smith et al. [41] pointed out that if patients are harmed due to recommendations from an opaque AI system (AIS) adopted by clinicians, questions arise about how responsibility will be assigned.Specifically, in the healthcare field, doctors are expected to take responsibility for their decisions.If a black-box system provides incorrect recommendations, doctors will find it challenging to explain why they followed the system's advice, potentially raising legal and ethical liability concerns.
Based on the existing studies [42], [43], [44], in the following, we provide the quantification of the loss resulting from the lack of interpretability in black-box models.Specifically, we employ descriptive terms to quantify the degree of loss in two specific scenarios: medical and financial.
• Medical Scenario Amann et al. [42] pointed out that, in the medical domain, the lack of interpretability in blackbox models can lead to serious legal and ethical uncertainty.Without adequate consideration of interpretability, these technologies can neglect regulatory issues and result in significant harm.Moreover, Grote et al. [43] pointed out that in the face of a black-box model lacking interpretability, its clinical decision support can constrain the capabilities of physicians.Specifically, physicians can rigidly adhere to the output of the black-box model to avoid being held accountable.This situation poses a serious threat to the autonomy of physicians.• Financial Scenario Yan et al. [44] pointed out that, in the financial domain, the lack of interpretability in the decision mechanisms of black-box models poses a challenge for financial practitioners and regulatory authorities in understanding the factors influencing the model's decisions.This can significantly impact the fairness of loan decisions, potentially resulting in substantial financial losses.Although interpretability is a valuable trait, it is not the sole factor taken into account when deploying models, especially in the healthcare industry [45], [46].Deep learning has also demonstrated remarkable success in healthcare applications [47].However, there are compelling reasons that test prioritization for classical models remains highly necessary.
• Applicability to Structured Medical Data: Deep learning finds extensive use in the field of medical imaging [46], aiding in the automatic detection of diseases and tumors.However, a substantial portion of data in the healthcare sector exists in structured tabular formats.Classical machine learning models have demonstrated superior performance when dealing with structured medical data, outperforming deep learning methods [48], [49].For instance, Shwartz et al. [48] pointed out that when handling tabular datasets, the classical ML technique XGBoost outperforms the evaluated DL models.• Need for Interpretability: In healthcare [40], when clinicians need to justify their decisions to patients, having an understanding of the reasoning behind model predictions is essential.Classical machine learning models can provide this crucial information [50].• Regulatory Approvals: Regulatory bodies can require models to elucidate the decision-making processes of a model to facilitate comprehensive treatment risk assessment [7].The interpretability that classical ML models can provide is crucial for obtaining regulatory approvals.Machine learning testing involves systematically evaluating and validating machine learning models to ensure their accuracy, reliability, and effectiveness in prediction or decisionmaking [7], [51], [52], [53].The primary goal is to reveal disparities between intended and actual behaviors exhibited by ML systems [11].Compared to traditional software systems, machine learning testing presents distinct challenges.One pivotal challenge is the Oracle Problem [54], which pertains to the difficulty in acquiring accurate labels or ground truth for training and testing data.In the context of testing ML-based systems, automated testing oracles are typically unavailable.Therefore, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
manual labeling remains the mainstream method, which can lead to substantial labeling costs.In the literature, numerous fields are dedicated to addressing labeling cost concerns, such as test selection [55], [56] and test prioritization [15], [16].In our study, we concentrate on test prioritization, which will be further elaborated in the subsequent section.

B. Test Case Prioritization
In the field of traditional software testing, test case prioritization aims to determine the sequence in which test cases are executed to uncover defects more effectively.In the literature, numerous techniques for test prioritization have been proposed.The majority of these approaches are rooted in code coverage analysis.Notably, two primary coverage-based techniques are: Coverage-Total Method (CTM) and Coverage-Additional Method (CAM) [57].CTM operates by sequentially selecting tests with the highest coverage rates, followed by those with progressively lower rates.In cases where tests share the same coverage rate, the method introduces randomness to determine the prioritization.In contrast, CAM distinguishes itself from CTM by its approach.It strategically utilizes feedback from previous selections, iteratively opting for tests that target previously uncovered code structures, thereby incrementally expanding the coverage.
Test input prioritization in the field of Deep Neural Networks (DNNs) [15], [16], [20], [58] aims to enhance the efficiency of testing by focusing on test inputs that are more likely to expose model misclassifications, thereby revealing potential bugs earlier.This approach ensures that crucial test inputs are identified and labeled promptly within the constraints of limited time.Previous research [16] has indicated that confidence-based approaches outperform the aforementioned coverage-based methods.These confidence-based approaches prioritize tests based on the model's confidence.One notable approach is DeepGini [16], which surpasses all existing coverage-based prioritization methods in terms of both effectiveness and efficiency.A recent comprehensive investigation conducted by Weiss et al. delved into the capabilities of various confidence-based DNN test input prioritization techniques, such as Vanilla Softmax, Prediction-Confidence Score (PCS), and Entropy.They demonstrated the effectiveness of these approaches in identifying potentially misclassified test inputs.
However, while confidence-based test prioritization methods have been proven effective [16] and can be adapted for classical ML models, their application in the context of test prioritization for classical ML models is hindered by several limitations.We discuss these limitations as follows.
• Single dimension on binary classification models Binary classification models [59], [60] categorize test inputs into two distinct classes, which limits the application of confidence-based test prioritization approaches to a single dimension.Specifically, when applying confidence-based approaches to these models, the first step is calculating the probabilities for each classification, denoted as (p, 1 − p).
If the model's prediction probability for a test is (0.5, 0.5), it means the model is most uncertain about this test [16], indicating this test is more likely to be misclassified.The closer a test's p value is to 0.5, the more uncertain the model is about that particular test.Consequently, uncertainty is solely determined by p.
Regardless of the specific confidence-based test prioritization method employed, tests with p values closer to 0.5 will be prioritized over others.To illustrate this point, consider a hypothetical test set with three tests, and the model's probability vectors for these tests are as follows: t 1 (0.9, 0.1), t 2 (0.7, 0.3), t 3 (0.8, 0.2).Irrespective of the chosen confidence-based test prioritization method, the resulting ranking will be t 2 → t 3 → t 1 because t 2 has the p value (0.7) closest to 0.5, followed by t 3 (p = 0.8), while t 1 has the farthest p value from 0.5 (p = 0.9).
The above conclusions have been confirmed through our experimental results.For each subject, all confidencebased methods yield identical effectiveness, indicating they produce the same ranking for a given test set.
• Lack of model-specific insights Confidence-based approaches for test prioritization consider the model a black box and rely solely on its prediction probability vectors.This neglects the transparency and interpretability of classical ML models, which are mostly white-box and have an understandable decision-making process.As a result, confidence-based approaches fail to incorporate crucial model-specific insights from classical ML models, leading to suboptimal test prioritization.• Ignoring attribute features Furthermore, confidencebased approaches ignore a crucial aspect of the test datasets for classical ML models, namely, the attribute features.These features are carefully engineered by domain experts to effectively capture and represent crucial aspects of the underlying data.They can directly reflect the attribute information of each test input.However, confidence-based approaches ignore this crucial feature information in the process of test prioritization.To overcome the aforementioned limitations, we propose MLPrior, a test prioritization approach specifically tailored for classical ML models.MLPrior leverages the characteristics of classical ML classifiers (i.e., interpretable models and carefully engineered attribute features) to prioritize test inputs.The core premises behind MLPrior are twofold: 1) tests more sensitive to the injected mutations are more likely to reveal bugs, and 2) test inputs closer to the decision boundary of the model are more likely to be predicted incorrectly.
The first premise is grounded in the well-established practice of traditional mutation testing [21], [22], [23], [61], [62], which considers that test cases sensitive to mutations (able to capture mutants) have a higher capability to detect bugs in software.The second premise has been identified and demonstrated in prior work [24].

C. Mutation Testing
Mutation testing [63], [64] is a systematic software testing technique that has gained significant attention in both academic and industrial research communities [65], [66].
The fundamental principle is to introduce small and intentional modifications, called mutants, into the source code of a software system [67].These mutations simulate potential faults that may occur during the execution of the program.A well-designed test suite should be able to detect the presence of these mutants, indicating its capability to detect real faults in the code [68].In the context of mutation testing, the term "kill" refers to the ability of a test case to detect a specific mutant [69].When a test case "kills" a mutant, it means that the test case is able to reveal a difference in behavior between the original program and the mutated version of the program.A test suite with a high "mutation kill" rate is considered more effective and reliable, as it demonstrates a greater ability to detect potential faults or deviations from the expected behavior.

D. Automated Labeling Approaches for Machine Learning
Data labeling is a labor-intensive task that is indispensable in the development of supervised machine learning systems [70].Conventional data labeling methods typically rely on manual effort, which is a time-consuming and costly process.Moreover, in specialized fields like medicine and finance, manual labeling necessitates domain-specific expertise, further increasing its cost.In recent years, various automated or semi-automated data labeling methods [71], [72] have emerged, aimed at reducing the burden of manual labeling and improving the overall labeling efficiency.
Desmond et al. [72] introduced a semi-automated data labeling system that views the labeling task as a collaborative effort between human annotators and machine annotators, which are implemented as predictive models.The core of this approach involves a human-machine coactive process facilitated by a semi-supervised predictive model and an active learning selector.In each iteration, the active learning selector prioritizes the most uncertain examples for annotation by human annotators based on the model's predictions.The consistency between human decisions and machine predictions is continuously monitored and presented at various checkpoints, allowing annotators to assess the machine's performance in the labeling task.Once annotators are satisfied with the machine's performance, they can delegate the remaining labeling tasks to the machine (automatic labeling).
Wu et al. [73] proposed a semi-automated labeling method based on active learning and label informativeness.Specifically, the SLMAL algorithm selects the most informative examplelabel pairs for annotation by combining the uncertainty of examples and the informativeness of labels.During this process, the algorithm initially identifies and prioritizes the examplelabel pairs in need of labeling the most and subsequently employs the nearest neighbors of these highly uncertain pairs to predict their partial labels.
However, semi-automatic labeling comes with several limitations: • Human Involvement: In the semi-automatic labeling process, human intervention is still required, especially in complex decision-making processes.This can result in an increase in overall labeling time and costs, particularly in situations requiring domain expertise.
• Scalability: Semi-automatic labeling methods can face challenges when dealing with large-scale datasets, primarily regarding processing speed and resource utilization.• Sensitivity to Labeling Quality: The performance of the model is largely dependent on the quality of the initial labeled data used for training.Low-quality or biased labeling data may lead to a decrease in model performance.However, despite the presence of semi-supervised learning, in order to labeling tests more accurate and of higher quality, manual labeling is still the mainstream in the industry [71].
Automated labeling methods offer a potential solution to the aforementioned limitations.Nevertheless, due to the constraints outlined below, fewer automated labeling methods are specifically designed for classical machine learning.To our best knowledge, the known method applicable to labeling for classical machine learning models is Programmatic Labeling [74].Programmatic labeling automates the labeling process through scripts and programming algorithms, significantly improving the efficiency of data preparation.However, Programmatic Labeling typically requires specialized programming skills to create labeling rules, which may pose a barrier for researchers without a technical background.
In the following, we outline the challenges that make it difficult to develop automated labeling methods specifically designed for classical machine learning, resulting in the current scarcity of such methods.

A. Overview
In this paper, we propose MLPrior, a test prioritization approach specifically designed for classical ML models.Fig. 1 illustrates the workflow of MLPrior.Given a test set T and an ML model M , MLPrior produces a sorted test set T , where test cases that are more likely to be mispredicted by the model are placed at the front.We outline the steps of MLPrior as follows.❶ Attribute feature generation: In the initial stage, ML-Prior converts the attribute values of each test t ∈ T into a feature vector, denoted as V D t .This involves transforming  non-numeric attributes into a numeric format.To accomplish this, we create a mapping dictionary that includes all non-numeric attributes paired with their corresponding numeric values.For instance, in the context of the attribute "gender," the values "male" and "female" are mapped to 0 and 1, respectively.❷ Mutation feature generation (model): Based on the model mutation rules described in Section III-B, MLPrior generates a set of mutated models for the original ML model M .For each test t ∈ T , MLPrior identifies whether t "kills" each of the mutated models (i.e., whether the predictions made by the mutated model and the original ML model for t are different).This process allows MLPrior to construct a model mutation feature vector, denoted as V M t .Each element of V M t corresponds to a specific mutated model.More specifically, MLPrior sets the i th element of t's model mutation vector to 1 if t kills the i th mutated model.Otherwise, the element is set to 0. ❸ Mutation feature generation (inputs): Based on the input mutation rules outlined in Section III-B, MLPrior generates mutated inputs for each test instance t ∈ T .By comparing the predictions of model M on the i th mutated input with its predictions on the original test input t, ML-Prior constructs an input mutation vector denoted as V I t .If the prediction of model M for the i th mutated input is different from that of the original test input t, the i th element of V I t is set to 1. Otherwise, it is set to 0.
❹ Feature Concatenation: For each test t ∈ T , MLPrior concatenates the three types of feature vectors constructed in the previous steps (i.e., V D t , V M t and V I t ) and obtain a final feature vector, denoted as V t .❺ Learning-to-Rank: For each test instance t ∈ T , MLPrior feeds its final feature vector (V t ) into the pre-trained XG-Boost ranking model [8], which will produce the probability of this input being misclassified.Finally, MLPrior ranks all the tests in T based on their probability scores in descending order, thereby prioritizing the possiblymisclassified tests.In MLPrior, the concept of feature is crucial.To demonstrate the processes of feature extraction, combination, and concatenation more intuitively, we provide a typical example.In this example, we delve into the specifics of how MLPrior generates features for a given test t, illustrating each step of the process in detail.Furthermore, we visually illustrated this example in Fig. 2 to enhance the presentation of MLPrior's feature generation process.
• Feeding Attributes of t to MLPrior Given a classical ML model M and its corresponding test set T , let t be a test instance from the test set T .Given that the dataset for the classical ML model is in a tabular format, we assume the attribute features of t as t = (s 1 , s 2 , . . ., s n ).Here, s n includes both numeric and non-numeric formats (such as strings).In this step, we input the attributes of the test t into MLPrior.
• Generation of Original Attribute Features We input the attribute vector of test t, which is (s 1 , s 2 , . . ., s n ), into MLPrior.MLPrior then converts all non-numeric attributes into numeric format to construct the original attribute vector of t, represented as (i 1 , i 2 , . . ., i n ).

• Generation of Input Mutation Features Subsequently,
MLPrior generates N mutated inputs of the test t, denoted as (t 1 , t 2 , . . ., t N ).MLPrior then feeds these mutated inputs into the original ML model to make predictions.If the model output for t i differs from the result of the original sample t, the i th element of the input mutation feature vector will be set to 1; otherwise, it will be set to 0. In this manner, we obtain the input mutation feature vector for t, represented as (0, 1, . . ., 0).This vector indicates that for the first mutant of t, denoted as t 1 , the model's prediction is the same as for t.For the second mutant of t, denoted as t 2 , the model's prediction differs from that for t. • Generation of Model Mutation Features For the original model M , MLPrior generates K mutated models, denoted as (m 1 , m 2 , . . ., m K ), and inputs the original sample t into these mutated models for prediction.If the prediction of the i th mutated model for t differs from the prediction of the original model M for t, then the i th element of the model mutation feature vector is set to 1; otherwise, it is set to 0. Through this method, we obtain the model mutation feature vector for t, represented as (1, 0, . . ., 1).This vector indicates that the first mutated model of M , denoted as m 1 , predicts differently for the test t compared to the original model M .Conversely, the second mutated model of M , denoted as m 2 , predicts the same for the test t as the original model M .• Feature Combination: MLPrior concatenates the three types of features obtained from the previous steps (i.e., Original Attribute Features, Input Mutation Features, and Model Mutation Features) to form the final feature vector for t.This final feature vector is represented as (Original Attribute Features, Input Mutation Features, Model Mutation Features) = (i 1 , i 2 , . . ., i n , 0, 1, . . ., 0, 1, 0, . . ., 1).The primary purpose of this step is to encapsulate the attribute information of each test instance t ∈ T into a feature vector, which will then be utilized as input to the ranking models for test prioritization.Since the ranking models require numeric inputs, MLPrior converts all the non-numeric attribute values of t into a numeric format.To this end, we construct a mapping dictionary that specifies the numeric value corresponding to each non-numeric attribute value.For instance, for the attribute "gender," the attribute value "male" is transformed into 1, while "female" is transformed into 0. The motivation behind extracting these original features is explained as follows.
Prior research [24] pointed out that test inputs situated closer to the decision boundary of a model are more likely to be misclassified.In order to effectively capture the spatial relationship between a test input and the decision boundary and to preserve the carefully-selected and low-dimensional features of the classical ML test set, we directly generate feature vectors of each input from its original attribute values.

B. Mutation Rule Specification
In this stage, we propose two types of mutation rules designed specifically for classical ML models and their corresponding datasets.The principle underlying our utilization of mutation testing in test prioritization is: If a test input exhibits high sensitivity to the injected mutations, this input is more likely to detect faults in the system.This principle is derived from previous research in traditional mutation testing [23], [61], [62].We extend this principle to encompass ML systems, correspondingly designing model mutation rules and input mutation rules.The key insights of MLPrior are that: 1) if an input can kill many mutated models (i.e., the predictions for the input made by the mutated models and the original model are different), indicating that this input is sensitive to model mutations, MLPrior considers this input more likely to be misclassified.2) If the prediction result for a given test input is different from that of many of its mutated inputs, indicating that the predictions for the input are sensitive to the mutations, MLPrior considers this input more likely to be misclassified.In the following sections, we provide a detailed explanation of our mutation approaches.

1) Model Mutation Rules:
The model mutation rules are designed to make slight changes to the architecture parameters or weight parameters of the pre-trained ML models to generate mutated models.We ensure that the new parameter values are close to their original values in order to achieve slight mutations.It is important to note that this process does not involve any retraining operations.Therefore, the total execution time of generating model mutants is short, with an average duration of 3 seconds, as shown in Table IV.
In our study, we evaluated the effectiveness of MLPrior using five classical ML models, namely Decision Tree [9], K-Nearest Neighbors (KNN) [75], Logistic Regression (LR) [10], XGBoost [8], and Gaussian Naive Bayes (GaussianNB) [8].The rationale behind selecting these models is twofold: 1) They have gained widespread adoption in various industries due to their interpretability and proven performance [32], [76]; 2) These models have been extensively utilized in recent ML testing studies [26].It is important to note that MLPrior's applicability extends beyond the evaluated models.By making simple adjustments to the model mutation rules (i.e., enabling them to target the architecture parameters or weight parameters of the evaluated model), it can be adapted to a diverse range of interpretable ML models.We elaborate on the specific details of conducting model mutation as follows.❶ Decision Tree [9] Decision tree is a machine learning method that predicts data step-by-step based on features.During prediction, attribute values are utilized to make decisions at internal nodes of the tree, determining which branch node to enter based on the decision outcome until a leaf node is reached to obtain the classification result.

Input to Decision Tree:
The input to a Decision Tree consists of a dataset containing instances with associated features.The Decision Tree algorithm utilizes these input features to create a hierarchical structure that facilitates effective classification.

Process of Classification:
Decision tree operates by sequentially making decisions at each split node of the tree.For a given input, it begins at the root node and evaluates the features of the input to determine the appropriate branch to follow at each split node.This process iterates until a leaf node is reached, signifying a classification outcome.

Mutating Decision Tree:
To induce mutation in the Decision Tree model, we randomly select a set of split nodes and introduce random deviations to their threshold values, thereby influencing the predictive outcomes of the Decision Tree model.We explain below why changing the thresholds can alter the predictive results of a decision tree: Consider a situation where a given test sample t passes through nodes in the original tree.Based on decisions made at split nodes, it arrives at leaf node A, and thus will be classified as A category.After making slight adjustments to the thresholds of a group of decision nodes, when sample t traverses the mutated tree, the modified decision thresholds at split nodes can lead it to reach leaf node B. ❷ K-Nearest Neighbors (KNN) [75]  to map the results onto the [0, 1] interval, representing the probability of belonging to class 1.This enables the classification of input samples.Weight Coefficient: In the Sigmoid function of logistic regression, weight coefficients determine the impact of different features on predicting the output.Each feature is assigned a corresponding weight coefficient.For example, in Formula 1, the weight coefficient for the feature x 0 is w 0 .

Mutating Logistic Regression:
To introduce mutation to the Logistic Regression, we randomly select a feature from the Sigmoid function and modify its weight coefficient, thus affecting the model's predictions.For example, consider Formula 1, which represents a trained Logistic Regression model taking four input features: x 0 , x 1 , x 2 , and x 3 .In this formula, w 0 , w 1 , w 2 , and w 3 denote the weight coefficients for each feature, and f (x) represents the prediction score.We mutate the model by randomly selecting one of the four weight coefficients and making a slight adjustment to its weight coefficient.This mutation process directly influences the output value of f (x), consequently impacting the classification results of the model. (1) ❹ XGBoost [8] XGBoost is a widely used gradient boosting algorithm designed for enhanced predictive modeling.XGBoost is a variant of the boosting algorithm [77], which aims to integrate multiple weak classifiers into a robust classifier.As a boosting tree model, XGBoost aggregates multiple tree models to form a powerful classifier.
In binary classification tasks, XGBoost defaults to output 0 or 1, representing two different classes.Internally, XGBoost calculates an initial probability value p, subsequently comparing it to a threshold (with a default value of 0.5) prior to determining the final class output: values exceeding 0.5 yields an output of 1, whereas values below 0.5 yield an output of 0.
Mutating XGBoost: To mutate XGBoost, we apply a random slight offset to the internal threshold of the XGBoost model, thereby generating model mutants.For instance, consider the original XGBoost threshold of 0.5; upon introducing a minor offset, the threshold becomes 0.4 for the mutated XGBoost model.Under this mutation, the following scenarios arise: 1) Given a test input t 1 with a predicted p value of 0.45, the original XGBoost predicts an outcome of 0 (p < 0.5), whereas the mutated XGBoost predicts an outcome of 1 (p > 0.4); 2) Given another test input t 2 with a p value of 0.3, both the original XGBoost and the mutated XGBoost models predict an outcome of 1 (p > 0.4; p > 0.5).It can be observed that t 1 is more sensitive to the injected mutation than t 2 , and we consider that t 1 is more likely to be misclassified by the model.This mutation rule can be reasonably interpreted from an uncertainty perspective: when a slight adjustment in the model's classification threshold can alter the test's classification result, it indicates that the model's prediction probability for that test is close to 0.5.According to prior work [16], the closer a prediction probability is to 0.5, the greater the model's uncertainty regarding that test, making it more prone to misclassification.
❺ Gaussian Naive Bayes (GaussianNB) [8] Gaussian Naive Bayes (GNB) is a probabilistic machine learning classification technique based on Gaussian distribution.It assumes that each parameter (a feature) possesses independent predictive power for the output variable.The combination of predictions from all parameters yields the final prediction.
Mutating GaussianNB: To induce mutations in Gaus-sianNB, we introduce a random slight adjustment to the internal threshold of the GaussianNB model, resulting in the generation of model mutants.2) Input Mutation Rules: The prior work [24] introduced a mutation operator, noise perturbation, for mutating inputs in image format, which adds noise to data for mutation.A common type of image noise is occlusion noise [78], achieved by overlaying a black block on the part of the image.This black block typically consists of a matrix filled with 0. The method involves replacing the matrix of pixel values at the original location in the image with this zero-filled matrix (black block).Inspired by this technique, MLPrior's input mutation rule involves randomly selecting a specific feature from the feature vector of t and changing its value to 0. Before this, MLPrior initially converts all attributes of t into a corresponding numerical feature vector.The objective is to alter the attribute value of this particular feature, thus affecting the model's predictions.To gain a deeper insight into the impact of input mutation rules on model predictions, we provide explanations using the five classical ML models evaluated in our study as examples.It is important to note that our input mutation rules are applicable to a wide range of datasets for classical ML models.❶ Decision Tree Given a test input, if a specific feature value of this input is changed to 0, it could lead to a change in the decision path that the input takes down the tree.This mutation can cause the input to be categorized differently than it would have been without the mutation.❷ K-Nearest Neighbors (KNN) For KNN, changing the value of a feature to 0 can alter the distance calculation between this input and other instances.This shift in distances can lead to a different set of k nearest neighbors being considered, thereby potentially affecting the classification result of the input.❸ Logistic Regression In logistic regression, modifying a feature's value to 0 will impact the coefficients associated with that feature.This can lead to a different logistic function, causing the instance's predicted probability to shift, ultimately affecting the classification outcome.❹ XGBoost For a given sample, setting a feature of it to 0 can influence the way that features contribute to the ensemble of decision trees.This can lead to different tree structures being emphasized during prediction, thereby affecting the final prediction of the sample.❺ Gaussian Naive Bayes (GaussianNB) For a given sample, setting a feature's value to 0 can impact the calculation of probabilities for the various classes based on the Gaussian distribution assumption.This can influence the final classification result.

C. Mutation Feature Generation
For each test t ∈ T , based on the aforementioned mutation rules, we generate mutants and subsequently build mutation feature vectors.The detailed procedures are elaborated below.
• Input Mutation Features (IMF) Based on the input mutation rules presented in Section III.B.Otherwise, the i-th element will be set to 0. An example of the resulting feature vector is (1, 0,..., 0).

D. Feature Concatenation
Based on the aforementioned steps, for each test sample t ∈ T , MLPrior generates three types of feature vectors: the attribute feature vector, the input mutation vector, and the model mutation vector.Subsequently, for t ∈ T , MLPrior concatenates these three types of features to obtain the final feature vector, which is then used as input to the ranking model.

E. Learning-to-Rank
Once obtaining the feature vector for each t ∈ T , MLPrior aims to train a ranking model to automatically learn the probability of a test input t being misclassified by the ML model M based on its feature vector.In the following section, we describe the process of constructing the ranking model and explain how to utilize the ranking model for test prioritization.
Ranking model building MLPrior leverages the XGBoost ranking algorithm [8], an optimized distributed gradient boosting learning algorithm, to construct the ranking model.Given the classical ML model M with dataset D, we first split the dataset D into two partitions: the training set R and the test set T , in a 7:3 ratio [79].The test set remains untouched for the purpose of evaluating MLPrior.Based on the training set R, our objective is to construct a training set R for training the ranking models.To achieve this, we generate the final feature vector for each r ∈ R, following the steps described in Section ?? to Section III-D.These features are used as the training features Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
for the dataset R .Next, we utilize the original ML model M to classify each instance r ∈ R and then compare the model's predictions with the corresponding ground truth of r.By doing so, we can identify whether r is misclassified by the model M .If r is misclassified, we label it as 1; otherwise, we label it as 0. As a result, we obtain the labels for the training set R .Based on the constructed training set and corresponding training labels obtained above, we can proceed to train the ranking model of MLPrior.
Test prioritization via ranking model It is essential to emphasize that the XGBoost ranking algorithm, upon completion of its training process, is a binary classification algorithm.It classifies a test into two categories instead of providing an estimation of misclassification probability.Therefore, we made specific adjustments to the original XGBoost algorithm.Specifically, we extract the intermediate value from the model's output, which was originally used to determine whether a test instance would be predicted incorrectly or not.Typically, if the intermediate value surpasses the threshold, the input is classified as "misclassified"; otherwise, it is classified as "not misclassified".Instead of proceeding with the final classification, we directly employ this intermediate value as the misclassification probability score.A high value denotes that a test instance has a high probability of being misclassified.Finally, we sort all the tests in the test set T in descending order based on their misclassification probability scores, resulting in the prioritized test set T .

F. Variants of MLPrior
In order to explore the influence of different ranking models on the effectiveness of MLPrior, we propose four variants, denoted as MLPrior T , MLPrior K , MLPrior L , and MLPrior N .These variants utilize different ranking models for test prioritization, namely, decision tree [80], K-nearest neighbors (KNN) [29], logistic regression [10], and Gaussian Naive Bayes (Gaus-sianNB) [81], respectively.They solely differ in the selection of the ranking models, while the remaining workflow is kept identical.
• MLPrior T This variant incorporates the decision tree ranking model.The principle of the Decision Tree algorithm is to partition the dataset into subsets at split nodes, iteratively branching until reaching leaf nodes that provide the final classification.• MLPrior K integrates the KNN algorithm.KNN is a wellestablished machine learning technique.It operates on the fundamental principle of proximity, where the classification of a sample is determined by considering the majority labels of its K nearest neighbors in the feature space.• MLPrior L integrates the Logistic Regression algorithm [10].Logistic Regression employs the logistic function to transform the linear combination of the independent variables into a range between 0 and 1.Consequently, this probability value is employed to perform classification.In MLPrior, we employ XGBoost [8] as the ranking model for test prioritization.In this research question, we investigate the impact of different ranking models on the effectiveness of MLPrior.To this end, we construct four variants employing different ranking models: decision tree [80], K-nearest neighbors (KNN) [29], logistic regression [10], and Gaussian Naive Bayes (GaussianNB) [81].By evaluating the effectiveness of these variants, we explore the influence of ranking models.

• RQ4: To what extent does each type of features contribute to the effectiveness of MLPrior?
To construct the feature vector for a given test input, ML-Prior generates three types of features: model mutation features, dataset mutation features, and attribute features.
In this research question, our objective is to investigate the extent to which each type of features contributes to the effectiveness of MLPrior.• RQ5: How does the selection of main parameters of MLPrior impact its effectiveness?
We investigate the influence of the main parameters in MLPrior.Our objective is to evaluate whether MLPrior can consistently outperform the compared test prioritization approaches when these main parameters fluctuate.

B. Subjects
In our research, we utilized 305 subjects to assess the effectiveness of MLPrior.A subject in this context refers to a combination of a classical ML model and a dataset.The description of these subjects can be found in Table I.Out of the 305 subjects, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.25 subjects (5 datasets × 5 ML models) were generated using natural datasets, while 250 subjects were generated using mixed noisy datasets.Additionally, 30 subjects were generated using fairness datasets.Below, we explain the construction method for mixed noisy datasets and fairness datasets.
• Mixed noisy datasets blend natural data with noisy data, with the natural data accounting for 70% and the noisy data accounting for 30%.The reason we chose 30% is that: A high noise ratio, such as 90%, would lead to a substantial proportion of noisy test inputs.In this scenario, a significant number of misclassified tests would be chosen by any prioritization method, making it difficult to demonstrate the effectiveness of MLPrior.Therefore, to ensure an effective evaluation of MLPrior and the compared approaches, we choose a reasonable noise generation ratio (i.e., 30%).For each of the five natural datasets, we generated 10 mixed noisy datasets, resulting in a total of 50 (5 × 10) mixed datasets.Each mixed dataset was paired with five classical ML models, leading to 250 subjects (50 datasets × 5 models).• Fairness datasets refer to datasets carefully constructed with a specific focus on avoiding the introduction of biases related to individual attributes, such as gender, age, etc.In our study, we generated a fairness dataset from a natural dataset following the approach utilized in prior work [26]: we randomly selected a subset of instances and modified their gender and age attribute values while keeping their original labels untouched.Employing this approach, we generated 6 fairness datasets.We pair each dataset with five classical ML models, leading to 30 subjects (6 datasets × 5 models).
• Adult [82], [90], [91]: The adult dataset is designed to predict whether an individual's annual income exceeds 50K based on various demographic and financial attributes.It consists of 48,842 instances, with each instance representing a single individual.All the instances are divided into two classes: > 50K and <= 50K.Each individual is described by 14 different attributes, such as age, occupation, education level, workclass, etc.
• Bank [83], [90]: The bank dataset is utilized to forecast whether a client will subscribe to a term deposit, utilizing their demographic, financial, and social information.It consists of 49,732 instances, classified into two classes: subscribing to the term deposit or not subscribing.Each instance encompasses 16 attributes, such as age, education, loan, and balance.
• Stroke [84]: The stroke dataset is employed for predicting the occurrence of a stroke in patients.It comprises 40,907 instances, classified into two classes: having a stroke or not having a stroke.Each instance is described using 10 attributes, such as age, heart disease, hypertension, work type, residence type, and smoking status.
• Diabetes [85]: The diabetes dataset is utilized for predicting diabetes occurrence in patients.It comprises 253,680 survey responses related to diabetes.This dataset is categorized into three classes: 0 for no diabetes or diabetes only during pregnancy, 1 for prediabetes, and 2 for diabetes.
• Heartbeat [86]: The Heartbeat dataset is used for classifying heartbeat signals.
In our experiments, we used 30,000 heartbeat signal sequence data.Each sample in the dataset has a consistent sampling frequency and equal length in its signal sequence.The Heartbeat dataset is divided into 4 classes, which are categorized as heartbeat signal types (0, 1, 2, 3).2) Classical ML Models: We evaluate the effectiveness of MLPrior using five well-established classical ML models: Decision Tree [9], K-Nearest Neighbors (KNN) [75], Logistic Regression (LR) [10], XGBoost [8], and Gaussian Naive Bayes (GaussianNB) [8].These models were chosen based on two primary reasons: First, their widespread adoption in various industries owing to their interpretability and demonstrated performance [12], [32], [76], [92].
In the industry, the five classical ML models we evaluated are broadly implemented, and their accuracy is crucial, as their prediction errors could have serious consequences.Therefore, thorough testing and test prioritization of these classical ML models are essential.
• Hospitality industry [76] The logistic regression model can utilize financial data to predict whether a hotel business is at risk of bankruptcy.Investors in the hotel industry will rely on these models to make crucial financial and Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
operational decisions.If the predictions are inaccurate, Investors can make erroneous investment decisions, such as investing in businesses that are at risk of bankruptcy.• Service industry [32] The decision tree model can be employed to analyze the impact of information and communication technology (ICT) on service industry performance using global service industry data from the World Bank.Service industry companies will depend on such analyses to formulate strategies, such as investing in ICTs.
Incorrect predictions could result in misallocation of resources, affecting the company's long-term performance and competitiveness.• Financial industry [93], [94] The XGBoost algorithm can be utilized for personal credit risk assessment.Rao et al. [93] employed XGBoost to predict an individual's credit risk for determining loan approval decisions.Moreover, KNN can be used for credit scoring(i.e., assessing the credit risk of loan applications) [94].• Healthcare industry [95] The Gaussian Naive Bayes model can be leveraged for diagnosing cancer based on the patient's medical information [95].To better illustrate the utility of MLPrior, we provided a specific example.For instance, in the above scenario where XGBoost is used for personal credit risk assessment, MLPrior can be utilized to identify misjudged loan approvals (where the XGBoost model incorrectly classifies some applicants who should not receive loans as qualified borrowers, thus approving their loan applications).This enables financial institutions to detect and focus on potential high-risk cases earlier, thereby not only reducing losses but also enhancing their overall efficiency in risk management.
Second, their extensive use in recent ML testing studies [26], [96], [97], [98], [99].Importantly, it should be noted that ML-Prior's applicability is not limited to the evaluated models.With minor adjustments to the model mutation rules (i.e., making them target the architecture parameters or weight parameters of the assessed ML model), MLPrior can be adapted to various interpretable ML models.
• XGBoost [8] XGBoost, an ensemble method that belongs to the family of boosting algorithms, functions by integrating the forecasts of multiple Classification and Regression Trees (CART) [100] to create a robust classification mechanism.This algorithm amalgamates weak learners to engineer a powerful model with superior predictive capacity.• Gaussian Naive Bayes (GaussianNB) [95] Gaussian Naive Bayes, a probabilistic classifier based on Bayes' theorem with an assumption of independence among predictors, is known for its efficacy in multiclass classification problems and its robustness against irrelevant features.• Logistic Regression (LR) [10] Logistic Regression is a widely-adopted statistical model employed in scenarios of binary classification tasks.This model is founded on the principles of probability and logistic function, offering an interpretable mathematical framework.• Decision Tree [9] Decision tree constructs a tree-like structure, where internal nodes represent decision points based on feature values, and leaves represent the predicted outcomes.
• K-nearest neighbors (KNN) [75] KNN is a widelyadopted classification algorithm that assigns labels to instances based on the majority vote of their K neighboring data points.The KNN algorithm is known for its simplicity and flexibility in handling classification tasks.
• DeepGini [16] DeepGini operates by assessing the model's uncertainty in its predictions for tests.The fundamental premise of DeepGini is that tests for which the model exhibits greater uncertainty in its predictions are deemed to have a higher likelihood of being incorrectly predicted.Consequently, these tests will be prioritized higher.The mechanism for calculating this uncertainty in DeepGini is encapsulated in a specific formula, referred to as Formula 2. In this formula, the symbol ξ(t) denotes the model's uncertainty regarding its prediction for a particular test t.The higher the value of ξ(t), the greater the uncertainty associated with the model's prediction for the test t, and t will be prioritized higher.By prioritizing tests with higher values of ξ(t), DeepGini can identify and prioritize test inputs that are potentially misclassified.
where N is the number of classes, and p t,i denotes the probability of the model predicting t belonging to class i.
• VanillaSM [20] The VanillaSM algorithm ranks all the tests by computing the difference between the highest activation probability within the output softmax layer for each test and 1.The calculation is defined by Formula 3.
A lower value of V (t) indicates that the test is more likely to be misclassified by the model.
where N is the number of classes.max N i=1 l i (t) represents the model's prediction probability for the most confident classification of test t among all N classes.
• Prediction-Confidence Score (PCS) PCS [20] prioritizes test inputs by calculating the difference between the probabilities of the model's most confident class and the second most confident class for each test.The formula is given as Formula 4. A smaller PCS(t) indicates that a test is more likely to be mispredicted by the model.
where p 1 (t) is the predicted probability of the model for the most confident class of test t, and p 2 (t) is the predicted probability of the model for the second most confident class of test t.• Entropy Entropy [20] ranks all tests by calculating the entropy value of the model's predicted probability vector for each test.A higher entropy value for a test indicates that it is more likely to be mispredicted by the model.• Random selection [101] In random selection, the order of test input execution is determined randomly.

D. Measurements
Following the existing work [16], we employed two metrics to evaluate the effectiveness of MLPrior, the compared approaches, and the variants of MLPrior: Average Percentage of Fault Detection (APFD) [57] and Percentage of Faults Detected (PFD) [16].
• Average Percentage of Fault-Detection (APFD) APFD is a well-established metric utilized for evaluating the effectiveness of test prioritization.A higher APFD value indicates greater effectiveness.The APFD values are computed using Formula 5.
where n denotes the number of test inputs in the test set, and k represents the number of misclassified inputs.o i is the index of the i th misclassified test within the prioritized test set.Below, we explain from a formula perspective why larger APFD values indicate high test prioritization effectiveness.Firstly, in the formula, since n is a constant, a larger APFD value means that the value of k i=1 o i (i.e., the total index sum of misclassified tests within the prioritized list) is smaller.A smaller k i=1 o i implies that the misclassified tests are relatively positioned toward the front of the prioritized test set.This indicates that the misclassified tests are indeed prioritized at the beginning of the test set through the test prioritization approach, thus demonstrating that its effectiveness is high.Following prior work [16], we normalize the APFD values to [0,1].A prioritization approach is considered better when the APFD value is closer to 1.
• Percentage of Fault Detected (PFD) PFD quantifies the ratio of detected misclassified test inputs to the total number of misclassified tests.A higher PFD value suggests that a test prioritization approach is more effective.The calculation of PFD follows Formula 6.
where #F d is the number of detected misclassified test inputs.#F is the total number of misclassified test inputs.
In our study, we measured the PFD values of MLPrior and compared test prioritization approaches using varying ratios of prioritized tests.

E. Implementation and Configuration
In terms of the compared approaches, we employed the available implementations provided by their respective authors [16], [20].Concerning the XGBoost ranking model, we utilized XGBoost version 1.4.2[8].For the ranking models Decision Tree, KNN, Logistic Regression, and GaussianNB, we utilized the package provided by scikit-learn 0.24.2 [102].Regarding the parameters of the ranking models, we set the n_estimators parameter of XGBoost to 100.We set the max_iter parameter of Logistic Regression to 100.For the Decision Tree ranking algorithm, we set the min_samples_split parameter to 2. The var_smoothing parameter of GaussianNB was set to 1e-9.Additionally, we set the n_neighbors parameter of KNN to 5.
Furthermore, concerning model mutation, we generated 100 mutant models for each original classical ML model.For dataset mutation, we generated 20 mutant datasets for each natural dataset.In other words, MLPrior generates 20 mutated inputs for each test.Moreover, we conducted a statistical analysis to mitigate the impact of randomness.For each subject (i.e., a dataset with a model), we repeated the experiments 5 times and reported the average results.We conducted the experiments on a high-performance cluster, and each cluster node runs a 2.6 GHz Intel Xeon Gold 6132 CPU with an NVIDIA Tesla V100 16G SXM2 GPU.In terms of data processing, we conducted corresponding experiments on a MacBook Pro laptop with Mac OS Big Sur 11.6, Intel Core i9 CPU, and 64 GB RAM.

A. RQ1: Effectiveness and Efficiency of MLPrior
Objectives: We evaluate the effectiveness and efficiency of MLPrior in prioritizing test inputs for classical ML models.
Experimental design: We conducted experiments to evaluate the performance of MLPrior from three perspectives: • Effectiveness To assess the effectiveness of MLPrior, we carefully designed 15 subjects consisting of three prevalent datasets, each paired with five classical ML models.Detailed information regarding the subjects can be found in Table I.Moreover, we compared MLPrior against a range of DNN prioritization approaches, namely DeepGini [16], Vanilla Softmax [20], Prediction-Confidence Score (PCS) [20], Entropy [20], and Random Selection.To measure the effectiveness, we used the APFD metric [57] and the PFD metric [16], which are widely-adopted measures for evaluating test prioritization techniques.• Efficiency We evaluate the efficiency of MLPrior by quantifying the time required for each step of MLPrior, as well as the time cost of each compared approach.• Statistical analysis Considering the randomness associated with the training process of the ML models and the MLPrior approach, we conduct a statistical analysis to ensure the stability of our research.More specifically, we replicated all the experiments a total of five times, calculating average results to report in this section.Furthermore, we calculated the p-values to evaluate the statistical significance of our findings.Results: The experimental results pertaining to RQ1 are presented in Table II, Table III, Fig. 3, Table IV and Table V.We highlight the approach with the highest effectiveness in grey to facilitate quick and easy interpretation of the results.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.When applied to natural inputs, MLPrior outperforms all the compared methods in terms of APFD across all subjects, with an average improvement of 14.74%∼66.93%over the compared approaches.Table II exhibits the effectiveness of MLPrior in comparison to the compared test prioritization approaches across different subjects.From the table, we see that MLPrior outperforms all the compared methods across all subjects.Specifically, the APFD values of MLPrior range from 0.787 to 0.990, while that of the compared approaches span from 0.494 to 0.837.Table III demonstrates the effectiveness of MLPrior and the comparative test prioritization methods on multiclass classification datasets.We see that in all cases, the effectiveness of MLPrior is higher than all the comparative methods.Specifically, the APFD range of MLPrior is from 0.639 to 0.915, while the APFD range for the comparative methods is from 0.475 to 0.852.The experimental results demonstrate that MLPrior's effectiveness surpasses all comparative methods on multiclass datasets.
Table V shows the comparison of effectiveness between ML-Prior and other test prioritization methods on all subjects in both binary and multi-class datasets.The evaluation metrics include the number of cases where each method performs the best (denoted as #Best cases), the average APFD value of each test prioritization approach (denoted as Average APFD), and the improvement of MLPrior relative to each comparison method (denoted as Improvement(%)).From Table V, we can see that MLPrior performs the best across all cases, whether in binary or multi-class datasets.In binary datasets, the average APFD of MLPrior is 0.854.In multi-class datasets, it is 0.812, and across all subjects (including both binary and multi-class), it is 0.833.The average APFD of comparison methods across all subjects ranges from 0.499 to 0.726.
Moreover, under all subjects, the average improvement of MLPrior relative to all the compared test prioritization methods ranges from 14.74% to 66.93%.More specifically, in binary datasets, the improvement range of MLPrior relative to all comparison methods is from 18.78% to 70.46%.In multi-class datasets, the improvement range is from 10.93% to 63.05%.These experimental results demonstrate that MLPrior's effectiveness surpasses all other test prioritization methods on natural test inputs.
Fig. 3 provides a visual comparison between MLPrior and other test prioritization approaches in terms of PFD on the Bank dataset with the GaussianNB model.In this figure, the effectiveness of MLPrior is represented by the red curve, while Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.the blue curve represents the effectiveness of confidence-based test prioritization methods.Additionally, the black curve depicts the baseline effectiveness.It is noteworthy to mention that all confidence-based approaches are consolidated into a single line due to their identical effectiveness across all cases, as evidenced in Table II.
The reason why all confidence-based methods yield the same experimental results on binary classification ML models is as follows: Given a binary classification model, suppose the probability of a test t belonging to category 1 is p, then the probability of it belonging to the other category is 1 − p. Regardless of the confidence-based method used, tests with p values close to 0.5 are deemed more uncertain [16] and thus are prioritized to the front.Therefore, the experimental results of all test prioritization methods are the same.We explain this in detail below.
Feng et al. [16] demonstrated that in a binary classification model, if the model's prediction probability for a test is (0.5, 0.5), it means the model is most uncertain about this test, indicating this test is more likely to be misclassified.The closer a test's p value is to 0.5, the more uncertain the model is about that particular test.Consequently, uncertainty is solely determined by p. Regardless of the specific confidence-based test prioritization method employed, tests with p values closer to 0.5 will be prioritized over others.
To illustrate this point, consider a test set with three tests, and the model's probability vectors for these tests are as follows: t 1 (0.9, 0.1), t 2 (0.7, 0.3), t 3 (0.8, 0.2).Irrespective of the chosen confidence-based test prioritization method, the resulting ranking will be t 2 → t 3 → t 1 because t 2 has the p value (0.7) closest to 0.5, followed by t 3 (p = 0.8), while t 1 has the farthest p value from 0.5 (p = 0.9).
From Fig. 3, we see that MLPrior consistently outperforms all the compared methods across different prioritization ratios.These experimental results strongly suggest that MLPrior exhibits higher effectiveness than other test prioritization approaches in classical ML test prioritization.As stated in the experimental design, due to the inherent randomness associated with the training process, we conducted a statistical analysis.This analysis involved repeating all experiments a total of five times.The p-value of the experimental results was found to be significantly less than 10 −06 , which suggests that MLPrior can stably outperform the compared test prioritization approaches.
MLPrior showcases acceptable efficiency, with an average execution time of less than 20 seconds.In addition to evaluating its effectiveness, we also compared the efficiency of MLPrior with other test prioritization approaches, and the experimental results are presented in Table IV.The findings indicate that the average total running time of MLPrior on each subject is under 20 seconds, which can be broken down into three main components: feature generation (3 seconds), ranking model training (15 seconds), and prediction (55.133 ms).Here, 'ms' refers to milliseconds.The prediction times for the confidence-based test prioritization methods are as follows: DeepGini: 1.323 ms; VanillasM: 1.020 ms; PCS: 1.355 ms; Entropy: 114.483 ms.While confidence-based test prioritization techniques exhibit higher efficiency with a running time of less than 1 second, the computational cost of MLPrior remains reasonable in practical scenarios, especially considering the laborious and costly nature of manual labeling.Despite being slightly less efficient than confidence-based methods, the considerable improvement in effectiveness demonstrated by MLPrior, ranging from 18.78% to 70.46% compared to those techniques, underscores its overall performance.

Answer to RQ1: When applied to natural inputs, ML-Prior outperforms all the compared methods in terms of APFD across all subjects, with an average improvement of 14.74%∼66.93% over the compared approaches. Moreover, MLPrior showcases acceptable efficiency, with an average execution time of less than 20 seconds.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

B. RQ2: Effectiveness of MLPrior on Different Types of Test Inputs
Objectives: In addition to assessing MLPrior's performance on natural test sets, we also evaluate its effectiveness on different types of test inputs, encompassing mixed noisy data and fairness data.Mixed noisy datasets are composed of 70% natural data and 30% of noisy data.Fairness datasets are constructed with the aim of avoiding biases associated with individual attributes, such as gender and age.Ensuring fairness in machine learning is crucial to prevent bias and discrimination against specific groups during predictions.Fairness has emerged as a critical ethical consideration across diverse machine learning domains, such as recruitment, loan approvals, and medical diagnosis [25].In these domains, the absence of fairness can result in unjust treatment of certain groups, significantly impacting individuals' lives and rights.
Our investigation revolves around two primary subquestions: • RQ-2.1 How does MLPrior perform on mixed noisy data?
• RQ-2.2How does MLPrior perform on fairness data?Experimental design: We conduct the following experiments to answer the aforementioned sub-questions.
[Experiment ❶] In the first step, we generate noisy data from the three natural datasets used in RQ1 (i.e., Adult, Bank, and Stroke).To this end, we mix 30% noisy data with 70% natural data to create mixed noisy data.The reason we chose a noise generation ratio of 30% is as follows: A high noise ratio, such as 90%, would result in a significant proportion of noisy test inputs, and a substantial number of misclassified tests would be selected by any prioritization method, thereby complicating the demonstration of MLPrior's effectiveness.Therefore, in order to ensure an efficacious evaluation of both MLPrior and the comparative approaches, we opted for a reasonable noise generation ratio (i.e., 30%).For each of the three natural datasets, we generate ten mixed noisy datasets, resulting in 30 (3 × 10) mixed datasets.Each mixed dataset is paired with five classical ML models, leading to a total of 150 subjects (30 datasets × 5 models).Based on these generated subjects, we compare the effectiveness of MLPrior with other test prioritization methods.
[Experiment ❷] To generate fairness data for evaluation, we adopt the approach used in previous research [26].Specifically, for each natural dataset utilized in RQ1 (i.e., Adult, Bank, and Stroke), we randomly selected a subset of instances from the original test set and modified their gender and age attribute values while keeping the original labels untouched.The reason for ensuring the labels untouched is as follows: In the context of ensuring fairness, the model should maintain consistent classification results when the protected attributes (such as genders and ages) are changed, while all other attributes remain unaltered.
Concretely, for the attribute "gender", we changed half of the "male" to "females" and half of the "females" to "males".Regarding the attribute "age", following the prior work [87], we modified the "middle age" (30∼59) instances in the test set to "young age" (18∼29) while converting the "young age" test instances to "middle age."Using the generated fairness test sets, we compare the effectiveness of MLPrior with other test prioritization methods.
Results: The experimental findings pertaining to RQ2.1 are presented in Table VI  the approach with the highest effectiveness in grey to facilitate easy interpretation of the results.On mixed noisy inputs, MLPrior consistently performs better than all the compared approaches, with an average improvement of 18.55%∼67.73%.From Table VI, we see that MLPrior consistently outperforms all the compared methods across each case.Remarkably, the APFD values achieved by MLPrior range from 0.810 to 0.982, while that of the compared methods range from 0.497 to 0.766.Table VII presents the effectiveness of MLPrior compared to other test prioritization methods on noisy datasets for multiclassification.We see that MLPrior outperforms all other test prioritization methods across all multiclassification subjects.The range of APFD values for MLPrior is from 0.639 to 0.916, whereas the range for the compared test prioritization methods is from 0.485 to 0.851.We conclude that on noisy datasets for multiclassification, the effectiveness of MLPrior surpasses that of the compared test prioritization methods.
Table VIII provides an overall comparison of the effectiveness of MLPrior and other test prioritization methods on binary classification datasets, multiclass classification datasets, and all subjects (both binary and multiclass).The evaluation metrics include the number of cases where each method performs the best (denoted as #Best cases), the average APFD value of each test prioritization approach (denoted as Average APFD), and the improvement of MLPrior relative to each comparison method (denoted as Improvement(%)).
In Table VIII, we observe that MLPrior performs the best across all subjects, regardless of whether they are binary or multiclass.The average APFD of MLPrior on all subjects (including both binary and multiclass) is 0.837.Specifically, the average APFD of MLPrior in binary classification is 0.858, while in multiclass classification, it is 0.815.In contrast, the range of the average APFD for the comparison methods across all subjects is from 0.499 to 0.706.Moreover, across all subjects, the average improvement of MLPrior relative to the comparison test prioritization methods ranges from 18.55% to 67.73%.

Answer to RQ2.1:
On mixed noisy inputs, MLPrior consistently performs better than all the compared approaches, with an average improvement of 18.55%∼67.73%.
The experimental results of RQ2.2 are presented in Table IX, Table X, Table XI.Table IX displays the effectiveness differences between MLPrior and all the comparative methods on the fairness dataset in terms of APFD.The gray shading indicates the best-performing method for each case.
On fairness data, MLPrior consistently performs better than all the compared approaches, with an average improvement of 15.34%∼62.72%.We see that MLPrior achieves the highest effectiveness across all cases, with an APFD range of 0.813 to 0.897.In contrast, the comparative methods have an APFD range of 0.484 to 0.788.
Table X showcases the effectiveness of MLPrior compared to other test prioritization methods on fairness datasets for multiclassification.We can see that MLPrior exceeds the performance of all other test prioritization methods in all multiclassification subjects.The APFD values for MLPrior range from 0.765 to 0.801, while the compared test prioritization methods range between 0.495 and 0.759.The experimental results demonstrate that, in the context of fairness datasets for multiclassification, MLPrior's effectiveness is superior to that of the other compared test prioritization methods.
Table XI presents a comparative analysis of the effectiveness between MLPrior and other test prioritization methods across all fairness subjects within binary and multi-class datasets.The evaluation metrics encompass the number of instances where each method is most effective (denoted as #Best cases), the average APFD value for each test prioritization approach (denoted as Average APFD), and the relative improvement of MLPrior compared to each method (denoted as Improvement(%)).According to Table XI, MLPrior consistently outperforms other methods in all scenarios, whether in binary or multi-class datasets.Specifically, in binary datasets, MLPrior's average APFD is 0.847.In multi-class datasets, it is 0.776, and the overall average across all subjects (encompassing both binary and multi-class) stands at 0.812.The average APFD for the comparison methods across all subjects varies from 0.499 to 0.704.
Furthermore, across all fairness subjects, the average improvement of MLPrior compared to all other test prioritization methods ranges from 15.34% to 62.72%.More specifically, within binary datasets, MLPrior's improvement over the comparison methods varies from 20.14% to 70.08%.In multiclass datasets, this improvement range is between 10.54% and 61.69%.These experimental results indicate that MLPrior's effectiveness is superior to all other test prioritization methods when dealing with fairness test inputs.

Answer to RQ2.2:
On fairness data, MLPrior consistently performs better than all the compared approaches, with an average improvement of 15.34%∼62.72% Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

C. RQ3: Impact of Ranking Models on the Effectiveness of MLPrior
Objectives: We investigate the impact of different ranking models on the effectiveness of MLPrior.
Experimental design: In order to investigate the impact of different ranking models, we propose four variants of MLPrior denoted as MLPrior T , MLPrior K , MLPrior L , and MLPrior N .These variants employ the ranking models decision tree [80], K-nearest neighbors (KNN) [29], logistic regression [10], and Gaussian Naive Bayes (GaussianNB) [81], respectively.The only difference between them and the original MLPrior lies in the selection of the ranking models, while the rest of the workflow remains unchanged.We utilize the APFD metric to evaluate the effectiveness differences of MLPrior, these variants, and the comparative test prioritization methods on natural and mixed noisy datasets.
Results: The experimental results for RQ3 are presented in Table XII XVII.Tables XII and XIII display the effectiveness of MLPrior, its variants, and the compared test prioritization methods on natural datasets.Tables XIV and XV show their effectiveness on noisy datasets.Table XVI presents their effectiveness on fairness datasets.Table XVII illustrates their average performance across all datasets (including natural, noisy, and fairness datasets), as well as the improvements of MLPrior relative to its variants and the compared test prioritization methods.XII) and multiclass classification datasets (Table XIII).We see that, whether on binary or multiclass datasets, the effectiveness of MLPrior (measured by APFD) consistently surpasses all its variants.On binary natural datasets (Table XII), the APFD range for MLPrior is from 0.8110 to 0.990, while the range for its variants is from 0.589 to 0.898.On multiclass natural datasets (Table XIII), the APFD range for MLPrior is from 0.639 to 0.915, while the range for its variants is from 0.580 to 0.890.We conclude Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Table XVI demonstrates the effectiveness of MLPrior, its variants, and the compared test prioritization methods on fairness datasets.We see that, whether on fairness datasets constructed based on age or those constructed based on gender, the effectiveness of MLPrior outperforms both its variants and all test prioritization methods.Specifically, the APFD range for MLPrior is from 0.765 to 0.897, while the range for its variants is from 0.648 to 0.821.We conclude that, on fairness datasets, the effectiveness of MLPrior exceeds all its variants.Experimental design: To assess the impact of different feature types on the effectiveness of MLPrior, we adopt the cover metric from the XGBoost algorithm [8] as the measurement tool.Firstly, within the context of each subject, we compute the importance scores for each generated feature.Subsequently, we identify the top N most contributing features.Based on it, we investigate the extent to which each type of feature contributes to the effectiveness of MLPrior.Below, we explain the working principle of the XGBoost cover metric.

MLPrior outperforms all its variants in test prioritization, indicating that among all ranking models, the XGBoost model (utilized by the original MLPrior) can better utilize the generated features of test inputs for test prioritization. Table XII and Table XIII demonstrate the effectiveness of MLPrior on natural datasets, including binary classification datasets (Table
The Working Principle of XGBoost Cover Metric: The cover metric in XGBoost quantifies feature importance by evaluating the average coverage of each instance across the leaf nodes in a decision tree.Specifically, the cover metric calculates the frequency at which a specific feature is utilized to partition the data in all trees of the ensemble.The coverage values associated with each feature across all trees are then summed.Subsequently, the resulting coverage value is normalized by the total number of instances, providing the average coverage of each instance by the leaf nodes.The significance of a particular feature is determined based on its derived coverage value, with features exhibiting higher coverage values being assigned greater importance. Results: Table XVIII presents the contributions of different feature types to the effectiveness of MLPrior.In this table, we utilize the abbreviations MMF, IMF, and OAF to represent model mutation features, input mutation features, and original attribute features, respectively.The numbers after the feature abbreviations denote the indices of the corresponding features.For instance, IMF-123 denotes the input mutation feature with index 123.We conducted the feature contribution analysis on both binary classification datasets (Adult, Bank, and Stroke) and multiclass classification datasets (Diabetes and Heartbeat).
All three types of features (i.e., model mutation features, input mutation features, and original attribute features) visibly contribute to the effectiveness of MLPrior.In Table XVIII, we find that in binary classification datasets, for the majority of cases (14 out of 15), all three types of features are present among the top-N most contributing features.For instance, in the dataset Adult with the LR model, IMF features account for 40% of the top 10 critical features, MMF features account for 50%, and OAF features account for 10%.In the case of dataset Bank with the Tree model, IMF features contribute to 20% of the top 10 critical features, MMF features account for 70%, and OAF features account for 10%.Moreover, regarding the multiclass classification datasets, we find that in all cases (10 out of 10), all three types of features are present among the top-N most contributing features.These experimental findings demonstrate that each type of feature makes a visible contribution to the effectiveness of MLPrior.
Answer to RQ4: All three types of features (i.e., model mutation features, input mutation features, and original attribute features) visibly contribute to the effectiveness of MLPrior.

E. RQ5: Impact of Main Parameters in MLPrior
Objectives: We delve into the impact of main parameters on the effectiveness of MLPrior.
Experimental design: Building upon the existing research by Wang et al. [15], We delve into an exploration of the impact of three main parameters within the MLPrior's ranking model.These parameters include max_depth, which denotes the maximum tree depth for each XGBoost model, colsample_bytree, representing the sampling ratio of feature columns during the tree construction process, and learning_rate, indicating the boosting learning rate utilized in the XGBoost ranking model.To achieve the research objectives, we conducted a series of experiments using natural datasets.We carefully modified the aforementioned three main parameters and observed the variations in the effectiveness of MLPrior (measured by APFD).
Results: The experimental results of RQ5 are presented in Fig. 4, illustrating the fluctuations in MLPrior's effectiveness when the main parameters' values are altered.The X-axis represents the parameter values, while the Y-axis represents MLPrior's effectiveness (measured by APFD).The solid red line corresponds to MLPrior, while the dashed lines represent the confidence-based test prioritization approaches.We investigated the influence of the main parameter on MLPrior's effectiveness across both binary classification datasets (Adult, Bank, and Stroke) and multiclass classification datasets (Diabetes and Heartbeat).
MLPrior consistently outperforms the confidence-based test prioritization approaches, even when the values of the main parameters are altered.Notably, we see that MLPrior consistently outperforms the confidence-based test prioritization methods across all subjects, as evidenced by the red line persistently positioned above the blue dashed lines.For example, in Fig. 4(e), we observe that when the parameter colsample_bytree varies, MLPrior's APFD ranges from 0.86 to 0.88, whereas the confidence-based methods' APFD effectiveness is approximately 0.75.Moreover, under the multiclass dataset Heartbeat, when the parameter colsample_bytree changes, MLPrior's APFD ranges from around 0.84 to 0.85, whereas the APFD effectiveness of confidence-based methods ranges from around 0.745 to 0.750.Under the multiclass dataset Diabetes, when the parameter learning_rate changes, MLPrior's APFD ranges from around 0.77 to 0.78, whereas the APFD effectiveness of confidence-based methods is around 0.71.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.The parameter colsample_bytree has a relatively small impact on the effectiveness of MLPrior, while the parameters max_depth and learning_rate have relatively high effects.Furthermore, we observe that the parameter colsample_bytree, which determines the sampling ratio of feature columns during the construction of each tree, has a relatively modest impact on the effectiveness of MLPrior.In other words, the effectiveness of MLPrior remains relatively stable even when the parameter colsample_bytree is altered.In contrast, the parameters max_depth (the maximum tree depth) and learning_rate (the boosting learning rate) exert a relatively high impact on the performance of MLPrior.
Answer to RQ5: MLPrior consistently outperforms the confidence-based test prioritization approaches, even when the values of the main parameters are altered.The parameter colsample_bytree has a relatively small impact on the effectiveness of MLPrior, while the parameters max_depth and learning_rate have relatively high effects.

A. Generality of MLPrior
While we employed five ML models in our study, ML-Prior can actually be adapted for a broad range of classical ML models through simple modifications to the model mutation rules, specifically by enabling them to target the architecture parameters or weight parameters of the evaluated model.We explain below why MLPrior exhibits generality.First, the core element of MLPrior is feature generation, which involves generating three essential types of features from the target tests: Model mutation features, Input mutation features, and Attribute features.Once the features are generated, MLPrior can utilize the ranking model to learn from these features for the purpose of test prioritization.Concerning model mutation features, making the aforementioned simple adjustments (i.e., enabling model mutation rules to target the architecture parameters or weight parameters of the evaluated model) can allow for the generation of model mutation features.For input mutation features and attribute features, MLPrior is capable of directly generating these features.Consequently, MLPrior can be applied to a diverse range of classical ML models.
Moreover, to better demonstrate the generality of MLPrior, we provide a detailed explanation of how to apply MLPrior to a new type of ML model.

B. Threats to Validity
Threats to Internal Validity.The primary internal threats to the validity primarily pertain to the implementation of the compared approaches.To mitigate the threat, we implemented the compared approaches based on the implementations published by their respective authors.Another internal threat arises from the inherent randomness inherent in the training process of the ML models.To mitigate this potential issue, we conducted a statistical analysis.Specifically, we repeated all the experiments five times and reported the average experimental results.Furthermore, we calculated the p-value of the experimental results to demonstrate the stability of our findings.
Threats to External Validity.The external threats to validity arise from the ML models and test datasets employed in our study.To mitigate these threats, we carefully selected a variety of ML models and datasets that are utilized by several toplevel conferences [26], [89], [103] in the field of ML testing.Moreover, our evaluation of MLPrior extended beyond natural datasets to encompass a spectrum of scenarios, encompassing mixed noisy datasets (comprising both natural and noisy data) as well as fairness-oriented datasets.This approach allowed us to substantiate the efficacy of MLPrior across various contexts.

A. Test Prioritization Techniques
Test prioritization aims to establish an optimized sequencing of tests with the objective of early detection of system bugs.In the field of traditional software testing, numerous test prioritization approaches have been proposed [104], [105], [106], [107], [108].Lou et al. [109] introduced an innovative approach to prioritize test cases, focusing on the inherent ability of individual test cases to detect faults.Their approach consists of two distinct models: a statistics-based model and a probability-based model, both of which quantify the fault detection capability of each test case.Through empirical evaluations, they demonstrated that the statistics-based model outperformed alternative methods, underscoring the significance of incorporating fault detection capability within the realm of test case prioritization.Henard et al. [110] conducted a thorough comparative study to analyze existing test prioritization techniques, finding that the difference between white-box strategies [111] and black-box strategies [112] are small.Chen et al. [113], in pursuit of enhancing the velocity of compiler testing, introduced the LET (Learning and Scheduling-based Test prioritization) framework.This pioneering framework is underpinned by two salient processes: the learning process, designed to discern program features and prognosticate the potential of a novel test program in revealing bugs, and the scheduling process, which strategically prioritizes test programs based on their propensity to unveil bugs.
In addition to the traditional field of software engineering, multiple test input prioritization strategies have been proposed in the literature for Deep Neural Networks (DNNs) [15], [16], [20], [114] to tackle the labeling-cost issue.Feng et al. [16] introduced DeepGini, which prioritizes tests by utilizing the Gini score to measure model confidence for each test input.Byun et al. [115] assessed various white-box metrics for ranking bug-revealing inputs, encompassing widely-used measures like softmax confidence, Bayesian uncertainty, and input surprise.Furthermore, Weiss et al. [20] extensively investigated diverse test input prioritization techniques for DNNs, particularly focusing on uncertainty-based metrics such as Vanilla Softmax, Prediction-Confidence Score (PCS), and Entropy.These metrics have demonstrated effectiveness in identifying potentially misclassified test inputs and have played a crucial role in facilitating test prioritization endeavors.Furthermore, Wang et al. [15] proposed a mutation-based test prioritization approach for DNNs, which will be described in the subsequent Section VII-D.

B. DNN Testing
In addition to test prioritization, the domain of DNN testing encompasses several other pivotal areas, such as test selection [55], [56], test input generation [17], [116], and test adequacy.Test selection aims to select a representative subset from the original test set to estimate the accuracy of the entire test set.Various test selection approaches have been proposed in the literature.Li et al. [56] proposed CES (Cross Entropy-based Sampling), which performs test selection by minimizing the cross-entropy between the selected test set and the entire test set, ensuring that the distribution of the selected test set closely matches the original set.Chen et al. [55] proposed PACE, which selected representative test inputs based on clustering, prototype selection, and adaptive random testing.First, Pace divides all test inputs into clusters based on their testing capabilities.Then, PACE utilizes the MMD-critic algorithm [33] to select prototypes from each group.For tests not belonging to any groups, PACE leverages adaptive random testing [117] to select test inputs by considering diversity.
Within the domain of test input generation, researchers have proposed a multitude of techniques aimed at generating diverse and effective inputs for DNN systems.Pei et al. [17] proposed DeepXplore, a white-box differential technique that focuses on generating test inputs capable of effectively evaluating the robustness of real-world DL systems.By leveraging the notion of neuron coverage, DeepXplore generates inputs that cover distinct regions of the neural network.Tian et al. [116] presented DeepTest, a method specifically tailored for generating test inputs to assess the performance of autonomous driving systems.DeepTest employs a greedy search strategy in conjunction with nine realistic image transformations to produce a diverse set of challenging input data.By systematically exploring the input space, DeepTest aims to uncover potential failures or limitations in autonomous driving systems, thereby enhancing their safety and reliability.
Regarding test adequacy, Ma et al. [18] proposed a set of multi-granularity testing criteria, including k-multisection neuron coverage, neuron boundary coverage, and strong neuron activation coverage.These criteria have been developed to identify corner behaviors and uncover potential vulnerabilities in DNN systems by comprehensively examining the coverage of various aspects of the neural network's behavior.Kim et al. [118] introduced surprise adequacy as a novel test adequacy criterion for testing DL systems.The surprise adequacy criterion emphasizes the importance of a test input being both challenging and informative while still adhering reasonably to the underlying training data distribution.This criterion emphasizes that a good test input should be sufficiently challenging and informative but should not deviate excessively from the training data distribution.

C. Mutation-Based Test Prioritization for Traditional Software
Mutation testing [63] entails generating intentional defects, referred to as mutants, within the software code to assess the test suite's quality.In the field of traditional software testing [23], [109], [119], mutation testing can be employed to assess the fault-detection capabilities of individual test cases, thereby achieving test prioritization.Lou et al. [109] introduced a novel test-case prioritization approach that determines the order of test cases by considering their fault detection ability.This ability is defined based on the analysis of mutation faults simulated from real software faults.By strategically ordering the test cases, this approach aims to maximize the efficiency of the testing process by prioritizing the detection of critical faults.Papadakis et al. [23] conducted a mutation analysis as an alternative technique to Combinatorial Interaction Testing (CIT).Their research suggests that the mutants generated using their approach demonstrate a stronger correlation with code-level faults than the input interactions targeted by the CIT approach.This underscores the potential of mutation analysis to offer valuable insights into underlying faults within software systems and guide test case prioritization.Furthermore, Shin et al. [119] proposed a novel test case prioritization method that combines mutation-based and diversity-based approaches.They demonstrate that mutation-based prioritization is as effective as, or more effective than, random prioritization and coverage-based prioritization.

D. Mutation Testing and Mutation-Based Test Prioritisation for Deep Learning
Mutation Testing for DNNs The field of mutation testing for DNNs has seen significant exploration, with numerous studies contributing to the evolution of various mutation operators and frameworks [24], [114], [120].A notable contribution in this domain is from Ma et al. [24], who introduced Deep-Mutation.This innovative approach is designed to assess the quality of test data for DL systems through comprehensive mutation testing.DeepMutation encompasses a diverse array of mutation operators at both the source and model levels.These operators are meticulously crafted to inject faults into different components of DL systems, including training data, programming code, and the models themselves.Building upon this foundation, Hu et al. further expanded their work with the development of DeepMutation++ [120], an advanced mutation testing tool specifically tailored for DL systems.DeepMuta-tion++ introduced a set of new mutation operators that are particularly suited for feed-forward neural networks (FNNs) and Recurrent Neural Networks (RNNs).A key feature of this tool is its capability to dynamically mutate the runtime states of RNNs, a critical aspect for evaluating the resilience of these networks under various operational conditions.Humbatova et al. [114] Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
made a significant stride in the field by developing DeepCrime, the first mutation testing tool that implements DL mutation operators grounded in actual DL faults.DeepCrime is characterized by its comprehensive set of 24 newly defined mutation operators.These operators are not just theoretical constructs but are based on real-world faults observed in DL systems, making DeepCrime a highly practical tool for testing and improving the reliability of these systems.
Mutation-based test prioritization for DNNs Wang et al. [15] introduced PRIMA, an innovative test input prioritization technique founded on intelligent mutation analysis.PRIMA is applicable to both classification and regression models and possesses the capability to handle test inputs generated through adversarial input generation techniques, thereby enhancing the probability of misclassification.However, PRIMA's model mutation rules cannot be adapted to classical ML models.
In this study, we proposed MLPrior, a mutation-based test input prioritization approach specifically designed for classical ML models.The significant differences between MLPrior and PRIMA are as follows: •

VIII. CONCLUSION
In order to solve the labeling cost problem for classical ML models, we propose MLPrior, which prioritizes tests that are more likely to be misclassified.MLPrior leverages the unique characteristics of classical ML classifiers, including their interpretability and carefully engineered dataset features, to effectively prioritize test inputs.The foundational principles of MLPrior are twofold: Firstly, tests exhibiting higher sensitivity to mutations are more likely to be misclassified.Secondly, tests situated closer to the decision boundary of the model are more susceptible to misclassification.Capitalizing on these principles, we design mutation rules specifically for classical ML models and their datasets.For each test, we generate mutation features while simultaneously transforming its attribute into a feature vector that can indirectly quantify the proximity between it and the decision boundary.Concatenating these features, MLPrior constructs a final vector for each test, which will be inputted into a pre-trained ranking model for the purpose of predicting its misclassification probability.Finally, MLPrior ranks all the tests according to their misclassification scores in descending order.We conducted an extensive study to evaluate MLPrior, utilizing 185 different types of subjects that encompass natural, noisy, and fairness datasets.The experimental results demonstrate that MLPrior exhibits higher effectiveness compared to existing test prioritization methods, yielding an average improvement of 14.74%∼66.93% on natural datasets, 18.55%∼67.73%on mixed noisy datasets, and 15.34%∼62.72% on fairness datasets.
The total duration for test prioritization using MLPrior is around 20 seconds, involving model/input mutation, feature generation, ranking model training, and test prioritization.One crucial factor is that MLPrior does not require any retraining operations in the model mutation process.Mutations are generated by directly modifying the architecture parameters or weight parameters of the evaluated models.
• Efficient: • Model-specific insights Compared to confidence-based test prioritization approaches, MLPrior leverages the interpretability characteristic of classical ML models and introduces mutations through modification of the model's architecture parameters or weight parameters, thus achieving effective test prioritization.• Attribute feature inclusion In contrast to DNN test data, classical ML datasets typically possess lower-dimensional features, rendering them more cost-effective and timeefficient for test prioritization.Moreover, these features are typically carefully selected by domain experts, providing a direct reflection of attribute information for each test input.

of Domain Knowledge Automated
• Diversity •

Domain Adaptation Challenges Even within the same
• Difficulty in Quantifying Domain Knowledge Encoding domain expertise into an automated labeling system can be a complex task.
2, ML-Prior generates a set of input mutants for each test t ∈ T .Subsequently, MLPrior proceeds to compare the predictions of model M for each input mutant with that of the original input t to construct the input mutation vector.During this process, if the prediction for the i-th mutated input differs from that of the original test input t, the corresponding i-th element of the feature vector is assigned a value of 1; otherwise, it is assigned a value of 0. An example of the resulting feature vector is (0, 1, . . ., 0).
• Model Mutation Features (MMF) Based on the model mutation rules described in Section III-B, MLPrior generates a set of mutated models for the original ML model M .For each test t ∈ T , MLPrior identifies whether t "kills" each of the mutated models (i.e., whether the predictions made by the mutated model and the original ML model for t are different) to construct the model mutation vector.More specifically, if t kills the i-th mutated model, the i-th element of t's model mutation vector will be set to 1.

How does MLPrior perform in terms of effec- tiveness and efficiency?
[20]ssumes that each feature possesses independent predictive power for the output variable.The final prediction is obtained by combining the predictions derived from all features.To solve the labelling cost problem, we propose MLPrior, a test input prioritization approach specifically designed for classical ML models.In this research question, we evaluate the effectiveness and efficiency of MLPrior by comparing it with several existing test prioritization approaches[16],[20].

How does MLPrior perform on different types of test inputs?
In order to evaluate the effectiveness of MLPrior in various scenarios, we constructed mixed noisy datasets and fairness datasets.We compare the effectiveness of MLPrior against various test prioritization approaches on the generated datasets.

TABLE II EFFECTIVENESS
COMPARISON AMONG MLPRIOR AND DNN TEST PRIORITIZATION APPROACHES IN TERMS OF APFD ON NATURAL DATASETS (BINARY CLASSIFICATION)

TABLE IV TIME
COST OF MLPRIOR AND THE COMPARED TEST PRIORITIZATION APPROACHES

TABLE VI EFFECTIVENESS
COMPARISON AMONG MLPRIOR AND DNN TEST PRIORITIZATION APPROACHES IN TERMS OF APFD ON MIXED NOISY DATASETS (BINARY CLASSIFICATION) , Table VII, Table VIII.Table VI showcases the effectiveness difference between MLPrior and the compared test prioritization methods when applied to mixed noisy inputs.The evaluation metric employed is the Average Percentage of Faults Detected (APFD).We highlight

TABLE VIII EFFECTIVENESS
IMPROVEMENT OF MLPRIOR OVER THE COMPARED APPROACHES IN TERMS OF APFD ON MIXED NOISY DATASETS

TABLE IX EFFECTIVENESS
COMPARISON AMONG MLPRIOR AND DNN TEST PRIORITIZATION APPROACHES IN TERMS OF APFD ON FAIRNESS DATASETS (BINARY CLASSIFICATION) , Table XIII, Table XIV, Table XV, Table XVI and Table

TABLE XII EFFECTIVENESS
COMPARISON AMONG MLPRIOR, MLPRIOR VARIANTS AND DNN TEST PRIORITIZATION APPROACHES IN TERMS OF APFD ON NATURAL DATASETS (BINARY CLASSIFICATION) Table XVII displays the effectiveness of MLPrior, its variants, and the compared test prioritization methods across all

TABLE XV EFFECTIVENESS
COMPARISON AMONG MLPRIOR, MLPRIOR VARIANTS AND DNN TEST PRIORITIZATION APPROACHES IN TERMS OF APFD ON MIXED NOISY DATASETS (MULTICLASS CLASSIFICATION) The experimental results above demonstrate that MLPrior performs better than its variants, indicating that, among all the ranking models evaluated, the XGBoost model used in the original MLPrior demonstrates a better capability in utilizing the generated features of test inputs for test prioritization.We investigate the contributions of three types of features (i.e., model mutation features, input mutation features, and original attribute features) on the effectiveness of MLPrior.

needed to apply MLPrior to new ML models
When an ML testing practitioner aims to apply MLPrior to a new type of ML model, they need to possess the following skills: 1) An understanding of the internal parameters and mechanisms of the new machine learning model, to effectively carry out model mutation operations in ac- • Skills • Characteristics

for models to utilize MLPrior When
a model exhibits the following characteristics, it can be added to the set of models that can use MLPrior: 1) The dataset of the model is in the tabular format, as our input mutation and attribute feature generation operations are specifically crafted for classical ML models that utilize tabular datasets; 2) The model is a white-box model, which allows for modifications to its internal structure or parameters, facilitating the implementation of MLPrior's model mutation operations.Furthermore, we offer the following protocol to guide an ML testing practitioner in adapting MLPrior to new model classes.It details the systematic process for generating the model mutation features, original attribute features, and input mutation features.
• Model Mutation Feature (MMF) Generation To Different Approaches for Model Mutation ML-Prior and PRIMA leverage different model mutation approaches.In MLPrior, model mutations are specifically designed for white-box classical machine learning models.These mutations are based on the interpretable nature of these models and involve modifying the architecture parameters or weight parameters of the evaluated model.PRIMA, on the other hand, is primarily focused on DNNs, which are non-interpretable black-box models.Examples of model mutations in PRIMA include adding noise to the weights of neurons and altering the structure of DNN layers.• Attribute Feature Inclusion Another significant difference is that MLPrior employs the inherent attribute features of classical ML model datasets for test prioritization.In contrast, PRIMA does not incorporate this information into its test prioritization procedure.The motivation behind MLPrior's utilization of attribute features for test prioritization is that classical ML datasets typically exhibit lower-dimensional features compared to DNN test data.Additionally, these features are carefully selected by domain experts, directly reflecting the attribute information associated with each test input.• Feature Generation Strategy In terms of model and input mutation, compared to PRIMA, MLPrior emphasizes generating mutation features directly from mutation results.For example, in model mutation, the i th element in the vector indicates whether the i th mutated model is 'killed' by this input.This method is intuitive and reproducible.• Use of Multiple Ranking Models MLPrior employs five different ranking models and assesses their effectiveness in utilizing mutation features for test prioritization.In contrast, PRIMA utilizes only a single ranking model.By comparing multiple ranking models, MLPrior can identify the most effective model for learning mutation features in the context of test prioritization.