Test Input Prioritization for Graph Neural Networks

GNNs have shown remarkable performance in a variety of classification tasks. The reliability of GNN models needs to be thoroughly validated before their deployment to ensure their accurate functioning. Therefore, effective testing is essential for identifying vulnerabilities in GNN models. However, given the complexity and size of graph-structured data, the cost of manual labelling of GNN test inputs can be prohibitively high for real-world use cases. Although several approaches have been proposed in the general domain of Deep Neural Network (DNN) testing to alleviate this labelling cost issue, these approaches are not suitable for GNNs because they do not account for the interdependence between GNN test inputs, which is crucial for GNN inference. In this paper, we propose NodeRank, a novel test prioritization approach specifically for GNNs, guided by ensemble learning-based mutation analysis. Inspired by traditional mutation testing, where specific operators are applied to mutate code statements to identify whether provided test cases reveal faults, NodeRank operates on a crucial premise: If a test input (node) can kill many mutated models and produce different prediction results with many mutated inputs, this input is considered more likely to be misclassified by the GNN model and should be prioritized higher. Through prioritization, these potentially misclassified inputs can be identified earlier with limited manual labeling cost. NodeRank introduces mutation operators suitable for GNNs, focusing on three key aspects: the graph structure, the features of the graph nodes, and the GNN model itself. NodeRank generates mutants and compares their predictions against that of the initial test inputs. Based on the comparison results, a mutation feature vector is generated for each test input and used as the input to ranking models for test prioritization. Leveraging ensemble learning techniques, NodeRank combines the prediction results of the base ranking models and produces a misclassification score for each test input, which can indicate the likelihood of this input being misclassified. NodeRank sorts all the test inputs based on their scores in descending order. To evaluate NodeRank, we build 124 GNN subjects (i.e., a pair of dataset and GNN model), incorporating both natural and adversarial contexts. Our results demonstrate that NodeRank outperforms all the compared test prioritization approaches in terms of both APFD and PFD, which are widely-adopted metrics in this field. Specifically, NodeRank achieves an average improvement of between 4.41% and 58.11% on original datasets and between 4.96% and 62.15% on adversarial datasets.

Test Input Prioritization for Graph Neural Networks Yinghua Li , Xueqi Dang , Weiguo Pian , Andrew Habib , Jacques Klein , Member, IEEE, and Tegawendé F. Bissyandé Abstract-GNNs have shown remarkable performance in a variety of classification tasks.The reliability of GNN models needs to be thoroughly validated before their deployment to ensure their accurate functioning.Therefore, effective testing is essential for identifying vulnerabilities in GNN models.However, given the complexity and size of graph-structured data, the cost of manual labelling of GNN test inputs can be prohibitively high for realworld use cases.Although several approaches have been proposed in the general domain of Deep Neural Network (DNN) testing to alleviate this labelling cost issue, these approaches are not suitable for GNNs because they do not account for the interdependence between GNN test inputs, which is crucial for GNN inference.In this paper, we propose NodeRank, a novel test prioritization approach specifically for GNNs, guided by ensemble learningbased mutation analysis.Inspired by traditional mutation testing, where specific operators are applied to mutate code statements to identify whether provided test cases reveal faults, NodeRank operates on a crucial premise: If a test input (node) can kill many mutated models and produce different prediction results with many mutated inputs, this input is considered more likely to be misclassified by the GNN model and should be prioritized higher.Through prioritization, these potentially misclassified inputs can be identified earlier with limited manual labeling cost.NodeRank introduces mutation operators suitable for GNNs, focusing on three key aspects: the graph structure, the features of the graph nodes, and the GNN model itself.NodeRank generates mutants and compares their predictions against that of the initial test inputs.Based on the comparison results, a mutation feature vector is generated for each test input and used as the input to ranking models for test prioritization.Leveraging ensemble learning techniques, NodeRank combines the prediction results of the base ranking models and produces a misclassification score for each test input, which can indicate the likelihood of this input being misclassified.NodeRank sorts all the test inputs based on their scores in descending order.To evaluate NodeRank, we build 124 GNN subjects (i.e., a pair of dataset and GNN model), incorporating both natural and adversarial contexts.Our results demonstrate that NodeRank outperforms all the compared test prioritization approaches in terms of both APFD and PFD, which are widely-adopted metrics in this field.Specifically, NodeRank achieves an average improvement of between 4.41% and 58.11% on original datasets and between 4.96% and 62.15% on adversarial datasets.

I. INTRODUCTION
R ECENT years have witnessed widespread adoption of graph machine learning for modeling, predictive, and analytics tasks on graph-structured data, while the emergence of Graph Neural Networks (GNNs) [1] has led to the achievement of the unprecedented performance of a variety of applications in drug design [2], [3], [4], recommender systems [5], [6], and social network analysis [7], [8].As they are increasingly adopted, the debugging of GNNs becomes essential, especially in safetycritical and security-sensitive domains.A key perspective in that domain is developing effective and efficient techniques for GNN testing to achieve quality assurance.
Unfortunately, Deep Neural Networks (DNNs), including GNNs, are notoriously difficult to test due to the limitations in the availability of a test oracle [9], [10], [11].Indeed, DNN testing is challenged by the fact that it is costly and timeconsuming to label test inputs: 1) automated labeling is not yet mainstream; 2) datasets can be substantially large, and the data can be complex, as in the case of GNNs; 3) labeling may require deep domain-specific knowledge, which is prohibitively expensive to acquire.Therefore, to achieve efficient and effective testing of DNN-based systems, researchers and practitioners generally focus on identifying only the relevant test inputs that are likely to cause the system to behave incorrectly (i.e., bugrevealing test inputs).Diagnosing those inputs is then expected to provide insights for debugging the DNNs.
Prior work has developed various techniques to identify and prioritize bug-revealing test inputs, which allows testers/developers to focus on the most critical inputs [10], [11], [12], [13].Such test prioritization techniques aim at optimizing the time as well as the required resources for testing.A large majority of DNN test prioritization approaches fall within three categories [11]: coverage-based, confidence-based and surprise-based approaches.Confidence-based approaches, such as DeepGini [10], prioritize test inputs based on model confidence: a test input is more likely to be incorrectly predicted via a DNN model if that model outputs similar prediction probabilities for each class.Coverage-based approaches, such as CTM [14], simply adapt coverage-based test prioritization from traditional software systems testing into DNN testing and have been shown to underperform against confidence-based approaches [10].Surprise-based methods [13], [15] perform test prioritization based on the surprise of test inputs.This "surprise" is quantified by measuring the distance in neuron activation patterns between a test input and the training data.However, existing studies [16] have demonstrated that surprisebased methods are less effective than confidence-based approaches.Furthermore, surprise-based methods typically come with higher computational costs due to the need for more parameter tuning.
Although confidence-based approaches have demonstrated effectiveness in the context of DNNs, they suffer from several limitations when applied to GNNs.Notably, they do not account for the interdependence inherent in graph-structured test inputs composed of nodes and edges.These approaches were originally designed for DNNs, where tests are independent of each other.Additionally, confidence-based approaches operate under the assumption that test inputs for which the model exhibits low confidence are more likely to be misclassified and, therefore, should be given higher priority.However, in the presence of adversarial attacks, the model's confidence can be higher for incorrect predictions, leading to erroneous outputs.
More recently, novel approaches such as PRIMA [11] are being introduced in the literature of DNN testing, leveraging techniques such as mutation analysis.However, PRIMA, the state-of-the-art in DNN test prioritization, cannot be applied to GNNs since their mutation operators are not adapted to graph-structured data and models.Dang et al. [17] proposed GraphPrior, a test prioritization method specifically designed for GNNs.Despite GraphPrior also relying on mutation analysis, there are significant differences between NodeRank and GraphPrior: • Incorporating Input Mutations GraphPrior performs test prioritization solely based on model-specific mutations, whereas NodeRank not only considers model mutations but also takes into account mutations specific to the input.NodeRank considers two types of input mutations: 1) Node feature mutations, which are designed to perturb the feature attributes of selected nodes, consequently influencing the representation and information flow within the graph; 2) Graph structure mutations, which aim to alter the interdependence of the test inputs within the graph by introducing additional edges, thus changing the structural properties of the graph.• Leveraging Ensemble Learning Techniques for learning-to-rank In contrast to GraphPrior, which employs a single ranking model to learn the misclassification probability of test inputs, NodeRank leverages ensemble learning techniques to integrate multiple base ranking models with the aim of optimizing its performance.Existing studies [18], [19], [20] have demonstrated that ensemble learning typically achieves higher accuracy than single ML models.Furthermore, our analysis delves into the influence of different ensemble techniques on NodeRank and illustrates that the sum-based ensemble technique yields the best performance.• Considering Different Killing Methods GraphPrior simply assumes that a mutated model is considered "killed" if the predictions of the original model and the mutated model for the test input differ.However, prior research [21] has highlighted that in the context of DNN mutation analysis, variations in the outputs between a mutated model and the original model can occur solely due to the inherent randomness in the training process rather than because the mutant is actually discriminated from the original model.Therefore, we utilized the killing method provided by DeepCrime [21] for test prioritization and generated relevant variants of NodeRank.In DeepCrime, the killing process involves iteratively training both the original model and the mutated model, then comparing the distribution difference in their outputs to determine whether the mutated model is "killed."This approach can contribute to mitigating the impact of randomness in the training process.Based on the DeepCrime approach, by comparing NodeRank variants utilizing model mutation rules and those not utilizing model mutation rules, we demonstrated that mutations generated by the model mutation rules of NodeRank contribute to its effectiveness.
This paper.We propose NodeRank (Node Ranking for graph-structure test inputs), a novel test input prioritization approach targeting GNNs.NodeRank leverages the ideas from traditional mutation testing [22], [23] to prioritize potentially misclassified test inputs so that such tests can be identified earlier with limited manual labeling costs.More specifically, the core idea of NodeRank is that: a test is considered more likely to be misclassified if this test can kill many mutated models and produce different prediction results with many mutated inputs.
NodeRank is a test prioritization approach that is modelbased, input-based, and mutation testing-based.It applies mutation operations to GNN models and tests, generating mutation features for test prioritization.Specific mutation operations applied are described below.
In NodeRank, we developed three distinct types of mutations, namely graph structure mutation (GSM), node feature mutation (NFM), and GNN model mutation (GMM), based on the characteristics of GNNs and the graph test dataset.GSM aims to modify the interdependence of the graph test inputs by introducing additional edges, thereby altering the structural properties of the graph.NFM, on the other hand, perturbs the feature attributes of selected nodes, thereby influencing the representation and information flow within the graph.Both GSM and NFM can be categorized as input mutations as they directly modify the characteristics of the dataset.In contrast, GMM is specifically developed to mutate GNN models, with the objective of modifying the message passing of the GNNs by changing specific training parameters.The GMM mutation type thus falls under the category of model mutations.
For each test input, NodeRank generates these three types of mutations, as described above.Subsequently, by comparing the prediction results before and after the mutation, NodeRank generates a mutation feature vector for each test.Specifically, for graph input mutation (i.e., GSM and NFM), if a mutated input fails (i.e., the predictions for the mutated inputs and the original inputs are different), the corresponding element in the relative feature vector is marked as 1; otherwise, it is marked as 0. For GNN model mutation (i.e., GMM), if a mutated model is killed (i.e., the prediction results for this input via the mutated models and the original models are different), the corresponding element in the relative feature vector is marked as 1; otherwise, it is marked as 0. The mutation feature vector of each test is then fed into pre-trained ranking models, which are designed to predict the likelihood of this input being misclassified.Our ranking models are trained to automatically predict a misclassification score indicating its likelihood of being misclassified by the model.
To further enhance the performance of our ranking models, we adopt ensemble learning techniques that combine the predictions from multiple base ranking models.The idea draws inspiration from the field of ensemble learning [18], [19], [24], which aims to improve overall performance by integrating predictions from two or more base machine learning models.Notably, ensemble learning methods have achieved state-ofthe-art outcomes across various machine learning applications [25], [26], [27], [28].In the NodeRank framework, we employ three distinct ensemble methods [18], [29], [30] to effectively combine the outputs of the individual ranking models.
It is important to note that, NodeRank differs from the stateof-the-art test prioritization approach, PRIMA, in several domains: the target (GNN vs. DNN) as well as the approach (mutation rules and ranking strategies).
• Target.NodeRank is designed to address the test prioritization problem in GNNs and, therefore, operates on datasets that exhibit complex interdependence between individual test inputs.In contrast, PRIMA is intended for traditional DNNs, where each sample in the dataset is independent.• Mutation rules.NodeRank's mutation rules can affect the interdependency between test inputs from two perspectives: First, NodeRank's model mutation rules can directly or indirectly affect the message passing between nodes in graph data.More specifically, in the mutated GNN model, the manner in which nodes acquire information from their neighboring nodes is slightly different from that of the original GNN model.Second, NodeRank's node mutation rules modify the interdependence between nodes by adding edges to nodes.When adding a new edge from node A to node B, a new connection is built, and the prediction of node A is now impacted by the newly connected node B, thus changing the node interdependence.In contrast, the mutation rules of PRIMA are specifically designed for independent test inputs and, therefore, do not impact the relationships between tests.• Ranking strategies.NodeRank leverages ensemble ranking models to learn from mutation results for test prioritization.These models are constructed by combining different base ranking models, thereby improving the overall performance of the model [18], [19], [24].In contrast, PRIMA employs a single ranking model for test prioritization.We evaluate the performance of NodeRank based on 124 subjects (i.e., a pair of dataset and GNN model).Our evaluation considers both natural inputs and graph adversarial inputs, which are generated by eight graph adversarial attacks [31], [32], [33], [34].We compare NodeRank with multiple test prioritization approaches.Our experimental results demonstrate that, on natural datasets, the average improvement of NodeRank over the compared approaches, in terms of APFD, is between 4.41% and 58.11%.On graph adversarial inputs, the average improvement of NodeRank over the compared approaches in terms of APFD ranges from 4.96% and 62.15%.
NodeRank can be applied across diverse real-world contexts.For instance, a typical use case of node classification in GNNs is fraud detection [35] in banking transfer transaction systems.Here, each account can be represented as a node, while the transactional interactions between them can be represented as edges.Through node classification, these accounts can be categorized as normal or fraudulent.When developers use GNNs to predict whether each node (account) is a fraudulent account, the GNNs can exhibit wrong prediction behavior, such as predicting fraudulent accounts as normal accounts, which can lead to losses for the bank.In this scenario, NodeRank can be utilized to prioritize potentially misclassified accounts (with those more likely to be misclassified ranked at the top).These sorted accounts can be provided to bank staff, allowing them to quickly perform manual checks on the accounts that are more likely to be misclassified, thus reducing losses.
The contributions of this paper are as follows: • Approach.We propose a novel approach, NodeRank, to prioritize test inputs for GNN models.NodeRank introduces three distinct types of mutation rules that target the mutation of graph structure, node features, and GNN models, respectively and adopt ensemble-learning-based learning-to-rank to intelligently combine mutation results for effective test input prioritization.• Study.We conducted a large-scale study based on 124 subjects to evaluate the effectiveness of NodeRank on both natural and adversarial inputs.The experimental results demonstrate its effectiveness.• Performance Analysis.We provide an extensive analysis of the performance of NodeRank by investigating the influence of the different ensemble learning strategies as well as by performing an ablation study to showcase the contributions of the different mutation feature sets.Our dataset, code, and results are made publicly available in a replication package1 for the community.

II. BACKGROUND
We now briefly introduce the key domain concepts for our work.

A. Graph Neural Networks
Graph neural networks [36], [37], [38] have achieved great success in solving machine learning problems on graph-structured data [7], [39], [40].Initial models learned representations of target nodes by propagating neighborhood information through recurrent neural architectures in an iterative manner until a stable fixed point is reached.Subsequently, several variations have been proposed in the literature: Kipf et al. [41] proposed Graph convolutional networks (GCN), which adapt convolution techniques from classical convolutional neural networks, to graph data.GCN implements message passing of multi-order neighborhoods by superimposing several convolutional layers.More recently, other GNN architectures have been proposed towards taking into account the advancements in the field of deep learning: for example, Veličković et al. [38] proposed graph attention networks (GAT) which uses attention techniques to assign different weights according to the importance of nodes in the graph.
In GNNs, a graph is usually defined as a data structure composed of nodes and edges.We denote a graph as G = (V, E) where V = {1, 2, . . ., N} refers to the set of N nodes, and E ⊆ V × V refers to the set of edges.In GNN datasets like Cora (a node classification dataset), each node represents a scientific paper, while edges represent citation relationships between papers.In this dataset, test inputs typically refer to new nodes (scientific papers) that have not been seen during the training process.In the case of the Cora dataset, given a test input (a scientific paper), a GNN model is used to classify the paper into specific categories.In other words, the GNN model predicts the categories that best describe the content of the given paper.For instance, these categories can be "reinforcement learning" and "neural networks," implying that the paper belongs to the "reinforcement learning" or "neural networks" category.
[GNN training process] GNNs undergo a training process similar to other neural networks.The inputs required for GNN training typically include: 1) Graph Structure.Graph structure information encompasses the connections between nodes in the graph; 2) Node Features.Each node typically comes with associated feature vectors, which reflect the attributes of the node; 3) Target Labels.In the training data for GNN node classification, "Target Labels" refer to the category to which each node belongs.These labels are typically predefined.
During the training process of GNNs, several components are continually trained and optimized: 1) Model Parameters.The primary aim of GNN training is to refine the model parameters.These parameters include weights and biases linked to operations like graph convolutions and aggregation functions within the GNN architecture.2) Node Embeddings.GNNs comprise layers with associated parameters, and part of the training process involves learning these node embeddings.Node embeddings are vector representations of individual nodes within the graph.They capture a node's structural and feature-based information and evolve as the model trains; 3) Loss Function.The loss function plays a pivotal role in GNN training.It quantifies the disparity between the model's predictions, typically pertaining to nodes or graph-level attributes, and the actual ground truth labels for the given task.Throughout training, model parameters are iteratively adjusted to minimize this loss function.
The specific training process for GNNs typically consists of the following steps: • Initialization: All GNN parameters are randomly initialized, typically with small random values.• Forward Propagation: For each node, its node embedding is updated based on the information from its neighbors.This is typically achieved using weight matrices and aggregation functions (e.g., mean or max pooling).This aggregation process can go through multiple layers, allowing information to propagate further in the graph.[GNN inference] In the context of GNNs, inference refers to using a pre-trained GNN model to perform prediction on new graph data.For example, in node classification tasks, the GNN model utilizes its learned parameters and weights to classify a given node.The input typically consists of the features of the node and the graph structure of its belonged graph.The output is the classification of the node.

B. Mutation Testing
Mutation testing [22] is a software testing method that aims to evaluate the quality of the test suite by intentionally introducing small changes (called mutations) into the source code and observing the test suite's reaction.The core objective is to determine the effectiveness of test suites in finding code bugs.The intuition is that if a test case can detect intentionallyintroduced errors, this test case is more likely to detect real bugs in practice.Mutation testing has achieved state-of-the-art performance by providing a comprehensive evaluation of the test suite via creating and testing multiple variations (mutants) of the code, ensuring that the test cases are thoroughly covering different scenarios and even edge cases, which is difficult to achieve through traditional testing methods.
In mutation testing, kill and fail are terms used to describe the results of running a test suite on a set of artificially created code changes or 'mutants' to evaluate the quality of the test suite.Specifically, a mutant is regarded as 'killed' if the behavior of this mutant differs from that of the original code, indicating that the test suite is capable of detecting the fault introduced by the mutant.A test input is said to 'fail' if it is not passed by the target program.

C. Ensemble Learning
Ensemble learning [18] is a meta approach in machine learning where multiple, generally diversified, ML models are combined to achieve better performance and generalization.Common examples of ensemble learning strategies include Majority voting and Stacking.Majority voting is a straightforward approach that sums for each prediction class the number of yielded predictions by the different models: the class with the majority number of predictions is then outputted by the ensemble model.On the other hand, Stacking utilizes a meta-model such as logistic regression to learn how to optimally combine the predictions from base ML models.
Ranking is crucial to many real-world applications, notably in the field of information retrieval.In software engineering, test prioritization assumes the possibility to rank test cases according to their ability to reveal faults.In recent studies [18], the ranking has been formalized as a machine learning problem, and ensemble ranking often employs ensemble learning techniques to learn optimal weights for combining multiple ranking algorithms.

A. Overview
NodeRank is a model-based, input-based, and mutation testing-based test prioritization approach.By employing mutation operations on both GNN models and test inputs, NodeRank produces mutation features for each test input and predicts the misclassification probability of the input in order to perform test prioritization.Fig. 1 presents the overview of the different steps in our NodeRank test prioritization approach.First, we offer detailed explanations for certain elements in Fig. 1 and provide reasons for the symbols utilized for them, with the goal of enhancing the understanding of the figure.
• The second dotted square in Step 1 represents an N × M matrix.This matrix is used to encapsulate the feature vectors of all nodes.Specifically, each row of the matrix represents a node's feature vector.There are N rows in the matrix corresponding to the N nodes in the dataset.M columns represent that each node has M features.• The reason we use a dotted representation for node feature vectors and mutation features is that, in our experiments, these features are both represented using matrices.
The dotted square can serve as a visual abstraction of a matrix.A matrix comprises multiple values, and we use dots to represent the values within the matrix abstractly.For example, in the second dotted square in Step 1, the third dot in the second row represents the value of the third feature of the second node in the graph dataset.In Fig. 1, a chromosome with a break indicates the mutations applied to the GNN model or inputs, resulting in mutation results.These mutation results are subsequently used for mutation feature generation in order to perform test prioritization.Moreover, each box represents a step in the NodeRank workflow.In the following section, we offer a general overview of each step, as depicted in Fig. 1. Specific details for each step can be found from Section III-B to Section III-D.
❶ Generating mutants.NodeRank generates mutants for three different inputs: the graph structure itself (which represents the interdependence of samples in the datasets), the node features (which represents sample data), and the GNN model (which is learned and is the target of testing).To that end, we develop specific mutation rules that are carefully designed for GNN testing.Section III-B details those rules that must be applied to generate mutants for a given test set T , the graph structure G of the data, and the GNN model M under test.❷ Extracting and combining mutation features.NodeRank then obtains the model prediction towards the mutants and the original test inputs.By comparing the predictions, NodeRank generates the mutation feature vector for each input.The detailed description is as follows.Given M , a mutant of M , NodeRank considers that a test input kills M if the prediction on this test input by M is different from the prediction by M .For a given mutant of a test input t ∈ T , NodeRank considers that this mutant failed if it leads to a prediction that is different from the prediction using t.Given G , a mutant of G, NodeRank considers that the mutant fails if the prediction of the GNN using G is different from its prediction when using G.
Based on the execution outputs, NodeRank builds feature vectors to train a ranking model.These are referred to as mutation features and are of three types: Node mutation features, graph structure mutation features, and model mutation features (cf.Section III-C for details).❸ Ranking test inputs using ensemble ranking models.Eventually, for each test input, NodeRank produces three vectors, which represent three types of mutation features.These vectors are then concatenated to produce a mutation feature vector v for each test input t ∈ T .Given all test inputs from T , NodeRank, therefore, leverages ensemble ranking models based on their associated mutation features to predict ranking scores of the test inputs.These scores, ordered in a descending way, are used to prioritize the associated test inputs accordingly.
The findings presented in Section V provide compelling evidence for the effectiveness of NodeRank, which can be attributed, in part, to the careful design of mutation rules and the effective ensemble strategy of ranking models.1) Our designed mutation rules can effectively generate informative mutation features by leveraging the interdependence of test inputs.The node mutation rules operate by introducing new edges between nodes in the graph dataset, which impacts the interdependence structure of the data.The model mutation rules affect message passing between nodes in the GNN prediction process, leading to small changes in node interdependence.2) NodeRank adopts an ensemble ranking model for test prioritization, which leverages the strengths of multiple base ranking models to improve the overall performance.By comparing different ensemble strategies, we are able to identify the most suitable approach for use in NodeRank's test prioritization process.
In the remainder of the section, we will describe in detail the mutation rules that we have designed for NodeRank (cf.Section III-B), the construction process of the mutation feature vectors (cf.Section III-C), the setup of the ensemble ranking model (cf.Section III-D) and the application of NodeRank (cf.Section III-E).

B. Specifying Mutation Rules
We design mutation rules that are adapted to the key main ingredients of a GNN: the graph structure of the data, the nodes in the graph, and the GNN model itself, which are explained in detail as follows.
1) Graph Structure Mutation (GSM): Graph structure mutation is designed to introduce slight changes to the input graph by randomly incorporating new edges.Consequently, when provided with a test input node, denoted as t ∈ T , we create mutants by adding one or more edges between node t and a randomly selected node, denoted as s ∈ T .For a given node t ∈ T , the following mathematical formula provides an intuitive representation of the GSM mutation: where G represents the original graph.G represents the mutated graph structure.In each iteration, we use the addEdge function to generate an edge from node t to a randomly selected node s i ∈ T .We use the symbol "+" to denote the addition of the newly generated edge to the original graph G.This process is repeated n times, resulting in the addition of n edges to the original graph G.
2) Node Feature Mutation (NFM): Given a test set and the features of the test inputs, node feature mutation aims to slightly change the features of the targeted nodes in order to offset their position in the feature space.This offset implies the modification of feature values in the different dimensions.
In the following, we introduce how node feature mutation is performed in detail.Given the original test set T , which consists of n nodes, each node t is characterized by m dimensions, where each dimension corresponds to a specific feature value of the node n.In this case, T can be represented as an n × m feature matrix.To perform node feature mutation, we apply an offset to this matrix.Specifically, assuming the degree of offset is denoted as α, Formula 2 represents the mutation process for the test set T .As observed in the formula, the initial step involves multiplying the matrix of the original test set T by the offset degree to calculate the ultimate offset to be applied to T .Subsequently, the matrix of the mutated test set, denoted as F (T ), is derived by adding T 's matrix to the offset α * F (T ).
where F (T ) is the feature matrix of the mutated test set T , F (T ) is the feature matrix of the original test set T , and α is the coefficient of the degree of offset.
3) GNN Model Mutation (GMM): Given a trained graph neural network model, the GNN model mutation aims to change the training parameters slightly.Formula 3 offers an intuitive representation of the GMM mutation.For integer or float type parameters, the mutation operation involves making slight adjustments to the parameters.In the case of Boolean-type parameters, the mutation operation involves switching between True and False.Therefore, the formula is as follows: where M refers to the mutated GNN model.M refers to the original GNN model, θ refers to a parameter of the original model M , and β refers to the coefficient of change, indicating the magnitude of parameter change.The symbol ¬ signifies the logical negation operation, which inverts the parameter θ.If the original value is True, it becomes False, and if the original value is False, it becomes True.
In NodeRank, we consider the following: • Learning Additive Bias (LAB) [41], [42], [43] The LAB parameter is a Boolean variable that determines whether to introduce a predetermined offset to the representation vectors of nodes in the GNN model.By enabling the LAB parameter (set to True), a bias parameter is assigned to each node's representation vector.This allows the GNN model to capture the intrinsic properties of the graph better and improve the interdependence between nodes in the prediction process.• Negative Slope (NS) [42] NS is a float parameter that controls the slope of the negative part of the activation function used in the Gated Linear Unit (GLU) operation, a commonly used non-linear function for message passing in GNNs.In particular, GLU combines the node features with the weighted sum of their neighboring nodes' features, which is the message passed between nodes in the graph.
The negative slope parameter of the activation function in the GLU operation determines the rate of decrease for negative input values and can affect the message passing between nodes.As such, the value of NS plays a crucial role in determining the sensitivity of the GNN model to negative input values and the resulting impact on the interdependence between nodes in the graph.• Changing Multi-head Attentions (CMA) [42] CMA is an integer type parameter that determines the number of attention heads employed by the GNN model, with an increase in CMA leading to an expanded model capacity and improved capacity to capture the interdependencies that exist among the nodes in the graph.• Concat (CON) [42] The CON parameter is a Booleantype parameter that determines the method used to integrate node embeddings from neighboring nodes.When set to True, the concatenation operation is employed to combine the node embeddings of adjacent nodes, resulting in a more sophisticated and expressive node representation.This, in turn, enhances the capacity of the GNN model to capture more interdependencies between nodes.• Adding Self Loops (ASL) [41], [42] The ASL parameter is a Boolean parameter that governs the addition of selfloops to the input graph in graph neural networks.By setting ASL to 'True', self-loops are introduced to each node in the graph, enabling the aggregation of intrinsic information from nodes into their representation vectors.This operation modifies the weighting of neighboring nodes and can affect the interdependence of nodes during the prediction process.• Adding Layer Computations (ALC) [41] ALC is a Boolean type parameter that determines whether or not to include additional layers of computation in the GNNs.
When ALC is set to true, additional layers are introduced to the network, which allows for more complex transformations of the node features.As a result, the message passing process becomes more refined and capable of capturing more intricate dependencies among the nodes.• Hidden Channel (HC) [41], [42], [43], [44] The HC parameter is an integer configuration parameter that governs the dimensionality of the hidden representation in each layer of the GNNs.As such, modifications to this parameter can impact the interdependence of nodes in a given graph by allowing the GNN to learn more expressive and informative node embeddings.We explain how the mutation rules of NodeRank utilize node interdependence to generate mutations as follows: • For Model-level mutants: NodeRank's mutant rules can directly or indirectly affect the message passing between nodes in graph data.More specifically, in the mutated GNN model, the manner in which nodes acquire information from their neighboring nodes is slightly different from that of the original GNN model.• For Node-level mutants: NodeRank modifies the interdependence between nodes by adding edges to nodes.When adding a new edge from node A to node B, a new connection is built, and the prediction of node A is now impacted by the newly connected node B, thus changing the node interdependency.Note that the mutation rules of NodeRank are specifically developed for GNNs, and its applicability in the context of DNNs has not yet been examined.Specifically, regarding node mutation rules, NodeRank focuses on modifying the connection relationships between nodes in a graph.However, in DNNs, the samples within a dataset are independent and lack any inherent connectivity, rendering the proposed mutation rules unsuitable for such datasets.Moreover, the model mutation rules of NodeRank are designed to impact the message passing between nodes during the prediction process, either directly or indirectly.In contrast, conventional DNNs generally consist of independent samples within a dataset, implying that such mutation rules are unlikely to influence the transmission of information between distinct tests.

C. Constructing Mutation Features Vectors
Leveraging the three types of mutation rules introduced in the previous steps, we generate a mutation feature vector for each test input.To this end, we execute the three mutation rules, thereby generating three distinct feature vectors for each input.These feature vectors are concatenated to build the final mutation feature vector.In the following, we explain the generation of each feature vector of different mutation types.
Dataset mutation (NFM and GSM) Given a test input t and a GNN model M , we denote the mutants of t as {t 1 , t 2 ,..., t n }, which are obtained using NFM mutation rules.We associate a vector V of size n to the test input t where n is the number of mutants and V [k] maps to the execution output for the mutant t k .If t k fails (i.e., the prediction of t k is different from that of t), then V [k] is set to 1. Otherwise, it is set to 0. We use the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
same procedure to build a graph mutation features vector for t using the GSM mutation rules.Formula 4 describes the process of dataset mutation in our mutation testing operation. where Otherwise, it is set to 0. Formula 5 presents the process of model mutation in our mutation testing operation.
where M (t) represents the prediction of the original model M for the test input t.M k (t) represents the prediction of the mutant model M k for the same test input t.

D. Building an Ensemble Ranking Model
Based on the previous step, NodeRank generates a feature vector V i for each test input t i ∈ T .This feature vector is then used as the input to the ensemble ranking model for predicting the misclassification probability of t i .The design of the ensemble ranking models is motivated by the principles of learning-to-rank [45] and ensemble learning [18].In particular, we adopt four base ranking models, including Logistic Regression [46], Random Forest [47], XGBoost [48], and LightGBM [49], to form ensemble models that can leverage the strengths of each individual model.NodeRank uses the sum-based ensemble learning method [18], which combines scores of the base ranking models for a given test input.By inputting V i into the sum-based ensemble ranking model, NodeRank obtains a misclassification score for t i , which can be used to estimate the probability that the GNN model M will misclassify t i .
Our experiments further consider two other ensemble learning methods (i.e., stacking-based [29] and voting-based [18]) to build variants of NodeRanks and assess the effectiveness of our design choices (cf.Section IV-F).

E. Usage of NodeRank
The inputs of NodeRank are a test set T and a GNN model M .The output is the prioritized test set T P .NodeRank generates mutants for the test set T and the GNN model M and exploits the execution outputs of the GNN on these mutants to build feature vectors that can be utilized to learn to prioritize test inputs using ensemble ranking models.We present the training process of each ranking model as follows.
❶ Dataset Split Given a GNN model M with dataset T , we partition the dataset T into two subsets: a training set R and a test set.Following common practice in the field [50], we allocate 70% of the data to the training set and consider the remaining 30% as the test set.We emphasize that the test set is kept entirely separate from the training process and is only utilized to evaluate NodeRank.❷ Training set construction Based on the given training set R, the objective of this step is to build a training set R for training the ranking models.Firstly, for each input r i ∈ R, three types of mutants are generated, and based on the execution of these mutants, the mutation feature vector V i of r i is obtained.Subsequently, the mutation feature vector of r i is utilized to build the features of the training set R .
Secondly, the original GNN model M is used to classify each input r i ∈ R and compare it with the ground truth of r i .This step helps identify whether r i is misclassified by the GNN model M .If r i is misclassified by M , it is labeled as 1, and if not, it is labeled as 0. This process aids in building the labels of the ranking model training set R .❸ Training ranking models After building R , we train the ranking model based on it.Notably, the training set R contains binary labels (i.e., 1 or 0), whereas the ranking models are expected to output continuous values, referred to as misclassification scores.To address this, we made certain modifications to the ranking algorithms we employed, such as the random forest.During the classification process, these algorithms calculate an intermediate value, which is used to decide whether an input belongs to a particular class.If the intermediate value exceeds a predefined threshold of 0.5 (which is configurable), the input is classified into the first class; otherwise, it is classified into the other class.Rather than outputting the binary label, we directly output the intermediate value, representing the misclassification score.This score indicates the likelihood of a test input being misclassified by the GNN model, with a higher score indicating a greater probability of misclassification.

IV. EVALUATION DESIGN
To assess NodeRank, we enumerate various research questions (cf.Section IV-A), which explore the performance metric (cf.Section IV-B) for test inputs prioritzation on a diverse set of GNN subjects (cf.Section IV-D).Beyond the prioritization performance of NodeRank in uncovering model misclassification, we also consider the performance under adversarial settings (cf.Section IV-E).In this section, we also present how the design of the different variants of NodeRank(cf.Section IV-F), which vary based on the ensemble ranking strategy.Finally, information about implementation and configuration setup is provided in Section IV-G.

A. Research Questions
We investigate the following research questions: • RQ1: What is the effectiveness of NodeRank?
Building on studies in traditional software testing [51], [52], effective test prioritization techniques should be able to prioritize possibly-misclassified test inputs.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
• RQ2: How does NodeRank perform on adversarial inputs?Graph adversarial attacks [32], [33] can induce GNN models to be confident in their however-incorrect predictions.In the original NodeRank, for a given test input, we employ the killing approach from traditional mutation testing [54] to generate model mutation features.These features are then utilized to predict the misclassification probability for this input.However, the model mutation features generated by such a killing approach can contain information from both the model mutation rules and the randomness inherent in mutated model training, both of which can contribute to the effectiveness of NodeRank.In this research question, we aim to demonstrate that the model mutation rules actually contribute to the effectiveness of NodeRank by employing the killing approach in DeepCrime [21], which takes into account the training randomness of the mutated models during the killing process.• RQ6: How do the parameter ranges of the newly designed mutation operators impact the effectiveness of NodeRank?
In NodeRank, we developed a set of novel mutation operators tailored for GNNs.In this research question, we investigate how the parameter ranges of these newly designed mutation operators affect the performance of NodeRank.

B. Performance Metric
We evaluate the effectiveness of test prioritization based on the common Average Percentage of Fault-Detection (APFD) [14] metric.Specifically, higher APFD values indicate faster misclassification detection rates.Given a GNN model M under the test set T , the APFD values are calculated via Formula 6.

AP F D
where n is the number of test inputs in T ; k is the number of test inputs in T that will be misclassified by M ; o i represents the position of the i th misclassified test within the prioritized test set.When the sum of the index values for the first k misclassified tests, i.e., k i=1 o i , is small, it indicates that the prioritized test set has a higher order of the misclassified tests, leading to a larger APFD score.Consequently, a higher APFD score indicates better prioritization effectiveness.
Following prior work [10], we perform normalization on the APFD values, making them fall in the range of [0, 1] to facilitate comparison.We thus assume a test prioritization approach is better if its APFD value is closer to 1.
To conduct a more detailed evaluation, we employ the Percentage of Fault Detected (PFD) metric [10] to quantify the fault detection rate of each test prioritization approach across varying ratios of prioritized test inputs.High PFD values indicate higher effectiveness in identifying misclassified test inputs.PFD is calculated based on Formula 7.
where F c is the number of misclassified test inputs that are correctly detected.F t is the total number of misclassified test inputs.
In this study, we compare the PFD of NodeRank and the uncertainty-based test prioritization approaches against different ratios of prioritized tests.We use PFD-n to represent the first n% prioritized test inputs.

C. Compared Approaches
This study utilized five compared approaches, including a baseline approach (i.e., random selection) and four DNN test prioritization techniques.The selection of these methods was driven by several factors.Firstly, we aimed to consider approaches that could be feasibly adapted for GNN test prioritization.Secondly, the chosen techniques have been demonstrated as effective for DNNs in the existing literature [10], [55], [56].Lastly, open-source implementations of these techniques are available.
• DeepGini [10] employs the Gini coefficient as a statistical measure of the likelihood of misclassification, thereby enabling the ranking of test inputs.The calculation of the Gini score is presented in Formula 8.
where ξ(x) refers to the likelihood of the test input x being misclassified.p i (x) refers to the probability that the test input x is predicted to be label i. N refers to the number of labels.• Vanilla Softmax [55] calculates the difference between the value of 1 and the maximum activation probability in the output softmax layer.Formula 9 clearly depicts the calculation process.
where l c (x) belongs to a valid softmax array in which all values are between 0 and 1, and their sum is 1.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
• Prediction-Confidence Score (PCS) PCS [55] measures the difference between the predicted class and the second most confident class in softmax likelihood.PCS is calculated by Formula 10.Low PCS values indicate high probability of being misclassified.
where l k (x) refers to the most confident prediction probability.l j (x) refers to the second most confident prediction probability.• Entropy Entropy [55] measures uncertainty in a classification model's prediction for a given test by computing the entropy of the softmax likelihood.• GraphPrior GraphPrior [17] is a test prioritization method specifically designed for GNNs.GraphPrior generates mutated models for GNNs and regards tests that kill many mutated models as more likely to be misclassified.• Random selection [57] In random selection, the order of execution for test inputs is determined randomly.

D. GNN Subjects 1) Graph Datasets:
Our study utilizes four benchmark datasets commonly used in the field of graph neural networks (GNNs).The Cora and CiteSeer datasets are composed of machine learning publications, represented as nodes in a graph structure, with edges representing citation links between the publications.The PubMed dataset, on the other hand, contains bio-medicine publications.The LastFM Asia Social Network dataset, consists of the relationships between users on the Last.fmmusic service in Asia, where users are represented as nodes and their mutual follower relationships are represented as edges.These datasets have been widely adopted in existing research on graph neural networks [58], [59], [60], [61], [62].
Overall, we built 124 subjects to evaluate the effectiveness of NodeRank, including 16 subjects of natural datasets and 108 subjects of adversarial datasets.
• Cora [63] Cora comprises 2,708 scientific publications and 5,429 links between them.Publications are considered nodes and are classified into seven classes.• CiteSeer [63] CiteSeer is composed of 3,327 scientific publications and 4,732 links between them.Publications (nodes) are classified into six classes.• PubMed [63] PubMed is composed of 19,717 diabetesrelated publications and 44,338 links between them.Publications (nodes) are classified into three classes.• LastFM Asia Social Network [64] LastFM Asia Social Network comprises 7,624 nodes and 27,806 edges.2) GNN Models: We consider four GNN models which have been widely studied in the literature of neural network testing, specifically under adversarial attacks.
• Graph Convolutional Network (GCN) [41] is a class of neural networks that use graph convolutions.GCN leverages the information of edges to aggregate node information to generate new node representations.• Graph Attention Network (GAT) [42] introduces a graph attention layer to weigh the importance of different nodes within a neighborhood.Each node is assigned an attention score so that more important neighbors can be identified.• Topology Adaptive GCN (TAGCN) [44] designs a set of fixed-size learnable filters to perform convolution operations on graphs.These filters adapt to the topology of the graph while it is scanned for convolution.• Graph Sample and Aggregate (GraphSAGE) [43] generates node embeddings through sampling and aggregating features of neighbor nodes.For computational efficiency, GraphSAGE samples a fixed number of neighbors for each node.

E. Graph Adversarial Attacks
In RQ2, we aim to investigate the effectiveness of NodeRank on test inputs generated through diverse graph adversarial attacks.Graph adversarial attacks refer to the manipulation of the graph structure or node features to generate graph adversarial perturbations that fool the GNN models.To evaluate the performance of NodeRank against such attacks, we applied a range of adversarial attacks in our experiments.We introduced these attacks as follows.
• Delete internally, connect externally (DICE) [31] DICE randomly inserts or deletes an edge for each perturbation.DICE follows two crucial rules: 1) only removing edges between nodes that are from the same class, and 2) only inserting nodes that are from different classes.• Min-max attack (MMA) [32] The min-max attack is a type of untargeted white-box GNN attack, which formulates the attack problem as a min-max optimization problem.In this setup, the inner maximization objective is to update the model's parameters (θ) by maximizing the attack loss, and it can be efficiently solved using gradient ascent.Meanwhile, the outer minimization is achieved using the Projected Gradient Descent (PGD) [65] algorithm, which iteratively perturbs the graph within a bounded p norm constraint to ensure that the generated perturbations are not too large.• Node embedding attack-Add (NEAA) [33] In the node embedding attack-add, attackers have the ability to manipulate the original graph structure by adding new edges while ensuring that a predetermined budget constraint is not exceeded.• Node embedding attack-Remove (NEAR) [33] In the Node embedding attack-Remove, adversarial attacks are aimed at modifying the original graph structure by selectively removing edges while adhering to a budget constraint.• PGD attack (PGD) [32] The PGD attack leverages the Projected Gradient Descent (PGD) algorithm to search for optimal structural perturbations to attack GNNs.• Random Attack-Add (RAA) [34] RAA randomly adds edges to the input graph to generate perturbations.• Random Attack-Remove (RAR) [34] RAR randomly removes edges to the input graph to generate perturbations.• Random Attack-Flip (RAF) [34] RAF randomly flips edges to the input graph to generate perturbations.

F. Variants of NodeRank
In this paper, when using NodeRank, we refer to the approach that utilizes the Sum-based ensemble learning method (cf.Section III-D) on top of the four considered base models, namely Logistic Regression [46], Random Forest [47], XGBoost [48], and LightGBM [49].We also implemented two variants using the stacking-based, and voting-based ensemble methods.
1) NodeRank S : With this variant, we implemented a stacking-based ensemble method, which uses meta-learning [66] to learn from the outputs of base ranking models to make more accurate predictions.Given a GNN model M that classified nodes into n classes and a test set T test , NodeRank S performs as follows: (1) first, each base ranking model RM i is trained using mutation features of the training input set T train of M ; (2) then, NodeRank uses the output of each ranking model to create a new dataset.More specifically, NodeRank S inputs the mutation results of the training set to each ranking model to obtain the outputs.For each training input, NodeRank S obtains four probability scores, which will be considered as new features, while the label is 1 or 0. Here, 1 means the training input is misclassified by the GNN model M , while 0 means the training input is correctly classified.Since the training set has ground truth for each input, in this way, we build a new dataset.
(3) NodeRank S uses the new dataset to train the meta-learner.Here, each input has four features, which are the outputs from the four ranking models.The ground truth is whether an input is misclassified by the GNN model M .(4) After training the metalearner, NodeRank S inputs the mutation results of the test set T test to ranking models.Then, NodeRank S inputs the outputs of ranking models to the meta-learner, which will provide a score for each test input in T test .Based on the scores, NodeRank S prioritizes all the test inputs.
2) NodeRank V : With this variant, we implemented the majority voting-based ensemble learning method [67] to combine the prediction results of different ranking models.Majority voting sums the predictions for each class and returns the class with the majority vote as the ensemble prediction.Given a GNN model M and a test set T test , NodeRank V performs as follows: (1) first, each base ranking model RM i is trained using mutation features of the training input set T train of M ; (2) For a test input in T test , NodeRank V inputs its mutation features to N ranking models, obtaining N scores (i.e., misclassification probabilities) for this input, denoted as Then, NodeRank S transforms each score into 0 or 1. Scores below 0.5 are converted to 0, otherwise to 1.In this way, NodeRank S obtains an N-length vector for each input.For example, {0, 1, • • • , 0}.NodeRank S regards 1 voting for misclassification (i.e., the input will be misclassified by the GNN model M ) and 0 voting for correct classification.
(3) After voting, for each input, NodeRank V sums its votes from all ranking models.NodeRank V ranks all the test inputs based on their votes for misclassification.

G. Implementation and Configuration
We implemented NodeRank in Python based on the PyTorch [68] framework.We also integrate the available implementations of the compared approaches [10], [16], [69] into our experimental pipeline to adapt to the GNN prioritization problem.Regarding the GNN models selected as subjects in our study, the range of their accuracy is: GAT: 71%∼77%, GCN: 70%∼73%, GraphSAGE: 71%∼73%, TAGCN: 72%∼81%.Regarding our mutation rules, for the GNN model mutation, we generated 144 mutants on average.For graph structure mutation, we generated 265 mutants on average.For node feature mutation, we generated 147 mutants on average.Concerning the configurations of node mutation rules in the experiments of this paper, we made the following design choice: We slightly modify attributes, with an offset between 0.005 to 0.015.
We conducted all learning experiments on a highperformance computer cluster, where each cluster node runs a 2.6 GHz Intel Xeon Gold 6132 CPU with an NVIDIA Tesla V100 16G SXM2 GPU.For the data processing, we conducted our experiments on a MacBook Pro laptop with Mac OS Big Sur 11.6, Intel Core i9 CPU, and 64 GB RAM.Overall, our experiments involved 124 subjects, of which 16 subjects were based on natural inputs and 108 subjects were based on adversarial inputs.

V. EXPERIMENTAL RESULTS
For each research question, we present the experimental objective, design, and results before discussing the findings.

A. RQ1: Performance of NodeRank
Objective: We evaluate the performance of NodeRank in prioritizing test inputs for GNNs.To that end, we also compare NodeRank against five uncertainty-based test prioritization approaches.Experimental design: We use our initial subjects (4 datasets and 4 GNN models, leading to 16 combinations of natural inputs, i.e., without any adversarial attacks introduced).Moreover, we compare NodeRank with 6 test prioritization approaches, which include 1 test prioritization method for GNNs (GraphPrior), 4 test prioritization methods for traditional DNNs (i.e., DeepGini, VanillaSM, PCS, and Entropy), and a baseline method (random selection).Specific details about these compared methods can be found in Section IV-C.All subjects are applied to NodeRank, as well as the six compared approaches.Beyond effectiveness, we also investigated the efficiency of NodeRank by analyzing the time cost of each step involved in its execution.Furthermore, due to the randomness in the GNN model training process, we conducted a statistical analysis to ensure the stability of our findings.Following the prior work [70], we repeated all the experiments 30 times.The following results are the averages obtained from the 30 repeated experiments.
To demonstrate the statistical significance of the improvement of NodeRank relative to the compared test prioritization approaches, we utilized the Mann-Whitney U test [71] to compute the p-value of the repeated experimental results.The Mann-Whitney U test is a statistical method used to determine whether there is a notable distinction between two sets Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Furthermore, in addition to showcasing the average experimental results, we also evaluate the variability of these results in order to ensure a more fair comparison between the effectiveness of NodeRank and existing test prioritization approaches.The specific steps of these experiments are elucidated below: • Effectiveness distributions between NodeRank and the compared approaches As previously mentioned, we conducted 30 repetitions of all experiments.Subsequently, based on the results generated from these 30 repetitions, we used box plots to illustrate the distribution of results for various test prioritization methods.The rationale behind employing box plots is that: 1) they offer an intuitive representation of data distribution, including key statistics like the median, quartiles, and identification of outliers.This visual format enables a quick understanding of data characteristics; 2) Box plots offer a visual tool for easily comparing the distribution of experimental results across various test prioritization approaches.When multiple box plots are displayed side by side, the differences between them can be clearly exhibited.

• Confidence interval between NodeRank and the compared approaches
Based on the results of 30 repeated experiments, we calculated the confidence interval of each test prioritization approach.Following the existing study [72], we employed Formula 11 to compute the upper and lower bounds of the confidence interval.We calculated the confidence intervals for different test prioritization methods across two metrics (PFD and APFD) and two scenarios (natural and adversarial datasets).
where X represents the average value, σ represents the standard deviation, n represents the sample size, and Z α/2 represents the confidence coefficient.Results: The experimental results of RQ1 are presented in Tables I-VII, Figs. 2 and 3. We highlight the approach with  III.Our analysis demonstrates that NodeRank consistently surpasses GraphPrior, all the confidence-based approaches, and random selection in terms of average PFD, regardless of the proportion of prioritized tests.Furthermore, the effectiveness of NodeRank is visually apparent in Fig. 3.In the figure, NodeRank is represented by the red line, GraphPrior by the blue line, and the baseline method by the pink line.It is evident that NodeRank consistently outperforms GraphPrior,  all the confidence-based approaches, and the baseline.These experimental findings further confirm the high effectiveness of NodeRank.Table VI presents the results of statistical analysis.We use the Mann-Whitney U test [71] as the metric to calculate the p-value of the experimental results.Our objective is to demonstrate that the improvement of NodeRank over other testing methods is statistically significant.Within Table VI, we see that the range of p-values is from 7.7245 × 10 −7 to 0.0092.These values are all less than 0.05, indicating that the improvement of NodeRank compared to other test prioritization methods is statistically significant.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Moreover, Fig. 2 presents and compares the effectiveness (in terms of APFD) of NodeRank with other test prioritization methods using box plots.The box plots highlight the distribution of results of multiple repeated experiments for NodeRank and other test prioritization methods.In Fig. 2, we see that, in terms of the median, the median APFD value of NodeRank exceeds that of other test prioritization methods across all natural datasets.Moreover, in the presented two specific examples shown in the box plots, which respectively correspond to subject CiteSeer, TAGCN, and subject LastFM, GCN, we can also see that the median of NodeRank from repeated experiments is the highest.
Regarding the quartile range, NodeRank's quartile range (i.e., the height of the box) exhibits some variations across different datasets, but overall, its upper quartile is higher than that of other methods.Analyzing outliers, the box plots do not show significant outliers, indicating that NodeRank's performance across different datasets is relatively stable, with no extreme cases of inefficiency.In summary, we conclude that NodeRank outperforms all compared testing prioritization methods in terms of APFD based on the distribution of data from multiple experimental results.This demonstrates that NodeRank exhibits better effectiveness in test prioritization compared to other methods.
Furthermore, we calculated the confidence intervals for all test prioritization methods, and the experimental results are presented in Tables IV and V.In Table IV, we see that NodeRank's APFD has the highest lower and upper bounds compared to other test prioritization methods, with values of 0.7798 and 0.7867, respectively.Notably, NodeRank's lower bound (0.7798) even exceeds the upper bounds of all other comparative methods.GraphPrior's upper bound is 0.7536, while the upper bounds for other test prioritization methods range from 0.4966 to 0.7109.Table V exhibits the confidence intervals of all test prioritization methods in terms of PFD.The gray highlights indicate the test prioritization approaches that achieve the maximum PFD in this scenario.In Table V, we see that NodeRank also demonstrates the highest lower and upper bounds in terms of PFD compared to other test prioritization methods when prioritizing different ratios of tests.These experimental findings highlight that, in terms of confidence intervals, NodeRank's effectiveness exceeds that of the comparative test prioritization methods.
In addition to its effectiveness, we also present an analysis of NodeRank's efficiency in Table VII.We offer a comprehensive breakdown of the time taken by each step in NodeRank and compare it with GraphPrior, the confidence-based test prioritization methods, and the baseline approach (random selection).As shown in Table VII, the time required for NodeRank is divided into four parts: mutant generation, feature extraction, ranking model training, and NodeRank prediction.Among these steps, mutant generation is found to be the most time-consuming, taking approximately 35 minutes, followed by ranking model training, which takes approximately 3 minutes.Overall, NodeRank requires a total time of approximately 38 minutes.However, it is worth noting that the prediction time of NodeRank is extremely fast, taking less than 1s once the ranking model is trained, and the mutation features are extracted.The overall runtime of GraphPrior is similar to NodeRank, approximately 38 minutes.In contrast, the confidence-based test prioritization methods have an overall runtime of less than 1 second.While NodeRank is less efficient than the uncertaintybased test prioritization approaches (which takes less than 1s), its time cost remains acceptable compared to the prohibitively expensive manual labeling.
Answer to RQ1: On natural test inputs, NodeRank consistently exhibits better effectiveness compared to GraphPrior, all confidence-based approaches, and the baseline method across all subjects, as evident from both APFD and PFD metrics.
In terms of APFD, NodeRank showcases an average improvement of 4.41% and 58.11% over the compared approaches.Additionally, the efficiency of NodeRank is within an acceptable range, thereby demonstrating its practical usefulness.

B. RQ2: Prioritization of Adversarial Inputs
Objective: We evaluate the effectiveness of NodeRank on adversarial test inputs.We assume that natural test inputs (cf.RQ1) can easily discriminate which ones are more likely to reveal bugs.In contrast, with adversarial inputs, by construction, they are all generated to make the probability of the wrong classification label as high as possible.Thus, a test input prioritization on adversarial inputs may be challenged in ranking them adequately.Yet, such prioritization is still necessary to ensure a fast assessment of GNN model robustness.Experimental design: To investigate the effectiveness of NodeRank on adversarial datasets, we generated adversarial test inputs using eight graph adversarial attack methods [31], [33], [34].We set the attack level to 0.3, which indicates that 30% of the test inputs in the test set are adversarial tests.It is worth noting that a high attack level, such as 90%, would result in a significant proportion of adversarial test inputs.Under such circumstances, any prioritization method could potentially select a larger number of bug cases, making it difficult to effectively demonstrate the efficacy of NodeRank.Thus, to ensure a proper evaluation of NodeRank and the compared approaches, we selected a reasonable attack level (i.e., 0.3), which effectively limits the proportion of adversarial test inputs.
Eventually, we construct 108 subjects (i.e., a combination of a GNN model and an adversarial inputs set).Consistent with the experimental design employed in RQ1, we evaluate the prioritization effectiveness of NodeRank and the compared approaches using both the APFD and PFD metrics.Similar to RQ1, we conducted 30 repetitions of all experiments and reported the average outcomes.Aside from presenting the average experimental findings, we assessed the variability of these results to ensure a fairer comparison between the effectiveness of NodeRank and existing test prioritization methods.Detailed steps for these experiments can be found in the experimental design of RQ1 (refer to Section V-A).Results: The experimental results of RQ2 are presented in Tables VIII-XIV, Figs. 4 and 5. Table VIII presents the APFD scores of NodeRank and the compared approaches on DICEbased graph adversarial inputs.Again, NodeRank performs the best across all subjects.Experiment results on all the subjects are available on our GitHub. 2able IX presents the average APFD values for NodeRank and the compared approaches, as well as the average improvement of NodeRank over the compared approaches across different adversarial attacks.We can see that, across all cases, NodeRank consistently outperforms GraphPrior, confidencebased approaches, and random selection.Specifically, NodeRank achieves an average APFD ranging from 0.7570 to 0.8041, whereas GraphPrior averages between 0.7046 and 0.7625.The remaining testing prioritization methods show APFD ranges from 0.4922 to 0.7337.In terms of improvement over GraphPrior, NodeRank demonstrates an average improvement ranging from 4.69% to 8.78%.NodeRank's improvement over the other testing prioritization methods varies from 6.72% to 62.15%.   in comparison to other test prioritization methods is statistically significant.Table X presents the confidence intervals for all test prioritization methods in relation to the metric APFD.We see that NodeRank's APFD has the highest lower and upper bounds compared to other test prioritization methods.Specifically, the lower bound is 0.7686, and the upper bound is 0.7749.These experimental results underscore that in terms of APFD and considering confidence intervals, NodeRank demonstrates better effectiveness compared to other test prioritization methods.
In addition to the APFD metric, we also computed the PFD of NodeRank and compared approaches under adversarial attack scenarios, and the results are presented in Table XI and Fig. 4. As shown in Table XI, NodeRank outperformed the compared approaches regarding PFD values for all attacks and any prioritization ratio of test inputs.Notably, NodeRank detected more than 90% of the bugs when approximately 50% of the test inputs were prioritized.
Furthermore, Fig. 4 offers two visual examples for assessing the effectiveness of NodeRank compared to other approaches on the CiteSeer and LastFM datasets.In the figure, NodeRank is represented by a red line, GraphPrior by a blue line, and the baseline method by a pink line.We see that NodeRank consistently outperforms GraphPrior, as well as all confidence-based approaches and the baseline method.These experimental results demonstrate that the effectiveness of NodeRank exceeds that of all compared approaches under adversarial attack scenarios, indicating its efficacy in detecting bugs in adversarial datasets.
The box plot in Fig. 5 illustrates NodeRank's effectiveness (in terms of APFD) compared to other test prioritization methods using box plots on adversarial datasets.It presents the distribution of results from multiple repeated experiments for both NodeRank and the compared approaches.In Fig. 5, we Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.see that, across all adversarial datasets, NodeRank's median effectiveness, as indicated by the median line within the box, surpasses that of other methods.
Regarding the quartile range, NodeRank's quartile range (i.e., the height of the box) exhibits some variations across different datasets, but overall, its upper quartile is higher than that of other methods.This difference is particularly noticeable in the subjects "RAR, Cora, GAT" and "NEAA, Cora, GCN".In terms of the outliers, we see that the box plots do not show significant outliers, indicating that NodeRank's performance across different datasets is relatively stable.Based on the above experimental results, we conclude that NodeRank outperforms all compared testing prioritization methods in terms of APFD based on the distribution of data from multiple experimental results.This demonstrates that NodeRank exhibits higher effectiveness in test prioritization compared to other methods on adversarial datasets.   of tests.These experimental findings emphasize that, from the perspective of confidence intervals, NodeRank shows higher effectiveness compared to other test prioritization methods.
Table XIV displays the confidence intervals in terms of APFD.In Table XIV, NodeRank's APFD shows the highest lower and upper bounds compared to other test prioritization methods, with values of 0.7760 and 0.7783, respectively.Remarkably, NodeRank's lower bound (0.7760) even surpasses the upper bounds of all other comparative methods.GraphPrior's upper bound is 0.7344, while the upper bounds for other test prioritization methods range from 0.5022 to 0.6987.
Answer to RQ2: On adversarial test inputs, NodeRank consistently demonstrates better effectiveness in comparison to GraphPrior, all confidence-based approaches, and the baseline method across all subjects in terms of both the APFD and PFD metrics.Regarding APFD, NodeRank exhibits an average improvement of 4.96% and 62.15% over the compared methods.

C. RQ3: Influence of Ensemble Learning Methods
Objective.We investigate the impact of ensemble learning strategies on NodeRank's effectiveness in the learning-to-rank process.Experimental design.We employ NodeRank and its variants, namely NodeRank V and NodeRank S (cf.Section IV-F for details), to prioritize test inputs for both natural and adversarial scenarios, and evaluate their effectiveness in terms of APFD.These variants differ in the ensemble learning strategies used in the learning-to-rank process.Results.Table XV presents the average effectiveness of NodeRank and its variants, along with several compared approaches, on both natural and adversarial datasets.The upper part shows the average effectiveness under different models, while the bottom part shows the average effectiveness across different datasets.From Table XV, we can observe that the average effectiveness of NodeRank and its variants outperform all the compared approaches (i.e., GraphPrior, the confidence-based approaches and random selection) in each case.Additionally, the effectiveness of NodeRank is comparatively better than their variants.Across different GNN models, NodeRank performs the best in 100% of the cases on natural data.Furthermore, on the adversarial data, NodeRank also outperforms in 100% of the cases.From the perspective of datasets, on natural data, NodeRank performs better than all the variants in each case.On adversarial data, NodeRank has the highest average effectiveness across all adversarial datasets.Overall, the final average effectiveness of NodeRank is 0.7833 and 0.7772 on natural and adversarial datasets, respectively.These experimental results demonstrate that the sum-based ensemble learning strategies used in NodeRank is more suitable for test prioritization.
Answer to RQ3: On both natural and adversarial datasets, NodeRank offers a better effectiveness, in terms of APFD, over other variants.We also note that any variant of NodeRank outperforms all the compared approaches in GNN test prioritization.

D. RQ4: Ablation Study of Mutation Operators
Objective: We investigate the effect of each category of mutation operators (i.e., GSM, NFM, and GMM).To this end, we analyze the contributions of the features generated by each type of mutation operator and conduct corresponding ablation studies.We proceed as proposed by Meyes et al. [53]: We measure the impact of a component on an ML system by removing or replacing this component and observing whether the performance of the ML system is affected.The objective is not to comprehensively check which feature set combinations provide good performance but rather to check that each set contributes to the performance.Experimental design: We assume that the node mutation features (NFM) as a key component of the NodeRank approach.Then, the graph structure mutation (GSM) features, which are obtained from the dataset, are considered the next most important feature set.Finally, the graph model mutation (GMM) features are considered as the first that can be removed in the ablation study, following the process in [53].The experimental steps for checking the contributions of each subset of mutation features to the performance of NodeRank are thus as follows: 1) We compute the test prioritization performance of NodeRank when all mutation features are used.2) We compute the test prioritization performance of a variant of NodeRank where the ranking model is learned with vectors that do not consider GMM features.3) We compute the test prioritization performance of a variant of NodeRank where the ranking model is learned only with NFM feature vectors (i.e., by removing the GMM and GSM).4) Finally, we also consider the case where no features are used.NodeRank, therefore, does not implement ensemble learning to rank.Instead, we consider a random ranking approach to prioritize the test set.Note that we do not attempt to perform experiments that compare the value of the different feature sets.Indeed, the mutation space of GNNs is complex, and mutations of different types can produce feature vectors of various sizes, which may implicitly impact the learning performance, making any performance comparison biased or uninformative.Results: The results of the ablation experiment are reported in Table XVI.As expected, the Random prioritization approach, Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
which employs no mutation features for learning to rank, performs the worst in terms of APFD.In contrast, the NodeRank approach that learns to rank by incorporating all three mutation rule sets (pertaining to nodes, graph structure, and graph model) exhibits the highest performance.Remarkably, the exclusion of graph model mutation features leads to a decline in learning performance by approximately 17.84% and 14.90% in terms of APFD on natural and adversarial datasets, respectively.On the other hand, employing only node mutation features yields a significant improvement over Random prioritization, with a performance gain of approximately 21.98% and 18.75% in terms of APFD on natural and adversarial datasets, respectively.
Moreover, by comparing against the performance of uncertainty-based DNN test prioritization approaches and GraphPrior, we note that the combinations of the three categories of mutation features were necessary to achieve state-of-the-art performance in GNN test prioritization.

Answer to RQ4:
The design choice in NodeRank to include all three types of mutation operators was effective.Indeed, although the node mutation operator can enable NodeRank to outperform random prioritization, it is the combination of NFM, GSM, and GMM operators that together lead to the SOTA performance of NodeRank.

E. RQ5: Investigating the Contributions of Model Mutation Rules on NodeRank Effectiveness
Objective: In this research question, our aim is to demonstrate that the model mutation rules of NodeRank actually contribute to its effectiveness.In the original NodeRank, we utilize the killing approaches in traditional mutation analysis for DNNs.This killing process is used to generate model mutation features of a given test input.The features are then utilized to predict the misclassification probability of this input.In this Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
process, the model mutation features generated by killing may contain information resulting from model mutation rules and randomness in model training, both of which may contribute to the effectiveness of NodeRank.In this research question, by utilizing the killing approach in DeepCrime [21], which considers the training randomness in the process of the killing, we aim to demonstrate that the model mutation rules actually contribute to the effectiveness of NodeRank.Experimental design: To demonstrate the aforementioned objective, we designed three types of variants of NodeRank: 1) NodeRank DeepCrime , a variant that utilizes DeepCrime's killing method to mitigate the influence of randomness when generating model mutation features.DeepCrime's killing approach takes into account the training randomness of the mutated model.Specifically, this killing approach requires repeating the training process n times for both the original model N = N 1 , . . ., N n and its mutated model M = M 1 , . . ., M n .A test is considered killed if the difference between the outputs of the original and mutated models is statistically significant with non-negligible and non-small effect size.2) NodeRank Random , which does not incorporate model mutation rules and solely relies on random generation of model mutation features, and 3) NodeRank withoutGMM , which does not utilize model mutation features.We validated whether model mutation rules contribute to the effectiveness of NodeRank by comparing the effectiveness of these three variants.If NodeRank DeepCrime outperformed both NodeRank Random and NodeRank withoutGMM , we consider that the model mutation rules contribute to the effectiveness of NodeRank.
In the subsequent sections, we first describe the detailed implementation of DeepCrime.Then, we present the details of the variants of NodeRank and how we leverage these variants to demonstrate that model mutation rules contribute to NodeRank's effectiveness.

1) Implementation of DeepCrime
Given an original GNN model N and a test t, the DeepCrime approach follows the following method to determine whether a test is "killed".❶ For the original GNN model N , we repeated its training process n times, resulting in n GNN models: N 1 , . . ., N n .
Similarly, for the mutated model M , we repeated its training process n times, obtaining M 1 , . . ., M n .Consistent with previous research [21], we set n = 20 in our experiments.❷ For N 1 , . . ., N n , we used each GNN model to make predictions on the test t and obtained the predicted classifications for t from each model.Similarly, for M 1 , . . ., M n , we used each mutated model to predict the test t and obtained the predicted classifications for t from each model.❸ Prior work [21] suggests that the mutated model M is considered "killed" if, for the given test t, the difference between the output of the original and mutated models, denoted as A N (t) = A N1 , . . ., A Nn and A M (t) = A M1 , . . ., A Mn , is statistically significant with a nonnegligible and non-small effect size.Therefore, we measure whether the mutated model M is "killed" using Formula 12.
isKill(N, M, t) In Formula 12, isKilled indicates whether the test t "kills" the mutated model M .A N (t) represents a series of predictions (outputs) for test t from the set of models N 1 , . . ., N n .Similarly, A M (t) refers to A M1 , . . ., A Mn , representing the set of predictions (outputs) for test t from the set of models M 1 , . . ., M n .
The term "effect size" [73] quantitatively measures the difference between two distributions of APFD results.One commonly used measure of effect size is Cohen's d.This value can be interpreted using thresholds provided by Cohen [74]: |d| < 0.2 indicates a "negligible" effect, |d| < 0.5 indicates a "small" effect, |d| < 0.8 indicates a "medium" effect, and otherwise, it is considered a "large" effect.The prior study on DNN mutation analysis [21] pointed out that the effect size should be non-small.The β can have an impact on the effectiveness of the killing method DeepCrime in the context of NodeRank for test prioritization.In our experiments, we set β to be 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, and 0.9, covering a wide range of effect sizes.By adopting different effect size values, we can observe whether, under different effect sizes, the effectiveness of the NodeRank variants using DeepCrime and mutation rules is better than the variant not using mutation rules but instead randomly generating mutated models, thereby better validating the contributions of model mutation rules.

2) Variants of NodeRank
In the following, we explain how we design variants of NodeRank and how we utilize them to demonstrate the effectiveness of NodeRank's model mutation rule.The two tables show that in each case, NodeRank DeepCrime performs the best.Furthermore, each variant of Node-Rank DeepCrime , regardless of different effect size settings, exhibits higher effectiveness than that of NodeRank Random .
According to the experimental design mentioned above, if the effectiveness of NodeRank DeepCrime is higher than that of NodeRank Random and NodeRank withoutGMM , we consider that the model mutations (generated by model mutation rules) in NodeRank DeepCrime contribute to mutated models.Therefore, the above experimental results indicate that the model mutation rule of NodeRank actually contributes to its effectiveness.

Answer to RQ5:
The model mutation rules of NodeRank actually contribute to its effectiveness.

F. RQ6: Influence of Mutation Operator Parameters on NodeRank
Objective: In NodeRank, we designed a set of new mutation operators specifically for GNNs.In this research question, we explore the influence of the parameter ranges of mutation operators on NodeRank.Experimental design: First, we selected multiple mutation operators with parameters of integer/float types.This choice was made because, for Boolean-type mutation operators, mutations involve toggling between True and False, resulting in only one possible parameter value, rendering parameter changes unfeasible.Following the approach from the existing study [11], for each investigated mutation operator, we systematically varied its parameters multiple times while keeping the parameters of other mutation operators in their initial states.Subsequently, we recorded NodeRank's effectiveness (measured by APFD) after each parameter change.We used line graphs to visually depict the impact of parameter changes on NodeRank's effectiveness for each mutation operator.Specifically, NodeRank consists of three types of mutation operators: Graph structure mutation (GSM), Node feature mutation (NFM), and GNN model mutation (GMM).Since these mutation operators aim to introduce subtle modifications to the original test set or the GNN model, we aim to ensure that after adjusting parameter ranges, the new parameter values also result in relatively slight changes.
• Graph structure mutation (GSM) GSM includes a mutation operator that involves slightly changing the structure of the input graph by randomly adding edges.Consequently, the parameter for this mutation operator is the number of added edges.This parameter was set to 1, 2, 3, and 4 to investigate the impact of the parameter range of this mutation operator on the effectiveness of NodeRank.• Node feature mutation (NFM) NFM includes a mutation operator that changes the features of the targeted nodes to adjust their positions in the feature space.Consequently, the parameter for NFM is the node feature offset.This parameter was set to 0.05, 0.10, 0.15, and 0.20 to investigate the impact of the parameter range of this mutation operator on the effectiveness of NodeRank.• GNN model mutation (GMM) GMM comprises multiple mutation operators that aim to make slight changes to the training parameters in NodeRank.We selected mutation operators with integer/float-type parameters and adjusted their parameter ranges.These include "Negative Slope" with parameter range adjustments of (0.1, 0.2, 0.3, 0.4) and "Hidden Channel" with parameter ranges of (5,10,15,20).Results: The experimental results of RQ6 are presented in Fig. 6.The experiments are conducted on 16 natural subjects.Among them, Fig. 6(a) shows the impact of changing the parameter number of edges for the mutation operator targeting graph structure.Fig. 6(b) illustrates the influence of the parameter node feature offset for the mutation operator targeting node features.Fig. 6(c) demonstrates the impact of the parameter negative slope for the mutation operator targeting the GNN models.Fig. 6(d) displays the influence of the parameter Hidden channel for the mutation operator targeting the GNN model.In this context, the red line represents NodeRank.First, we see that across all parameter settings of the mutation operator, NodeRank effectiveness consistently exceeds that of all the comparative test prioritization methods (i.e., GraphPrior, confidence-based approaches and random selection).Moreover, we found that NodeRank performs stably when the parameter values of the newly designed mutation operators change.For example, when modifying the "Number of edges" parameter, the APFD values of NodeRank vary within the range of approximately 0.778 to 0.785.Similarly, when adjusting the "Node feature offset" parameter, the APFD values of NodeRank fluctuate between approximately 0.777 and 0.785.
Answer to RQ6: Across all parameter settings of the newly designed mutation operator, NodeRank's effectiveness consistently outperforms that of other comparative test prioritization methods.Moreover, the effectiveness of NodeRank remains stable when the parameter values change.

A. Generality of NodeRank
Our proposed NodeRank and its variants perform test prioritization for GNNs via ensemble-learning-based mutation analysis.The evaluation on 124 subjects demonstrates their effectiveness on both natural and adversarial datasets.The scheme of NodeRank, (i.e., slightly changing graph inputs and graph models) can also be generalized to edge-level and graph-level GNN tasks.In the future, we will carefully design relevant mutation rules to further adapt NodeRank to other GNN tasks.
Additionally, we discuss the potential applicability of NodeRank for regression tasks.However, currently, the mutation rules and ranking models of NodeRank are designed explicitly for classification tasks.To extend NodeRank to regression tasks, modifications to the model mutation rules and ranking models would be required.If appropriate model mutation rules can be identified for regression tasks and suitable ranking models can be designed, NodeRank could also be a promising approach for regression tasks.

B. Challenges of NodeRank
NodeRank requires a sufficiently large training set to train its internal ranking model.This training set includes labels (i.e., samples that the model predicts incorrectly are labeled as 1, while correctly predicted samples are labeled as 0).If the original model has very high accuracy, it can result in very few training samples labeled as 1, potentially leading to an imbalanced dataset during the training of NodeRank's ranking model.
An imbalanced dataset can cause a decrease in performance when dealing with samples labeled as 1, as there are not enough examples to learn how to rank these samples correctly.
For example, in a scenario involving bank transfer transactions, where each account represents a node and edges represent transfer transactions between accounts, GNN models can be used to identify fraudulent accounts (i.e., whether a node is a fraudulent account or not).If the GNN model has a very high accuracy (few nodes predicted incorrectly), it will result in very few samples labeled as 1 in the NodeRank training set.This directly affects the training of the ranking model in NodeRank.Under these conditions, the effectiveness of NodeRank in prioritizing the misclassified accounts will be affected.

C. Differences in Approaches for NodeRank
In this Section, we discuss the differences in approaches for NodeRank from three perspectives, namely the differences in evaluating NodeRank methods, the differences between NodeRank and its variants, as well as different NodeRank approaches with different types of features.
[Differences in evaluating NodeRank methods] In addition to evaluating NodeRank on natural datasets, we assess its effectiveness from three different perspectives, as presented in RQ2 through RQ4.This is because these perspectives cover key aspects and contribute to a comprehensive understanding of NodeRank's performance.In RQ2, we assess the efficacy of NodeRank when confronted with adversarial test inputs.In RQ3, we explore how ensemble learning strategies influence NodeRank's effectiveness within the context of learning-torank.In RQ4, we examine the individual contributions of each category of mutation features (GSM, NFM, and GMM) that are generated for NodeRank's learning-to-rank model.Below, we provide a detailed explanation of the differences across approaches for RQ2, RQ3, and RQ4, as well as why it is important to assess the effectiveness of NodeRank from these three different perspectives.[Differences between NodeRank and its variants] In RQ3, we propose several variants of NodeRank.In RQ3, the variants of NodeRank differ in the ensemble learning strategies used to combine base ranking models.Apart from this distinction, the workflows of the NodeRank variants remain identical to NodeRank.
[Different NodeRank approaches with different types of features] In RQ4, we design different NodeRank approaches, which apply different types of mutation features for test prioritization.Specifically, NodeRank NF M only applies the NFM features.NodeRank NF M+GSM applies both the NFM and GSM features.Our aim is to investigate the contributions of each feature type to the effectiveness of NodeRank.

D. Threats to Validity
Threats to Internal Validity.The internal threat to validity mainly exists in the implementation of NodeRank, its variants, and the compared approaches.To reduce the threat, we implemented all approaches based on the widely used library Py-Torch.Concerning the compared test prioritization approaches, we considered the implementations released by the authors.Another internal threat lies in the randomness of the model training process.To mitigate this threat, we conducted a statistical analysis involving performing ten repetitions of the model training process for both the original and mutated models.We then used these results to calculate the statistical significance of the experiments.The selection of the mutation rules used in our study represents another potential threat to the internal validity of our research.Despite our best efforts to identify model mutation rules, it is possible that there are other unknown training parameters that could serve as mutation rules.To mitigate this potential threat, we deliberately chose model mutation rules that could directly or indirectly impact node interdependence in the prediction process.Threats to External Validity.The external threats to validity mainly stem from the selection of the graph datasets as well as the GNN models adopted for our study.This threat is mitigated by the diversity of the subjects, as well as by the fact that we consider assessing not only natural inputs but also adversarial inputs.Threats to Construct Validity.Our mutation rules are similar to the attacks used under graph adversarial settings.This may, in theory, create a bias in the experimental results related to adversarial test input prioritization.However, this threat is mitigated by two elements: first, we also apply NodeRank on natural inputs; second, the objective of the mutation is eventually to generate features for learning to rank initial inputs, not generating new samples that will be part of the test suite.

A. Test Prioritization Techniques
Test prioritization focuses on finding the ideal ordering of tests to detect more bugs in a limited time budget.In traditional software testing, a variety of approaches [14], [75], [76], [77], [78], [79] has been proposed.Mutation analysis has also been Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
explored for test prioritization: Shin et al. [51] use a diversityaware mutation adequacy criterion and demonstrate its effectiveness on large-scale developer-written test cases.Papadakis et al. [80] proposed mutating Combinatorial Interaction Testing models for test prioritization.Gökçe et al. [81] introduced a prioritized testing approach aimed at enhancing the testing capacity of ESG-based testing algorithms.ESG-based algorithms, as discussed by Belli et al. [82], focus on generating software test suites that meet specific criteria related to both coverage and execution cost.Gökçe et al.'s approach leverages adaptive competitive learning algorithms for training the neural networks utilized in this process.The core objective of their work is to improve the test capacity of existing algorithms by prioritizing the testing process.GÖKÇE et al. [83] introduced a modelbased approach to test prioritization.Their method focuses on providing an effective algorithm for ordering test cases based on the perceived degree of preference by the tester.Unlike code-based approaches, which rely on prior knowledge such as fault counts, or source code, GÖKÇE et al.'s approach is radically different.It does not require prior knowledge about the system under test (SUT), making it suitable for a wide range of testing scenarios.
For DNNs, Feng et al. [10] have proposed DeepGini, which prioritized test inputs based on model uncertainty: a test input is more likely to be incorrectly predicted if the DNN model outputs similar probabilities for different classes.PRIMA [11] is currently the state-of-the-art DNN test prioritization approach.It is based on intelligent mutation analysis guided by learningto-rank.NodeRank shares similarities with PRIMA in the use of mutation analysis.Unfortunately, PRIMA's mutation rules are not applicable to GNNs and their inputs.Our work is thus the first approach that specifically leveraged mutation testing adapted to GNNs in order to achieve test input prioritization.

B. Mutation Testing
Mutation testing is commonplace in traditional software engineering [51], [80], where it constitutes a widely validated way to assess the quality of test cases.Mutation rules for traditional software have therefore been iteratively refined in the community.Recent studies have extended the applicability of mutation testing to various domains by focusing on adapting new mutation rules.Beyond simple bugs, Loise et al. [23] proposed 15 security-aware mutant operators to improve security testing.Beyond plain Java code, Deng et al. [84], [85] proposed novel mutant operators that are specifically designed to test Android applications (e.g., with event handling and activity lifecycle mutant operators).
Furthermore, in addition to the context of traditional software, several studies have investigated the application of mutation testing to DNNs and have proposed different mutation operators and frameworks.For instance, Ma et al. [86] proposed DeepMutation, a method to assess the quality of test data for DL systems using mutation testing.To achieve this, they designed a collection of source-level and model-level mutation operators to inject faults into the training data, programs, and DL models.The effectiveness of the test data is evaluated by analyzing the extent to which the injected faults can be detected.Later, Hu et al. [87] extended their work into a mutation testing tool for DL systems named DeepMutation++.This tool introduced new mutation operators for feed-forward neural networks (FNNs) and Recurrent Neural Networks (RNNs) and enabled the mutation of run-time states of an RNN.Another notable contribution is DeepCrime [21], a mutation testing tool that implements a set of DL mutation operators based on real DL faults.Shen et al. [88] proposed MuNN, a mutation analysis method for neural networks.MuNN defined five mutation operators based on the characteristics of neural networks.

C. Deep Neural Network Testing
In order to improve the test efficiency of DNNs, existing studies [9], [10], [11], [87], [89], [90], [91], [92] has proposed several approaches to optimize the test process, which is mainly divided into two categories.The first one is test input prioritization, which has been elaborated in the above section.The second one is test selection, which focuses on selecting a small group of test inputs to precisely estimate the accuracy of the whole testing set to reduce labelling costs.Li et al. [9] proposed Cross Entropy-based Sampling (CES) to select representative test inputs for DNN accuracy estimation, which minimizes the cross-entropy between the selected set and the entire test set to ensure the distribution of the selected test set similar to the original test set.Chen et al. [89] proposed PACE for test selection and accuracy estimation.Pace clusters all the inputs in a test set into different groups and leverages the MMDcritic algorithm [93] to select prototypes from each group.In addition to improving DNN testing efficiency, existing studies [15], [86], [94], [95], [96] have also focused on measuring DNN testing adequacy.Pei et al. [95] proposed neuron coverage to assess the extent to which a test set covers the DNN model logic.Ma et al. [96] proposed DeepGauge, a set of coveragebased metrics that consider neuron coverage a good indicator to evaluate the adequacy of test inputs.Kim et al. [15] proposed surprise adequacy, which assesses the adequacy of test inputs by measuring their surprise with respect to the training set.

VIII. CONCLUSION
To relieve the labelling-cost problem and improve the efficiency of GNN testing, we propose a novel test prioritization approach, NodeRank, which prioritizes test inputs that are more likely to be misclassified by the evaluated GNN model.NodeRank filled a gap in the literature: prioritization approaches that achieve state-of-the-art performance on DNNs are not suitable for GNNs since they ignore the interdependence between test inputs in graph-structured datasets.NodeRank leverages the concepts of mutation testing to perform test prioritization, with the aim of reducing the labelling cost in the process of evaluating a GNN model.Overall, NodeRank is a test prioritization approach that is model-based, input-based, and mutation testing-based.It utilizes mutation operations on both GNN models and test inputs to generate mutation features for each test input, facilitating test prioritization.The core idea is that: If a test input (node) can kill many mutated models and produce different prediction results with many mutated inputs, this input is considered more likely to be misclassified by the GNN model and should be prioritized higher.The specific process of NodeRank consists of two core steps: (1) NodeRank introduced three types of mutation rules to generate mutants from the perspective of the graph structure, node features, and the GNN model, respectively.(2) After obtaining the mutation results, NodeRank generated mutation feature vectors and utilized ensemble ranking models for test prioritization.Experimental results on 124 diverse subjects, considering natural and adversarial inputs, demonstrated the effectiveness of NodeRank.More specifically, NodeRank outperformed all the compared test prioritization approaches with an average improvement between 4.41% and 62.15%.Moreover, ablation experiments are performed to check that the different types of mutation features are all useful for the effectiveness of NodeRank.

Fig. 4 .
Fig. 4. Test prioritization effectiveness among NodeRank and the compared approaches for CiteSeer with GCN attacked by MMA and LastFM with GraphSAGE attacked by PGD.X-Axis: the percentage of prioritized tests; Y-Axis: the percentage of detected misclassified tests.

Fig. 5 .
Fig. 5. Effectiveness distributions between NodeRank and the compared approaches on adversarial test inputs.

•
Loss Calculation: This process calculates the loss value based on the GNN's output and the true labels.•Backpropagation: This process computes the gradients of the loss function with respect to each parameter.• Parameter Weight Updates: This process updates each parameter weight based on the gradient values.• Iterate: This process repeats the above steps (forward propagation, loss calculation, backpropagation, parameter updates) until a stopping condition is met, such as reaching a predefined number of iterations.It is important to note that graphs used in GNN training differ from normal graphs.Specifically, GNN training graphs include feature attributes for nodes.Furthermore, in tasks like node classification, nodes in the graphs have category labels.In contrast, normal graphs usually comprise only the topological structure of nodes and edges, without specific labels or node attributes.

since they both represent matrices, we use dotted squares with different colors to distinguish them. In Step 1, the dotted square represents the node feature vector, while in Step 2, the dotted square represents mutation features generated from the mutation results
• It is important to note that the meaning of the dotted squares in Step 1 and Step 2 is different.However, .• We utilize arrows to illustrate processes and operations.For instance, in Step 1, the Graph data undergoes graph structure mutation within Step 1 and feature generation in Step 2, resulting in graph input mutation features.Another example involves the Node feature vector in Step 1, which undergoes node feature mutation in Step 1 and feature generation in Step 2 to yield node mutation features.• The chromosome symbols represent mutation results.
Given a test input t, a GNN model M and its mutants {M 1 , M 2 ,..., M n }, we associate to the test input t, a vector V of size n (i.e., the number of mutants of M ) where V [k] maps to the execution output for the mutant M k with test input t.If t kills the mutated model M k (i.e., the prediction of v k via the original model M and the mutated model (t k ) represents the prediction of the GNN model M for mutant t k , and M (t) represents the prediction for the original test input t.Model mutation (GMM):

Do the model mutation rules of NodeRank contribute to its effectiveness?
Thus, existing confidence-based test prioritization approaches are likely to fail.We demonstrate the superior performance of NodeRank under such settings.

TABLE I EFFECTIVENESS
COMPARISON AMONG NODERANK, RANDOM, DEEPGINI, VANILLASM, PCS, AND ENTROPY IN TERMS OF THE APFD VALUES ON NATURAL DATASETS The gray shade indicates the approach with the highest effectiveness.ofdata distributions.The Mann-Whitney U test does not require the assumption of normal distribution for the data.Therefore, it can be used for both normal and non-normal distributed data.The Mann-Whitney U test transforms the data into ranks, calculates a test statistic based on these ranks, and uses this as a basis for computing the p-value to assess if there is a statistically significant difference between the two sets of data.A p-value < 0.05 is generally considered indicative of significance.

TABLE V CONFIDENCE
INTERVAL OF NODERANK AND THE COMPARED APPROACHES IN TERMS OF PFD ON NATURAL TEST INPUTS

TABLE VI STATISTICAL
ANALYSIS ON NATURAL TEST INPUTS (IN TERMS OF P-VALUE UNDER THE MANN-WHITNEY U TEST) Test prioritization effectiveness among NodeRank and the compared approaches for CiteSeer with TAGCN and PubMed with GAT.X-Axis: the percentage of prioritized tests; Y-Axis: the percentage of detected misclassified tests.

TABLE X CONFIDENCE
INTERVAL OF NODERANK AND THE COMPARED APPROACHES IN TERMS OF APFD ON DICE-BASED GRAPH ADVERSARIAL TEST INPUTS

TABLE XI AVERAGE
COMPARISON RESULTS AMONG NODERANK AND THE COMPARED APPROACHES ON ADVERSARIAL DATA IN TERMS OF PFD

TABLE XII CONFIDENCE
INTERVAL OF NODERANK AND THE COMPARED APPROACHES IN TERMS OF PFD ON ADVERSARIAL DATASETS Note: The gray shade indicates the approach with the highest effectiveness.

TABLE XIII STATISTICAL
ANALYSIS ON ADVERSARIAL DATASETS (IN TERMS OF P-VALUE UNDER THE MANN-WHITNEY U TEST) Tables XII and XIV displays the confidence intervals of all test prioritization methods.Table XII displays the confidence intervals of all test prioritization methods in terms of PFD.We see that, in terms of PFD, NodeRank also demonstrates the highest lower and upper bounds compared to other test prioritization approaches when prioritizing different ratios Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE XIV CONFIDENCE
INTERVAL OF NODERANK AND THE COMPARED APPROACHES IN TERMS OF APFD ON ADVERSARIAL TEST INPUTS NodeRank withoutGMM represents NodeRank without the use of model mutation features.NodeRank Random incorporates model mutation features, but the mutated models are not generated by model mutation rules; instead, they alter the initial random seed to produce different but equivalent GNN models as the mutated models.NodeRank DeepCrime utilizes model mutation rules to generate mutated models.After completing the above process, we consider that if the effectiveness of NodeRank DeepCrime is higher than that of NodeRank Random and NodeRank withoutGMM , the model mutation rules operated in NodeRank DeepCrime contribute to its effectiveness.Results: Table XVII presents the experimental results for RQ5.We highlighted the approach with the highest effectiveness in grey to facilitate quick and easy interpretation of the results.The table above showcases the average effectiveness of NodeRank withoutGMM , NodeRank Random , and NodeRank DeepCrime across different subjects in terms of models.The table below shows the average effectiveness in terms of datasets.

Contributions of Mutation Features:
In RQ4, we delve into the individual contributions of each category of mutation features (GSM, NFM, and GMM) on NodeRank.Understanding these differences is essential to identify which features are most critical for NodeRank's effectiveness, guiding further research and development efforts.