Significance tests of feature relevance for a black-box learner

An exciting recent development is the uptake of deep neural networks in many scientific fields, where the main objective is outcome prediction with the black-box nature. Significance testing is promising to address the black-box issue and explore novel scientific insights and interpretation of the decision-making process based on a deep learning model. However, testing for a neural network poses a challenge because of its black-box nature and unknown limiting distributions of parameter estimates while existing methods require strong assumptions or excessive computation. In this article, we derive one-split and two-split tests relaxing the assumptions and computational complexity of existing black-box tests and extending to examine the significance of a collection of features of interest in a dataset of possibly a complex type such as an image. The one-split test estimates and evaluates a black-box model based on estimation and inference subsets through sample splitting and data perturbation. The two-split test further splits the inference subset into two but require no perturbation. Also, we develop their combined versions by aggregating the p-values based on repeated sample splitting. By deflating the bias-sd-ratio, we establish asymptotic null distributions of the test statistics and the consistency in terms of Type II error. Numerically, we demonstrate the utility of the proposed tests on seven simulated examples and six real datasets. Accompanying this paper is our Python library dnn-inference (https://dnn-inference.readthedocs.io/en/latest/) that implements the proposed tests.


I. INTRODUCTION
D EEP neural networks [1] are a representative of black- box models, in which the learning process between features and outcomes is usually difficult to track due to the lack of knowledge about complex hidden patterns inside.The primary goal of deep learning (DL) is to fit a deep neural network for predicting outcomes with high predictive accuracy.Driven by its superior prediction performance [1], scientists seek accountability and interpretability beyond prediction accuracy.In particular, they demand significance tests based on DL to explore novel discoveries on scientific domain knowledge, for example, if a specific lung region is significantly associated with COVID-19.
Given a dataset, the goal of statistical significance tests is to examine if a collection of features of interest is associated with the outcome.A problem of this kind frequently occurs in classical parametric models or non-black-box models, for instance, in statistical genetics such as Alzheimer's disease (AD) studies, where a gene is routinely examined and tested for AD association based on a linear model.Yet, significance testing based on a black-box model for a more complicated dataset remains understudied.In Section I-A, we discuss the existing methods and their limitations and issues.In Section I-B, we summarize our contributions to highlight the novelty of the proposed methods in addressing the existing issues.

A. Existing methods and their limitations
In the existing literature, inference methods can be categorized into two groups: non-black-box tests and black-box tests.Non-black-box tests, such as the Wald test [2] and the likelihood-ratio test [3], [4], perform hypothesis testing for hypothesized features (i.e.features of interest) based on the asymptotic distribution of the estimated parameters in a parametric model such as a linear model.Black-box tests focus on a model-free hypothesis, such as Model-X knockoffs, conditional randomization tests (CRT; [5]), holdout randomization test (HRT; [6]), permutation test (PT; [7]), conditional permutation test (CPT; [8]), and leave-one-covariate-out test (LOCO; [9]).Specifically, Model-X knockoffs conduct variable selection with FDR control based on a specified variable importance measure on each of the individual features.CRT, HRT, and CPT examine the independence between the outcome and each feature conditional on the remaining features (at least for their simulations and implementation).PT examines the marginal independence between the outcome and hypothesized features.LOCO introduces the excess prediction error for each feature to measure its importance for a given dataset.
Limitations of existing works.Despite the merits of the methods developed, they have their limitations.(i) First, for non-black-box tests, it is difficult to derive the asymptotic distribution of the parameter estimates from black-box models, especially for over-parametrized neural networks.Moreover, the explicit feature-parameter correspondence may be lost for a black-box model, such as a convolutional neural network (CNN; [10]) using shared weights for spatial pixels, and recurrent neural networks (RNN; [11]) using shared weights for subsequent states.(ii) Second, most existing black-box tests focus on variable importance or inference on a single feature, yet simultaneous testing of a collection of features is more desirable in some applications.For example, in image • To address issues (i) and (ii), we propose a flexible risk invariance null hypothesis for a general loss in (2), which measures the impact of a collection of hypothesized features on prediction.Its relation to conditional independence is given in Lemma 1. • To address issues (iii) and (iv), the one-split and twosplit tests bypass the requirement of estimating the conditional distributions of features and testing based on the differenced empirical loss with sample splitting subject to computational constraints.• We provide theoretical guarantee of the proposed tests, c.f., Theorems 2 and 4, and Theorems A.1-A.4 in Appendix A. The theory is illustrated by extensive simulations.
• We compare the proposed tests with other existing tests and demonstrate their utility on four benchmarks on various deep neural networks in Section VI.We develop a Python library dnn-inference to implement the proposed tests.Overall, the proposed tests relax the assumptions and reduce the computational cost, providing more practical, feasible, and reliable testing for a black-box model on a complex dataset.
This article is structured as follows.Section II introduces the one-split test as well as its combined test.Section III performs Type II analysis and establishes the consistency of the proposed tests.Section IV develops sample splitting schemes.Section V is devoted to simulation studies and applications to four real datasets.The Appendix encompasses the two-split test, additional numerical examples, and technical proofs.

II. THE PROPOSED BLACK-BOX TESTS
In DL, a deep neural network f (X) is fitted to predict an outcome Y based on features X ∈ R d , and its prediction performance is evaluated by a loss function l(f (X), Y ).Our objective is to test the significance of a subset of features X S = {X j : j ∈ S} to the prediction of Y , where S is an index set of hypothesized features and X S c = {X j : j / ∈ S} with S c indicating the complement set of S. Note that X S can be a collection of weak features in that none of these features is jointly significant to prediction, but collectively they are.For example, in image analysis, the impact of each pixel is negligible but a pattern of a collection of pixels (e.g. in a region) may instead become significant.
To formulate the proposed hypothesis, we first generate dual data (Z, Y ) by replacing X S by some irrelevant constants as Z S such as where M is an arbitrary deterministic constant.Note that the dual data Z satisfies that Z S ⊥ Y | Z S c and Z S c = X S c , thus we aim to use differences between (X, Y ) and (Z, Y ) to measure the impact of X S on the prediction of the outcome Y .To proceed, we introduce the corresponding risks: Then, the differenced risk is defined as R(f * ) − R S (g * ) to measure the significance of X S , that is, compare the best prediction performance with/without the hypothesized features X S with existence of X S c .Here f * = argmin f R(f ) and g * = argmin g R S (g) are the optimal prediction functions in population.
To determine if X S is significantly relevant to the prediction of Y , consider null H 0 and alternative H a hypotheses: Rejection of H 0 suggests that the feature set X S is relevant to the prediction of Y .It is emphasized that, in (2), the targets are the two true or population-level functions f * and g * , instead of their estimates (based on a given sample) as implemented in some existing tests.
In the next section, we demonstrate the relation between the proposed hypothesis and the independence hypothesis.More discussion about the differences between the hypotheses in HRT and LOCO can be found in Sections II-E and V-A.

A. Connection to independence
This subsection illustrates the relationships among the risk invariance hypothesis in (2), marginal independence, and conditional independence; the latter two are defined as: Lemma 1: For any loss function, conditional independent implies the proposed risk invariance, that is, Moreover, if the negative log-likelihood or the cross-entropy l(f (X), Y ) = −1 Y log(f (X)) is used in (2) as a loss function, then H 0 is equivalent to conditional independence almost surely under the marginal distribution of X, that is, R(f * ) − R S (g * ) = 0 ⇐⇒ For any y, As suggested by Lemma 1, conditional independence always implies risk invariance, but they can be almost surely equivalent with some particular loss functions.Hence, at any significance level, a rejection of the null hypothesis of risk invariance implies a rejection of the null hypothesis of conditional independence.Yet, such a relationship does not exist for marginal independence.Next, we present three cases with disparate loss functions to illustrate their relationships.Case 1.
As shown in Figure 1, conditional independence implies risk invariance in Cases 1 and 2 while they are equivalent in Case 3, as suggested by Lemma 1.In general, conditional independence or risk invariance does not yield marginal independence and vice versa.
It is worthwhile mentioning that different loss functions can lead to different conclusions.We interpret a significance test according to the loss function being used.For example, consider the misclassification error (MCE) and the crossentropy loss for testing the relevance of X S respectively.With the existence of X S c , the H 0 under the cross-entropy loss indicates that the hypothesized features are irrelevant to the conditional distribution of Y given X, yet the H 0 under MCE suggests that the hypothesized features are irrelevant to classification accuracy.

B. One-split test
, where N is the number of total samples, n = ζN and m = N − n are the sample sizes of estimation and inference subsets, and ζ is a splitting ratio.On this ground, the dual estimation subset (Z i , Y i ) i=1,••• ,n and the dual inference subset (Z j , Y j ) j=n+1,••• ,n+m can be generated based on the masking processing in (1).The sample splitting intends to reduce the potential bias and to prevent overfitting, especially for an overparametrized black-box model, which has been considered elsewhere for a different purpose in [12], [13], [4].
To access the null hypothesis in (2), we conduct a two-level estimation to R(f * )−R S (g * ), that is, using the (dual) estimation subset to empirically estimate predictive models (f * , g * ), and the (dual) inference subset to empirically estimate two risks R(•) and R S (•).Specifically, given the (dual) estimation subset, we obtain an estimator ( f n , g n ) to approximate (f * , g * ), for example, by minimizing a regularized empirical loss of a deep neural network based on the (dual) estimation subset.Then, the differenced empirical loss is evaluated on the (dual) inference subset based on the estimator ( f n , g n ): As a remark, for flexibility, we do not specify the estimation procedure of ( f n , g n ), the only explicit condition is summa-rized in Assumption A, which properly requires that ( f n , g n ) is a consistent estimator of (f * , g * ).
One difficulty in inference is that under H 0 the bias of R( f n ) − R S ( g n ) approximating R(f * ) − R S (g * ) could dominate its standard error; that is, the ratio of the bias to the standard derivation, called the bias-sd-ratio, could be severely inflated, making the asymptotic distribution of R( f n )−R S ( g n ) invalid for inference.This aspect is explained in detail in Section II-D.To circumvent this difficulty, we present the onesplit test with data perturbation to guard against the potentially inflated bias-sd-ratio by adding an independent noise: where σ n is the sample standard deviation of {∆ n,j } j=1,••• ,m given f n and g n : n,j , and ε j ∼ N (0, 1); j = 1, • • • , m are independent noise, and ρ n > 0 is the perturbation size.Note that our proposed test is in principle similar to classical hypothesis tests using a single test statistic.For example, if we use the negative log-likelihood as the loss function, it can be regarded as an extension of the likelihood ratio test (LRT; [14]) to a black-box model.According to the asymptotic null distribution of Λ (1) n in Theorem 2, we calculate the p-value P (1) = Φ(Λ Note that m is a subsequence of n, and m → ∞ as n → ∞.To derive the asymptotic null distribution of Λ (1) n , we make the following assumptions.Assumption A (Estimation consistency).For some constant γ > 0, ( where O p (•) denotes stochastic boundedness [15].
Assumption A concerns the rate of convergence in terms of the differenced regret, where R( f n ) − R(f * ) ≥ 0, known as the prediction regret with respect to a loss function l(•, which says that the rate n −γ is no worse than the least favorable one between the regrets of f n and g n .In the literature, the convergence rates for the right-hand of (5) have been extensively investigated.For example, the rate is n −ξ/(2ξ+d) for nonparametric regression [16], and the rate is dn −2ξ/(2ξ+1) log 3 n for a regularized ReLU neural net [17], where ξ is the degree of smoothness of a d-dimensional true regression function.Note that an over-parametrized model may slow the convergence rate γ, yet an under-parametrized model may violate Assumption A, since the approximation error may not vanish.This fact is supported by Example 7 in Simulation (Section V-B).n ).Assume that for some constant µ > 0, where ∆ n,1 is defined in (3), and E • |E n is the conditional expectation of inference samples given the estimation samples.Assumption C (Variance condition for Λ (1) where Var(•|E n ) denotes the conditional variance of inference samples given the estimation samples.
Assumptions B and C are used in applying the central limit theorem for triangle arrays [18], which are verifiable under some mild conditions, c.f., Lemma C.1 in Appendix C.
The asymptotic null distribution for Λ (1) n is indicated in Theorem 2.
Theorem 2 (Asymptotic null distribution of Λ where d −→ denotes convergence in distribution.Theorem 2 says that the proposed test is valid under the splitting condition of m = o(n 2γ ).As a result, the estimation/inference splitting ratio needs to be suitably controlled.In Section IV, we propose a "log-ratio" splitting scheme, in which the splitting condition is automatically satisfied, c.f., Lemma 6.
As an alternative, we present the two-split test in Appendix A to address the bias-sd-ratio issue, where we divide inference samples further into two equal subsets for inference, in which no data perturbation is needed.

C. Combining p-values over repeated random splitting
Combining p-values via repeated random sample splitting can strengthen the one-split test (3).First, it stabilizes the testing result.Second, it can often empirically compensate for the power loss by combining evidence across different split samples, as illustrated in [19], [20] and our simulations in Section V. Subsequently, we use the order statistics of the pvalues to combine the evidence from different splitting, though we could apply other types of combining such as the corrected arithmetic and geometric means [21], [22].
Given a splitting ratio, we repeat the random splitting scheme U ≥ 2 times; that is, each time, we randomly split the original dataset into an estimation/inference subsets.In practice, U cannot be large due to computational constraints and is usually 3-10 for large-scale applications.Then we compute the p-value P (1) u on the u-th splitting; u = 1, • • • , U , and combine them in two ways: the q-order and Hommel's weighted average of the order statistics [23].Specifically, (q-order) P (1) = min U q P (1) where q , and P (q) is the q-th order statistic of U .The q-order combined test ( 7) is a generalized Bonferroni test with the Bonferroni correction factor U q .The Hommel combined test renders robust aggregation and yields a better control of Type I error, where C U is a normalizing constant.
In Theorems 3 and 5, we further generalize the result of [23] to control Type I and Type II errors of the proposed tests asymptotically.A computational scheme for the combined tests is summarized in Algorithm 1.
Theorem 3 (Type I error for the combined one-split test): Under Assumptions A-C, if m = o(n 2γ ), then under H 0 , for any 0 < α < 1 and any U ≥ 2, the combined one-split test for (3) achieves where P (1) is defined in (7).

D. Role of data perturbation
This subsection discusses the role of the data perturbation for the one-split test.Now consider the one-split test without perturbation, that is, n into three terms: Under H 0 , T 3 = 0, and T 2 is the bias-sd-ratio introduced in Section II-B.Specifically, under H 0 , as n → ∞, σ As a result, T 1 may not satisfy the assumption of the central limit theorem.Furthermore, T 2 may not converge to zero.For example, are vanishing in the same order.Thus, the asymptotic null distribution in (6) breaks down since Λ (1) n is dominated by T 2 .By comparison, with data perturbation, ρ n → ρ > 0, (σ (1) n in ( 6) is valid.Moreover, a "log-ratio" sample splitting scheme is proposed in (8), where the splitting condition is automatically satisfied, as indicated in Lemma 6.
In later simulations (cf.Table VII), we will show numerically that, if no data perturbation is applied in the one-split test, it leads to increasingly inflated Type I errors with larger datasets in a neural network model.

E. Comparison with existing black-box tests
The one-split test in (3) has some characteristics that distinguish it from other existing black-box tests, including CRT [5], HRT [6], CPT [8] and LOCO tests [9].
CRT, CPT, and HRT test the conditional independence of a single feature individually, and the LOCO test measures the increase in prediction error due to not using a specified feature in a given dataset.The differences between the proposed tests and other existing tests can be summarized in three folds.First, for CRT, CPT, HRT, and LOCO tests, it is unclear how to test a set of multiple features X S , which is the target of our tests.Second, the significance of relevance is defined in different ways.The LOCO test conducts a significant test for the estimated model based on a given dataset with the mean absolute error, yet CRT, CPT, HRT, and the proposed tests conduct testing at the population level; that is, the former three examine conditional independence, while the last one focuses on the risk invariance as specified in (2).Third, CRT, CPT, and HRT require well-estimated conditional probabilities of every feature given the rest, which is often difficult in practice.Finally, the proposed tests are advantageous over CRT and CPT with reduced computational cost by avoiding a large number of model refitting.

III. TYPE II ERROR ANALYSIS
This section performs Type II error analysis of the one-split test (3) and its combined version (7).
Theorems 4 and 5 suggest that the one-split test and its combined test are consistent in that their asymptotic Type II error tends to zero as δ → ∞.
Theorem 4 (Limiting Type II error of the one-split test): Suppose that the one-split test (3) satisfies Assumptions A-C and m = o(n 2γ ), then Given the results of Theorem A.3 in Appendix A, we note that the one-split test is more powerful than the two-split test in terms of the asymptotic Type II error.
Theorem 5 (Limiting Type II error of the combined tests): Suppose that the one-split test (3) satisfies Assumptions A-C and m = o(n 2γ ), then for P (1) defined as the q-order combined test in (7), we have and for P (1) defined as the Hommel combined test (7), we have 3 ) + Φ(−h 0 ), h 0 = δ √ 2σ (1) and T (h, a) is Owen's T function [24].Note that the upper bound in Theorem 5 can be further improved if the explicit dependency structures of the p-values from repeated sample splitting are known.

IV. SAMPLE SPLITTING
The one-split and two-split tests require the sample splitting ratio ζ to satisfy the requirement m = o(n 2γ ) to control the Type I error.In this section, we develop two computing schemes, namely "log-ratio" and "data-adaptive" tuning schemes, to estimate ζ in addition to the perturbation size ρ for the one-split test.

A. Log-ratio sample splitting scheme
This subsection proposes a log-ratio splitting scheme to ensure automatically the splitting condition m = o(n 2γ ).Specifically, given a sample size N ≥ N 0 , where N 0 is a minimal sample size required for the hypothesis testing, the estimation and inference sizes n and m are obtained by: where Lemma 6: Suppose that the estimation/inference sample sizes (n, m) are determined by the log-ratio sample splitting scheme in (8), then they satisfy the splitting condition m = o(n 2γ ) for any γ > 0 in Assumption A. B. Heuristic data-adaptive splitting (tuning) scheme The log-ratio splitting scheme in ( 8) is relatively conservative as the inference sample size m increases in the logarithm of the estimation sample size n.To further increase a test's power, we develop a heuristic data-adaptive tuning scheme as an alternative.
The data-adaptive tuning scheme selects (ζ, ρ) by controlling the estimated Type I error on permutation datasets.To proceed, we define the permutation on hypothesized features, that is, for where π is a permutation mapping.Note that the hypothesized feature X i,S is conditional independence to the outcome Y i for the permutation sample.Alternatively, the null hypothesis H 0 is true for permutation datasets.On this ground, we aim to use the proportions of rejecting over T -times permutation as an estimate of the Type I error, and select (ζ, ρ) which is able to control the estimated Type I error.Ideally, re-fitting and re-evaluation are required for each permutation dataset.To reduce the computational cost, we only fit ( f , g) based on a permutation estimation subset, and estimate Type I error by reevaluating them at T -times permutation on an inference subset.The detailed procedure is summarized in following Steps 1-4.
Step 1 (Sample splitting).Given a splitting ratio ζ, split the original sample into the estimation and inference samples.
Step 4 (Estimate Type I error).Permute the inference subset T -times, and generate the corresponding dual samples via (1).
For the fixed estimators ( f n , g n ), compute the (combined) pvalues for each permuted (dual) inference samples under the perturbation size ρ, denote as Then an estimated Type I error is computed as: The splitting ratio ζ controls the trade-off between Type I and Type II errors.Specifically, a small ζ value yields biased estimators ( f n , g n ), leading an inflated Type I error, yet could reduce the Type II error because of an enlarged inference subset.The perturbation size ρ, as mentioned early, controls the bias-sd-ratio to ensure the validity of the asymptotic null distribution.
For the one-split test, the data-adaptive scheme estimates (ζ, ρ) as the smallest values in some candidate sets that controls the estimated Type I error.In the process of searching candidate sets ζ and ρ, it stops once the termination criterion Err (1) (ρ, ζ) ≤ α is met, which intends to reduce the computational cost.In particular, where Err (1)

A. Numerical comparison with existing black-box tests
This subsection presents a simple example to illustrate the differences between the proposed tests and other existing black-box tests, including the holdout randomization test (HRT; [6]), the leave-one-covariate-out test (LOCO; [9]), the permutation test (PT; [25], [7]), and the holdout permutation test (HPT; [6]).For PT, we use the scheme of [7] to permute multiple hypothesized features X S , on which we refit the model, and the permutation size is 100.Algorithm 2 in Appendix B summarizes the procedure for the permutation test.Note that we exclude CRT here due to its enormously expensive computing in refitting a model many times.
To alleviate the high computational cost of refitting, HPT uses data-splitting into a training sample and a test sample.Then it fits only one time on training data and performs the permutation test over the test sample with the trained model.In our context, we extend HPT in [6] by simultaneously permuting multiple hypothesized features X S .
One issue with the PT and HPT is that permutations of hypothesized features usually alter the dependence structure between X S and X S c .As a result, the sampling distribution based on permuted samples may differ from the null distribution.For example, the simulated example in Appendix B.2 indicates that both HPT and PT lead to dramatically inflated Type I errors.
In this section, we generate a random sample of size Second, the outcome Y is generated as Y = 0.02(X 1 + X 2 + X 3 ) + 0.05 , where ∼ N (0, 1).
A simulation study is performed for the one-split and twosplit tests, HRT, LOCO, and HPT.For HRT, we use the code of [6] available at GitHub with a default mixture density network with 2 components.For other methods, we fit a linear function based on stochastic gradient descent (SGD) with the same fitting parameters, that is, epochs is 100, batch _ size is 32, and early stopping with validation _ split being 0.2 and patience being 10, where patience is the number of epochs until termination if no progress is made on the validation set.For HRT, LOCO, HPT, and PT, the sample splitting ratio is fixed as 0.8, and the data-adaptive scheme is used for the proposed tests.
The returning values are summarized in Table II: the onesplit and two-split tests return valid p-values for the hypothesis in (2) with S = {1, 2, 3}, HRT and LOCO return p-values for individual features of conditional independence and errorinvariance for a given dataset, respectively.PT and HPT provide p-values for marginal independence.Therefore, the proposed tests are the only ones targeting the specified null hypothesis in (2).

B. Simulations
Consider a nonparametric regression model, where only depends on a subset of features of x, in which z S0 = 0 and z S c 0 = x S c 0 with S 0 = {1, • • • , |S 0 |}.Given a hypothesized index set S, our goal is to test if X S is relevant to predicting the outcome Y , as specified in (2).
For illustration, we set the regression function as a neural network f l−1 , w * l,j is the j-th column of the matrix W * l , τ > 0 is a constant, d l is the width for the l-th layer, and and L is the depth of the network.Clearly, f * ∈ H, where H is a candidate class defined as: We perform simulations under the model in (12), where X ∼ N (0, BΣ), Σ ij = r |i−j| , r ∈ [0, 1), r represents the correlation coefficient of features, B controls the magnitude of the features, (L, , τ ) denotes the depth, width, and the L 2norm of the neural network, S 0 = {1, • • • , |S 0 |} is an index set of the true non-discriminative features, and S is an index set of hypothesized features.For hypotheses in (2), we examine four index sets of hypothesized features S: These four sets are illustrated in Figure 2. Note that S S 0 = S 0 in (i), implying that it is for Type I error analysis, while S S 0 = S 0 in (ii)-(iv), suggesting Type II error analysis.From (ii) to (iv), the distance (or correlation) between the hypothesized features S and those non-discriminative features in S 0 is increasing (or decreasing), thus the Type II error is expected to go down.Seven examples are considered for (i)-(iv) hypothesized feature sets.
Example 1. (Impact of the sample size and splitting method) This example (Table III [6]), the leave-one-covariate-out test (LOCO; [9]), permutation test (PT), and holdout permutation test (HPT; [6]).For a test's Type I and II errors, we compute the proportions of its rejecting H 0 out of 1000 simulations under H 0 and out of 100 simulations under H a , respectively.
When implementing the log-ratio splitting scheme, (n, m) is determined by (8) with N 0 = 1000, and ρ = 0.01; for the dataadaptive scheme, the grids of ζ are set as {0.2, 0.4, 0.6, 0.8}.Moreover, the grids for searching the optimal perturbation size are {0.01,0.05, 0.1, 0.5, 1.0}.For combined tests, the number of repeated random splitting is set as 5.The hyperparameters of fitting a neural network are the same as in Section V-A.Runtime.The combined tests may double the runtime of their non-combined counterparts based on the data-adaptive tuning scheme.This result suggests that the one-split/two-split, and their combined tests are practically feasible for black-box testing subject to computational constraints as in the case of applying deep neural networks to large data.
Combining p-values.As suggested by Table B.4 in Appendix, the Hommel combining method controls the Type I error while having reasonably good power in reducing the Type II error.The Bonferroni and Cauchy methods have an issue of failing to control Type I error, whereas other combining methods are conservative in the first case of Example 6.
Over/under-parameterized models.As suggested in Table V, an under-parameterized model ( = 32) has inflated Type I errors, which agrees with the theoretical analysis in Section II-B; an over-parameterized model ( = 128) is able to control the Type I error, and provides similar performance in the power or Type II error to that of the perfectly specified model (with exactly the same network structure of the ground truth model), it is partially because the early-stopping is conducted as a regularization for overparameterized models.For the (combined) two-split test, both the over-and under-parameterized models perform similarly to the perfectly specified model.One plausible explanation (for its no inflation of Type I errors) is that the two-split test is conservative in the finite-sample setting.
We summarize the advantages of the different tests and the combining/tuning methods in Table VI.

C. One-split test and perturbation
Consider a regression model in (12), where S 0 = {1, 2, 3}, X ∼ N (0, BΣ), where Σ 1j = Σ j1 = 0.1; j = 1, • • • , p, and Σ ij = 0, if i, j = 1 and i = j.In this case, let S = S 0 , then H 0 is true in the population level.Furthermore, only partial features are observed in a dataset (x Then, we simulate a dataset (x , and N = 2000, 6000, 10000.For implementation, we set ζ = 0.2 for the one-split and two-split tests and ρ = 1.0 for the one-split test.The fitting parameters of a neural network remain the same as in Section V-B.Then, the Type I errors based on the two-split test, and the one-split tests with/without perturbation are reported in Table VII.
As indicated in Table VII, the two-split test and the one-split with perturbation approximately control Type I errors across all situations, whereas the one-split test without perturbation has inflated Type I errors significantly exceeding the nominal level α = 0.05.

A. MNIST handwritten digits
This subsection applies the proposed test to the MNIST handwritten digits dataset [10].The MNIST dataset is a standard benchmark for XAI methods [26], in part because the results of detection could be easily evaluated by human visual intuition.In particular, we extract 14, 251 images from the dataset with labels '7' and '9' to discriminate between these two digits.Our primary goal is to test certain image features  In this application, we consider three different types of masked regions, as displayed in Figure 3.
To proceed, we specify the underlying model as the default convolution neural network (CNN) provided by Keras for the MNIST dataset.Finally, we apply the one-split test, the twosplit tests, and their combined tests based on the data-adaptive tuning scheme with a significance level of α = 0.05.As suggested by Table VIII, the (combined) one-/two-split tests all fail to reject H 0 when it is true in Cases 1-2, but all reject H 0 in Case 3 when it is false.Overall, the test results confirm our intuition that the hypothesized regions in Cases 1-2 are visually indistinguishable, whereas that in Case 3 is visually discriminative, as illustrated in Figure 3.

B. Mechanisms of Action (MoA) prediction for new drugs
This subsection applies the proposed tests to examine the significance of "treatment", "gene expression", and "cell viability" to MoA prediction of new drugs.The dataset consists of 23814 drug-MoA annotation pairs with three types of features ("treatment", "gene expression", and "cell viability"), and 207 binary labels indicating multiple targets of MoA responses, as illustrated in Figure 4. Specifically, "treatment" includes "treatment duration" (continuous) and "treatment dose" (categorical); "gene expression" and "cell viability" include 773 gene expression data (continuous), and 100 human cells' responses (continuous) to drugs [27], [28], respectively.
In this application, we consider the significance of those three types of feature sets, as displayed in Figure 4.
For implementation, we use TabNet [29] as the predictive model for our proposed tests.The results are summarized in Table IX, which indicates that all tests fail to reject H 0 at α = 0.05 for "gene expression" features.For Cases 1 and 3, all tests consistently reject H 0 , identify "treatment" and "cell viability" as significant features to MoA prediction.

C. Chest X-rays for pneumonia diagnosis
This subsection illustrates the application of the proposed tests to chest X-ray images in a pneumonia diagnosis dataset [30].This dataset consists of 5,863 X-ray images, each labeled as "Pneumonia" or "Normal".To proceed, we crop an image to produce a version of the image that focuses on the lung fields, based on DeepXR.Then, we use a square cropping region to retain important areas containing parenchymal anatomy and retrocardiac anatomy.
For implementation, we specify the learning model as a convolution neural network (CNN) and apply the one-split test, two-split test, and their combined tests based on the dataadaptive tuning scheme at a significance level of α = 0.05.Similarly, we also consider three different types of hypothesized regions, as displayed in Figure 5.As suggested by Table X, all tests fail to reject H 0 at α = 0.05 in Case 2 when H 0 is likely to be true.For Cases 1 and 3, only the (combined) one-split test rejects both the H 0 , but other tests fail to do so when H 0 is likely to be false.In agreement with the earlier results, the one-split test seems more powerful to detect a discriminative region.

D. Significance of keypoints to facial expression recognition
This subsection examines the significance of five keypoints (left eye, right eye, eyes, nose, and mouth) on seven facial expressions: 'angry', 'disgust', 'fear', 'happy', 'sad', 'surprise', 'neutral') on the FER2013 dataset, consisting of 48x48 pixel grayscale facial images.The facial images have been automatically registered.For each facial image, an emotion label is provided as one of seven expressions.Given a facial image, we produce the keypoints based on the existing facial landmark detection libraries dlib and open-cv.The primary goal is to deliver the significance of the keypoints to facial expression recognition.After preprocessing, we obtain 11709 triples of images, labels, and keypoints.The scatter plot for the keypoints is provided in Figure 6, from which we consider five different collections of hypothesized regions corresponding to five keypoints: left eye, right eye, eyes, nose, and mouth, respectively.The hypothesized regions based illustrative examples are displayed in Figure 7.For implementation, we use the same VGG deep neural network and the same training hyperparameters as in [31].Note that the adopted VGG network in [31] is one of the state-of-art facial expression recognition methods (Rank 4) in FER2013 papers-with-code Leaderboard [32].
As suggested by Table XI, all tests fail to reject H 0 in Cases 1, 2, and 4. For Cases 1-2, it is partly because the predictive information in the left/right eye is symmetrically leaked in the other eye.For Case 4, the result confirms the visual intuition that "nose" is not a discriminative keypoint to facial expression.For Cases 3 and 5, all tests consistently reject H 0 , suggesting that "eyes" and "mouth" are discriminative regions, which are visually confirmed by illustrative samples in Figure 7.Note that the proposed tests are equally applicable to more substantial computer vision applications, for which the testing results could provide instructive information for visual sensor management and construction.

E. Evaluating significance of localization in CIFAR100
Note that the proposed methods are equally applicable to significance tests with instance adaptive hypothesized features.Therefore, they can be used to evaluate the effectiveness of discriminative localization methods, such as the class activation maps (CAM; [33]) and Grad-CAM [34].In this subsection, we demonstrate a significant test in the CIFAR100 dataset based on adaptive hypothesized features localized by Grad-CAM.Specifically, in the training set, we apply Grad-CAM to a fitted AlexNet to produce importance/heatmaps of features/pixels of all images, see 6 demonstrative examples in Figure 8.Then, 4 cases of hypothesized tests are provided by taking top-5%, top-10%, top-15%, top-30% important features as hypothesized features.Next, the proposed one/two-split tests are conducted in the testing set with a ResNet50 network, and the resulting p-values are summarized in Table XII.
Overall, the test results confirm our intuition, the inconsistent results in Cases 2 and 3 by one-split and two-split tests may be caused by the power loss of two-split tests.It is worth mentioning that the sequence of pairs (top important hypothesized/localized features, p-values) produced by the proposed tests can be an evaluation of the effectiveness of the localization method.

F. Significance of keywords in sentiment analysis
This subsection examines the significance of keywords in sentiment classification based on the IMDB dataset [35].This dataset provides 50,000 highly polar movie reviews for binary sentiment classification.We also obtain lists of positive, negative, and neutral opinion words from [36].In this application, we apply the proposed tests to examine the significance of positive/negative/neutral words contributing to sentiment analysis.For illustration, we report the results based on the top 350 frequent positive and negative-sentiment words and 350 randomly selected neutral-sentiment words in the IMDB dataset.
For implementation, we use a Bidirectional LSTM model as a prediction model for sentiment classification and apply the one-split/two-split tests and their combined tests based on the log-ratio splitting method at a significance level of α = 0.05.Overall, the test results in Table XIII confirm our intuition, where the positive and negative-sentiment words significantly contribute to sentiment analysis, but not neutral-sentiment words.Inconsistent results in Cases 1 and 2 by one-split and two-split tests may be caused by a power loss of the two-split test.

VII. CONCLUSIONS
This article proposes two novel risk-invariance tests, onesplit and two-split tests, to assess the impact of a collection of hypothesized features on prediction.Theoretically, we have established asymptotic null distributions of test statistics and their consistency in Type I/II errors.Numerically, we have demonstrated the utility of the proposed tests on simulated and real datasets.Next, we summarize some strengths and limitations of the proposed tests.
Strengths.(i) The proposed tests provide a practical inference tool for black-box models on complex data, which considerably relax assumptions in the existing literature.For example, CRT and HRT require a well-estimated conditional probability for features, which is often impractical.(ii) The proposed tests work for general risk-invariance testing on a collection of features of interest, which encompasses the conditional independence test when the log-likelihood loss is used.(iii) The proposed tests involve a limited number of model refitting, which suitable for large-scale problems.
Limitations.(i) One-split/two-split tests split over the original dataset at the expense of reduced power or increased Type II error.(ii) The log-ratio splitting scheme is conservative in that it prefers situations with a large estimation subset and a small inference subset.

Algorithm 1 : 3 Shuffle the data; 4 5
(ρ, ζ) is the estimated Type I error computed via(10), ζ and ρ represent sets of candidate ζ and ρ values.Overall computational cost.Algorithm 1 summarizes the computational scheme of the one-split test.For the noncombining test in Algorithm 1, the data-adaptive scheme usually requires 2-3 times of training and evaluations since the loop for the tuning of (ζ, ρ) usually terminates in one or two iterations.For the combined test, the data-adaptive scheme based on 5 random splits usually requires 10 times of training and evaluations.The running time for the proposed test is indicated in TablesIII and B.1 in Appendix B. One-split test for region significance Input : Data: (x i , y i ) N i=1 ; Set of hypothesized features: S; Number of splitting: U Output: p-value for testing (2) 1 Estimate ( ρ, ζ) from (11) ; 2 for u = 1, • • • , U do Split the data into an estimation subset and an inference subset, where m = ζN and n = N − m; Generate dual samples via (1) for estimation/inference subsets; EXAMPLES This section examines the proposed tests for their capability of controlling Type I and Type II errors in both simulated and real examples.All tests are implemented in our Python library dnn-inference (https://github.com/statmlben/dnn-inference).
) concerns the performance of the proposed tests in relation to the sample size N based on logratio and data-adaptive splitting methods, where N ranges from 2000 to 10000, B = 0.4, r = 0.25, p = 100, = 128, L = 3, |S 0 | = 5.Example 2. (Impact of the strength of features of interest) This example (Table IV) concerns the performance of the proposed tests with respect to the magnitude of features B, where B = 0.2, 0.4, 0.6, N = 6000, p = 100, r = 0.25, = 128, L = 3, and |S 0 | = 5.The data-adaptive tuning scheme is applied for this example.Example 3. (Impact of the depth and width of a neural network) This example (Table B.1 in Appendix B) concerns the performance of the proposed tests in terms of the width and depth L of a neural network, where N = 6000, L = 2, 3, 4, = 32, 64, 128, B = 0.4, r = 0.25, p = 100 and |S 0 | = 5.Example 4. (Impact of the number of hypothesized features) This example (Table B.2 in Appendix B) concerns the proposed tests with respect to the number of hypothesized

001 Fig. 2 :
Fig.2: Illustration of four index sets of hypothesized features in simulations: (i) Type I error analysis, (ii)-(iv): Type II error analysis.Note that the impact of the hypothesized features S on S 0 decreases while the Type II error is expected to decrease from (ii) to (iv).
Type I/II errors of the (combined) one-split/two-split tests.As indicated in Tables III-IV and Tables B.1 -B.5, the one-split/two-split tests perform well in all examples with respect to controlling Type I/II errors.In particular, Type I errors are close to the nominal level α = 0.05, whereas Type II errors decrease to 0 as the sample size N increases.As expected, the one-split test outperforms the two-split test in terms of Type II error, which agrees with Theorems 4 and Theorem A.3 in Appendix A. The combined tests consistently improve the performance in terms of both Type I/II errors.

Fig. 3 :
Fig. 3: The hypothesized regions (HRs) in Cases 1-3 for differentiating digits 7 and 9 in Section VI-A.Case 1: a HR is (19 : 28, 13 : 20), which indicates that H 0 is true; Case 2: a hypothesized region is (21 : 28, 4 : 13), which indicates that H 0 is true; Case 3: a hypothesized region is (7 : 16, 9 : 16), which indicates that H a is true.Note that the p-values in the top are given by the one-split test.

Fig. 5 :
Fig. 5: The hypothesized regions (HRs) in Cases 1-3 for discriminating "Normal" (first row) versus "Pneumonia" (second row) X-ray images in Section VI-C.Case 1: a HR is (50 : 200, 20 : 110), for which H 0 is likely to be false; Case 2: a HR is (50 : 200, 100 : 150), for which H 0 is likely to be true.Case 3: a HR is (50 : 200, 150 : 240), for which H 0 is likely to be false.Note that the p-values in the top are given by the one-split test.

Fig. 6 :
Fig. 6: The scatter plot for the keypoints (left eye, right eye, nose, and mouth) in FER2013 facial expression recognition dataset, yielding that the hypothesized regions in Cases 1-4 cover the corresponding keypoints in most faces.

TABLE I :
(8)ustration of split sample sizes (n, m) using the log-ratio splitting scheme(8)as the total sample size N increases from 2000 to 100000 while N 0 = 2000 is fixed.

TABLE II :
Returning values of the one-split/two-split tests and other existing black-box tests.Here one-split, two-split, HRT, LOCO, PT, and HPT denote the proposed tests in Algorithms 1 and Algorithm 1 in Appendix A, holdout randomization test (HRT;

TABLE III :
Empirical Type I/II errors of the (combined) one-/two-split tests, and their combined tests in Example 1 at α = 0.05.

TABLE IV :
Empirical Type I/II errors of the (combined) onesplit and two-split tests in Example 2.

TABLE V :
Empirical Types I/II errors of the (combined) one-/two-split tests in Example 7 based on 0 = 64 (width for the truth model) and different s (width for a learning model).

TABLE VI :
Advantage for different tests, combining, and tuning methods.

TABLE VII :
Type I errors of the one-split tests with/ without perturbation (PTB) and the two-split test in Section 6.4.
differentiating digit '7' from digit '9', where a marked region of an image specifies hypothesized features.

TABLE VIII :
P-values of (combined) one-/two-split tests in the MNIST dataset.Significant p-values for testing feature irrelevance are underlined at a nominal level α = 0.05.

TABLE IX :
P-values of (combined) one-/two-split tests in the MoA prediction dataset.

TABLE X :
P-values of the (combined) one-/two-split tests in the chest X-ray dataset.

TABLE XI :
P-values of the (combined) one-/two-split tests in the FER2013 dataset based on five keypoints: left eye, right eye, eyes, nose, and mouth.

TABLE XII :
P-values of the (combined) one-/two-split tests in the CIFAR100 dataset.The percentages of hypothesized features are 5%, 10%, 15%, 30%, corresponding to the top important features ranked by Grad-CAM localization heatmaps.

TABLE XIII :
P-values of the (combined) one-/two-split tests in the IMDB dataset with hypothesized features as: Case 1: the top 350 frequent positive words; Case 2: top 350 negativesentiment words; Case 3: 350 randomly selected neutralsentiment words.