A Comparison of Global Sensitivity Analysis Methods for Explainable AI With an Application in Genomic Prediction

Explainable Artificial Intelligence (XAI) is an increasingly important field of research required to bring AI to the next level in real-world applications. Global sensitivity analysis (GSA) methods play an important role in XAI, as they can provide an understanding of which (groups of) parameters have high influence in the predictions of machine learning models and the output of simulators and real-world processes. In this paper, we conduct a survey into global sensitivity methods in an XAI context and present both a qualitative and a quantitative analysis of these methods under different conditions. In addition to the overview and comparison, we propose an open source application, GSAreport, that allows you to easily generate extensive reports using a carefully selected set of global sensitivity analysis methods depending on the number of dimensions and samples, to gain a deep understanding of the role of each feature for a given model or data set. We finally present the methods discussed in a complex real-world application of genomic prediction and draw conclusions about when to use which GSA methods.

uncertainties are incorporated in the data and the model are 27 just a few questions that domain experts and engineers want 28 an answer to before deploying AI methods to production. 29 SA plays an important role in the field of XAI as it can 30 answer a part of these questions without being dependent 31 on a certain machine learning model or even a sampling 32 strategy. Especially when the number of model inputs is large, 33 recognizing the factors on which to focus resources in data 34 collection and data-driven modeling efforts becomes crucial. 35 SA is used in many real-world applications, including but 36 not limited to: understanding how chemical models work [1], 37 measuring the influence of input parameters on biological 38 models [2], and understanding and analyzing factors affecting 39 CO2 emissions in the construction industry [3]. SA methods 40 can be divided into global and local SA methods. Global 41 approaches focus on the variation of all inputs, leading to 42 under different sample sizes and number of dimensions. The 93 second experiment in Section VIII focuses on the accuracy 94 of these methods under different conditions for (relatively 95 simple) randomly generated linear functions. In Section IX, 96 we combine the two previous quantitative results with a 97 qualitative comparison of the different methods. In Section X, 98 we propose the open-source software package GSAreport, 99 which allows to easily use the aforementioned XAI and 100 GSA methods to gain a deep understanding of the feature 101 sensitivities of any model or process. We conclude the work 102 with a detailed example of a real-world application dealing 103 with genomic prediction in Section XI, which shows the 104 applicability of these methods, and a summary in Section XII. 105 II. GLOBAL SENSITIVITY ANALYSIS 106 While there are already some excellent reviews of sensitivity 107 analysis methods, such as [8] and [9], none of these reviews 108 compare GSA methods in terms of robustness under different 109 sample sizes and number of dimensions and applicability to 110 explainable AI applications. Most related work is limited to 111 a small number of dimensions (typically less than five) and 112 compares only two or three methods. A recent work [10] 113 proposes a meta-function to benchmark different GSA meth-114 ods in their ability to find ''ground truth'' sensitivity indices. 115 However, the ''ground truth'' was computed with a large 116 sample size and only allows comparison of specific SA meth-117 ods, since not all SA methods provide similar information 118 (first-order sensitivity scores) as output, making it difficult 119 to compare these algorithms. When working with a small 120 sample size of high-dimensional data, a different approach 121 may be required than when working with many samples 122 of low-dimensional data. Therefore, this study presents and 123 compares common methods of sensitivity analysis, both qual-124 itatively and quantitatively, to finally provide some rules for 125 which methods are better to use and when, and to extend the 126 analysis to the more complex case of high-dimensional data, 127 for which a sample set of relatively small size is available. The 128 review of SA methods in this paper is limited to Global Sen-129 sitivity measures. Comparing LSA methods would require a 130 completely different approach that is beyond the scope. 131 For demonstration and comparison purposes, to explain the 132 different methods of SA, in this paper in Sections III, IV, V 133 and VI we use the transformed, multi-modal Rastrigin func-134 tion (f 3 ) from the Black-Box Optimization Benchmarking 135 (BBOB) test suite within the COCO framework [11] as the 136 function we want to analyze. 137 Figure 1 shows a surface plot of the function with n = 138 2 input parameters. In the examples in Sections III, IV, V 139 and VI, n = 5 is used. 140 The following sections provide a comprehensive list 141 of GSA methods, which are divided into four cate-142 gories: variance-based, derivative-based, density-based, and 143 model-based GSA methods.

145
Variance-based methods are based on the assumption that 146 variance is sufficient to describe the output uncertainty, 147 an assumption made by Andrea et al. [12]. Variance-based 148 methods calculate the sensitivity of the input parameters via 149 an ANOVA-like decomposition of the function.  For a formal definition of how FAST calculates the sensi-187 tivity we refer to [16].

189
Random Balance Designs Fourier Amplitude Sensitivity Test 190 (RBD-FAST) [17] does not depend on a specific sampling 191 scheme and works well with a basic Latin Hypercube Sam-192 pling (LHS) scheme. The hybrid RBD-FAST method is com-193 putationally much more efficient than classical FAST and 194 gives similar performance according to [17]. 195 An example of the first order (S1) sensitivity calculated by 196 using RBD-FAST for f 3 is given in Figure 4.

198
A. MORRIS 199 Morris method [18] is a qualitative measure that shows sev-200 eral advantages compared to other SA strategies. First of all, 201 it can work directly on discrete domains. Then, it includes 202 multi-dimensional averaging (i.e., the evaluation of the effect 203 of a factor while the others are also varying) and allows 204 grouping of factors, which makes it particularly suitable to 205 work with very large data sets. Finally, the modified version 206 considering absolute means of the distribution of elementary 207 effects is robust with respect to type II errors. Given a random sampling of a vector X = (X 1 , X 2 , .., X n ) 210 from an input space , which is a n−dimensional p−level 211 grid, Morris method generates a distribution F i of elementary 212 effects for every i th input. If we consider r randomly sampled 213 inputs, and build a trajectory on the input space by changing 214 one parameter at a time by summing or subtracting a noise 215 term , the elementary effect for the i th factor on the j th 216 trajectory is defined as Based on this definition, two sensitivity indices are computed 219 for each factor: the mean of the absolute value of the elemen-220 tary effects Sobol sensitivity indices, on the left, a stacked bar plot of the first (S1) and total sensitivity (ST) for each parameter with the 95% confidence interval for the first (S1 Conf) and total order (ST Conf) respectively as dark blue error bars. On the right the second order sensitivities between two parameters (dark blue is high, white is 0).

FIGURE 3.
Sobol network plot. Each node is a parameter, the size of the node is the total sensitivity index of the parameter, the blue halo around the node is the confidence. The thickness of each edge denotes the secondary interactions.
The first measures the influence of the i th feature on the 225 output: the larger, the more it contributes to the output vari-

254
where the features X i 1 , X i 2 , and X i 3 belong to the same group  If ∂f /∂x i ∈ L 2 , meaning that it is a a square-integrable 277 function, the most used DGSM measure is defined as the 278 mean value of (∂f /∂x i ) 2 : The higher the measure ν i , the more influential the i−th 281 factor. DGSM also support grouping of factors. For further 282 details, the reader is referred to [21]. 283 An example of ν i indices found by DGSM is given in 284 Figure 6.

286
Density-based SA methods consider the entire Probability 287 Density Function (PDF) of the model output in order to 288 calculate the sensitivity of the inputs and their interactions. 289 Density-based SA methods are popular for their ability to 290 overcome certain limitations associated with the interpreta-291 tion of variance-based measures in the presence of dependen-292 cies among the model inputs. However, their estimation runs 293 the risk of becoming infeasible when the number of model 294 inputs is large (high dimensionality) or when the computing 295 time of the model or function takes longer than a few minutes. 296 Below we discuss two of the most popular density-based SA 297 methods, DELTA and PAWN. The DELTA (δ) [22] method is a Densitity based SA method 300 that is independent of the sampling generation method. The 301 method provides both the first order sensitivity and the δ 302 (similar to the total sensitivity) for each input parameter. 303 DELTA aims at assessing the influence of the entire input 304 distribution on the entire output distribution without reference 305 to a particular moment of the output. The moment indepen-306 dent sensitivity indicator δ for a factor X i is calculated using 307 FIGURE 7. δ sensitivities plot, indicating X 4 as most influential.
where 310 • X is a vector of input factors and x ∈ X ; is the conditional density of Y given that one 317 of the parameters, X i , assumes a fixed value.

318
An example of the calculated δ values for the BBOB 319 f 3 function is presented in Figure 7. non-linear functions in order to estimate sensitivity of the 346 input features.

348
Linear regression is a linear approach for modeling the rela-349 tionship between the input features and the output. Using 350 multiple linear regression (more than one variable), a set 351 of coefficients of the linear function are fit to the data by 352 minimizing the residual sum of squares between the observed 353 targets and the predicted targets. The learned coefficients 354 immediately represent the relative importances of the input 355 variables. In the experiments in this work we scale the coef-356 ficients between 0 and 1 to resemble feature importance 357 indices.

358
In general the linear regression model fits a function in the 359 form of Equation 11.
where y is the dependent variable, β, is the intercept of the 362 model, X i corresponds to the i th explanatory variable of the 363 model, and is the random error with expectation 0 and 364 variance σ 2 .   to the random feature subsets.
Where d is the number of dimensions. All other methods use 458 exactly b samples. This means that only Morris, Sobol and 459 DGSM have d more samples, but since the number of samples 460 is much larger than the dimensionality in this experiment the 461 effect of this can be neglected.

462
In the following experiment we run seven SA meth-463 ods (DGSM, Delta, FAST, Morrris, RDB-FAST, Sobol and 464 PAWN), the simple statistical method Pearson correlations 465 and two model-based methods (linear regression and random 466 forest). We test these methods over a wide range of sample 467 sizes (from 2 7 to 2 15 samples) for the 24 different noiseless 468 BBOB functions [11]. Each method that requires a specific 469 sampling scheme uses this specified scheme, while other 470 methods such as Pearson, RDB-FAST, random forest and 471 linear regression use Latin Hypercube sampling. We repeat 472 these experiments 10 times with different random seeds and 473      being exceptions. When we look at a highly multi-modal 517 function f 11 in Figure 11 and function f 21 in Figure 12, The values given are the averages of these runs. We did 558 not include TreeSHAP because it is model dependent and 559 computationally more expensive.

561
From Figure 13 and Figure 14, it can be seen that most 562 methods require many more samples as the number of total 563 dimensions increases. In Figure 14, one can see that FAST 564 in particular requires a minimum number of samples, which 565 depends on the number of features to be analyzed (the gray 566 areas did not have enough samples to execute the algorithm). 567 Two clear winners emerge from the tau statistics, namely 568 the linear model and the Morris method, both of which 569 perform very well even when the number of dimensions is 570 high and the sample size is low. Note that the linear model 571 has an advantage in this setup since only linear functions 572 are considered. In real-world applications, this is unlikely 573 to be the case and therefore the linear model would be less 574 useful. Sobol and FAST perform well when the sample size 575 is sufficiently large. In terms of efficiency, the Delta and RF 576 methods require the most computational effort and Pearson 577 the least (although it is less effective and also relies heavily 578 on the linearity assumption).

580
In this paper, we present methods that are global and model-581 free, i.e., they are independent of assumptions about the 582 model, such as linearity, additivity, etc. Although the main 583 goal is to provide model-based analytical tools to study how 584 uncertainties in model output are related to uncertainties in 585 input, quantitative methods are useful whenever the goal 586 is to understand the extent to which a particular factor is 587 more important than another. Variance-based measures are 588 an example of quantitative methods. The choice between 589 quantitative and qualitative methods usually depends on the 590 costs and characteristics of the case study under investigation. 591 Relatedly, there are several desirable characteristics for GSA 592 methods:

593
• Computation of first, second, and total order sensitivi-594 ties. Based on the study purposes, a GSA method should 595 be able to compute different sensitivity indexes: the 596 VOLUME 10, 2022  In addition to this set of characteristics that a GSA method 627 can support, the results of the first and second quantita-628 tive experiments are included in Table 2  In order to make Global Sensitivity Analysis methods and to 640 combine the information that can be extracted with this large 641 variety of methods and automatically select the right meth-642 ods, we propose the GSAreport application. With GSAreport 643 one can create a detailed and interactive HTML report for 644 a specific dataset, machine learning model or real-world 645 process. The application selects the best GSA methods to 646 use based on the number of dimensions and samples per 647 dimension. If the number of dimensions is over 64, Sobol and 648 PAWN are omitted. If the number of samples per dimension 649 is less than 50, Sobol, Delta, and PAWN are not applied. 650 These rules follow the observations from Table 2. FAST 651 and DGSM are not included in the application due to their 652 relatively low performance. The GSAreport application can 653 be easily installed by either downloading the executable, 654 using the provided Docker image, or setting up the Python 655 dependencies. No programming skills are required to use the 656 package. The software generally works with two different 657 steps, the sampling step and the analysis and report generation 658 step. In the sampling step, the software can generate different 659 designs of experiments required for the application of all 660 The Morris method is here applied to provide an exam-678 ple of GSA on a specific real-world scenario dealing with 679 genomic prediction, i.e., prediction of phenotype (output) 680 based on genomic data (input). The target of the application 681 is to understand which parts of the genome, i.e., Single 682 Nucleotide Polymorphisms (SNPs), have a major role in the 683 predicted outcome. Among the other GSA techniques, the 684 Morris method was chosen for this particular application 685 because it can handle the very large number of discrete 686 variables characterizing this application study and includes 687 VOLUME 10, 2022 TABLE 2. Comparison of different GSA and XAI methods based on different aspects and support. A + indicates that the feature is present for a method. For the performance in accuracy and time a distinction is made between the best methods (++), good methods (+), worse methods (−) and the worst method for a given situation (--). *The TreeSHAP method is model dependent, however classical Shapley values are model independent. ** The linear model has an advantage due to the experimental setup and is therefore not included in this overview regarding accuracy performance. *** The TreeSHAP method was not included. multi-dimensional averaging, i.e., is able to account for inter- The GSA methodology based on the Morris method 724 comprises several building blocks, which are illustrated 725 in Figure 15. Using a breeding simulation platform, several 726 generations of random mating are simulated to create a pop-727 ulation of 1000 plants. In this simulation the genotype of a 728 plant is represented using 27205 SNPs, split over 5 chro-729 mosomes. The exact location of a SNP on a chromosome 730 is sampled uniformly at random. To calculate the trait value 731 (phenotype) of a plant we assume an epistatic trait model 732 based on 20 SNPs, listed in Table 4. In an epistatic trait model 733 the contribution of a SNP to the trait is based on both the 734 states of that SNP as well as the state of (some) other SNPs 735 through interactions. This population is split randomly, 80% 736 for training and 20% for testing. A Convolutional Neural 737 Network model (CNN) is trained on this data set. In accord 738 with the literature [31] and in order to be able to draw sound 739 statistics on the results, the training is repeated 10 times by 740 initializing with different random seeds. Once the models are 741 trained, a Morris statistical analysis is performed for each of 742 them, by following the steps described in Section IV-A: a 743 sample matrix is generated by considering 10 initial random 744 designs and varying one SNP at a time on a k-dimensional 745 3-level (0,1,2) grid, where k = 27205 is the number of SNPs, 746 and, as a consequence, the number of factors of our sensitivity 747 analysis. Once the Morris sample matrix is generated, the 748 phenotype associated to each genotype, meaning the trait 749 value associated to each sequence of SNPs, is evaluated on 750 the CNN prediction models. Finally, the GSA is performed, 751 by considering both the standard and the grouping version of 752 the Morris method.    Table 4. 796 The ground truth contains a dense range of features between 797 25500 and 26800. Let us take a look at Figure 16, which pro-798 vides an example of (µ * , σ ) distribution for the 27205 SNPs 799 under analysis. Although we obtained different points dis-800 tributions over the (µ * , σ ) plane for different CNN training 801 random seeds, all of them present on the top right corner SNPs 802 belonging to the same range. Moreover, if features belonging 803  Hence, to draw sound conclusions, we present the two line-813 plots in Figure 17, which depict the trend of the µ * sensitivity Thus, grouping factors into subgroups allows us to run a much 821 cheaper GSA while providing results that are easily readable 822 and consistent with the full SA for all factors. From the plots, 823 it can be observed that all the major peaks detect some SNPs 824 belonging to the ground truth, but not the other way around, 825 meaning that some ground truth features are located in ranges 826 with low µ * according to Morris analysis. However, the 827 highest peaks are located in correspondence to the neighbours 828 with the highest density of features belonging to the ground 829 truth, highlighting space for further investigation in localized 830 portions of the factor domain. In addition, the most impor-831 tant features/groups seem to influence the importance of the 832 neighbouring features/group, hence suggesting a correlated 833 behaviour between them. This is because neighboring SNPs 834 on the same chromosome are very likely to be inherited 835 together.

836
To take advantage of the biological reality for SNP 837 interaction, a different logic for grouping factors based on 838 neighbouring is used. We base the splitting of factors into 839 different groups based on pairwise correlation between con-840 sequent SNPs, computed on the training data. In particular, 841 we evaluate correlations based on Pearson product-moment 842  correlation coefficients R ij = Cov(X i , X j )/(σ i σ j ), where Cov 843 is the covariance function and σ i is the standard deviation 844 of a random variable X i . Given three thresholds for this 845 coefficient, namely R = 0.90, 0.98, 0.99, we split the factors 846 sequence to different groups when the pairwise correlation 847 of one factor with one of its two neighbours falls below this 848 threshold. Hence, we obtain 5, 11, and 38 different groups 849 for the three thresholds, respectively. Figure 18 shows the 850 averaged lineplots representing the values calculated for the 851 µ * sensitivity measure for all feature groups.

852
It is evident that by choosing R = 0.90 we have SNPs 853 from the ground truth following into each group. In this case, 854 the 5 groups are most likely exactly the 5 chromosomes 855 we used for simulating the data. According to our SA, high 856 values of µ * correspond to high number of relevant SNPs 857 belonging to a group. In fact, the fifth group, which is the 858 one with the highest µ * value, is composed of the SNPs 859 ranging from index 20886 and 27205, hence it also contains 860 the highest number of features listed in Table 4. By choosing 861 R = 0.98 some groups do not contain SNPs from the ground 862 truth (G2 and G8). These groups are irrelevant according 863 to the Morris SA. On the other hand, G6, ranging from 864 feature 14564 to 16757 does not contain any ground truth 865 SNP either, but has some relevance according to the results 866 of the SA. In any case, according to the data from the ground 867 truth, G11 = [23665, 27205] is the more dense group, and it 868 has a sufficiently high value for µ * to attract our attention 869 and be worth further and more focused SA. By choosing 870 R = 0.99 we have even higher agreement between experi-871 mental results and ground truth. In fact, all the groups that 872 are considered relevant according to our SA also contain 873 some SNPs in Table 4. However, it has to be stressed that the between SNPs creates factors groups of different cardinalities 876 and hence tends to favour bigger groups. Except for the first 877 subplot in Figure 18 where all groups have similar cardinality,   For future research, new methods could be incorporated 927 into the experiments and a methodology to test accuracy for 928 more complex functions could be explored. This work was 929 limited to the comparison of global SA methods, while there 930 are also many local SA methods that would require different 931 experimental setups for comparison.