Gaussian Process Regression with Local Explanation

Gaussian process regression (GPR) is a fundamental model used in machine learning. Owing to its accurate prediction with uncertainty and versatility in handling various data structures via kernels, GPR has been successfully used in various applications. However, in GPR, how the features of an input contribute to its prediction cannot be interpreted. Herein, we propose GPR with local explanation, which reveals the feature contributions to the prediction of each sample, while maintaining the predictive performance of GPR. In the proposed model, both the prediction and explanation for each sample are performed using an easy-to-interpret locally linear model. The weight vector of the locally linear model is assumed to be generated from multivariate Gaussian process priors. The hyperparameters of the proposed models are estimated by maximizing the marginal likelihood. For a new test sample, the proposed model can predict the values of its target variable and weight vector, as well as their uncertainties, in a closed form. Experimental results on various benchmark datasets verify that the proposed model can achieve predictive performance comparable to those of GPR and superior to that of existing interpretable models, and can achieve higher interpretability than them, both quantitatively and qualitatively.


Introduction
Gaussian processes (GPs) have been well studied for constructing probabilistic models as priors of nonlinear functions in the machine learning (ML) community. They have demonstrated great success in various problem settings, such as regression [Rasmussen, 2003, Wilson et al., 2012, classification [Rasmussen, 2003, Csató et al., 2000, time-series forecasting [Roberts et al., 2013], and black-box optimization [Snoek et al., 2012]. A fundamental model on GPs is Gaussian process regression (GPR) [Rasmussen, 2003]; owing to its high predictive performances and versatility in using various data structures via kernels, it has been used in not only the ML community, but also in various other research areas, such as finance [Gonzalvez et al., 2019], geostatistics [Camps-Valls et al., 2016], material science [Zhang et al., 2020] and medical science [Cheng et al., 2017, Futoma, 2018.
GPR is defined on an infinite-dimensional feature space via kernel functions. Therefore, it requires the values of the kernels defined on pairs of samples, i.e., a covariance matrix of the samples as an input, rather than the samples themselves. Owing to the nonlinearity of the kernel, GPR enables nonlinear predictions. In terms of interpretability, Figure 1: Example of explanation for prediction by GPX for a sample on the Boston housing dataset [Harrison Jr and Rubinfeld, 1978]. Red and blue bars indicate positive and negative predicted feature contributions, respectively, and error bars indicates predicted standard deviation of the feature contributions. We provide further examples and feature description in Appendix A.
the covariance is useful for understanding the relationship between the samples; however, since the kernels make the features invisible, GPR cannot explain which features contribute to the predictions, like linear regression models. Therefore, it prevents us from judging whether the predictions by GPR are reasonable and performed by fair decision.
For the interpretability of ML, several methodologies that explain the features that contribute to the outputs of prediction models, including GPR, have been proposed; in this case, the prediction models are often regarded as black-boxes [Molnar, 2019, Chen et al., 2018. Their representative methods are local interpretable model-agnostic explanations (LIME) [Ribeiro et al., 2016] and Shapley additive explanations (SHAP) [Lundberg and Lee, 2017], which approximate the prediction for each test sample by a locally linear explanation model. Since the weights of the learned explanation model represent feature contributions to the prediction, they can assist ML practitioners and scientists to understand the behavior of the prediction models and the functionality of the features in the prediction. However, some limitations exist in these methods. First, because the forms of the prediction and explanation models differ, it is unsure whether the estimated feature contributions reflect those of the prediction model. Second, because the explanation model is learned on each test sample, it may not obtain consistent explanations on similar samples.
To overcome the aforementioned limitations, we propose a novel framework for GP-based regression models, Gaussian process regression with local explanation, called GPX, which reveals the feature contributions to the prediction for each sample, while maintaining the predictive performance of GPR. In GPX, both the prediction and explanation for each sample are performed using an easy-to-interpret locally linear model. Therefore, no gap exists between the prediction and explanation. The weight vector of the locally linear model is assumed to be generated from multivariate GP priors [Álvarez et al., 2012]. As the multivariate GP priors have a covariance function defined as kernels on the samples, GPX ensures that similar samples have similar weights. The hyperparameters of GPX are estimated by maximizing the marginal likelihood, in which the weight vectors for all the training samples are integrated out. For a test sample, the predictions with their uncertainties of the target variable and weight vector are obtained by computing their predictive distributions. The explanation for the predicted target variable is provided using the estimated weight vector with uncertainty, as shown in Figure 1. Depicting the explanation with uncertainty helps users of GPX judge the reasonability of the predicted weights.
In experiments, we evaluated GPX both qualitatively and quantitatively in terms of predictive performance and interpretability on various benchmark datasets. The experimental results show that 1) GPX can achieve predictive errors comparable to GPR and lower errors compared with existing interpretable methods, 2) it can outperform model-agnostic interpretable methods and locally linear methods in terms of three interpretability measurements, and 3) the feature contributions produced by GPX are appropriate.

Related Work
Linear regression models are simple types of interpretable models, as their weight for each feature directly represents the contribution to the output of the models if the feature has a binary value, where the contribution helps us to interpret the effectiveness of the feature on a task. For real-valued features, the contribution for each feature is calculated as the product of the feature value and its corresponding weight. A number of studies introducing various regularizations have been conducted to produce methods such as the ridge regression [Hoerl and Kennard, 1970] and lasso [Tibshirani, 1996] methods. These models have a global weight vector that is shared across all samples. In kernel methods, automatic relevance determination (ARD), which considers the global relevance of each feature contained in kernel functions, is widely used [Neal, 2012, Wipf andNagarajan, 2008]. The above approaches are beneficial in understanding global effectiveness of features. However, since linear regression models often does not fit real complicated problems, resulting in low predictive accuracy, the estimated weights may be also unreliable. In addition, these approaches cannot estimate weights/relevances appropriate for individual samples, which means that they cannot cope with the need of the significant changes of the weights/relevances among the samples. For example, as shown in the result on Digits dataset (Figure 3), this is crucial in image classification tasks that important pixels (i.e., features) and their weights change depending on individual images (i.e., samples).
On the other hand, some locally linear models for regression have been proposed, such as the network lasso [Hallac et al., 2015] and localized lasso [Yamada et al., 2017], which have a weight vector for each sample. Therefore, these methods can avoid the drawbacks of globally linear models. To receive the benefit, we focus on generating predictions with explanations using locally linear models. In the network and localized lasso, the weights of locally linear models are estimated via optimization with network-based regularization, where the network must be defined on samples in advance. If the network is not provided, as assumed in standard regression problems, we can construct a k-nearest-neighbor graph of samples to create a network, where k is a hyperparameter that must be optimized via cross validation. Meanwhile, GPX can estimate weights and their uncertainties without constructing graphs by assuming that weights are determined by functions generated from GPs.
For GP models, [Paananen et al., 2019] proposed feature selection methods that quantify the feature relevances by measuring the difference between the predictive posterior for each sample and that for its vicinity. The feature relevance can be used for evaluating the importance of the feature in prediction. Meanwhile, the feature contribution produced by GPX indicates whether the feature makes a positive or negative contribution to an individual prediction and how much it contributes.
With regard to research on deep neural networks (DNNs), a number of studies have been conducted on making predictions generated by DNNs interpretable [Chen et al., 2019, Arras et al., 2017, Ying et al., 2019. Some of these studies have developed methods that make interpretable predictions by generating locally linear models for each sample using DNNs [Melis and Jaakkola, 2018, Schwab et al., 2019, Yoshikawa and Iwata, 2020. These concepts inspired our study, but we formalize our model without DNNs. To the best of our knowledge, our study is the first to develop a GP-based regression model with local explanations. Compared to the DNN-based locally linear models, GPX has mainly two benefits: 1) GPX can produce the feature contributions with uncertainty, and 2) GPX can be straightforwardly used for tasks in which GPR can be used advantageously, such as Bayesian optimization [Snoek et al., 2012, Golovin et al., 2017. [Wilson et al., 2012] proposed Gaussian process regression networks, which have similar structure with multi-layer perceptron. Herein, the weights of the networks are generated from GPs. Although their idea is related to GPX, their work did not focus on improving the explainability of the predictions by GPR.

Proposed Model
In this section, we describe the proposed model, i.e., Gaussian process regression with local explanation, called GPX.
We consider a scalar-valued regression problem. Suppose that training data D = {(x i , y i , z i )} n i=1 containing n samples is provided. x i ∈ X is an original input representing the ith sample, where X is an original input space. Although a typical representation for x i is a vector, it can be any data representation on which kernel functions are defined, such as graphs [Vishwanathan et al., 2010] and sets [Muandet et al., 2012, Yoshikawa et al., 2014. y i ∈ R is a target variable for the sample. z i ∈ R d is a d-dimensional vector of simplified representation for x i . Because GPX explains the prediction via a simplified representation, the meaning of each dimension of z i should be easily understood by humans, e.g., tabular data and bag-of-words representation for text. z i is an optional input; therefore, if x i can be used as a simplified representation, one can define In GPX, both the prediction of target variables y and their explanations are performed via easy-to-interpret locally linear models, i.e., target variable y i for the ith sample is assumed to be obtained using locally linear function f i : R d → R, defined as follows: where w i ∈ R d is a d-dimensional weight vector for the ith sample, and y ∼ N (0, σ 2 y ) is a Gaussian noise with variance σ 2 y > 0. Here, the explanation for the ith sample is obtained using either weight vector w i or feature ∈ R n×d without any constraints is an ill-posed problem because the number of free parameters in W , nd, is larger than that of target variable n. To avoid this problem in GPX, we assume that functions determining W are generated from a multivariate GP. More specifically, weight vector w i for the ith sample is obtained as follows: Gaussian noise with variance σ 2 w > 0, and I d is an identity matrix of order d. Here, vector-valued function g : X → R d is a function that determines the weight vector for each sample, and each element of g is generated from a univariate GP independently, as follows: where where m(x) is the mean function, and k θ (x, x ) is the covariance function with set of parameters θ. Herein, we use zero mean function for m(x). Covariance function k θ (x, x ) is a kernel function defined on two inputs x, x ∈ X . For example, one can use a scaled RBF kernel with parameters θ = {θ 1 , θ 2 } as the kernel function when x, x are vectors, defined as follows: By using g generated as such, GPX ensures that two similar samples, i.e., those having a large kernel value, have similar weight vectors.
We let G = (g(x i ) ) n i=1 ∈ R n×d . Based on the generative process above, the joint distribution of GPX is written as follows: where where G i,· and G ·,l denote the ith row and lth column vectors of G, respectively, and K = (k θ (x i , x j )) n i,j=1 is a Gram matrix of order n, which is identical to the requirement in GPR.

Training and Prediction
In this section, we describe the derivation of the marginal likelihood of GPX, the hyperparameter estimation for GPX, and the derivation of the predictive distributions of target variables and weight vectors for test samples.

Marginal likelihood
To ease the derivation of the marginal likelihood, we first modified the formulation of the joint distribution (6), while maintaining its mathematical meanings, as follows: whereK is a block diagonal matrix of order nd whose block is K, and vec(·) is a function that flattens the input matrix in a column-major order. Here,Z where diag(·) is a diagonal matrix whose diagonal elements possess the values of the input vector. In (6), d functions that output n-dimensional column vectors in W are generated from GPs; however, in (10), it is rewritten such that a single function that outputs an nd-dimensional flatten vector vec(W ) is generated from a single GP. Consequently, the likelihood of target variables y can be rewritten as a single multivariate normal distribution.
Subsequently, we derived the marginal likelihood by integrating out G and W in (10). Owing to the property of normal distributions, it can be obtained analytically, as follows:

Hyperparameter estimation
If k θ (x, x ) is differentiable with respect to θ, all the hyperparameters, i.e., θ, σ w , and σ y , can be estimated by maximizing the logarithm of the marginal likelihood with respect to them for the training data using gradient-based optimization methods, e.g., L-BFGS [Liu and Nocedal, 1989].

Predictive distributions
For a new test sample (x * , z * ), our goal is to infer the predictive distributions of target variable y * and weight vector w * . First, the predictive distribution of y * is obtained similarly as in the standard GPR, as follows: where ∈ R n and c * * = σ 2 y + k θ (x * , x * ) + σ 2 w z * z * ∈ R. Second, the predictive distribution of w * is obtained by solving the following integral: where we define where each block is an n-by-1 matrix, and (l, l)-block of the block matrix is (k θ (x * , x i )) n i=1 for l = 1, 2, · · · , d, and the other blocks are zero matrices. Solving the integral analytically according to the property of the normal distributions, we obtain We provide the detailed derivation of predictive distributions (14) and (18) in Appendix B.
The marginal likelihood (13) and the predictive distribution for y * (14) are similar to those of GPR, except that GPX can obtain the predictive distribution for w * (18). Since GPX can be used with the same input as GPR if Z = X, it can be employed in existing ML models, instead of GPR.

Computational efficiency
As with ordinary GPR, the computational cost of GPX is dominated by the inverse computation. The computation of A requires inverting a square matrix of order nd,K + σ 2 w I nd . However, because the matrix is block diagonal and every diagonal block comprises K + σ 2 w I n , a square matrix of order n, A can be obtained by inverting K + σ 2 w I n only once. The remaining inverse matrix C −1 is of order n. Therefore, all the inverse matrices appearing in GPX can be obtained using a naive implementation with a computational complexity of O(n 3 ), which is the same as that in GPR. To significantly reduce the computational cost, efficient computation methods for GPR, such as the inducing variable method [Titsias, 2009] and KISS-GP [Wilson and Nickisch, 2015], can be used for GPX. In addition, becausē k * ,K andZ are sparse matrices, one can obtain the predictive distributions efficiently using libraries for sparse matrix computation.

Experiments
In this section, we demonstrate the effectiveness of the proposed model, GPX, quantitatively and qualitatively, by comparing various interpretable models. Through a quantitative evaluation, we evaluated the models based on the following perspectives: • Accuracy: How accurate is the prediction of the interpretable model?
• Faithfulness: Are feature contributions indicative of "true" importance?
• Sufficiency: Do k-most important features reflect the prediction?
• Stability: How consistent are the explanations for similar or neighboring examples?
In addition, we qualitatively evaluated whether the feature contributions produced by the models were appropriate by visualizing them. Subsequently, we experimentally compared the computational efficiency of the models.
All the experiments were done with a computer with Intel Xeon Gold 6132 2.6GHz CPU with 16 cores, and 120GB of main memory.

Preparation Datasets
We used eight datasets in the UCI machine learning repository [Dua and Graff, 2017], referred to as Digits, Abalone, Diabetes, Boston, Fish, Wine, Paper and Drug in our experiments. We provide the details of the datasets in Appendix C. Digits dataset is originally a classification dataset for recognizing handwritten digits from 0 to 9. To use it as a regression problem, we transformed the labels into target variables y of scalar values, i.e., the target variables for the labels from 0 to 4 were −1, and those for the remaining labels were 1. With Paper and Drug datasets whose samples were represented as sentences, the original input X and the simplified input Z differed, i.e., we used the 512-dimensional sentence vectors obtained using Sentence Transformers [Reimers and Gurevych, 2020] as X, while we used bag-of-words binary vectors for the sentences as Z. Each of the remaining datasets had the same X and Z. In all the datasets, the values of X and y were standardized before training and prediction. For a quantitative evaluation of each dataset, we evaluated the average scores over five experiments performed on different training/test splittings, where the training set was 80% of the entire dataset, whereas the remaining was the test set.

GPX setup
In GPX, we consistently used a scaled RBF kernel defined as (5). The hyperparameters of GPX were estimated based on the method described in Section 4, where they were initialized with θ 1 = 1.0, σ y = 0.1 and σ w = 0.1. In addition, we initialized bandwidth parameter θ 2 using median heuristics [Garreau et al., 2017].

Comparing methods
We compared GPX with several methods with globally or locally linear weights that can used as interpretable feature contributions or relevances for predictions. Lasso [Tibshirani, 1996] and Ridge [Hoerl and Kennard, 1970] are standard linear regression models with 1 and 2 regularizers, respectively, where their weights are globally shared across all samples. ARD is a GPR model with ARD kernel that identifies the relevance of each feature. The network lasso ("Network" for short) is a locally linear model that regularizes the weights of nodes such that neighboring nodes in a network have similar weights [Hallac et al., 2015]. In our case, each node represents a sample, and the network is a k-nearest neighbor graph on the samples based on the cosine similarity on X. The localized lasso ("Localized" for short) is an extension of the network lasso; it can estimate the sparse and exclusive weights of each sample by further incorporating an 1,2 regularizer into the network lasso [Yamada et al., 2017].
To compare model-agnostic interpretable methods with GPX in terms of explainablity for prediction, we used LIME [Ribeiro et al., 2016] and Kernel SHAP [Lundberg and Lee, 2017], which produce a locally linear model   for each test sample to explain the prediction by a black-box prediction model. For a fair comparison, we used GPR with RBF kernel as the prediction model. In addition, we used a Kullback-Leibler (KL) divergence-based feature selection method for GPR with ARD kernel ("KL" for short) [Paananen et al., 2019]. The hyperparameters of GPR were estimated by maximizing marginal likelihood, similarly as for GPX. Meanwhile, those of the remaining comparing methods were optimized by grid search. We provide the detailed description of the comparing methods in Appendix D.

Accuracy
First, we demonstrate the predictive performances of GPX and the comparing methods in Table 1. GPX achieved the lowest predictive errors on all the datasets, compared to the other globally or locally linear models. In addition, their predictive errors were comparable to that of GPR on all the datasets. This result indicates that GPR can be replaced by GPX to achieve similar predictive performances.

Faithfulness
Assessing the correctness of the estimated contribution of each feature to a prediction requires a reference "true" contribution for comparison. As this is rarely available, a typical approach for measuring the faithfulness of the contributions produced by interpretable models is to rely on the proxy notion of the contributions: observing the effect of removing features on the model's prediction. Following previous studies [Melis andJaakkola, 2018, Bhatt et al., 2020], we computed the faithfulness score by removing features one-by-one, measuring the differences between the original predictions and the predictions from the inputs without the removed features, and calculating the correlation between the differences and the contributions of the removed features. Table 2 shows the faithfulness scores of GPX and the comparing methods. Here, we denote the results of LIME and Kernel SHAP using GPR as the black-box prediction model by GPR+LIME and GPR+SHAP, respectively. We found that GPX achieved the best faithfulness scores on all the datasets. As GPX predicts and explains using a single locally linear model for each test sample, when removing a feature from the input, the contribution of the feature is subtracted from the prediction directly. Meanwhile, because GPR+LIME, GPR+SHAP and KL have different prediction and explanation models, a gap may exist between the estimated contribution in the explanation model and the latent contribution in the prediction. Because the predictions by GPX and GPR were performed using similar calculations, their faithfulness differences were likely due to the gap. With ARD, it cannot estimate feature relevances appropriate for each sample, as the feature relevances are shared over samples; therefore, it produced relatively low faithfulness scores.

Sufficiency
In general, the inputs contain many irrelevant features that do not contribute to the predictions, and discovering important features in all the features is difficult for users of the models. Therefore, a desirable property of the interpretable models is that it can assign high contributions only for important features that affect the predictions well. To quantify how each method satisfies the property, we define the sufficiency score at k, where k is the number of important features. In particular, the sufficiency score at k was computed by identifying k important features in the descending order of the absolute values of their estimated contributions, predicting from the inputs having only k important features, and comparing them against the original predictions. Because the number of important features varied according to the sample and dataset, we evaluated them at k = 1, 2, · · · , 10. Figure 2 shows the sufficiency scores of GPX and the comparing methods. Independent of the k values, GPX outperformed the others on Digits dataset, whereas GPX, GPR+LIME, and GPR+SHAP produced the best sufficiency scores on Diabetes, Fish and Wine datasets. These results indicate that GPX was appropriately assigned high contributions for the important features. On Abalone and Boston datasets, GPX was slightly inferior to the localized lasso at k = 1, 2, although GPX outperformed it at k ≥ 3. This is because the localized lasso has a regularizer that induces sparse weights. This result suggests that GPX can be further improved by employing the mechanism for generating sparse weights.

Stability
To generate meaningful explanations, interpretable methods must be robust against local perturbations from the input, as explanations that are sensitive to slight changes in the input may be regarded as inconsistent by users. In particular, flexible models such as locally linear models might be sensitive to such changes for achieving better predictions. As with the work by [Melis and Jaakkola, 2018], we used the following quantity for measuring the stability of the estimated weights for test sample (x * , z * ), as follows:  where, D te = {(x * , z * )} is a set of test samples; w * and w * are the standardized estimated weights associated with test samples (x * , z * ) and (x * , z * ), respectively; > 0 is a parameter that determines neighboring samples; m is the dimensionality of x * . We set = 0.05 in our experiments. Intuitively, the stability score will be high when the estimated weights for the sample and its neighboring samples are similar. Subsequently, we computed the stability score on a dataset by averaging the quantity (19) on all the test samples in the dataset. Table 3 shows the stability scores on each dataset. GPX achieved the best stability scores on all the datasets. With GPR+LIME and GPR+SHAP, their stability scores were lower than that of GPX, although the prediction powers of GPX and GPR were comparable. This would be because LIME and Kernel SHAP estimated the weights independently over the test samples. The stability score of KL was as good as that of GPX only on Paper dataset; however, on the other datasets, KL was inferior to than GPX.

Qualitative comparison
Finally, we qualitatively compared the estimated weights using GPX and the comparing methods on Digits dataset, in which the appropriate contributions for predictions were apparent. For this comparison, we rescaled the inputs X and Z to be within [0, 1]. Figure 3 shows the estimated weights on two samples. We provide the results for all the digits in Appendix E. On this dataset, the appropriate weights can be obtained by assigning weights having the same sign with the target variable to black pixels. We found that the methods except for GPX and the localized lasso could not estimate reasonable weights. Meanwhile, the weights estimated by GPX and the localized lasso were appropriate, although they exhibited different characteristics, i.e., dense weights from GPX, whereas sparse ones from the localized lasso. The task determines the better explanation; however, as showing important regions rather than pixels is meaningful for images, the estimated weights using GPX would be easier to interpret on Digits dataset. Furthermore, the degree of sparsity in the localized lasso can be changed as a hyperparameter; if the value of the hyperparameter is zero, the localized lasso is identical to the network lasso. However, because the estimated weights using the network lasso were inappropriate, those using GPX cannot be mimicked by the localized lasso. Table 4: Training, prediction and total times (seconds) of each method on the Digits dataset. The training times of GPR+LIME/SHAP and KL are those of GPR and ARD, respectively. The prediction times of GPR+LIME/SHAP and KL are those of producing explanations for all the test samples. GPX (ours) GPR+LIME GPR+SHAP KL Localized Network ARD Computational efficiency Table 4 shows the computational times of each of the methods on Digits dataset. First, the training time of GPX was much the same as those of GPR and ARD, and significantly faster than those of the localized and network lasso. Since the localized and network lasso requires the hyperparameter search, the their actual training times were about 48 and 12 times longer than the times shown in the table, respectively. Second, the prediction time of GPX was significantly faster than those of GPR+LIME/SHAP and KL. This is because GPX does not require learning model parameters at prediction phase. On the other hand, since GPR+LIME/SHAP learns an explanation model for each sample at that time, and KL requires producing a lot of predictions with slight changes in the value of each dimension of the input, their prediction times lead to increase.

Conclusion
We proposed a GP-based regression model with sample-wise explanations. The proposed model assumes that each sample has a locally linear model, which is used for both prediction and explanation, and the weight vector of the locally linear model are generated from multivariate GP priors. The hyperparameters of the proposed models were estimated by maximizing the marginal likelihood, in which all the weight vectors were integrated out. Subsequently, for a test sample, the proposed model predicted its target variable and weight vector with uncertainty. In the experiments, we confirmed that the proposed model outperformed the existing globally and locally linear models and achieved comparable performances with the standard GPR in terms of predictive performance, and the proposed model was superior to the existing methods, including model-agnostic interpretable methods, in terms of three interpretability measurements. Then, we showed that the feature weights estimated by the proposed model were appropriate as the explanation.
In future studies, we will confirm the effectiveness of the proposed model by applying its concept into various problems in which GPs have been successfully used, such as classification, black-box optimization, and time-series forecasting. In addition, we will extend the proposed model for further improvements in interpretability, e.g., by employing the mechanism of inducing sparsity for the weight vectors.

Acknowledgment
This work was supported by JSPS KAKENHI Grant Number 18K18112.

A Feature Description and Additional Examples for the Boston Housing Dataset
The Boston housing dataset, referred to as "Boston" in our experiments, contains information collected by the U.S. Census Service regarding housing in the area of Boston, Massachusetts [Harrison Jr and Rubinfeld, 1978] and is used for predicting house prices based on the information. Table 5 lists the names of the features and their descriptions for the Boston housing dataset. Figure 4 presents four examples of feature contributions estimated by GPX. We found that each of these examples has different feature contributions, although some of the features, such as "AGE" and "DIS," had consistent positive or negative contributions, respectively.

B Detailed Derivation of Predictive Distributions
In this appendix, we describe the derivation of predictive distributions in detail. For a new test sample (x * , z * ), our goal is to infer the predictive distributions of the target variable y * and weight vector w * . The predictive distribution of y * is obtained similarly to the standard GPR [Rasmussen, 2003]. In Section 4, we demonstrated that the marginal distribution of training target variables y for GPX is defined as where C = σ 2 y I n + (K + σ 2 w I n ) ZZ . According to (20), the joint marginal distribution of y and y * is defined as where The predictive distribution of y * is the conditional distribution of y * given y with training and testing inputs. Therefore, it can be obtained by applying the formula of conditional distributions for normal distributions [Petersen and Pedersen, 2012, Eq. (354)] to (21) as follows: Predictive distribution of w * The predictive distribution of w * can be obtained by solving the following equation: where the first integrand p(w * | W , X, x * ) is the conditional distribution of w * and the second integrand p(W | D) is the posterior distribution of W . The conditional distribution of w * is derived similarly to the conditional distribution where W ·,l is the lth column vector of W . According to this, the joint distribution of W and w * is defined as where we let k * = (k θ (x * , x i )) n i=1 and k * * = k θ (x * , x * ) + σ 2 w . Subsequently, we can obtain the conditional distribution of w * by applying the formula of conditional distributions for normal distributions [Petersen and Pedersen, 2012, Eq. (354)] to (25) as follows: Here, we can rewrite (26) as a single d-dimensional multivariate normal distribution as follows: whereK is a block diagonal matrix of order nd whose block is K, vec(·) is a function that flattens the input matrix in column-major order, andc * * = k θ (x * , x * ) + σ 2 w I d .k * is an nd-by-d block matrix, where each block is an n-by-1 matrix the and (l, l)-block of the block matrix is (k θ (x * , x i )) n i=1 for l = 1, 2, · · · , d, while the other blocks are zero matrices. By letting A =k * K + σ 2 w I nd −1 , we obtain To derive the posterior distribution of W , p(W | D), we first consider the joint distribution of W and D. This distribution is straightforwardly obtained as which can be rewritten as whereZ = (diag(Z ·,1 ), diag(Z ·,2 ), · · · , diag(Z ·,d )) ∈ R n×nd . By applying the formula of conditional distributions of normal distributions [Bishop, 2006, Eqs. (2.113)-(2.117)] to (30), we can obtain where Here, the computation of Σ requires inverting a square matrix of order nd with a computational complexity of O(n 3 d 3 ). By using the Woodbury identity [Petersen and Pedersen, 2012, Eq. (156)] to compute this inversion efficiently, we can transform Σ into S =K −KZ D −1ZK , which requires inverting a matrix of order n, D = σ 2 y I n + K ZZ . Consequently, we obtain p(W | D) = N (vec(W ) | σ −2 y SZ y, S).
From (28) and (33), one can see that (23) can be represented by the following equation: This integral can be obtained in a closed form, as shown in [Bishop, 2006, Eqs. (2.113)-(2.117)]. Therefore, we can obtain the predictive distribution of w * as follows:

C Specification of Datasets
We considered eight datasets from the UCI machine learning repository [Dua and Graff, 2017], which were referred to as Digits 2 , Abalone 3 , Diabetes 4 , Boston [Harrison Jr and Rubinfeld, 1978], Fish 5 , Wine 6 , Paper 7 , and Drug 8 in our experiments.
The first six datasets consisted of tabular data. We treated the original inputs X and simplified inputs Z identically in our experiments. The Digits dataset was originally developed as a classification dataset for recognizing handwritten digits from zero to nine. As described in Section 5.1, we used this dataset for a regression problem by transforming the digit labels into binary values of 1 or −1. Here, we used only the testing set from the original Digits dataset because that is how scikit-learn [Pedregosa et al., 2011]   age of abalone based on physical measurements. The Diabetes dataset is a dataset for predicting the onset of diabetes based on diagnostic measures. The Boston dataset is a dataset for predicting house prices, as described in Appendix A. The Fish dataset is a dataset for predicting acute aquatic toxicity toward the fish pimephales promelas for a set of chemicals. The Wine dataset is a dataset for predicting the quality of white and red wines based on physicochemical tests. The remaining two datasets are text datasets. The Paper dataset is a dataset for predicting evaluation scores for papers based on review texts written mainly in Spanish. The Drug dataset is a drug review dataset for predicting 10-star ratings for drugs based on patient review texts. For each dataset, X and Z are different. Specifically, we used the 512-dimensional sentence vectors obtained using sentence transformers [Reimers and Gurevych, 2020] as X and used bag-of-words binary vectors of the sentences as Z, where the cutoff frequencies for words were set to two and five for the Paper and Drug datasets, respectively. Table 6 lists the number of samples n and number of features d in each dataset.

D Detailed Description of Comparing Methods
In this appendix, we describe the implementation and hyperparameter search methods used for comparing methods.
We implemented GPR using PyTorch v1.5.0 9 . All hyperparameters for GPR were estimated by maximizing marginal likelihood [Rasmussen, 2003], where we initialized the hyperparameters to the same values as those for GPX. For Lasso and Ridge, we used the implementations provided by scikit-learn [Pedregosa et al., 2011]. The hyperparameters that regularize the strengths of the 1 and 2 regularizers in Lasso and Ridge, respectively, were optimized through a grid search using functions provided by scikit-learn (i.e., sklearn.linear_model.LassoCV and sklearn.linear_model.RidgeCV) with the default options. The search range for the hyperparameters for Lasso was limited to within 100 grid points such that the ratio of its minimum value to its maximum value was capped at 0.001, while that for Ridge was limited to within a range of {0.1, 1, 10}. For ARD, we implemented it as with GPR. Herein, the kernel function used in ARD is defined as where α is a scale parameter; θ l is the relevance of the lth feature; θ = {α, θ 1 , θ 2 , · · · , θ d } is a set of parameters to be estimated as with GPR. For the localized lasso, we used the original implementation written in Python 10 . The hyperparameters and their search ranges for the localized lasso are the strength of network regularization λ 1 ∈ {1, 3, 5, 7}, strength of the 1,2 regularizer λ 2 ∈ {0.01, 0.1, 1, 10}, and k ∈ {5, 10, 15} for the k-nearest-neighbor graph. The hyperparameters were optimized through a grid search. The network lasso is a special case of the localized lasso. If λ 2 for the localized lasso is zero, then the localized lasso is identical to the network lasso. Therefore, we used the implementation of the localized lasso and set λ 2 = 0 for the network lasso. The hyperparameter search for the network lasso was the same as that for the localized lasso, except for the setting of λ 2 . For LIME and Kernel SHAP, we used the original implementations 11 . For KL, we implemented it by mimicking its original implementation 12 . Herein, we set to amount of perturbation ∆ = 0.001 throughout our experiments. Figure 5 presents additional examples of estimated weights for the Digits dataset. We found that the weights estimated by GPX were appropriately assigned such that the regions of black pixels have weights with the same signs as those of the target variables.

E Additional Results for the Digits Dataset
In terms of the stability of explanations, estimated weights for the same digit should be similar. Figure 6 presents three examples of estimated weights for the digit two. We found that GPX estimated similar weights for all three examples. 9 https://pytorch.org/ 10 https://riken-yamada.github.io/localizedlasso.html 11 LIME: https://github.com/marcotcr/lime, Kernel SHAP: https://github.com/slundberg/shap 12 https://github.com/topipa/gp-varsel-kl-var Figure 5: Examples of estimated weights for digits ranging from zero to nine for the Digits dataset. The five upper rows present the weights of samples with digits of zero to four (y = −1), whereas the five bottom rows present those for samples with digits from five to nine (y = 1). Red and blue denote positive and negative weights, respectively, and their color strengths represent their magnitudes.