Disentangled Explanations of Neural Network Predictions by Finding Relevant Subspaces

Explainable AI aims to overcome the black-box nature of complex ML models like neural networks by generating explanations for their predictions. Explanations often take the form of a heatmap identifying input features (e.g. pixels) that are relevant to the model's decision. These explanations, however, entangle the potentially multiple factors that enter into the overall complex decision strategy. We propose to disentangle explanations by extracting at some intermediate layer of a neural network, subspaces that capture the multiple and distinct activation patterns (e.g. visual concepts) that are relevant to the prediction. To automatically extract these subspaces, we propose two new analyses, extending principles found in PCA or ICA to explanations. These novel analyses, which we call principal relevant component analysis (PRCA) and disentangled relevant subspace analysis (DRSA), maximize relevance instead of e.g. variance or kurtosis. This allows for a much stronger focus of the analysis on what the ML model actually uses for predicting, ignoring activations or concepts to which the model is invariant. Our approach is general enough to work alongside common attribution techniques such as Shapley Value, Integrated Gradients, or LRP. Our proposed methods show to be practically useful and compare favorably to the state of the art as demonstrated on benchmarks and three use cases.


INTRODUCTION
Machine learning techniques, especially deep neural networks, have been successful at converting large amounts of data into complex and highly accurate predictive models.As a result, these models have been considered for a growing number of applications.Yet, their complex nonlinear structure makes their decisions opaque, and the model behaves as a black-box.In the context of sensitive and high-stakes applications, the necessity to thoroughly verify the decision strategy of these models before deployment is crucial.This important aspect has contributed to the emergence of a research field known as Explainable AI [1], [2], [3], [4], [5], [6] that aims to make ML models and their predictions more transparent to the user.
A popular class of Explainable AI techniques, commonly referred to as 'attribution', identifies for a given data point the contribution of each input feature to the prediction [7], [8], [9], [10].Attribution techniques have demonstrated usefulness in a broad range of applications.They can identify contributing features in nonlinear relations of scientific interest [11], [12], [13], [14], [15], or enable further validation of the models at hand [16], [17].However, for certain applications and data types, a simple attribution of the decision function to the input features may be of limited use.Specifically, it may fail to expose the multiple reasons why a particular input feature contributes or which component of the decision strategy is responsible for that contribution.
These limitations have led to the advance of richer structured explanations.The development encompasses 'higherorder explanations' [13], [18], [19], [20] that aim to extract the contribution of input features in relation to other input features, and 'hierarchical explanations' [21], [22], [23] where concepts (e.g.directions in activation space) are first extracted and then leveraged to identify joint feature-concept contributions.Proposals for hierarchical or concepts-based explanations typically construct a latent space that maximally correlates with some ground-truth annotations [24], [17] or learn a latent space that maximizes some statistics of projected activations [25], [22].These approaches, however, do not guarantee a specific focus on features that are most relevant for the model to arrive at its decision; they may in some cases extract directions in activation space to which the model is mostly invariant.
To address the general need for more structured and focused explanations, we propose to equip Explainable AI with a new form of representation learning: extracting a collection of subspaces that disentangles the explanation (i.e.separates it into multiple semantically distinct components contributing to the model's overall prediction strategy).Technically, we contribute two novel analyses: principal relevant component analysis (PRCA) and disentangled relevant Fig. 1.Disentangled explanations produced by our proposed methods for the logit 'basketball' of a pretrained VGG16 network [26], [27].From left to right: the input image; a standard explanation (LRP); a decomposition onto the first PRCA component and the residue (computed at Conv4 3); the disentangled explanation produced by DRSA (4 subspaces extracted at Conv4 3).Red color • indicates pixels that contribute evidence for the prediction through the given component or subspace, and blue color • indicates pixels that speak against it.The outcome of DRSA can be interpreted as decomposing the prediction strategy into four sub-strategies, here, the detection of the basketball field, the outfits, the faces, and the ball, as highlighted in red in each column.
subspace analysis (DRSA), that achieve two particular flavors of the representation learning objective.PRCA can be seen as an extension of the well-known PCA that incorporates model response into the analysis [28].In a similar fashion, DRSA can be seen as an extension of ICA-like subspace analysis [29], [30], [31].As a result, PRCA and DRSA inherit advantageous properties of the methods they build upon, such as simplicity and ease of optimization.
Furthermore, our contributed PRCA and DRSA methods integrate transparently into a number of popular attribution techniques, in particular, Integrated Gradients [9], Shapley Values [32], [7], [10], and Layer-wise Relevance Propagation [8], [33], [34], [35].Hence, any explanation produced by these common attribution techniques can now be disentangled by our method into several meaningful components.Moreover, our proposed methods preserve useful properties of the underlying attribution techniques such as conservation [32], [8] (aka.completeness or efficiency) and their computational/robustness profile.Fig. 1 shows examples of disentangled explanations produced by our PRCA/DRSA approach, where we observe that the overall prediction strategy of a VGG16 network for the class 'basketball' decomposes into multiple sub-strategies including detecting the basketball field, the outfits, the faces, and the ball.
Through an extensive set of experiments on state-of-theart models for image classification, we demonstrate qualitatively and quantitatively that our approach yields superior explanatory power compared to a number of baselines proposed by us or other approaches from the literature.In particular, we observe that our disentangled explanations capture more distinctly the multiple visual patterns used by the model to predict.
Lastly, we present three use cases for the proposed disentangled explanations: (1) We show that disentangled explanations enable a simple and efficient interface for the user to identify and remove Clever Hans effects [36] in some model of interest.(2) We demonstrate on a subset of ImageNet containing different butterfly classes how disentangled explanations can enrich our understanding of the relation between visual features and butterfly classes.(3) We use PRCA to analyze explanations that are adversarially manipulated through a perturbation of the input image (see e.g.[37], [38]), allowing us to disentangle the original explanation from its adversarial component.

Concept-Based Explanations
Building on findings that deep neural networks encode useful intermediate concepts in their intermediate layers [44], [45], [46], a first set of related works considers the problem of explanation in terms of abstract concepts that are represented well in intermediate layers.For example, IBD [47] learns from ground-truth concept annotations an interpretable basis, allowing to decompose the prediction in terms of the different concepts.The TCAV method [24] builds for each concept a linear classifier in activation space, using ground-truth annotations, and generates per-instance concept sensitivity scores using derivatives in activation space.[48] extends the TCAV framework by using clustering algorithms to find directions in latent space, bypassing the need of having a concept dataset.Alternatively, [49] finds that using non-negative factorization yields higher fidelity than using clustering approaches.[25] views concepts as subspaces of the representation formed in some intermediate layer and proposes a sparse clustering algorithm to extract such subspaces.[50] introduces "virtual layers", converting the input signal to frequency space and back, in order to produce explanations in frequency domain.The NetDissect framework [51] offers a way of matching hidden neurons to concepts (obtained from the Broden dataset [52]).It enables partitioning the space of activation into multiple subspaces, each of them representing a distinct concept.Concept-based explanations have also been proposed to investigate similarity predictions [53].We refer to [54] for a broader overview and taxonomy of recent developments in concept-based explanations.

Structured and Higher-Order Explanations
Another line of work aims to extract structured explanations, either joint contributions of input features or of input features and concepts.Higher-order methods [55], [19], [13], [56], [20] enable the construction of these explanations and also better account for interaction effects present in ML models such as graph neural networks [20].[22] builds an interpretable model, called prototype networks, that support joint explanations in terms of concepts and input features.[23] enables such structured explanation in a posthoc manner, by extending the LRP framework to filter the explanation signal that passes through different activation maps representing different concepts.Another propagationfiltering approach is applied at each layer in [57] in order to build a hierarchical explanation.[21] learns a surrogate graph-based model at multiple layers of a trained neural network in order to produce hierarchical explanations.[58] proposes the context decomposition approach to extract hierarchies of input features that explain the prediction of an NLP model.The contextual decomposition approach is further extended in [59], in particular, addressing the question of how to explain combinations of two phrases.[60] extracts a hierarchical explanation through the use of a second-order attribution method.In contrast to structured or higher-order explanations, other methods such as [61] aim to simplify a standard first-order explanation, by making it sparse.

Applications to Model Validation and Improvement
A number of works leverage the joint usage of explanation techniques and intermediate representations for the purposes of model validation or improvement.[62] proposes a data-agnostic framework that uses synthetic images to investigate whether the intermediate representation of a trained model exhibits any potential Clever Hans effects.[17], [63] model Clever Hans phenomena in a trained model by an application of LRP and builds a transformation in activation space to prune these Clever Hans effects, while [64] mitigates such effects by relearning the last layer of the model using a reweighting dataset.

Representation Learning and Disentanglement
Beyond the field of Explainable AI, a broad range of works have addressed the question of learning disentangled representations [29], [30], [65].Related to our focus on relevant subspaces, some of these works take label information or model response into account [66] or study the guarantees of concept discovery algorithms [67].Other works focus not on learning disentangled representations, but on evaluating them (e.g.[68], [69]), including the study of how representations evolve from layer to layer in neural networks [44], [70], [71], [72].In contrast to the works above, our paper proposes techniques that specifically address the problem of disentangling explanations.

JOINT PIXEL-CONCEPT EXPLANATIONS
We place our focus on the commonly studied problem of attribution, which asks the extent by which each input feature has contributed to a given prediction.Shapley value [32], [7], [10], Integrated Gradients [9], or Layer-wise Relevance Propagation (LRP) [8], [34] can be called 'standard' attribution methods as they solve the problem of decomposing the output score into contributions of individual input pixels (or patches).An overview of these attribution techniques is provided in Supplementary Note A. One common limitation of these methods is that they do not provide information on the underlying reason (or concept) that makes a particular pixel relevant [24], [47].
More recent Explainable AI approaches, such as [23], [73], [50] or [74] have thus aimed at resolving input features contributions in terms of intermediate concepts that the ML model uses to produce the overall decision strategy.In a favorable case where those concepts are readily identifiable at some intermediate layer of the neural network, more specifically, when the input-output relation can be formulated via the two-step mapping: x → h → y, where x = (x p ) P p=1 is the collection of P pixels (or patches) forming the input image, where h = (h k ) K k=1 are the groups of neurons encoding each of the K concepts, and y is the output of the network (e.g. the evidence built by the network for the image's class).From this two-step mapping, one can generate a richer joint pixel-concept explanation via the corresponding two-step explanation process: The notation E(a, b) reads "explain a in terms of b", or "attribute a onto b".The score R k indicates the contribution of concept k to the prediction.The score R pk can then be interpreted as the joint contribution of input pixel p and concept k to the prediction.As an example, in Fig. 1, for some given input image, the score R k (with k = 2) would measure the contribution of the 'outfit' to the prediction 'basketball', and R pk would be the contribution of a particular pixel within the outfit.The collection of all the R pk 's forms the joint pixel-concept explanation.
In practice, when using backpropagation methods such as LRP, such a two-step explanation process takes the form of a filtering of the backward pass through specific neurons, a procedure we highlight graphically in Fig. 2 (right).The filtering approach is also found e.g. in [39], [20], [23].We provide the derivations of our two-step procedure for nonpropagation attribution techniques, such as the Shapley value [7], [10] and Integrated Gradients [9], in Supplementary Note B.1.
When the attribution technique used at each step obeys a conservation principle (which is the case for all the attribution methods stated above up to some approximation), one gets the conservation equation pk R pk = y.Furthermore, under some reasonable assumptions about the model and the attribution technique (cf.Supplementary Note B.1), one gets the stronger form of conservation ∀p : k R pk = R p .The joint pixel-concept explanation then becomes a decomposition of the standard pixel-wise explanation into multiple sub-explanations, and conversely, the standard pixelwise explanation can be seen as a reduction (or coarsegraining) of the joint explanation.

Concepts as Orthogonal Subspaces
We have so far assumed an intermediate representation in the model, with the variables (h k ) K k=1 encoding distinctly the multiple concepts used by the model to predict.In most deep neural networks, however, we only have a sequence of layers, each of which is a large-mainly unstructuredcollection of neurons whose role or contribution to the neural network output is not always easily identifiable and possibly not well-disentangled (see e.g.[75], [51]).As some earlier works have demonstrated [24], [76], [77], [25], meaningful concepts are typically recoverable (e.g. using a linear transformation) from the collection of activations a = (a i ) D i=1 at some well-chosen layer.We propose to append to such a layer of activations a virtual layer that maps the activations to some latent repre-sentation using some orthogonal matrix U of size D × D and back.The matrix is structured as where each block U k , a matrix of size D × d k , is associated with the concept k and defines a projection onto a subspace of dimensions d k .The virtual layer is depicted in Fig. 2 (left) and its mapping can be expressed as: highlighting (1) its property of keeping the overall decision function unchanged due to the orthogonality property, and (2) the extraction of the variable h k which is required for producing the joint pixel-concept explanation according to Eqs. ( 1) and (2).

Expressing Concept Relevance
There are many ways to learn matrices U k 's, for example, using PCA or other unsupervised analysis on a set of activation vectors a's, similar to [25].However, we aim in this work to learn subspaces that are specifically relevant to the decision function, i.e. with high relevance scores R k 's.To achieve this, we will first need to express R k (the outcome of Eq. ( 2)) in terms of the transformation matrix U k .We give the demonstration for the LRP [8] attribution technique.Let us recall the definition of the virtual layer in Eq. ( 4), and observe that each reconstructed activation a ′ j can be expressed in terms of concepts h k in the layer below as: with (U k ) j the jth row of the matrix U k .Assume we have propagated the neural network output using LRP down to the layer of the reconstructed activations and obtained for each a ′ j a relevance score R j 1 .The LRP-0 rule [8], [34] lets us propagate these scores to the layer below representing concepts: where j runs over all activated neurons j.After some rearranging, the same R k can be restated as: where c is a D-dimensional vector whose elements are given by c j = R j /a ′ j for all activated neurons j and c j = 0 otherwise.The vector c can be interpreted as the model response to a local variation of the activations, and we refer to it in the following as the 'context vector'.
In Supplementary Note B, we show that other attribution methods such as Gradient × Input [78] and Integrated Gradients [9] (with zero reference value) also produce relevance scores of the form of Eq. ( 6), and we provide an expression for their respective context vector c.Because methods based on the Shapley value [7], [10] do not yield the form of Eq. ( 6), one requires an approximation of the latter.In particular, our solution consists of first computing Shapley values w.r.t. the activations a j (or groups of it), and then performing the propagation step onto concepts using the LRP rule of Eq. ( 5).

LEARNING RELEVANT SUBSPACES
Having expressed concept relevance R k 's in terms of known quantities, specifically for each data point, (i) activations vectors a that we can collect and (ii) context vectors c representing the model response and that we can compute (e.g. using LRP), we can formulate various concept relevance maximization objectives over the transformation matrices U k 's.

Principal Relevant Component Analysis (PRCA)
The first objective we propose is to extract a subspace that is maximally relevant to the model prediction.Consider our virtual layer has the simple block structure where U ∈ R D×d defines a projection onto a subspace of fixed dimensions d and where U ∈ R D×(D−d) projects to the orthogonal complement.We ask "what matrix U yields a subspace that is maximally relevant for the prediction".
Starting from the expression of relevance in Eq. ( 6), we can formulate the search for such a maximally relevant subspace via the optimization problem: subject to: where the expectation denotes an average over some dataset (e.g.images of a given class and the associated activation and context vectors).Using linear algebra identities, 1.For our method to be applicable, one further requires that any activation a ′ j = 0 has relevance R j = 0.This property is satisfied when applying standard LRP rules (cf.Supplementary Note A). the same optimization problem can be reformulated as: where a crosscovariance term between the activations a and the context vector c appears.The solution to this optimization problem is the eigenvectors associated to the d largest eigenvalues of the symmetrized cross-covariance matrix (cf.Supplementary Note C.2 for the derivation).In practice, the orthogonal matrix of Eq. ( 7) can therefore be computed using a common eigenvalue solver: and, assuming eigenvectors are sorted by decreasing eigenvalues, we recover the blocks of that matrix as The proposed PRCA differs from standard PCA, by also taking into account-via the context vector c-how the output of the network responds to the activations a.Thus, PRCA is able to ignore high variance directions in the data when the model is invariant or responds negatively to these variations.Fig. 3 provides an illustration of this effect on two-dimensional data, showing that the PRCA subspace aligns more closely to the model response than PCA.
Note that several related approaches to refocus PCA on task-specific features have been proposed, although in a different context than Explainable AI.This includes 'directed PCA' [79], [80], where a subset of task-related features are selected before running PCA.It also includes 'supervised PCA' [66] which formulates a trace maximization problem involving both input and labels, and methods based on partial least squares [81].

Disentangled Relevant Subspace Analysis (DRSA)
We now extend the approach above for the purpose of separating an explanation into multiple components representing the different concepts contributing to the overall decision strategy.Specifically, we partition the activation space into multiple subspaces, with subspaces optimized in a way that they maximize some higher-order statistics of relevance scores.
Let the orthogonal matrix U associated to our virtual layer be structured in the general form of Eq. ( 3).Let n ∈ D be indices for different data points and be the positive part of the relevance associated to subspace k according to Eq. ( 6) for a given data point n.The rectification operation allows us to focus our subsequent analysis on extracting salient positive contributions.In particular, we define our disentanglement-inducing objective as: subject to: where we use in practice q = 0.5.The operator M p denotes a generalized F-mean with function F (t) = t p .The special cases M 0.5 and M2 can be interpreted as soft minand max-poolings respectively.The soft max-pooling over data points serves to encourage subspaces to align with instances n's with particularly high relevance scores.These instances can be interpreted as prototypes for each identified component of the decision strategy.The soft min-pooling over subspaces, on the other hand, serves to favor solutions that balance the total relevance attributed to the different subspaces.Other nested pooling structures are found in the Independent Subspace Analysis algorithms of [31], [30].These nested structures have also been studied in more depth in [82].The behavior of DRSA on a synthetic twodimensional activation space and a single one-dimensional subspace is illustrated in Fig. 4.While the optimization problem above is non-convex and does not have a closed form solution, we can proceed iteratively, starting from a random orthogonal matrix 2 , and similarly to [30], alternating gradient ascent and orthogonalization steps (U ← U (U ⊤ U ) −1/2 [84]).

Theoretical Properties
The proposed relevant subspace analyses have a number of desirable theoretical properties: ) with c such that R j = a ′ j c j , we have the conservation property k R k = j R j .Furthermore, when c = ξa with ξ ≥ 0, then we necessarily have R k ≥ 0.
(A proof can be found in Supplementary Note C.1.)These properties follow from the orthogonality constraint on the matrices (U k ) k .The first property (conservation) ensures that the two-step explanation produced by our method retains the conservation properties of the original explanation technique it is based on.The second property (positivity) ensures that an absence of contradiction in the decision function (e.g. a perfect alignment between activations and model response) results in a similar absence of contradiction w.r.t.concepts.
The following result links the proposed PRCA and DRSA algorithms to well-known analyses such as PCA and ICA.
Proposition 2. When the context vector c is equivalent to the activation vector a, the PRCA analysis reduces to uncentered PCA.Furthermore, if we assumed whitened activations, i.e., E[a] = 0 and E[aa ⊤ ] = I, and each matrix U k projecting to a subspace of dimension 1, then the DRSA analysis with parameter q = 2 reduces to ICA with kurtosis as a measure of subspace independence.
(A proof can be found in Supplementary Note C.1.)In other words, with some restrictions on the parameters, our proposed algorithms reduce to PCA and ICA when the model response is perfectly aligned with the activations.Unlike PCA and ICA, our analyses are able to extract subspaces that are relevant to the prediction even when the model response does not align well with the activations.
When considering the level of access to the model our method requires, it can be characterized as a white-box method.Specifically, our method requires access to at least one intermediate layer to perform PRCA and DRSA.When using our method together with attribution methods such as LRP, access to all layers of the model is required in order to implement the LRP propagation rules.

QUANTITATIVE EVALUATION
To evaluate the performance of our PRCA and DRSA methods at extracting relevant subspaces and producing disentangled explanations respectively, we perform experiments on the ImageNet [85] and Places365 [86] datasets.For Ima-geNet, we consider a subset of 50 classes 3 and three publicly available pre-trained models.These models are two VGG16 [26] models-which are from the TorchVision (TV) [27] and NetDissect (ND) [51] repositories, denoted by VGG16-TV and VGG16-ND 4 respectively-and the NFNet-F0 model (the smallest variant of a more recent architecture called 3. For ease of reproducibility and maximizing class coverage, we choose ImageNet classes with indices {0, 20, 40, . . ., 980}. 4. We remark that VGG16-ND is our PyTorch version of the model (originally in Caffe [87]'s format) on which the relation between concepts and feature maps has been analyzed in [51], allowing for a more direct comparison between the previous work and our DRSA approach.The original model is available at http://netdissect.csail.mit.edu/dissect/vgg16 imagenet/.
We evaluate our proposed methods with Shapley Value Sampling-an approximation of the classic Shapley valueand LRP; these two attribution techniques are chosen based on patch-flipping experiments [91] (see Supplementary Note D).We use the implementation of Shapley Value Sampling from Captum [92].Our LRP implementation for VGG16 is based on LRP-γ used in [19].For NFNets, we contribute a novel LRP implementation (see Supplementary Note K).For ResNet18, we use the LRP implementation from Zennit [93].We provide the details of these attribution methods' hyperparameters in Supplementary Note D.
In the following, we focus on evaluating the proposed approaches using activations from VGG16 at Conv4 3 (after ReLU), NFNet-F0 at Stage 2, and ResNet18 at Layer 4 6 .We refer to ablation studies on the choice of layers and the number of subspaces in Supplementary Note F.
To extract subspaces, we randomly choose 500 training images of each class and take their feature map activations at the layer of interest.For each image, we randomly pick 20 spatial locations in the feature maps.The procedure results in a collection of 10000 activation vectors for each class.We also collect corresponding context vectors (w.r.t. the target class) associated to Shapley Value Sampling and LRP attribution methods.We summarize these details in Supplementary Note E. All our evaluations are performed on a validation set disjoint from the data used for training the networks and optimizing the PRCA/DRSA subspaces.

Evaluation of PRCA
We test the ability of our PRCA method to extract a lowdimensional subspace of the activations, that retains input features used by the model to build its prediction.The extraction of what is relevant (vs.irrelevant) in an ML model has recently found application in the context of model compression (e.g.[94]).We first recall from Section 3.2 that any subspace of the activations, and the matrix U of size D ×d that projects onto it, embodies an amount of relevance expressible as: The relevance can then be attributed to the input space, using LRP or Shapley instantiations of E(R, x).We quantify how closely (in spite of the dimensionality bottleneck) the produced explanation describes the neural network prediction strategy using pixel-flipping [8], [91], a common evaluation scheme, sometimes also referred to as deletion/insertion experiments [95].
Pixel-flipping (in our case, 'patch-flipping'), proceeds by removing patches from the input image, from most to least 5.These Places365 classes are similar to the ones visualized in Ref. [47]'s Fig. 4. 6.We adopt the layer-name conventions of VGG16 from [51], of NFNet-F0 from [88], and of ResNet18 from [27].
relevant, according to the explanation 7 .As patches are being removed iteratively, we keep track network output and then compute the "area under the patch-flipping curve" (AUPC) [91] AUPC where x (τ ) denotes the image after τ removal steps, T is the number of steps until all patches have been removed from the image, E denotes an average over images of class t in the validation set, and f is the neural network output for class t to which we have applied the rectification function to focus on positive evidence.The weighting function w(τ ) ∈ (0, 1) is the difference between the percentage of patches flipped at the τ -th and (τ -1)-th steps.To make experiments executable in a reasonable time, we measure relevance over patches of size 16 × 16, and we flip τ 2 such patches at each step.The lower the AUPC score, the better the explanation, and the better the subspace U .
To the best of our knowledge, there are no existing baselines from the literature that are designed to extract a subspace of the activations that can preserve the decision strategy of a given class 8 .Hence, for comparison, we consider several baselines: 1) a random subspace 9 ; 2) the subspace of the first d eigenvectors of the standard (uncentered) PCA on the activations; and 3) the subspace corresponding to the d most relevant feature maps (Max-Rel) [23].
We can view the choice of baselines as a special ablation study of PRCA.Specifically, PCA corresponds to PRCA with the context vector c representing model response set to the activation vector a (Proposition 2); Max-Rel is a reduction of PRCA where the basis of the subspaces are canonical; and the random approach can be thought of as an 'untrained' PRCA.
Results are given in Table 1 for subspace size d = 1.We observe that, PRCA strongly surpasses the baseline methods across configurations.The observation indicates that PRCA can identify a subspace of the activations that is relevant to the prediction.
Next, we investigate the influence of the subspace dimensions d on the quality of the subspace.We analyze the AUPC score as a function of d.We perform the experiment on the VGG16-TV model with LRP.Fig. 5 shows that PRCA has the lowest AUPC scores across different values of d.The result supports the conclusion that the top few principal components of PRCA accurately capture the evidence the neural network builds in favor of the image's class.The fact that PRCA (and also Max-Rel) performs better than no subspace projection (i.e.retaining the whole activation 7. We opt for the removal-based rather than the insertion-based variant of patch-flipping because it makes the task of inpainting missing patches easier and thereby reduces evaluation bias.The replacement values for removed patches are set according to the TELEA [96] algorithm-a neighborhood-based inpainter-from OpenCV [97].
8. Ref. [98] studies a related problem: completeness in latent space, although its objective is to verify whether extracted concept vectors can restitute the full accuracy of the model.In contrast, our objective is to extract a subspace that maximally expresses the predicted evidence for a certain class.
9. We take the first d columns of a random orthogonal matrix.

TABLE 1
Patch-Flipping evaluation of PRCA for subspace size d = 1.Evaluation is performed over different combinations of models, datasets and underlying attribution techniques (columns).We report for each method in our evaluation (rows) the AUPC score.The AUPC scores are computed by averaging over instances within each class and then averaging over classes.The best method for each setting is shown in bold, and the second best with underline.The proposed PRCA method performs best in all settings.( †) average from three seeds.± 0.30 ± 0.28 ± 0.10 ± 0.29 ± 0.27 ± 0.42 Fig. 5. AUPC scores of different subspace methods when varying the subspace dimensionality (the variable d); lower is better.The analysis is performed on VGG16-TV with LRP (same as column 1 in Table 1).Each curve is an average over the means of these 50 classes, and shaded regions represent one standard error (over classes).The horizontal dashed line represents the AUPC of no subspace projection.
space) suggests that there exists some amount of inhibitory signal in activations that conceals mildly relevant features in the original heatmaps.By construction, the first few PRCA components are maximally relevant directions, and projecting activations onto them decreases the amount of such inhibitory signal.The showcase we present in Section 6.3, where we use PRCA to robustify an explanation under adversarial manipulations, corroborates the above interpretation.

Evaluation of DRSA
The second question we have considered in this paper (and for which we have proposed the DRSA analysis) is whether the explanation can be disentangled into multiple components that are distinctly relevant for the prediction.Specifically, we have set the problem of disentanglement as partitioning the space of activations into K subspaces, defined by their respective transformation matrices (U k ) k .From these K subspaces, one can retrieve each component of the explanation as E(R k , x).All methods in our benchmark yield such a decomposition onto a fixed number of K components (they only differ in the choice of the matrices U k 's).
For evaluation purposes, we propose an extension of the patch-flipping procedure used in Section 5.1.The extended procedure allows us to quantify the level of disentanglement, specifically, verifying that the multiple explanation components highlight distinct (spatially non-overlapping) visual elements contributing to the neural network's prediction.Our extension consists of running multiple instances of patch-flipping in parallel (one per component k) and aggregating patch removals coming from each component.More specifically, denoting by M τ k an indicator vector of patches removed based on the kth component after τ steps, we define the overall set of patches to be removed after these steps to be M τ = ∪ K k=1 M τ k , where the union operator applies element-wise.An illustration of the modified patchflipping procedure is given in Fig. 6.Similar to Section 4.1, as the patch-flipping proceeds, we keep track of the model output and compute the AUPC score.Our extended patchflipping procedure shares similarity with the evaluation procedures in [48], [25] as the latter also remove features based on the distinct explanation components (or concepts).
We evaluate our DRSA method against a number of baselines.The first baseline is random subspaces constructed from a random D × D orthogonal matrix.The second baseline, called DSA, is an ablation of the objective Fig. 6.Illustration of our modified patch-flipping procedure to evaluate explanation disentanglement.The first column shows a two-concept explanation (k = 1, 2) with red color intensity indicating patch relevance.The next three columns show the indicator vectors for the first three steps (τ = 1, 2, 3) of the modified patch-flipping procedure, with white color indicating removed patches.

TABLE 2
Patch-Flipping evaluation of DRSA for K = 4 subspaces.Evaluation is performed over the same dataset, model and attribution settings (columns) as in Table 1.We report for each method in our evaluation (rows) the AUPC score.Like in Table 1, the AUPC scores are first averaged over instances and then over classes.The best method is shown in bold, and the second best with underline.Entries with '×' are not applicable.( †) average from three seeds. of DRSA where we replace the context vector with the activation vector itself.
For ImageNet experiments, the third baseline is NetDissect [51], a state-of-the-art framework linking neurons to a large set of real-world concepts extracted from the Broden database [52].The approach associates each filter in the layer's feature map to a concept available in the dataset.For each identified concept, we define its subspace as the span of the standard basis vectors of the associated filters.We refer to Supplementary Note H for our reproduction details of NetDissect.For Places365 experiments, the third baseline is concept directions from IBD [47].Because these concept vectors do not form an orthogonal basis, we adapt our formulation of the virtual layer accordingly (details are provided in Supplementary Note I).
We set the number of subspaces to K = 4.For the random subspaces, DSA, and DRSA, we choose the dimensions of each subspace to be D/K.To build the DSA and DRSA subspaces, we use each class's set of activation (and context) vectors similar to the setup in Section 5.1.We provide the training details of DSA and DRSA in Supplementary Note E. Because Shapley Value Sampling is computationally demanding, we report for this method an average over only 10 validation instances per class.
Table 2 shows the AUPC scores across setups.We observe that our proposed approach (DRSA) outperforms baseline methods by reaching the lowest scores across these setups.The observation also aligns with the visual inspection of Fig. 1 earlier in the paper, where spatially distinct concepts could be identified from the DRSA explanations.The result suggests that the subspaces of DRSA capture distinctively relevant components in the decision of the neural network.
When ranking the 50 ImageNet classes based on the AUPCs score obtained by DRSA on VGG16-TV, we observe that class 'zebra' comes first.A subsequent visual inspection reveals that evidence for the class zebra arises from multiple, spatially disentangled, concepts such as the zebra's shape, its unique texture, and the environment in which they are located.We provide the details of the class comparison as well as qualitative examples in Supplementary Note G.
We further conduct ablation studies on the choice of layers and the number of subspaces.The results of these studies align with the conclusions from Table 2.In addition to these studies, we also perform experiments that verify certain intrinsic properties of the produced explanations.We refer to Supplementary Note F for these results.

APPLICATION SHOWCASES
We showcase three possible applications of the proposed PRCA and DRSA methods, namely (1) building a more trustworthy ML model by detection and removal of Clever Hans strategies in the model, (2) getting better insights into the data by highlighting multiple ways input and output variables are related, and (3) bringing further understanding about the problem of adversarially manipulated explanations.

Detecting and Mitigating Clever Hans Effects
A common issue with ML models is that they sometimes rely not on the true features-that should support the ML decision-but on artifactual features that spuriously correlate with the target labels on the available data.Such flawed strategies of the ML model are commonly referred to as 'Clever Hans' [36], [17].Clever Hans models evade traditional model validation techniques, such as cross-validation, when the spurious correlation is present in both the training and test data.Nevertheless, Explainable AI can reveal these Clever Hans strategies; specifically, the user would inspect the explanation of a number of decision strategies and verify that artifactual features are not highlighted as 'relevant' in the explanation.
We demonstrate in this showcase how the proposed DRSA analysis enables us to detect and mitigate Clever Hans effects in a highly efficient manner.In contrast to existing state-of-the-art approaches to Clever Hans detection [36], [62] and mitigation [17], our approach can leverage the multiple sub-strategies readily identified by DRSA, some of which can be of Clever Hans nature.In particular, for detecting Clever Hans strategies, one can let the user inspect one or a few representative examples of each decision strategy identified by DRSA.
We test our approach on some known example of a Clever Hans strategy: the reliance of VGG16-TV on Hanzi watermarks for predicting 'carton' [17].Fig. 7 (top) shows three training images 10 from the class 'carton' and their standard and DRSA heatmaps using LRP (DRSA is applied at Conv4 3 with K = 4).From the heatmaps, we can see that the subspace S4, unlike other subspaces, captures the Hanzi watermark quite prominently when the watermark is salient (e.g. the first and second examples).We therefore identify that S4 corresponds to the Hanzi Clever Hans strategy.We find that S4 is able to discriminate Clever Hans from non-Clever Hans instances with an AUROC of 0.909.We compare the Clever Hans detection ability of DRSA with that of SpRAy [36].The SpRAy method consists of performing a clustering of standard LRP heatmaps and inspecting individual clusters.In our case, we choose the 10.We select the examples based on the procedure outlined in Supplementary Note E.2.The rationale for using training images rather than validation images is that the training set is more likely to contain features causing the CH effect, and thus more useful for inspection purposes.
same number of clusters as DRSA subspaces.In contrast, the SpRAy's most discriminative cluster achieves a slightly lower AUROC of 0.842.Details of the experiment and full ROC curves are provided in Supplementary Note J.1.We expect further gain from our approach over SpRAy when the Clever Hans features occur at different locations in the input images.Overall, our experiment demonstrates the effectiveness of DRSA at identifying Clever Hans effects.
When it comes to mitigating Clever Hans strategies, we propose again to leverage DRSA.Specifically, building on the subspace(s) we have identified using DRSA to be of Clever Hans nature, we propose to refine the prediction of the class by subtracting the relevance scores associated to those subspaces from the prediction: In practice, we find that subspaces identified to be of Clever Hans type still contain residual non-Clever Hans contributions, especially negative ones.Hence, we propose to only consider excess relevance given by , where the expectation is  computed over a set of training images from the class, here 'carton'.We then use R (excess) k in place of R k in Eq. ( 12).We now apply the proposed method to mitigate the influence of the Hanzi watermark, focusing on the class carton and other classes that VGG16-TV tends to confuse as 'carton'.We say VGG16-TV confuses a class with class 'carton' if it has class carton in the top-3 predictions of its validation images with frequency at least 10%.With the criteria, these classes are 'crate', 'envelope', 'packet', and 'safe'.Using validation images of these classes and class 'carton', we then construct a classification problem in which some of the non-carton images are poisoned with a random Hanzi watermark (from one of the three we have prepared; cf.Supplementary Note J.1).We apply 25% poisoning, i.e. 25% of non-carton images are inpainted with Hanzi watermarks.We observe that the classification accuracy of the original model decreases on the poisoned data (from around 82% to 76%).The decrease indicates that our poisoning procedure effectively fools VGG16-TV.Fig. 7 (bottom) shows the difference of accuracy between the original and refined models on the 25%-poisoned data.We see that the refined model based on excluding the contribution of S4 has the highest classification accuracy (adding 3.6% to the accuracy score of the unrefined model).We further investigate the structure of error in Fig. 8 which shows the confusion matrices between predicted and target classes for the original and refined model.After the refinement, we observe that the number of misclassified non-carton examples decreases substantially.We finally compare our model refinement method with the method of [64], which consists of retraining the last layer of the model on poisoned data.We find that the retraining approach achieves a gain of 4.8% accuracy, slightly above our method based on DRSA.Our method, however, comes with the additional advantages of neither having to synthesize artificial Clever Hans instances nor having to choose a particular poisoning level for retraining.These advantages are decisive in the context where Clever Hans features are tightly interwoven with the other objects contained in the image.We refer to Supplementary Note J.1 for the details of the experiments, including different poisoning levels.
Overall, this showcase has demonstrated that DRSA can be an effective tool for detecting and mitigating Clever Hans effects in complex ML models.Furthermore, we stress that our approach is purely unsupervised: It requires neither assembling a dataset of examples labeled according to the strategy the model employs to predict them, nor to generate synthetic examples where the Clever Hans features have been stripped or artificially added.Furthermore, our Clever Hans mitigation approach is 'post-hoc': except for the DRSA analysis, our method does not require any training or retraining of the neural network model.

Better Insights via Disentangled Explanations
Explainable AI has been shown to be a promising approach to extract insights in the data and in the systems or processes that generates this data [2], [6].Several recent works have shown successful usages in biomedical or physics applications.For example, Explainable AI enabled a better understanding of what geometrical aspects of molecules are predictive of toxicity [99] (or 'toxicophores').It also allowed to predict protein interactions in a human cell [14], thereby supporting the research on identifying signaling pathways.There are many further examples of successful uses of Explainable AI for extracting scientific insights in geology [12], hydrology [11], quantum chemistry [100], [101], neuroscience [102], [103], histopathology [104], [15], etc.In these works, the authors often resort to standard heatmaps highlighting the extent by which one feature or a group of features contributes to the overall prediction.
The amount of insights one can extract from a standard explanation is however restrained by the fact that multiple concepts are entangled, and it is therefore difficult to gain a structured understanding of the relation between input and output.We showcase in the following how our proposed DRSA-LRP method enables the extraction of more sophisticated insights.We consider for an illustrative purpose the task of gaining insights into the visual differences between six classes of butterflies present in the ImageNet dataset: 'admiral', 'ringlet', 'monarch', 'cabbage', 'sulphur', and 'lycaenid' butterflies.
For this showcase where the objective is for the user to gain insights from the model, it is natural to choose the best model available.We choose NFNet-F0, which achieves an overall top-1 accuracy of 82% compared to VGG16-TV and -ND that achieve 72% and 70% respectively.We select 125 training images from each of these butterfly classes to form a training set.We use activation and context vectors from NFNet-F0 at Stage 1, and use LRP (with parameter γ = 0.1 to compute the explanations).We extract eight subspaces using DRSA with the optimization details similar to Section 5.2 (see also Supplementary Note E).
First, we would like to build a correspondence table between classes and concepts, indicating for each class which concepts are specific to it.We propose the following simple statistical test, which accounts for the fact that concepts are typically expressed only in a subset of images from the given class.Denote D ω to be the set of class ω's validation images and D the set of all validation images from the investigated classes (in our showcase, all butterfly images).We consider Subspace k to be specific to class ω if where Q α is the α-quantile of the given distribution and α < β.In our experiments, we choose α = 0.75 and β = 0.85.In this equation, scores R k are measured via i R ik .Fig. 9 illustrates the process of matching classes with DRSA subspaces.The right border of the rectangles and the dashed lines correspond to the left-and right-hand sides of Eq. ( 13).The analysis reveals 10 class-subspace matchings (highlighted in red).We observe that each DRSA subspace is associated to one type of butterfly, except for the subspaces S1 and S4, which matches multiple classes, thereby indicating visual concepts that are shared between multiple classes.Furthermore, the number of concepts associated to a particular class vary from one (sulphur butterfly) to three (cabbage butterfly).
Fig. 10 (left) explores, using a three-dimensional scatter plot, how the relation between butterflies and their respective classes is resolved by the subspaces S4, S5, and S7 of our DRSA analysis.Each point in this plot corresponds to one example, and its coordinate is given by the scores R k 's.As already noted in Fig. 9, we observe that 'monarch' is jointly expressed along axes S4 and S5, and 'admiral' is jointly expressed along axes S4 and S7.These subspaces are not relevant for the other classes; hence, their respective examples appear near the origin.Fig. 10 (right) shows pixel-wise explanations for the most prototypical examples of a few selected classes 11 .We observe that S1 corresponds to yellow colored surfaces, which seems to be common of ringlet and sulphur butterflies.S4 corresponds to white-dot texture, which is found on monarch's wings and body and admiral's wings.S5's pattern is specific to the orange/black texture on the wings of monarch specie.S7 captures the prominent orange pattern on the wings of the admiral butterfly.Lastly, we find that S8 captures the distinct dotted pattern that appears on the wings of the ringlet species.We provide the complete set of these subspace heatmaps in Supplementary Note J.2.Overall, throughout this showcase, we have demonstrated that our method is capable of providing further insights into the complex relation between visual features and class membership.In addition to highlighting features that are predictive of class membership, we have identified distinct visual concepts, such as dotted patterns or yellow textures, that are shared between multiple classes.These shared visual patterns provide a structured understanding of the nonlinear relation between butterfly species and their visual characteristics.

Analyzing Manipulated Explanations using PRCA
One of the premises of Explainable AI is to facilitate trust to stakeholders, but previous works [37], [105] show that explanation techniques are vulnerable to manipulation.More concretely, a slight perturbation of the input could lead to substantial changes in its explanation while maintaining visual similarity to the input and other statistics (e.g.model output).Crucially, [37] shows that such perturbation can lead to arbitrary changes in explanations, having neither relation to the input nor the original heatmap.Fig. 11 contrasts such a scenario (where a perpetrator perturbs an image to manipulate its explanation) and a regular Explainable AI scenario.
11.We show for the selected classes the example arg maxn min k∈K R k,n , where K is the set of subspaces associated to the given class, and R k,n is the contribution of Subspace k for example n to its associated class.Fig. 10.Left: Three-dimensional scatter plot resolving the relation between ImageNet's butterfly images and classes along three DRSA subspaces (S4, S5, and S7).Each point is an example, and its coordinates are the relevance scores of the three subspaces.Right: Prototypical examples and their standard and DRSA heatmaps.We show only the heatmaps of the class-subspace configurations that pass the selection criteria (Eq.13).We provide the complete set of heatmaps in Supplementary Note J.2. Fig. 11.Application of PRCA for shedding light on a forgery in Explainable AI.Top: Regular scenario.Middle: Adversarial scenario, following the approach of [37], in which a perpetrator imperceptibly perturbs an image in a way that its heatmap (here the LRP heatmap associated to the output 'tibetan terrier' of VGG16-TV) is steered maliciously towards an incorrect target heatmap.Bottom: PRCA of the perturbed image at Conv4 3. The analysis decomposes the manipulated heatmap into the heatmap of class tibetan terrier's maximally relevant direction and its residue.Red and blue colors indicate positive and negative contributions.
Certainly, the vulnerability to perturbation does not only raise practical concerns, but also theoretical questions on how such a phenomenon could happen.As a result, a number of theoretical analyses [37], [105] have been conducted to investigate the cause of the perturbation vulnerability.In particular, the investigation of [37] elucidates that the degree to which an explanation can change is partially upperbounded by the principal curvature evaluated at the data point.Furthermore, [37] shows that, for neural networks with ReLU, the principal curvature can be reduced by approximating ReLU with the softplus activation.By controlling the smoothness parameter of the softplus function, [37] shows that the robustness of explanation manipulation can be effectively increased in a post-hoc manner.
Nevertheless, from the perspective of layer-wise representation, it is still unclear how perturbation causes such dramatic changes in explanation or how such changes manifest at a certain layer.We therefore aim to demonstrate that PRCA might provide a clue to answer such questions.
As a proof of concept, we study the PRCA decomposition of LRP explanations from validation images of class 'tibetan terrier' in the ImageNet dataset [85] on VGG16-TV at Conv4 3.More precisely, we perform PRCA on a set of activation and context vectors from 500 training images of the class (details similar to the setup of Section 5.1).
To manipulate explanations, we use the optimization procedure proposed by [37] to find a perturbation that causes arbitrary changes in the explanation of each image, while retaining the same level of model response and visual similarity between the original and perturbed images.
The arbitrary changes are induced by a target explanation, which is the explanation of a random image from a different class.In addition, we also constrain the original and manipulated explanations to have similar total relevance scores.We summarize the details of the algorithm in Supplementary Note J.3.
Qualitatively, Fig. 11 (bottom) shows that the heatmap generated from the first PRCA component preserves features highlighted in the original heatmap, while the residual heatmap (orthogonal complement of the first PRCA component) contains features from both the original and target heatmaps.
Looking at the positive and negative parts of the residual heatmap, we observe that the former substantially resembles the target heatmaps, while the latter is closely similar to part of the original heatmap expressed in the PRCA heatmap with opposite sign.When using more PRCA components, the PRCA heatmap becomes similar to the target heatmap (see Fig. J.8 in the Supplementary Notes).The behavior suggests that, for VGG16-TV at Conv4 3 and class 'tibetan terrier', the first PRCA component is the direction affected the least by perturbation.
Quantitatively, Fig. 12 shows the mean squared error between manipulated heatmaps (and their PRCA decomposition versions) and original or rescaled target heatmaps: the error is averaged over the 50 validation images of class 'tibetan terrier'.We first observe that the manipulated heatmaps have lower error when comparing to the target heatmaps than the original heatmaps.It confirms that the optimization proposed by [37] is indeed effective and also works well with the additional constraint we impose.
Secondly, when looking at the error from the manipulated heatmaps on the first PRCA component (PRCA-1), we observe that these heatmaps are closer to the original heatmaps than the target ones.The difference between the two errors is interesting because it indicates that PRCA indeed captures parts of the class-specific representation that are less affected by the perturbation.The insight may provide a new perspective towards understanding and increasing the robustness of explanation manipulation (cf.also [38]).

CONCLUSION AND DISCUSSION
In this work, we have proposed to disentangle explanations of neural network models into multiple components in order to provide more useful information to the user compared to a standard explanation.
Technically, the desired disentanglement is achieved via an unsupervised analysis at some intermediate layer of the neural network model.A unique aspect of the proposed method is that it analyzes jointly the data and model response to the data.Hence, unlike a purely data-driven approach, our method enables a more focused disentanglement that efficiently ignores aspects of the data to which the model is invariant.Besides, our approach does not require any specialized datasets or concept annotations and can be applied to any deep neural network model whose predictions can be attributed to input features.Our method works together with a broad range of state-of-the-art attribution frameworks, such as the Shapley value and LRP.
We have demonstrated the high performance of our disentanglement approach, scoring significantly higher than other methods in our benchmark.Through an implementation of LRP we have contributed for the state-of-theart NFNet model, we have further demonstrated that our approach can bring more light into highly sophisticated prediction functions.Building upon existing attribution techniques, our method also inherits challenges with the problem of attribution, such as the need to adapt to the rapidly increasing complexity of ML models.
On a practical note, we have demonstrated the use of our method on three application showcases: 1) detection and generalized LRP-γ [9]: z B -rule [10]: where we have used the notation (•) + = max(0, •) and (•) − = min(0, •), and denoted by 1 [•] an indicator function.These multiple rules have different application conditions.For example, the LRP-γ rules requires positive input activations and an output activation function ρ that maps negative values to zero and positive values to positive values, e.g.ReLU.This restriction is dropped for LRP-0, LRP-ϵ and generalized LRPγ, where inputs can be both positive and negative, and where the only restriction is that the activation function ρ is sign-preserving, e.g.tanh, LeakyReLU, GELU [11], etc.The z B -rule is specialized for input layers receiving pixels, and the parameters l i ≤ 0 and h i ≥ 0 in this rule are the lowest and highest possible value of x i 's (e.g.corresponding to the value of black and white pixels).The rules above address convolution and dense layers.Propagation through max-pooling layers typically follows a winner-take-all [7] (or winner-take-most) strategy.Batch normalization layers are typically fused with the preceding linear layer [8].Backpropagation strategies have also been developed for LSTM and transformer architectures (cf.[12], [13]).In absence of neuron biases, most of these rules implement the layer-wise conservation j R j = k R k , and such layer-wise conservation implies the overall conservation i R i = f (x).

SUPPLEMENTARY NOTE B DISENTANGLED EXPLANATIONS WITH VARIOUS ATTRIBUTION TECHNIQUES
In this note, we show how to apply our approach to disentangling explanations in conjunction with attribution techniques other than LRP, specifically, Gradient × Input, Integrated Gradients, and the Shapley value.For this, we need for each of them to express the two steps of attribution, and verify that relevance scores R k 's have the structure required by the PRCA/DRSA analyses.In the following, we use the following decompositions of the neural network function f : The first (coarse) decomposition is used for the derivation of the two-step attributions, and the second (fine) decomposition is used to analyze the structure of R k .

B.1 Deriving Two-Step Attributions
We start with the two-step attribution process described generically in the main paper as: We have explained that, when using the LRP attribution technique, this two-step process can be readily implemented by filtering the backpropagation flow to only retain what passes through the neurons with index k.We demonstrate how to perform these two steps of explanation for non-backpropagation methods, in particular, Gradient × Input, Integrated Gradients, and Shapley value.
where [•] cst.denotes a constant approximation of the expression evaluated at the current point, and where we use the convention 0/0 = 0.With this approximation, one can more efficiently attribute to the input features by performing the subsequent integrated gradient calculation: which unlike Eq. (B.6) does not have a nested integral.Furthermore, assuming the functions f 1 and f 2 are zero at the origin, we get the conservation property p R p = pk R pk = k R k = y, however, k R pk ̸ = R p generally.Hence, we have a weaker form of conservation than the one obtained with Gradient × Input and LRP, and this is due to the relevance modeling step.

B.1.3 Application to Shapley Values
Using the Shapley value as an attribution method for each step (with reference points x = 0 and h = 0), we get: Like for Integrated Gradients, the nesting makes the computation expensive.We proceed similarly to Integrated Gradients by building a relevance model: with 0/0 = 0.Under this relevance model, we can more efficiently compute the joint relevance scores: Note that if the functions f 1 and f 2 are zero valued at the origin, then we have the conservation property p R p = pk R pk = k R k = y.However, like for Integrated Gradients, the relevance modeling step implies that k R pk typically differs from the original Shapley value R p .This weaker form of conservation is again due to the relevance modeling step.

B.2 Verifying the Structure of R k
Recall from the main paper that for PRCA/DRSA to be applicable, relevance scores R k 's associated to the vector h k 's should be expressible in terms of the orthogonal matrix U and activation/context vectors as: We verify that R k 's have this structure for Gradient × Input and Integrated Gradients, and show which approximation can be made for recovering such structure when using Shapley values.
B.2.1 Structure of R k 's with Gradient × Input When using Gradient × Input, the desired structured can be identified from an application of the chain rule for derivatives: and observing that h k = U ⊤ k a and that ∂a ′ /∂h k = U k , we get, with c = ∂y/∂a ′ .

B.2.2 Structure of R k 's with Integrated Gradients
When using Integrated Gradients with a linear integration path from 0 to h k , we state the Integrated Gradients equation and apply the chain rule for derivatives: Taking out constant terms from the integral and expressing h k as a function of a, we get: with c j = ∂y ∂a ′ j dt.This is similar to the Gradient × Input case, except that the gradient ∂y/∂a ′ is evaluated and averaged over all activations vectors encountered on the linear integration path.This specific structure of R k lets us revisit the relevance model R k proposed in Supplementary Note B.1 for performing the second step of attribution.In particular, an inspection of Eq. (B.23) suggests the alternate relevance model where [c] cst. is a constant approximation of c evaluated at the current data point.

B.2.3 Structure of R k 's with Shapley Value
The Shapley value equation does not allow for factoring out the transformation matrices U k 's as it was the case for the methods above.To incorporate Shapley value attribution (with baseline h k = 0), we have discussed in Supplement Note B.1.3how to perform Shapley value attribution on the activation layer.Also, because the Shapley value method is typically slow for high dimensions, attribution can in practice be performed in terms of groups of activations (e.g. the collection of activations in a feature map j).
Let us denote by a j the activations in feature map 1 j, and R j the attribution on this group of activations obtained with the Shapley value framework (with baseline a = 0).We apply LRP to further propagate to the concept k.More precisely, we use the standard LRP rule to redistribute the contribution of the concept k to the sum of activations in each group j: where 1 j denotes a vector of ones of same size as a j , and (U k ) j is the jth row of the block U k in the orthogonal matrix U connecting concept k to the group of activations j.The relevance score can then be further developed as: Like for standard Shapley value, the scores R k 's produced by our modified Shapley value formulation depend on input features in an intricate way (here, through the vector c which itself depends on the regular Shapley attribution scores R j 's).Similarly to the Integrated Gradients case, an inspection of Eq. (B.28) suggests the relevance model: where [c] cst. is a constant approximation of c evaluated at the current data point.

B.3 Analytical Calculation of Total Relevance
In practice, it can be useful to quickly predict certain properties of the explanation without computing the full explanation.For the Shapley Value and Integrated Gradients, the sum of obtained relevances ( pk R pk ) can be calculated without performing the second step of the explanation procedure.In particular, we can show that: where we have denoted by [E(a, x)] p the vector containing the attribution of all elements of a onto feature x p .From (B.31) to (B.32), we have used the linearity of Shapley values to pull the constant multiplicative factors out of the attribution function.From (B.33) to (B.34), we have used the conservation property of Shapley values to express the sum of scores forming the explanation as a difference of two function evaluations.Overall, the final formulation of total relevance pk R pk only involves-additionally to the prediction-the computation of the vector c and activations associated to the reference point x.

SUPPLEMENTARY NOTE C PROOFS AND DERIVATIONS
In this note, we provide the proofs of Propositions 1 and 2 of the main paper, and the derivation of the eigenvalue formulation of our PRCA objective.

C.1 Proofs of Propositions
Recall that a = (a j ) j is a vector of activations at some layer of the neural network, R j is the relevance of neuron j for the model output.Recall that R j decomposes as R j = a ′ j c j and c = (c j ) j .We restate the first proposition of the paper and provide the proof.Proposition 1.Let U = (U k ) k be an orthogonal matrix formed by U k 's.Using the formulation of relevance R k = (U ⊤ k a) ⊤ (U ⊤ k c) with c such that R j = a ′ j c j , we have the conservation property k R k = j R j .Furthermore, when c = ξa with ξ ≥ 0, then we necessarily have R k ≥ 0.
Proof.We get the conservation result by observing that For the positivity property, we first recall the assumption c = ξa with ξ > 0.Then, we have Proposition 2. When the context vector c is equivalent to the activation vector a, the PRCA analysis reduces to uncentered PCA.Furthermore, if we assumed whitened activations, i.e., E[a] = 0 and E[aa ⊤ ] = I, and each matrix U k projecting to a subspace of dimension 1, then the DRSA analysis with parameter q = 2 reduces to ICA with kurtosis as a measure of subspace independence.
Proof.We divide the proof of the proposition into two parts: 1) the reduction from PRCA to uncentered PCA; and 2) the reduction from DRSA to ICA.
(Part 1: Reduction from PRCA to PCA) Let U ∈ R D×d and recall that the PRCA objective is to maximize E[(U ⊤ a) ⊤ (U ⊤ c)] w.r.t.U subject to U ⊤ U = I d .Setting c = a, the objective can be further developed as: (Part 2: Relation between DRSA and ICA) Let a n ∈ R D and c n ∈ R D be the activation and context vectors of a data point n ∈ D. Note that because we consider the case where each subspace is 1-dimensional, there are therefore K = D such subspaces.We denote the collection of these subspaces by K = {1, . . ., D}.We also fix q = 2.A reduction of the DRSA objective to this setting gives: subject to: where is the rectified relevance on the subspace k of the data point n; where the matrix U = (U k ∈ R D×1 ) k∈K is the concatenation of D one-dimensional transformation matrices U k 's; and where M p is a generalized F-mean with function F (t) = t p .We now replace the context vector c n in (C.17) with a n , which gives us: where the last step follows from the fact that U k ∈ R D×1 .We then inject this expression in (C.15), which gives Optimizing the objective above is therefore equivalent to maximize Recall the definition of kurtosis for a random variable Y (see Eq. 8.5 in [14]): Let Y k = U ⊤ k a n be a random variable associated to the (random) activation vector a n .Because U k ∈ R D×1 and the activation vectors a n are whitened, we have Therefore, the maximization objective becomes the sum of kurt(Y k ).
Remark.Unlike the setting of Proposition 2 which uses q = 2 (i.e.M 2 k∈K ), we use in our DRSA models the parameter q = 0.5 in order to balance the contribution of each subspace.Such balancing is however not necessary in ICA because the latter is always preceded by a whitening transform.

C.2 Derivation of the PRCA Objective
} is a set of activation-context vector pairs.Let V ∈ R D×d where d is the number of dimensions chosen by the user.Optimizing the objective of PRCA: where Λ ∈ R d×d is a diagonal matrix containing the d largest eigenvalues and U ∈ R D×d is the concatenation of the d corresponding eigenvectors.
Proof.We divide the proof into three steps: 1) reformulating (V ⊤ a) ⊤ (V ⊤ c) in terms of trace; 2) constructing a constrained optimization problem using the method of Lagrange multipliers; and 3) finding the critical points of the constructed Lagrangian.
(Step 1): Observing that (V ⊤ a) ⊤ (V ⊤ c) ∈ R, we can write it as where the last step uses the fact that trace is invariant under cyclic permutation.
(Step 2): Because the condition V ⊤ V = I d induces d • (d + 1)/2 equality constraints, the Lagrangian of the objective is therefore where S ∈ R d×d is a symmetric matrix of Lagrange multipliers for the d • (d + 1)/2 equality constraints.(Step 3): Taking the derivative of L(V, S) w.r.t.V and setting it to zero yields Because S = S ⊤ , it can be diagonalized.Suppose its diagonalization is S = EΛE ⊤ where E ∈ R d×d is an orthogonal matrix and Λ ∈ R d×d is a diagonal matrix.Right multiplying Eq. (C.34) with E leads to We therefore arrive at the eigenvalue problem where each column U :,τ ∈ R D is the corresponding eigenvector of the τ -th largest eigenvalue Λ τ τ .

SUPPLEMENTARY NOTE D SELECTION OF THE ATTRIBUTION METHOD
As a starting point to our experiments, we need to select a suitable attribution method.It serves both to extract relevance scores necessary to build subspaces and then to compute disentangled explanations based on the learned subspaces.

D.1 Evaluation Baselines
We conduct the evaluation of attribution methods and parameter selection on the ImageNet dataset.We consider Shapley value, Gradient × Input, Integrated Gradients, and LRP attribution techniques.
For the Shapley value, we use the 'Shapley Value Sampling' approximation from the Captum library [15].Additionally, we also coarse-grain pixels into disjoint 16 × 16 pixel patches2 and perform attribution on these patches, making the attribution for one data point achievable within a reasonable time.We choose the number of permutations in Shapley Value Sampling to be 25.We use the reference point x = 0 which corresponds to setting removed patches to uniform gray color 3 , and to a = 0 when applying the method to a function of activations.
For Integrated Gradients, we set the reference points 3 x and a to zero similarly to the Shapley value setting, and choose the linear integration path between the reference points and the actual points.We perform 10 integration steps when attributing on input features, and 100 steps when attributing on activations.
For Layer-wise Relevance Propagation (LRP), we choose for the VGG16 architecture the heuristics of [16].Specifically, we use LRP-γ, where we set γ to value 0.5 in the first two convolution blocks, 0.25 in the third one, 0.1 in the fourth one, and 0.0 in the last convolution block and in the classification head.For the NFNets architecture, we use our novel LRP implementation that utilizes the generalized LRP-γ rule [9].We choose the generalized LRP-γ because activations in NFNets can be positive or negative.We use the same value of γ for all layers in NFNets and perform parameter selection based on the evaluation metric described in the following.We refer to Supplementary Note K for the details of our NFNet-LRP implementation.

TABLE D.1
Area under the Patch-Flipping Curve (AUPC) of different attribution methods for different models.We average the obtained score over 5000 random validation images from the ImageNet dataset [18] for all methods, except Shapley Value Sampling, where we use only 10% of the images for computational reasons.The largest error bars of Shapley Value Sampling and other methods are ±0.13 and ±0.06 respectively.We perform patch-flipping over patches of size 16 × 16.We show the best AUPC score of each model in bold.We average the scores from 5000 random validation images from the ImageNet dataset [18] for all methods, except Shapley Value Sampling, where we use only 10% of these images for computational reasons.We perform patch-flipping over patches of size 16 × 16.

D.2 Patch-Flipping Evaluation
Similar to Section 4.1 in the main paper, we evaluate the basic (one-step) attribution techniques (and their parameters) using the patch-flipping method and the area under the patch-flipping curves (AUPC) [17].
Experimental Setup : We construct a dataset of 5000 images in which we randomly select 5 validation images of each class in the ImageNet dataset [18].We compare five attribution methods, namely random attribution, Gradient × Input, Integrated Gradients, Shapley Value Sampling, and LRP.For computational reasons, we use for Shapley Value Sampling only 10% of the dataset.We perform the comparison on three ImageNet-pretrained models used in the main paper (VGG16-TV, VGG16-ND, and NFNet-F0).
Table D.1 shows AUPC scores from different attribution methods and models.From the table, as expected, we first observe that all the methods are substantially better than the random baseline.Secondly, we see that Shapley Value Sampling and LRP performs better than Gradient × Input and Integrated Gradients across the three models.Comparing to Shapley Value Sampling, LRP seems to be on par on VGG16-ND but slightly worse on VGG16-TV or better on NFNet-F0.We refer to Fig. D.1 for the corresponding patchflipping curves.For NFNet, we performed grid-search on γ ∈ {0, 0.001, 0.01, 0.1, 1.0} and found that γ = 0.1 yields a satisfying AUPC of 3.64.A slightly better AUPC of 3.54 could be obtained for γ = 0.01 but with visually noisier (and less interpretable) heatmaps (cf.

D.3 Computational Efficiency of Attribution Methods
Computational efficiency of the explanation method is an important aspect in practice.Gradient × Input and LRP perform favorably, with both approaches requiring only one forward/backward pass in the network.LRP comes with a small additional cost due to operationalizing LRP rules (e.g. via forward hooks).Although the cost is implementation-dependent, we find empirically that it does not exceed an order of magnitude of the original computation.In comparison, the cost of Integrated Gradient grows linearly with the number of integration steps, and that of Shapley Value Sampling linearly with the number of input features and sampled permutations.
We note that, unlike a typical (i.e.one-step) attribution scenario, the two-step explanation we consider in our paper generates not a single but K explanations per data point.In other words, the overall runtime increases for all methods by a linear factor K compared to the typical setup, making the computational efficiency an important criterion when selecting the underlying attribution method.

SUPPLEMENTARY NOTE E TRAINING DRSA AND DSA E.1 Preprocessing and Optimization Parameters
Let X be a set of randomly selected training images of the class of interest with |X | = N .We take activation (and context) vectors at n random spatial locations from each of these N training images.We generate the context vectors w.r.t. the logit of the class using a chosen attribution method.Denote A = {(a ∈ R D , c ∈ R D )} to be the set of these activation and context vectors pairs.Suppose i ∈ {1, . . ., |A|} and j ∈ {1, . . ., D}.We optimize DRSA on Â = {(â, ĉ)} where We found that the normalization helps stabilize the optimization process.We initialize the training of DRSA with a random D × D-orthogonal matrix 4 , which we partition into K blocks according to the numbers of dimensions d k 's chosen by the user.As mentioned in the main paper, each optimization iteration contains two steps, namely batch gradient ascent and 2) orthogonalization.Because the objective of DRSA (Eq. 10 in the main paper) is non-convex, we perform τ runs using different orthogonal matrices.Among these τ runs, we select the run that achieves the highest objective value to be the solution of the optimization.We sort the blocks U k 's of the solution according to in descending order and form the final orthogonal matrix U accordingly.
We select for N = 500 examples activation vectors at n = 20 different spatial locations.The optimization procedures are run for 5000 iterations, and we perform τ = 3 runs, retaining the best solution.With these parameters, the optimization time of DSA/DRSA is approximately 10 and 60 minutes on Nvidia Quadro RTX 5000 for activations with D = 512 and 1536 respectively; the former is the case of VGG16 at Conv4 3, while the latter is of NFNet-F0 at Stage 2.

E.2 Procedure for Selecting Subspace Prototypes
We develop a procedure to select a set of prototypical images for visualizing what semantic features DRSA subspaces represent.For example, we use the procedure to generate Fig. 1 in the main paper.Objectively, we design the procedure such that all subspaces are approximately equally expressed.We achieve this goal by utilizing the objective of DRSA (Eq. 10 in the main paper).
Let U = (U 1 | . . .|U k | . . .|U K ) be the learned orthogonal matrix from DRSA for a given class.Let n be the desired number of prototypes and N be the number of random candidate subsets.Denote X to be a set of some images from the class.The procedure goes as follows: 1) We construct N random candidate subsets, each containing n images from X ; 2) We compute, for each candidate subset, the DRSA objective using the activation and context vectors of the n images in the subset; 3) We take the subset with the largest objective to be the set of prototypical images.
We use N = 1000.

SUPPLEMENTARY NOTE F ABLATION STUDIES F.1 Ablation on PRCA and DRSA Formulations
The goal of PRCA is to find a d-dimensional subspace that is maximally relevant for the prediction.As discussed in Section 4.1 of the main paper, the goal is equivalent to find a matrix U ∈ R D×d that defines a projection onto such a subspace and takes the most relevance, i.e. maximize U E[R] with R = (U ⊤ a) ⊤ (U ⊤ c), and subject to U ⊤ U = I d .As stated in the main paper and shown in Supplementary Note C.2, the solution to the optimization is the d eigenvectors associated to the largest eigenvalues of the symmetrized crosscovariance matrix E[ac ⊤ + ca ⊤ ].In this ablation analysis, we aim to verify that the solution of PRCA indeed preserves the most relevance.We quantify the property through the 'Total Relevance' score of the input features (x p ) P p=1 : We remark that, for the Shapley value, we can compute this score efficiently via TotalRelevance , where ϕ is the function mapping the input x to the activation vector a, where x = 0, and where c is the context vector computed from attributing the neural network output f (x) onto a.We refer to Supplementary Note B.3 for the derivation.We compare PRCA with three ablations of its formulation: • Ablation 1: we replace the context vector c in the objective function with the activation vector a; the optimization problem leads to the formulation of (uncentered) PCA (cf.Proposition 2) • Ablation 2: we construct U from only standard basis vectors, i.e. the matrix U has only one non-zero entry in each row and column.The experimental setup is similar to Section 5.1 in the main paper.From Table F.1, we observe that, as to be expected, PRCA has the highest total relevance score comparing to the three ablations.This observation confirms that two properties of PRCA: 1) maximizing relevance, and 2) doing so over any orthogonal projection, are both important in order to concisely capture the relevant part of the decision strategy.As a further experiment, we analyze the total relevance score as a function of the subspace dimensions d.We perform the experiment using the VGG16-TV model.From Fig. F.1, we observe that PRCA is superior to the three ablations for every subspace size.In particularly, PRCA is able to 1) extract in the top-few principal components a large amount of positive evidence for the output neuron and 2) strongly suppresses negative contributions.
Together, the results of these experiments therefore substantiate that PRCA indeed finds the subspace that is maximally relevant.
The second question we have considered in this paper (and for which we have proposed the DRSA analysis) is whether the explanation can be disentangled into semantic components that contribute to the model's decision strategy.In Section 4.2 of the main paper, we have translated the question into the problem of finding an orthogonal matrix U = (U 1 | . . .|U k | . . .|U K ) that partitions the D-dimensional activation space into K subspaces.
Because our focus is on the case of CNNs-in that semantic patterns in the input are spatially separatesuch a matrix U disentangles explanation into spatially non-overlapping components.The goal of this ablation study is therefore to verify the non-overlapping property of explanation components directly at the level of joint pixel-concept relevance scores (not at the change of the model's output like the AUPC score).
To quantify the property, we propose separability and peakness scores of explanation components, which A low separability score occurs, for example, when only a single-component explanation (i.e. a standard explanation) is available or when all components of the explanation are the same.Conversely, the separability score is high when the contributions associated to different components correspond to different input features; in other words, these components are spatially separated.A high peakness occurs when the components of the explanation focus strongly on distinct aspects of the decision strategy.
In the following, we consider DRSA and two ablations of its formulation, namely • Ablation 1: we substitute the context vector c in the objective of DRSA with the activation vectors a; this ablation is considered in the main paper and called DSA.
• Ablation 2: we use a random orthogonal matrix.
The experimental setup is similar to Section 5.2 in the main paper.Table F.2 shows the separability and peakness scores across setups.From the table, we observe that DRSA has the highest separability and peakness scores.This result reflects the visual inspection of Fig. 1 in the main paper, where we can identify distinct concepts from the DRSA explanations.Furthermore, with different choice of layers or number of subspaces, the observation from Table F.2 still applies.We discuss these additional experiments in Supplementary Note F.2.

F.2 Choosing Different Layers or Number of Subspaces
We investigate the effect of the two important parameters in our evaluations between DRSA and other baselines, namely the choice of layers and the number of subspaces.Because DSA is the strongest baseline from the evaluation in Section 5.2 of the main paper, we compare DRSA with it on different values of these parameters.
We perform the experiments with LRP-γ and the VGG16-TV and VGG16-ND models with the 50 classes of the ImageNet dataset, similar to the main experiments (cf.Section 5.2 of the main paper ).We report the Area Under the Patch-Flipping Curve (AUPC), and the Separability and Peakness scores defined above.We remark that, for AUPC, a lower score is better, while, for separability and peakness, a higher score is better

F.2.2 Comparison between DSA and DRSA on Different Numbers of Subspaces
We investigate whether the difference in AUPC, separability and peakness scores of DSA and DRSA remains consistent when varying the number of subspaces K.We consider K ∈ {2, 4, 8, 16} and fix the layer of interest to be Conv4 3.  higher scores than DSA.This result suggests that the superiority of DRSA is not due to the number of subspaces.

SUPPLEMENTARY NOTE G CLASS-WISE AUPC SCORE ANALYSIS
We conduct an additional analysis on the experiment results presented in Section 5.2 of the main paper.Our goal is to investigate the level of explanation disentanglement across different classes.Because the outputs of different classes are often in different scales, to account for such an effect, we use the following class statistics for comparing the disentanglement level of different classes; the higher the ∆AUPC score, the more disentangled explanation components the class has.

SUPPLEMENTARY NOTE H DETAILS ABOUT NETDISSECT
We briefly describe the NetDissect method which we use in our benchmark evaluations in the main paper.
NetDissect [20] is a framework that associates high-level concepts to units (filters in a given layer) in neural networks.The framework constructs a set of concepts K from the semantic categories of the Broden dataset [21].Let J = {1, . . ., D} be the set of units in a given layer.The framework performs three steps to associate a unit j ∈ J with a concept k ∈ K: 1) Gathering Activation Maps: images from the Broden dataset {x ∈ R 3×h×w } are fed to the model.Their unit j activation maps {A j (x) ∈ R h ′ ×w ′ } are gathered to determine the 99.5-th percentile τ j of the unit j's overall response, i.e.P(a j > τ j ) = 0.005.
2) Producing Binary Response Mask: each activation map A j (x) is resized to the spatial dimensions of the input S j (x) = upsample(A j (x)) ∈ R h ′ ×w ′ , and then binarized with the percentile τ k to produce a response mask, i.e.M j (x) = 1 [Sj (x)>τj ] ∈ {0, 1} h×w .
3) Quantifying Alignment between 'Concept k' and 'Unit j': Let D k ⊂ D be the subset of Broden images whose pixels are annotated with the concept k.Denote the concept k annotation mask of each image x as L k (x) ∈ {0, 1} h×w .NetDissect quantifies the alignment between the concept k and unit j using the Intersection over Union (IoU) ratio: If the criteria IoU(j, k) > α is satisfied 5 , the concept k is added to the unit j's concept set K j ⊆ K.
NetDissect then assigns the concept with the largest IoU score to be the concept of the unit j, i.e.
We adapt 'NetDissect-Lite'6 (provided by [20]) to dissect two publicly available ImageNet-pretrained VGG16 [22] models.These two models are from TorchVision [23] (VGG16-TV) and NetDissect's model repository 7 (VGG16-ND).We reproduce NetDissect results at two layers, namely Conv4 3 and Conv5 3. We refer to our extended code repository 8 for the technical details of the reproduction.Fig. H.1 shows the distribution of concepts assigned to filters in the two layers of VGG16-TV and -ND.We see that the concept distributions of the two models generally agree.In particular, we observe that lower layers tend to capture low-level concepts (e.g.'part' and 'texture'), while high-level layers capture high-level semantics (e.g.'object' and 'scene'); our observations are similar to what was discussed in [20].For VGG16-ND specifically, our result is also similar to what [20] H.1 shows the number of unique detected concepts in each concept category.
As a remark, for the experiments in Section 5.2 of the main paper, we do not use the IoU > α criteria to construct subspaces from NetDissect (i.e.K j = K).As a result, each filter is associated to a concept.Furthermore, because the assignment procedure of NetDissect yields unequal numbers of filters per concept, we rank the concepts based on R k /N k where R k is the concept k relevance of a data point and N k is the number of filters corresponding to the concept.

SUPPLEMENTARY NOTE I DETAILS ABOUT INTERPRETABLE BASIS DECOMPOSITION (IBD)
We briefly outline the IBD method which we use in our benchmark evaluations in the main paper.IBD [26] is a framework that decomposes the decision of the neural network into a linear combination of concept vectors.Suppose that there exists 1) a set of concepts K = {k} and 2) concept vectors u k ∈ R D corresponding to each concept k.Let w t ∈ R D be the weight vector (in the last layer of the neural network) corresponding to the class t.IBD decomposes the weight vector as where K t ⊂ K is a set of class-compatible concepts, α k t ∈ R is a class-concept coefficient, and r t ∈ R D is a residue vector.By linearity, the output of the neural network for the class t is decomposed into the contribution of concepts (via concept vectors) In practice, IBD constructs the set of concepts K from the Broden dataset [21].Each concept vector u k is based on the weight vector of a classifier that is trained to determine the presence of the concept k.IBD uses a greedy algorithm to determine the set of class-compatible concepts K t and the class-concept coefficients {α k t } k∈Kt .
To integrate IBD in our evaluation, we use the sets of class-compatible concepts K t 's and concept vectors u k 's that are provided by the authors of IBD 10 .
We now describe how we construct a virtual layer from IBD concept vectors.We assume that 1) we have the set of class-compatible concepts K t and 2) the concept vectors u k are linearly independent.Let U ∈ R D×|Kt| be the matrix with the concept vectors in columns and U + = (U ⊤ U ) −1 U ⊤ be its leftpseudo inverse matrix.Denote a, c ∈ R D to be an activation vector and its context vector.To address the non-orthogonality of the concept vectors, we adapt our formulation of the virtual layer accordingly by expressing where a ⊥ is a residue vector.The concept relevance is where the row vector u + k ∈ R 1×D is the corresponding row in the left pseudo-inverse matrix U + .We note that this adaptation of the virtual layer has no guarantee on 'positivity' (cf.Proposition 1).training data increases.Overall, DRSA and DFR both produce significant accuracy gains compared to the original model.We note that, unlike DFR, our approach requires neither creating a modified training set with artificially added artifacts, nor setting a poisoning rate hyperparameter, thereby making our approach easier to deploy.

J.2 Showcase 2: Better Insights via Disentangled Explanations
We provide additional results complementing the discussion in Section 6.2 of the main paper.

J.3 Showcase 3: Analyzing Manipulated Explanations using PRCA
We provide additional details for Section 6.3 of the main paper.They include 1) the procedure that perturbs an input and causes changes in its explanation; 2) experimental setup; and 3) additional qualitative results.To ease visualization, we dim the classsubspace configurations that do not pass the selection criteria (Eq.13 in the main paper).

J.3.1 Finding Perturbation that Causes Arbitrary Changes in Explanations
We use the optimization procedure proposed by [31] to find a perturbation that leads to changes in explanation.Consider an input x ∈ R P and its label t ∈ C. Let f : R P → R |C| be a neural network and E be an attribution method that produces an explanation E(f t (x), x) ∈ R P .Reference [31] formulates an optimization problem that finds a perturbed version x of x such that 1) the two inputs x and x are visually indistinguishable; 2) the model f behaves approximately the same on these inputs, i.e. softmax(f (x)) ≈ softmax(f (x)); 3) but, the original E(f t (x), x) and manipulated E(f t (x), x) explanations are substantially different.
In practice, the difference is induced by making E(f t (x), x) similar to some target explanation E target .
More precisely, the objective of this constrained optimization problem is x ← arg min where λ ∈ R + is a hyperparameter and where we choose E target to be the explanation of a random input does not belong to class t.In addition, we rescale the target heatmap E target with p [E(f t (x), x)] p / p (E target ) p .
The ratio constrains the total relevance scores of the manipulated and original explanations are approximately the same, i.e. p [E(f t (x), x)] ≈ p [E(f t (x), x)] p .

J.3.2 Experimental Setup
We focus on manipulating LRP-γ heatmaps derived from VGG16-TV.More specifically, We perform the manipulation on the 50 validation images of class 'tibetan terrier' from the ImageNet dataset [18].We choose target heatmaps to be the heatmaps of 50 random images from other classes in the dataset; the heatmaps are produced w.r.t. the class of each random image.We extend the code provided by [31] to support 1) LRP-γ with the same heuristics we use for VGG16-TV 11 and 2) the relevance preservation constraint.Similar to the original work, we use the same gradient-based iterative procedure and parameters to perform the optimization.We refer to our extended code repository 12 for the details.We perform PRCA using the activation and context vectors from 500 training images of class 'tibetan terrier', and we extract these vectors at Conv4 3. shows a number of selected images and their manipulated heatmaps.We see that these manipulated heatmaps are different from the original heatmaps but similar to the target ones.The results confirm that the optimization procedure proposed by [31] also works under the relevance preservation constraint we impose.shows the decomposition of manipulated heatmaps onto PRCA with different subspace sizes and the correspondence residue (from the orthogonal complement of each subspace).From the figure, we observe that the PRCA heatmaps generally preserve the main characteristics of the original heatmaps, while the positive part of the heatmap residues tends to contain the structure of the target heatmaps.More importantly, we also observe that incorporating more PRCA components makes the PRCA heatmaps become similar to the target heatmaps.The evidence suggests that the first PRCA component is the least affected by explanation manipulation.

SUPPLEMENTARY NOTE K LRP IMPLEMENTATION FOR NFNETS
LRP is a propagation-based attribution method, and it comes with a number of propagation rules.One employs these rules to successively propagate a relevance quantity (e.g.logit value of a target class) from the last to the input layer.To use the method, one needs to choose an appropriate LRP propagation rule for each layer of the architecture.
Software packages like Innvestigate [33], Zennit [34], and Captum [35] provide ready-to-use LRP implementation for common architectures.To the best of our knowledge, these packages have not yet implemented LRP for the recent state-of-the-art Normalizer-Free Network (NFNets) architecture [36].Therefore, we close this technical gap by contributing a PyTorch [25] implementation of LRP for NFNets.
In the following, we first describe the NFNet architecture (Supplementary Note K.1).Secondly, we categorize layer patterns in the NFNet architecture into a number of cases (Supplementary Note K.2); for each case, we define an appropriate LRP rule and outline its implementation.Finally, we discuss a technical step that improves the numerical stability of the implementation (Supplementary Note K.2.7).We refer to Fig. D.2 for example explanations from our LRP-NFNets implementation.
We also write layers in NFNets with gray-boxed text: for example, Conv is a convolution layer, while Act is a ρ-activation layer.We depict 'FH-Conv ' to be the forward hook of Conv .We use (•) † to denote the quantity (•) is overridden in a forward hook and [•].detach() to denote the 'detach' functionality in PyTorch [25] that detaches the variable [•] from the underlying computational graph.In the following, we assume that Assumption 1.Let ρ : R → R be a sign-preserving activation function taking z → ρ(z) = a.The input z and output z have the same relevance, i.e.R(z) = R(a).
a j w jk .
Conv3 or FC Therefore, the relevance of a j and its forward hook are similar to what is described in Supplementary Note K.2.1, except that we substitute a k in Eq. (K.1) with z k .

K.2.3 Case : Orphan Activation Layers
These orphan activation layers are Act1 of every NormFreeBlock .The computation of these layers is a k = ρ(z k ).Act1 With Assumption 1, we have R(z k ) = R(a k ).Therefore, we override the output of these orphan activation layers to be FH-Act1 K.2.4 Case : Attention Layer (also known as Squeeze-And-Excitation Block) The attention layer AttnLast is the second layer to last of NormFreeBlock .The layer is the implementation of the Squeeze-and-Excite layer [42], which performs dimension-wise modulation to the output.More specifically, the layer has a function A : R d ′ → [0, 1] d ′ and output where ξ k = A(z) k .This dimension-wise multiplicative structure is similar to the gating mechanism in LSTMs [45] and GRUs [46].With this structure, [47], [48] argue that the modulation of the function A already influences the relevance R(z k ) in the forward pass and propose to directly compute the relevance R(z k ) from a k , without considering the dependency to a k via the variable ξ k .We therefore have We can implement the rule by

Case : Shortcut Connections
The shortcut connection Shortcut is the last step of NormFreeBlock .The step combines together inputs from two computational paths: the residual and shortcut paths.Let (z k ) d ′ k=1 and (s k ) d k=1 be the input of these two paths respectively.The input (z k ) k is from AttnLast , while the input (s k ) k is based on the input of NormFreeBlock .The computation of the step is We can interpret this step as a linear layer with w z = w s = 1, and the generalized LRP-γ rule reduces to We can implement the rule above by In practice, the shortcut connection is implemented as an in-place operation, which one can not attach any forward hook.To overcome the issue, one needs to replace this inplace operator with a PyTorch module that takes two inputs (z k ) and (s k ) and performs the addition.

=Fig. 2 .
Fig.2.Overview of our proposed approach for generating disentangled explanations.Left: The neural network to explain is augmented with a virtual layer performing an orthogonal transformation and back onto subspaces representing distinct concepts.The transformation matrices (U k ) k are optimized to extract subspaces maximizing some statistics of associated relevance scores (R k ) k (Eqs.(8)or(10)  with R k defined in (6)).Right: Once the virtual layer has been built, it is used to support the generation (via Eqs.(1) and (2)) of more informative pixel-concept explanations.

Fig. 3 .
Fig. 3. Left: Comparison between the one-dimensional subspace extracted by (uncentered) PCA and PRCA in a synthetic two-dimensional activation space.Right: Average relevance as a function of the angle of the subspace.The vertical lines correspond to the angles of the vectors in the left panel.By design, PRCA maximizes average relevance.

Fig. 4 .
Fig. 4. Left: One-dimensional subspaces extracted by DRSA and DSA (a reduction of DRSA where a is used in place of c) in a synthetic two-dimensional activation space.Unlike DSA, our proposed approach DRSA takes the model response (shown as a contour plot) into account, thereby being able to focus on the relevant component.Right: Normalized objective of DSA and DRSA at different angles.

Fig. 7 .
Fig. 7. Top: Training images from class 'carton' with their standard and DRSA heatmaps from VGG16-TV at Conv4 3 using the LRP backend.The heatmaps are generated w.r.t. the logit of class 'carton'.Red and blue colors indicate positive and negative pixel-concept contributions to the logit of the class.Bottom: Effect of model refinement (subtraction of each concept's contribution from the logit) on the accuracy when the validation data is poisoned with watermarks at a rate of 25%.

Fig. 8 .
Fig. 8. Confusion matrices from the original and refined models on clean and 25%-poisoned data.The refined model excludes the evidence of the subspace S4 from the prediction using Eq.(12).

Fig. 9 .
Fig. 9. Statistics of DRSA relevance scores R k 's from different butterfly classes.The dashed lines indicate the 0.85-quantile of relevance scores for each subspace.The left and right border of rectangles are the 0.25and 0.75-quantiles of class-conditioned relevance scores.The selection criterion of Eq. (13) can be visually interpreted as the right side of the box surpassing the dashed line, and we highlight boxes in red if the corresponding quantile satisfies the selection criterion.

Fig. 12 .
Fig. 12. Mean squared error to original or target heatmaps and different sets of heatmaps averaged over 50 validation images of class 'tibetan terrier'.These sets of heatmaps are manipulated heatmaps and their decomposition on the first PRCA component (PRCA-1).Two heatmaps are similar if the error between them is small.Vertical lines represent one standard error.

13 )=
Tr(U ⊤ ΣU ), (C.14)where (C.13) uses the cyclic permutation property of the trace operator Tr(•), and where we define Σ = E[aa ⊤ ] in (C.14).The last line is the canonical formulation for finding the d leading principal components that minimizes the l 2 reconstruction error E[∥a − U (U ⊤ a)∥ 2 ].Therefore, it shows that PRCA becomes equivalent to uncentered PCA in this special case.

Fig. D. 1 .
Fig. D.1.Patch-flipping curves from different attribution methods and models.We average the scores from 5000 random validation images from the ImageNet dataset[18]  for all methods, except Shapley Value Sampling, where we use only 10% of these images for computational reasons.We perform patch-flipping over patches of size 16 × 16.
Fig. E.1 shows the training curves of the optimization from the three models: VGG16-TV, VGG16-ND, and NFNet-F0.

Fig. E. 1 .
Fig. E.1.Training curves of the DSA and DRSA optimization across different models and attribution methods.In each plot, each curve corresponds to one of the 50 ImageNet classes used in the main experiments (see Section 5 of the main paper).

• Ablation 3 :
we use the first d columns of a random orthogonal matrix, i.e. no training involved

Fig
Fig. F.2. Area Under the Patch-flipping Curve (AUPC), separability and peakness scores of subspaces U extracted by DSA and DRSA at (a) different layers with 4 subspaces or (b) Conv4 3 with different numbers of subspaces.

F. 2 . 1
Comparison between DSA and DRSA at Different Layers We investigate whether the difference in AUPC, separability and peakness scores of DSA and DRSA remains consistent when choosing different layers.We consider three different layers of the VGG16 architecture, namely Conv3 3, Conv4 3, and Conv5 3. We fix the number of subspaces to K = 4. Results : Fig. F.2a shows the AUPC, separability, and peakness scores of DSA and DRSA at the three different layers of the VGG16 models and across 50 ImageNet classes.We observe that DRSA generally has better scores than DSA.The results indicate that the superiority of DRSA is not due to the choice of layers.
Fig. F.2b  shows the AUPC, separability, and peakness scores from disentangled explanations from different collections of subspaces.For the majority of configurations, we observe again that DRSA yields

Fig. G. 1 .
Fig. G.1.Per class ∆ Area Under the Patch-Flipping Curves (∆AUPC) of disentangled explanations produced by DSA and DRSA for VGG16-TV using LRP; higher is better.
Results : Fig. G.1 shows the ∆AUPC scores across different classes from VGG16-TV.We find that the top three classes are class 'zebra', 'drilling platform', and 'steam locomotive'.Fig. G.2 shows the DSA and DRSA explanation components of three validation images from class zebra.From the figure, we observe that DRSA decomposes the prediction strategy of the class zebra into four sub-strategies, namely the detection of the zebra body, the legs, the top part of the body, and other features.
reports 9 , indicating no difference between the original Caffe model and our PyTorch converted version.Complementary to Fig. H.1, Table

Fig. H. 1 .
Fig. H.1.Percentage of filters from two layers of VGG16-TV and -ND that NetDissect associates them to concepts from the Broden dataset.Area with light and dark shade indicate the percentages without and with the IoU criteria respectively.

Fig
Fig. J.1.(i) Three training images from Class Carton and their corresponding Hanzi watermarks.(ii) ImageNet classes that VGG16-TV often confuses for 'carton'.The horizontal axis is the number of validation images for which VGG16-TV prediction includes 'carton' in the top-3.(iii) Confusion matrices from the original and refined VGG16-TV models on clean and 50%-poisoned data.The refined model weakens the prediction of class 'carton' using the excess relevance from the subspace S4 (Eq. 12 in the main paper).

Fig. J. 2 .
Fig. J.2. Top: Clusters of carton training images from Spectral Relevance Analysis [28] (SpRAy).The second row shows the heatmaps of these training images, while the third row shows the ones after post-processing (cf.Section J.1.4).Bottom: Receiver Operating Characteristic (ROC) curves of DRSA and SpRAy in detecting the Hanzi watermark from a set of carton images.Asterisk indicates the setups that we use the negative of the distance to the corresponding cluster center.

Fig. J. 3 .
Fig. J.3.Accuracies of VGG16-TV models that are refined using Deep Feature Reweighting (DFR) or our DRSA approach to mitigate the influence of the Hanzi watermark on the prediction of non-carton images.
Fig. J.4 is a detailed version of Fig. 9 in the main paper.It illustrates the complete distributions of relevance scores across different subspaces and six butterfly classes.In the figure, we also highlight a prototypical example of each class, except class 'lycaenid' that we randomly selected; we note that the ones of class 'monach', 'admiral', 'sulphur', and 'ringlet' are part of Fig. 10 in the main paper.Fig. J.5 shows the standard and DRSA heatmaps of these examples.

Fig. J. 4 .
Fig. J.4.Distribution of relevance scores across six butterfly classes and subspaces extracted using DRSA with activation and context vectors from NFNet-F0 at Stage 1.The red circles are prototypical examples of each class, except the 'lycaenid' image that we randomly selected.The thick blue vertical lines represent the β-quantile of the six-class distribution from each subspace, while the thin red lines indicate the α-quantile of each class-subspace configuration.To ease visualization, we dim the classsubspace configurations that do not pass the selection criteria (Eq.13 in the main paper).

Fig. J. 5 .
Fig. J.5.Prototypical butterfly examples with their standard and DRSA heatmaps from NFNet-F0 using LRP-γ.We extract subspaces using activation and context vectors from NFNet-F0 at Stage 1.We generate the heatmaps w.r.t. the target class of each example.

Fig. J. 6
depicts that the total relevance scores between original and manipulated heatmaps are highly correlated.It indicates that the two heatmaps are approximately in the same scale.Fig. J.7

Fig. J. 6 .
Fig. J.6.Pearson correlation between the total relevance scores of original and manipulated heatmaps.

Fig. J. 8
Fig. J.8 shows the decomposition of manipulated heatmaps onto PRCA with different subspace sizes and the correspondence residue (from the orthogonal complement of each subspace).From the figure, we observe that the PRCA heatmaps generally preserve the main characteristics of the original heatmaps, while the positive part of the heatmap residues tends to contain the structure of the target heatmaps.More importantly, we also observe that incorporating more PRCA components makes the PRCA heatmaps become similar to the target heatmaps.The evidence suggests that the first PRCA component is the least affected by explanation manipulation.

Fig. J. 7 .
Fig. J.7.First row of each subplot: a random image used to produce a target heatmap, original image, perturbation noise, and perturbed image.Second row of each subplot: target heatmap, heatmap of the original image, and heatmap of the perturbed image.

K. 2 . 1
Case : Pairs of Convolution and Activation Layers A pair of convolution and activation layers is a common layer pattern in the NFNet architecture.The computation of the two layers is z k = d j=1 w jk a j , Conv a k = ρ(z k ).
0.20 ± 0.16 ± 0.08 ± 0.25 ± 0.21 ± ± 0.40 1.1 Application to Gradient × Input Let f 1 and f 2 be two piecewise linear functions.Using Gradient × Input as an attribution method for each step, we get: Observing that ∂y/∂h k is piecewise constant w.r.t.h (and therefore piecewise constant with w.r.t.x), we take this expression out of ∂/∂x p (•), which gives us Hence, where both derivatives can be computed separately, and we can identify in Eq. (B.4) the multivariate chain rule for derivatives filtered to only include the term that depends on k.Thus, we can relate the twostep and one-step Gradient × Input as k R pk = R p .Furthermore, if f 1 and f 2 are first-order positively homogeneous, then we have that p R p = pk R pk = k R k = y.Application to Integrated Gradients Let f 1 and f 2 be two differentiable functions.Using Integrated Gradients as an attribution method for each step with a linear integration path starting at the origin, we get: k x p .(B.3) B.1.2Thedouble integral is expensive in practice.For practical purpose, we can locally approximate the complicated function R k by a 'relevance model' R k , specifically, a linear function of h k .A possible relevance model is:

TABLE F .
1 Total relevance score (Eq.(F.1)) for PRCA and three ablations.Results are shown for different combinations of models, datasets, and underlying attribution techniques (columns).We highlight for each column the best subspace method (highest total relevance score) in bold.Results are averaged over 50 classes of the ImageNet dataset or 7 classes of Places365.( †) average from three seeds.± 0.66 ± 0.61 ± 0.58 ± 1.76 ± 1.41 ± 0.12 Fig. F.1.Total relevance of PRCA and three ablations when varying the subspace dimensionality (the variable d); higher is better.The analysis is performed on VGG16-TV with LRP (same as column 1 in Table F.1).Each curve is an average over the means of these 50 classes, and shaded regions represent one standard error (over classes).The horizontal solid line represents the total relevance of no subspace projection.

TABLE F .
2 Separability and peakness scores of subspaces U extracted by DRSA and two ablations (rows).Results are shown for the same dataset/model/attribution settings (columns) as in Table F.1.We highlight for each column the best method (highest scores) in bold.Results are averaged over 50 classes of the ImageNet dataset or 7 classes of Places365.( †) average from three seeds.

TABLE H .
1 Number of unique concepts that NetDissect detects from filters in two layers of VGG16-TV and VGG16-ND.The numbers outside and inside parentheses are without and with the IoU > α criteria respectively.