A Generalized Explanation Framework for Visualization of Deep Learning Model Predictions

Attribution-based explanations are popular in computer vision but of limited use for fine-grained classification problems typical of expert domains, where classes differ by subtle details. In these domains, users also seek understanding of “why” a class was chosen and “why not” an alternative class. A new GenerAlized expLanatiOn fRamEwork (GALORE) is proposed to satisfy all these requirements, by unifying attributive explanations with explanations of two other types. The first is a new class of explanations, denoted deliberative, proposed to address the “why” question, by exposing the network insecurities about a prediction. The second is the class of counterfactual explanations, which have been shown to address the “why not” question but are now more efficiently computed. GALORE unifies these explanations by defining them as combinations of attribution maps with respect to various classifier predictions and a confidence score. An evaluation protocol that leverages object recognition (CUB200) and scene classification (ADE20 K) datasets combining part and attribute annotations is also proposed. Experiments show that confidence scores can improve explanation accuracy, deliberative explanations provide insight into the network deliberation process, the latter correlates with that performed by humans, and counterfactual explanations enhance the performance of human students in machine teaching experiments.


I. INTRODUCTION
W HILE deep learning systems enabled significant advances in computer vision, their black-box nature creates difficulties for many applications. In general, it is difficult to trust a system that cannot justify its decisions. This motivated a large literature on explainable AI (XAI) methods, which complement network predictions with human-understandable explanations [1], [2], [3], [4], [5], [6], [7], [8], [9], [10]. In computer vision, the dominant XAI paradigm is that of visual explanations computed by attribution functions, which generate heatmaps localizing the image pixels [8], [11], [12], [13] or regions [14], [15], [16], [17] responsible for network predictions. While attributive explanations provide a coarse justification for the predictions, e.g., localizing the object within a larger background or highlighting one among distinct objects in the field of view, they are not sufficient for applications that require fine-grained classification. This can be seen in Fig. 1, where it is clear that the highlighted pixels belong to the bird but unclear which regions of the bird are responsible for the 'Cardinal' prediction. While the explanation would be satisfactory for a classification problem opposing 'Birds' to 'Dogs,' it is not helpful for one opposing 'Cardinals' to 'Summer Tanagers' or other bird species. In this case, the attributive explanation selects the entire bird and it is hard to know what differentiates one class from the other.
Fine-grained classification problems are prevalent in expert domains, such as medical imaging or biology, where there is a need to distinguish objects that differ in subtle details, and even for everyday applications that involve a large number of classes. For such problems, users are likely to demand more from the explanation system. As Fig. 1 illustrates, given the relatively low confidence value of 0.76, a user may want to know exactly why the system chose the 'Cardinal' label. Beyond the posthoc analysis of classification results, where the user is passive, explanations also play a critical role in interactive applications, such as machine teaching systems where users are taught to annotate images [18], [19], [20]. In this case, users naturally ask counterfactual questions, such as "why is this a Cardinal and not a Summer Tanager?" where an alternative or counter-class ('Summer Tanager') is provided. None of these questions can be satisfied by existing attribution-based visual explanations.
In this work, we propose a GenerAlized expLanatiOn fRamEwork (GALORE) for the solution of all these problems. Beyond the popular attributive explanations, GALORE includes a new class of explanations, denoted as deliberative, and a new version of counterfactual explanations 1 that are easier to compute than those previously available in the literature. Deliberative explanations, illustrated in the left of Fig. 1, address the "why?" question by visualizing insecurities about model predictions. These are the regions that the model considered most ambiguous, together with the classes that define the ambiguity. In the example of the figure, insecurities refer to body parts of bird classes that are confusable with 'Cardinal,' such as 'Pine Grosbeak,' 'Purple Finch,' or 'Summer Tanager'. Counterfactual explanations, illustrated in the right of the Figure, address the "why not?" question by visualizing the input changes needed to elicit the prediction of a user-provided counter class. In the example of the figure, the explanation shows that the two classes differ mostly in terms of the bird head. The unification is based on the definition of all explanations as combinations of multiple attribution maps, which vary according to the explanation type. Since attributions are very efficient to compute, the proposed framework establishes a family of low-complexity explanations that can be used in various applications, ranging from naive to expert domains, and supporting both passive post-hoc analysis of predictions or interactive applications such as machine teaching.
A core requirement of deliberative and counterfactual explanations is the ability to reason in terms of the difficulty posed to the classification by different image regions. Understanding why the classifier chose a class requires knowing what other classes could have been plausibly selected, and what image regions made those alternatives plausible, i.e., what image regions the classifier found ambiguous for the decision. This is the essence of deliberative explanations, which produce a list of such regions, denoted as insecurities, as illustrated in the left of Fig. 1. On the other hand, counterfactual explanations require the identification of regions that discriminate the predicted from the counterfactual class, i.e., which have high probability under the predicted class and low probability under the counterfactual. These regions can then be shown to the user, as illustrated in right of Fig. 1, to identify corresponding parts in objects from predicted and counter class.
Reasoning about ambiguities or class probabilities requires the classifier to produce confidence scores [26], [27], [28], [29], i.e., measure the confidence with which the image belongs to each of the possible classes. From these scores, it is possible to derive how difficult the classification is (the probability of the ground-truth class), how ambiguous it is (similarity between the probabilities of the top classes), or how much the image discriminates between two classes (large probability for one and small for another). We refer to the ability to measure these quantities as self-awareness, since it allows a classifier to quantify the confidence in its decisions. One of the insights of this work is that attributions of confidence scores allow the extension of these measures to image regions, so as to identify which regions are ambiguous, discriminant, or difficult to classify. This is naturally integrated in the GALORE framework, by simply combining the attribution maps for self-awareness with the attributions for class predictions required to compute the different explanations.
Beyond explanations, a significant challenge to XAI is the lack of explanation ground truth for performance evaluation. Besides user-based evaluations [30], whose results are difficult to replicate, we propose a quantitative metric based on a proxy localization task. This relies on standard metrics from the object detection literature and attribute annotations for different object parts or scene components. We show that these metrics can be adapted to the evaluation of the different types of explanations proposed with minor specializations. Compared to human experiments, the proposed proxy evaluation has the advantages of being substantially easier to perform and fully replicable.
Overall, the paper makes three contributions. First, it proposes the unified GALORE framework to generate attributive, deliberative, and counterfactual explanations. Deliberative explanations are a newly proposed family of explanations that visualize the deliberations made by a network to reach its predictions. GALORE also redefines counterfactual explanations as combinations of attributive explanations, significantly increasing their computational efficiency. Second, the paper shows how to leverage self-awareness to improve explanation accuracy, for different types of explanations. Third, it proposes a new experimental protocol for quantitative evaluation of deliberative and counterfactual explanations. Experimental results, using both this protocol and human experiments, show that the proposed deliberative explanations are intuitive, suggesting that the deliberative process of modern networks correlates with human reasoning, and that counterfactual explanations can substantially benefit applications like machine teaching.
Contrastive and Counterfactual Explanations. Counterfactual visual explanations transform an image of class A so as to elicit its classification into the counter class B [38], [56], [57], [58], [59], [60]. The simplest example are adversarial attacks [23], [56], which optimize perturbations to map an image of class A into class B. However, these perturbations usually push the perturbed image outside the boundaries of the space of natural images. Generative methods have been proposed to address this problem, computing large perturbations that generate realistic images [57], [61], [62], [63]. This is guaranteed by the introduction of regularization constraints, auto-encoders, or GANs [64]. However, because realistic images are difficult to synthesize, these approaches have only been applied to simple, MNIST or CelebA [65] style, datasets and domains that do not require expertise [37], [61], [63]. StylEx [59] and C3LT [66] are two recent methods, leveraging a GAN to produce the explanations. They, however, require training on large-scale data, which is not a necessity for other methods. A more plausible alternative is to exhaustively search the space of features extracted from a large collection of images, to find replacement features that map the image from class A to B [30]. While this has been shown to perform well on fine-grained datasets, exhaustive search is too complex for interactive applications.
XAI Evaluation. Explanations are frequently evaluated though human-in-the-loop experiments that measure their consistency with human intuition [16], [23], [53], [67] or evaluate if explanations improve user performance on some task [30]. It is also possible to assemble a dataset to generate human-driven groundtruth explanations [68]. An alternative approach is automated evaluation, using a proxy task without human participation. A typical example is to erase or add features and observe how the model predictions change [69], [70], [71], [72]. Another is localization, where regions of features deemed important by the explanation are compared to regions deemed intuitive for classification by humans [16], [73]. Another component of the evaluation of explanations is to test their robustness via sanity checks [74], [75], [76], [77]. In this work, we introduce a quantitative protocol for the evaluation of both deliberative and counterfactual visual explanations, which includes sanity checks.
Self-Awareness. Self-aware systems have some ability to measure their limitations or predict failures. This includes out-ofdistribution detection [78], [79], [80], [81] or open set recognition [82], [83], [84], [85], where classifiers are trained to reject non-sensical images, adversarial attacks, or images from classes on which they were not trained. All these problems require the classifier to produce a confidence score for image rejection. The most popular solution is to guarantee that the posterior class distribution is uniform, or has high entropy, outside the space covered by training images [86], [87]. This, however, is not sufficient for deliberative explanations, which have to precisely characterize the ambiguity of image regions, or counterfactual explanations, which require precise confidence scores for classes A and B. These explanations are more closely related to realistic classification [88], where a classifier must identify and reject examples that it deems too difficult to classify.

III. A UNIFIED VIEW OF EXPLAINABLE AI
In this section, we discuss the different types of explanations implemented by the proposed GALORE framework. The detailed computations required to produce the explanations are discussed in Section IV.

A. Attributive Explanations
Attributive explanations identify pixels responsible for a classifier prediction. This is intuitive but prone to generate explanations that are too generic. For example, when asked "why is an object a truck?" an attributive system would answer "because it has wheels, a hood, seats, a steering wheel, a flatbed, head and tail lights, and rearview mirrors," i.e., generate a list of all the truck parts. After all, all parts are responsible for the 'truck' label. The problem is that, while insightful, the explanation does not inform on what distinguishes the truck from, for example, a car. The explanation for 'car' would share all components other than the flatbed.
Similarly, visual attributive explanations tend to highlight all pixels of objects in the predicted class. This is sensible for coarse grained classification, e.g., 'birds' versus 'cats,' but not for finegrained, e.g., the CUB birds dataset [89] from which the images of Figs. 1, 2 and 3 were taken. On this dataset, where most images contain a single bird, methods like Grad-CAM [16] (used in these  examples) produce heatmaps that 1) cover most of the bird, and 2) vary little across classes of largest posterior probabilities, leading to very uninformative explanations. In this work, we seek better explanations for the fine-grained setting.

B. Deliberative Explanations
In this setting, visual concepts differ in subtle ways. There are frequently two or more classes of very similar appearance, and the classification can be quite ambiguous. This is illustrated in both Figs. 1 and 2, which present several similar birds, difficult to differentiate for a layperson. Due to this ambiguity, even an expert could reasonably oscillate between different interpretations while deliberating about the class to predict. An extreme example of this process are visual illusions such as that depicted in the left of Fig. 2, where different image regions provide support for conflicting image interpretations. In this example, the image could depict a 'country scene' or a 'face.' Most humans would consider the two interpretations while deliberating on a final prediction. When asked to explain the latter, they would say something like: "I see a cottage in region A, but region B could be a tree trunk or a nose, and region C looks like a mustache, but could also be a shirt. Since there are sheep in the background, I am going with country scene." More generally, different regions can provide evidence for two or more distinct predictions and there may be a need to deliberate between multiple classes.
Having access to this deliberative process is important to trust an AI system. For example, in medical diagnosis, a single prediction can appear unintuitive to a doctor, even if accompanied by a heatmap. The doctor's natural reaction would be to ask "why did you reach that conclusion?" Ideally, instead of simply outputting a predicted label and a heat map, the AI system should visualize its deliberations, producing a list of image regions that support other plausible predictions. For example, when categorizing medical images with respect to interstitial lung diseases [90], [91], the AI system should explain a prediction of 'emphysema' by highlighting the regions of greatest uncertainty between this and alternative predictions, such as 'normal' or 'fibrosis'. We denote these regions as insecurities, since they cast doubt on the validity of the predicted label. To accomplish this, we propose a new type of explanations based on heatmaps of network insecurities. These are denoted as deliberative explanations, since they visualize the network deliberations.
As illustrated in the right of Fig. 2, the deliberative explanation provides a list of insecurities (center inset), each consisting of 1) an image region and 2) an ambiguity, formed by the pair of classes that led the network to be uncertain about the region. Example images from the ambiguous classes can also be displayed, as shown in the right inset. For example, the first insecurity of Fig. 2 reflects the fact that the head of the Pelagic Cormorant is similar to those of the Brandt Cormorant and the Common Raven. Hence, this region raises uncertainty about the 'Pelagic Cormorant' label predicted by the classifier. The detailed implementation of deliberative explanations is discussed in Section IV-D.

C. Counterfactual Explanations
Returning to the 'truck' example, domain experts will likely not be satisfied by the simply listing of all truck parts. Instead, they are likely to request more precise explanations, for instance asking the question "Why is it a truck and not a car?" The answer "because it has a flatbed. If it did not have a flatbed it would be a car," is known as a counterfactual explanation [23], [30], [38], [92]. Counterfactual explanations, by supporting a specific query with respect to a counterfactual class (B), allow expert users to zero-in on a specific ambiguity between two classes, which they already know to be plausible predictions. Unlike attributions, these explanations scale naturally with user expertise. As the latter increases, the class and counterfactual class simply become more fine-grained. In computer vision, counterfactual explanations are usually implemented as "correct class is A. Class B would require changing the image as follows," where "as follows" is some visual transformation. Possible transformations include image perturbations akin to those used in adversarial attacks [23], image synthesis [37], [60], or replacing image regions by regions of some images in the counter class B, found by the exhaustive search of a large feature pool [30]. However, image perturbations and synthesis frequently leave the space of natural images, only working on simple non-expert domains, and feature search is too complex for interactive applications.
In this work, we propose the computation of counterfactual explanations by a simple and robust procedure, based on attributions. We start by introducing discriminant explanations that, as shown in Fig. 3, connect attributive to counterfactual explanations. Like attributive explanations, they consist of a single heatmap. This, however, is an attribution map for the discrimination of classes A and B, attributing high scores to image regions that are informative of A but not of B, and high classification confidence, indicating that the discrimination between the two classes is clear and easy to identify. The detailed generation of discriminant explanations is discussed in Section IV-E. The final counterfactual explanation is then composed by two discriminant explanations, with the roles of A and B reversed. It identifies the image regions informative of A but not B and the regions informative of B but not A.
As illustrated in Figs. 1 and 3, the presentation of these regions side by side allows the user to visualize how the image of A would need to be changed in order to be classified as B (and vice-versa). This shows that counterfactual explanations can be seen as a generalization of attributive explanations, computed by a combination of attribution and confidence prediction methods that is much more efficient to compute than previous methods. In fact, our experiments show that their computation is 50× to 1000× faster for popular networks. This is quite important for applications such as machine teaching, where explanation algorithms should operate in real-time, ideally in low-complexity platforms such as mobile devices.

IV. IMPLEMENTATION OF GALORE
In this section, we discuss a unified framework for implementation of the explanations discussed above.

A. Explanation Framework
Consider an object recognition system H : X → Y, mapping images x ∈ X into classes y ∈ Y = {1, . . . , C}, according to a classifier usually computed by a convolutional neural network (CNN). The classifier is denoted self-aware if it produces a confidence scores(x) ∈ [0, 1], encoding the strength of its belief that the image x belongs to the predicted class y * . The confidence score can be generated by the classifier itself, in which case it is denoted as selfreferential, or by a complementary network, in which case it is non-self-referential. Both the classifier and the confidence score generator are learned from a training set D of In this work, we propose a GenerAlized expLanatiOn fRamEwork (GALORE) to unify various visualization-based explanations, accounting for both confidence scores and a set C of class labels of interest beyond the prediction y * . All GALORE explanations are implemented with a heat map where · denotes multiplication, a i,j (.) is an attribution function, which measures how the spatial feature of x at location (i, j) contributes to a prediction. m α , m β and m γ are three functions that depend on the explanation. The detailed implementation of these functions for each type of explanation is discussed in the following sections and summarized in Table I.
The definition of (2) as a multiplication of attribution maps strengthens the heat map M at the locations where all the attributions are large and attenuates it when at least one of them is low. This can be seen as a measure of agreement of the different attributions that drastically penalizes disagreements. In this way, only locations that receive significant attribution from the different components are identified as salient, resulting in sharp heat maps that are informative of object details, as illustrated in Figs. 2 and 3. The process can also be seen as equating attribution maps to probability density functions of independent random variables and M to the resulting joint distribution. While this is not exact, since the attributions of h y * (x), h y c (x), and s(x) are not independent, it provides a computationally efficient approximation. Explanations are provided in the form of collections image segments [54], [93], [94] obtained by thresholding the heat map. We next discuss how (2) is used to implement different visualization strategies.

B. Attributive Explanations
Attributive explanations visualize how strongly the prediction y * is attributed to different regions of image x [8], [11], [12], [13], [51]. They are obtained from (2) by setting m α (x) = x, m β (x) = m γ (x) = 1, leading to heat map (for brevity, we omit location subscript in the rest of the paper) The attribution function a(.) is usually applied to a tensor of activations F ∈ R W ×H×D of spatial dimensions W × H and D channels, extracted at some layer of a deep network with x at the input. While many attribution functions have been proposed, they are usually some variant of the gradient of h y * (x) with respect to F. This results in an attribution map where the amplitude of A ij (.) encodes the attribution of the prediction to each entry i, j along the spatial dimensions of F. Two attributive heatmaps of an image of a "Cardinal" with respect to predictions "Cardinal" and "Summer Tanager," are shown in the top row of Fig. 3.

C. Self-Aware Attributive Explanations
Attributive explanations can be extended to account for confidence scores by setting m γ (x) = x. In this case, the attributive explanation becomes GALORE is compatible with any classification confidence score s(x).
A few examples that we compare in our experiments are discussed in Section V-B. Large heatmap entries indicate regions that not only contribute to the prediction but also make the classifier confident about it. When compared to standard attributive explanations, the self-aware version emphasizes more class-specific regions. In experiments, we will see that these regions usually cover the attributes discriminant for the predicted classes, providing a sharper and more convincing explanation for the classifier prediction.

D. Deliberative Explanations
A deliberative explanation consists of a set of Q insecurities {(r q , a q , b q )} Q q=1 that provide insight on the reasoning performed by the classifier to reach prediction y * . Each insecurity is a triplet (r, a, b), where r is the segmentation mask of a region responsible for classifier uncertainty, and (a, b) an ambiguity composed by a pair of class labels. Altogether, the insecurity shows that the network is insecure as to whether the image region defined by r should be attributed to class a or b. Note that none of a or b has to be the prediction y * , although this could happen for one of them. In Fig. 2, y * is the label "Pelagic Cormorant," and appears in insecurities 2, 5, and 6, but not on the remaining. This reflects the fact that certain parts of the bird could actually be shared by many classes.
Insecurities are generated by first identifying the set C = {y 1 , . . . , y E } of the E classes y of largest posterior probability h y (x). A candidate class ambiguity set A = C 2 is then created with all class pairs in C. For each ambiguity (a, b) ∈ A, an ambiguity map is computed using (2) Using as self-awareness score the complement of the belief in the prediction assigns larger scores to regions where the prediction is most ambiguous, reflecting the difficulty of the classifier decision. I i,j is large only when location (i, j) is deemed difficult to classify (large difficulty attribution a(1 − s(x)) i,j ) and this difficulty is due to large attributions to both classes a and b. The ambiguity map is thresholded to obtain the segmentation mask where 1 S is the indicator function of set S and T a threshold. The ambiguity (a, b) and the mask r{a, b}(x) form an insecurity.

E. Counterfactual Explanations
Counterfactual explanations assume an image x, a prediction y * , and a user provided counterfactual class y c = y * . A popular approach is to highlight the differences between x and an image x c from class y c by displaying matched bounding boxes on the two images. [30] showed that explanation performance is nearly independent of the choice of x c , i.e., it suffices to use a random image x c from class y c . We adopt a similar strategy in this work, implementing counterfactual explanations as where D(x, y * , y c ) and D(x c , y c , y * ) are discriminant heatmaps for images x and x c , respectively. The first map identifies the regions of x that are informative of the predicted class but not the counter class while the second identifies the regions of x c informative of the counter class but not of the predicted class. Altogether, the explanation shows that the regions highlighted in the two images are matched: the region of the first image depicts features that only appear in the predicted class while that of the second depicts features that only appear in the counterfactual class. The discriminant map of x is thresholded to obtain the segmentation mask Similarly, a segmentation mask is generated for x c using Fig. 3 illustrates the construction of a counterfactual explanation with two discriminant explanations.
To compute the heatmaps of (7), [30] proposed to exhaustively compare all combinations of features in x and x c , which is expensive. We propose a much simpler and more effective procedure that leverages a new class of attributive explanations, denoted as discriminant and defined as in (2), with m α (x) = m γ (x) = x, C = {y c }, and m β (a(.)) the complement of a(.). i.e., leading to heatmap a(s(x)). (11) This is large only at locations (i, j) that contribute strongly to the prediction of class y * but little to that of class y c , and where the discrimination between the two classes is easy, i.e., the classifier is confident. This, in turn, implies that location (i, j) is strongly specific to class y * but not specific to class y c , which is the essence of the counterfactual explanation. Discriminant explanations have commonalities with both attributive and counterfactual explanations. Like counterfactual explanations, they consider both the prediction y * and counterfactual class y c . Like attributive explanations, they compute a single attribution map D. The difference is that this map attributes the discrimination between the prediction y * and counter y c class to regions of x, identifying pixels strongly informative of class y * but uninformative of class y c . Fig. 3 shows how these explanations benefit from the fact that the self-awareness attribution map is usually much sharper than the other two maps. This is critical to identify the object details that differentiate the two classes.

F. Multi-Class Extensions
So far, we considered explanations involving single classes or class pairs. More generally, explanations may require, or benefit from, considering multiple classes. For example, deliberative explanations may involve ambiguities between several classes, such as a region compatible with the "Brandt Cormorant," "Fish Crow" and "Common Raven" classes in Fig. 2. In the extreme, the class posterior distribution h(x) could be approximately uniform for certain image regions. Similarly, for counterfactual explanations, a user could have more than a single counterfactual class in mind. We now consider the multi-class extension of GA-LORE, for both deliberative and counterfactual explanations. We define the dimension of ambiguity V as the number of classes involved.
For deliberative explanations of dimension V , the candidate class ambiguity set is first assembled by finding all class Vtuples A = C V in the candidate class list C. This is illustrated in Fig. 4, where V = 3, C contains the five classes shown on the left (green) and the set A includes the five ambiguities composed by 3-tuples of these classes, as shown in the right. For each ambiguity (a 1 , a 2 , . . ., a V ) in A, an ambiguity map is then computed using (2) This leads to large I i,j only when location (i, j) has strong attributions for all classes in C and is deemed difficult to classify by the self-awareness predictor. The thresholding of (6) is finally used to create a segmentation mask. Counterfactual explanations of dimension V and counterfactual class set C = {y 1 , . . ., y V } are implemented as where D(x, y * , C) is the discriminant explanation for counterfactual class set C, C v = C \ {y v } ∪ {y * }, and ⊕ v represents the side-by-side concatenation of explanations, as illustrated in Fig. 5. Similarly to (11), discriminant explanations are heat maps computed using (2) with the m α (·), m β (·), and m γ (·) definitions of (11), i.e., a(s(x)).
(14) As shown in the top row of Fig. 5, attributions are first computed with respect to the prediction h y * (x), the predictions h y v (x) of all the other classes y v ∈ C, and the confidence score s(x), for image x. This is then repeated for images x v , replacing x by each x v , as shown in the remaining rows. The discriminant maps D(x, y * , C) and D(x, y v , C v ) are then computed with (14), as shown in the green box. These maps emphasize regions that are predictive of class y * but unpredictive of all other classes in C, highlighting the class-specific features of y * that are discriminant with regard to C. The explanations are finally thresholded using (8) and (9) to obtain r{y * , C}(x) and r{y v , C v }(x) for ∀v ∈ {1, . . . , V }.

G. Explanation Strength
The clarity of explanations that involve several regions and several classes, such as deliberative or counterfactual, can benefit from a quantitative score, which we denote as the explanation strength, summarizing the relative importance of the different components. For example, ordering insecurities by degree of ambiguity helps guide user attention to the most important ones. To allow this type of manipulation, we define the strength of insecurity r 2 as the average intensity of the ambiguity map of (5) or (12) within the associated image segment Similarly, adding strengths to counterfactual explanations informs how much the explanation differentiates the prediction from each counter class. We define the strength of explanation Note that we make sure the segment size of two discriminant explanations are equal, i.e., |r{y * , y c }(x)| = |r{y c , y * }(x)|, by tuning the thresholds T in (8) and (9). This follows [30], and works well when the objects have roughly the same size in each image, which is the case for the datasets that we consider in our experiments. A different strategy may be needed in other cases. We leave the optimal threshold tuning strategy as a topic for future research. Similarly, multi-class counterfactual explanations have strength V. IMPLEMENTATION Table I summarizes how GALORE produces different visualization-based explanations, including different types of attributive, deliberative, and counterfactual explanations. All 2.Here we omit ambiguity (a, b) or C for brevity.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. explanations are obtained by combinations of attribution maps and classification confidence scores using (2). In this section, we discuss how these are computed.

A. Attribution Maps
Given a feature tensor F(x) in some deep network layer, attribution map a i,j (h y (x)) quantifies how the activations F i,j (x) at locations (i, j) contribute to prediction y. This could be either a class prediction or the prediction of a confidence score. In this section, we make no distinction between the two, simply denoting p(x) = g p (F(x)), where g is the mapping from activation tensor F into prediction vector g(F) ∈ [0, 1] P . For class predictions P = C, the prediction p is a class y, and g p (F(x)) = h y (x). For confidence predictions P = 1, the prediction is a confidence score, and g p (F(x)) = s(x).
GALORE is compatible with any attribution function in the literature [8], [11], [16], [17], [41], [51]. One of the most popular class of such functions is that of gradient-based attributions [11], [16], [51], which are derived from ∇g p (F(x)) and F(x), i.e., have the form q([∇g p (F(x))] i,j , F i,j (x)) for some function q. Our implementation uses the vanilla gradient based function of [11], which computes the dot-product of the partial derivatives of prediction p with respect to activations F(x) by these activations, Here we omit the dependency on x for simplicity. This is compared to two more complex attribution functions, integrated gradient (InteGrad) [51] and GradCAM [16]. Inte-Grad is based on the Riemman approximation of the integral of the gradient ∇g p along a linear path from a reference F 0 to the observed activation tensor F, where Ω is the number of steps in the approximation and set to 50. The reference F 0 is defined by the user and often chosen to be the image that induces zero activation. Unlike (18), which only uses the partial derivative at activation F i,j (x), InteGrad computes the average gradient along the linear path from F 0 to F. Grad-CAM [16] assigns a unique weight per activation channel k, which is the spatial mean of the activations of this channel where In our implementation, the attribution maps of (18), (19), (20) are normalized to [0,1] by min-max normalization, i.e., subtracting the minimum value and dividing by the maximum.
GALORE is also compatible with non gradient-based attribution functions [17], [53], [54], [55]. In experiments, we present results for score-CAM [17] and SHAP [53], two representatives of these methods. Like Grad-CAM, the attribution map of score-CAM is a weighted sum of channel activation maps but the weight w k of (20) is not derived from gradients, involving forward computations only. SHAP quantifies the element strength of an attribution map by its Shapley value. We omit the details for brevity.

B. Confidence Scores
Beyond attribution maps, GALORE is compatible with many classification confidence scores. We consider three scores of different characteristics. The softmax score [28] is the largest class posterior probability It is computed by adding a max pooling layer to the network output. The certainty score is the complement of the normalized entropy of the softmax distribution [29], Its computation requires an additional layer of log non-linearities and average pooling. These two scores are self-referential. We also consider the non-self-referential easiness score of [88], where s hp (x) is computed by an external predictor S, which predicts the difficulty of classifying each example and is trained jointly with the classifier. S is implemented by a network s hp (x) : X → [0, 1] whose output is a sigmoid unit. Fig. 6 shows a network implementation of (2). Given a query image x of class y * , a user-selected counter class y c = y * , a predictor h y (x), and a confidence predictor s(x) are used to produce the explanation. Note that s(x) can share weights with h y (x) (self-referential) or be separate (non-self-referential). x is forwarded through the network, generating activation tensors F h (x), F s (x) in pre-chosen network layers and predictions h a (x), h b (x), s(x), which depend on the explanation strategy. For deliberative explanations, the predictions are classes a, b from the candidate ambiguities set. For counterfactual explanations, they are h y  *  (x), h y c (x), s(x). The attributions of a,  b and s(x) to x, i.e., A(x, a), A(x, b), A(x, s(x)) are then computed with (18), (19), or (20), which reduce to a backpropagation step with respect to the desired layer activations and a few additional operations. These attributions can also be computed by other non-gradient-based functions. Finally, the attributions are combined with (5) or (11). Thresholding the resulting heatmap with (6) or (8) produces the deliberative explanation r{a, b}(x) or discriminant explanation r{y * , y c }(x). For counterfactual explanations, the network is simply applied to x c to compute r{y c , y * }(x c ). Multi-class deliberative extensions simply require a larger set of classes and replace (5) by (12). For multi-class counterfactual explanations, (11) is replaced by (14) and the process repeated for each counterfactual image x v .

VI. EVALUATION
Explanations can be difficult to evaluate, since ground truth is usually not available. Two major classes of evaluation strategies have been proposed.

A. User Experiments
One possibility is to perform Turk experiments, e.g., measuring whether humans can predict a class label given a visualization, or identify the most trustworthy of two models that make identical predictions from their explanations [16]. We use a similar strategy for deliberative explanations, by measuring whether, given an insecurity produced by the explanation algorithm, humans can predict the associated ambiguities. For counterfactual explanations, we use instead a machine teaching setting, testing whether the explanation helps humans distinguish different classes. While these strategies directly measure how intuitive the explanations appear to humans, they require subject experiments that are somewhat cumbersome to perform and difficult to replicate.

B. Proxy Tasks
A second evaluation strategy uses a proxy task, such as localization [15], [16] on datasets with object bounding boxes. While this is much easier to implement, there is usually no groundtruth for regions of importantance to the classification of an image. We overcome this problem by leveraging datasets annotated 3 with parts and attributes. Specifically, where the k th part of an object of class c is annotated with a semantic descriptor φ k c containing the attributes present in this class. For example, in a bird dataset, the "eye" part can have color attribute values "green," "blue," "brown," etc. The descriptor is a probability distribution over these values, characterizing the variability of attribute values of the part per class. Explanation ground-truth is derived from attribute distributions, as described next.
1) Deliberative Explanations: For deliberative explanations, we define insecurities as ambiguous parts, namely object parts common to multiple object classes or scene parts (e.g., objects) shared by scene classes. This reduces evaluation to insecurity localization.
For binary explanations, the similarity between classes a and b according to part k is defined as where γ is a dataset dependent similarity measure. This reflects the strength of the ambiguity between classes a and b, declaring as ambiguous parts that have similar attribute distributions under the two classes. To generate ground-truth, the values of α k a,b are computed for all parts p k and class pairs (a, b).
are selected as insecurity groundtruth, where K is the total number of parts. For multi-class explanations, given an ambiguity class set V = {a 1 , . . .a V }, the similarity of the V classes, according to part k, is defined as , where η is a dataset-dependent function. The similarities α k V are computed for all p k and V, and the M tuples of largest similarity selected as insecurity ground-truth.
Given this groundtruth, two metrics are used to evaluate the quality of the explanations, depending on the nature of part annotations. For datasets where parts are labelled with a single location (usually the geometric center of the part), i.e., p i is a point, the quality of segment r{a, b}(x) is computed by precision (P) and recall (R). Here, is the number of ground-truth parts included in the insecurities that compose the explanation. Precision-recall curves are produced by varing the threshold T of (6). For datasets where parts have segmentation masks, the quality of r{a, b}(x) is computed by the intersection over union (IoU) metric IoU = |r∩p| |r∪p| , where 2) Counterfactual Explanations: For counterfactual explanations, where the goal is to localize a region predictive of class A but unpredictive of class B, groundtruth is assembled by identifying parts with attributes specific to A that do not appear in B. This enables the evaluation of counterfactual explanations as a class-specific part localization problem. 3.Note that part and attribute annotations are only required to evaluate the accuracy of insecurities, not to compute the visualizations. These require no annotation.
For two-class explanations, where α k a,b measures the similarity between two classes according to part k, a small α k a,b indicates that part k discriminates between the two classes. To generate ground-truth, the N parts of smallest similarity in G, are selected as counterfactual ground-truth. For multiple counterfactual classes V = {y 1 , . . ., y V }, ground-truth consists of a set of parts that discriminates class a from those in V , which is defined as For two-class counterfactual explanations, evaluation is based on the precision-recall and IoU metrics used for deliberative explanations. For multi-class explanations, the definitions are generalized to account for the multiple counterfactual classes. Given a region r{a, On datasets with point-based ground truth, evaluation is based on precision and recall of the generated counterfactual regions. On datasets with mask-based ground truth, the IoU is used.
We also define a metric that captures the semantic consistency of two segments, r{a, b}(x) and r{b, a}(x c ), by calculating the consistency of the parts included in them. This is denoted as the part IoU (PIoU), This metric provides a fair comparison of different explanations if their counterfactual regions have the same size. Region size is controlled by T in (8) and (9). User expertise has an impact on counterfactual explanations. Beginner users tend to choose random counterfactual classes, while experts tend to pick counterfactual classes similar to the true class. Hence, explanation performance should be measured for the two user types. In this work, users are simulated by choosing a random counterfactual class b for beginners and the class predicted by a small CNN for advanced users. Class a is the prediction of the classifier used to generate the explanation, which is a larger CNN.
3) Attributive Explanations: For attributive explanations, ground-truth consists of parts with unique attributes, present in the ground truth class and lacking in all other classes. This is similar to the ground truth of multi-class counterfactual explanations but V now contains all dataset classes other than y * . However, it is frequently impossible to find a part whose attributes appear in a single class. Hence, we randomly select L classes from Y \ {y * }, to create a label set L = {y 1 , . . ., y L } and use the evaluation metrics discussed for multi-class counterfactual explanations with V = L. The difference is that, in the counterfactual setting, V is selected by the user.

VII. EXPERIMENTS
In this section we discuss an experimental evaluation of the explanations generated by GALORE.

A. Experimental Setup
Datasets. Experiments were performed on the CUB200 [89] and ADE20K [95] datasets. CUB200 [89] is a densely-labeled dataset of fine-grained bird classes, annotated with parts. 15 part locations (points) are annotated including back, beak, belly, breast, crown, forehead, left/right eye, left/right leg, left/right wing, nape, tail and throat. Attributes are defined and assigned to each part according to [89]. ADE20K [95] is a fine-grained scene image dataset with more than 1000 scene categories and segmentation masks for 150 objects. In this case, objects are seen as scene parts and each object has a single attribute, which is its probability of appearance in a scene. Both datasets were subject to standard normalizations. All results are presented on the standard CUB200 test set and the official validation set of ADE20 K.
Evaluation. On CUB200, where all semantic descriptors φ k c are multidimensional, similarities α k a,b are computed with i.e., the minimum similarity γ(φ k a , φ k b ) between all class pairs in V. To generate groundtruth for insecurities and discriminant regions, the set G of region and class tuples was divided into two subsets. The size M of the set of groundtruth insecurities was set to the 20% insecurities (p i , a i , b i ) or (p i , V) of strongest ambiguity. The size N of the set of discriminant groundtruth regions was set to the remaining 80% parts (p i , a i , b i ) or (p i , V) of smallest similarity. This division reflects the fact that dissimilar parts dominate G. Since parts are labelled with points, accuracy is measured with precision and recall.
On ADE20 K, the semantic descriptors φ k c are scalar (where k ∈ {1, . . ., 150}) namely the probability of occurrence of part (object) k in scenes of class c. This is estimated by the relative frequency with which the part appears in scenes of the class. Only parts such that φ k c > 0.3 are considered. For deliberative explanations, ambiguity strengths are computed with . This is large when object k appears very frequently in both classes, i.e., the object adds ambiguity. Due to the sparsity of the matrix of ambiguity strengths α k a,b , the number M of ground-truth insecurities is set to the 1% triplets of strongest ambiguity. On the other hand, counterfactual ground truth consists of the triplets (p i , a i , b i ) with φ k a > 0 and φ k b = 0, i.e., where object k appears in class a but not in class b.
Since deliberative explanations aim to explain examples that are difficult to classify, explanations are produced only for the 100 test images of largest difficulty score on each dataset. The W = 5 top classes are used to produce the class ambiguity set (see Section IV-D). In counterfactual explanations, AlexNet predictions [98] are used to mimic advanced users. For multi-class explanations, V is set to V = 3 for deliberative and V = 2 for counterfactual. This reflects the fact that users typically do not pose counterfactuals with large numbers of classes.

B. Ablation Study
Self-Awareness Scores. Fig. 7 shows the impact of the confidence scores of (21)-(23) on precision-recall curves (on CUB200) and IoU (on ADE20 K) for three explanation strategies. Some conclusions can be drawn. First, self-awareness is useful for all explanations. For attributive explanations, selfawareness attribution functions highlight more class-specific features. For counterfactual explanations, the gains are larger for expert users than for beginners. This is because the counter and predicted classes are more similar for the former, producing attribution maps that overlap. Second, the easiness score substantially outperforms the remaining scores, for all but counterfactual explanations with beginner users, where counter classes are easy to distinguish. Third, for deliberative explanations, only the easiness score s e (x) improves on the baseline. This suggests that self-referential difficulty scores are not always reliable. For this reason, the easiness score is used in the remaining experiments.
Attribution Function. 4 GALORE is compatible with any attribution function. Fig. 8 (left) compares different functions: baseline gradient ('Grad'), the integrated gradient of [51] ('Inte-Grad'), Grad-CAM [16], score-CAM [17], and SHAP [53]. For brevity, we only present deliberative and counterfactual results for advanced users. A few conclusions are possible. First, while the four more complex functions always outperform Grad, the differences are small, especially on ADE20 K. This is probably because ADE20 K is more difficult (more than 1000 categories and only about 16 examples per category) than CUB200 (200 categories and 26 examples per category). Second, while GALORE benefits from advanced attribution functions, there is little difference between InteGrad, Grad-CAM, SHAP and score-CAM. No attribution function is consistently better than all others.
Network Architectures. Fig. 8 (right) compares the explanations produced by ResNet-50, VGG16 and AlexNet. For counterfactual explanations, only the former two are compared because AlexNet is used to simulate the users. On CUB200, ResNet-50 has the best performance. Interestingly, although ResNet-50 and VGG16 have similar classification performance on these two datasets, the ResNet segments are much more accurate than those of VGG16. This suggests that the ResNet architecture uses more intuitive, i.e., human-like, deliberations. On ADE20 K, where the classification task is harder (< 60% mean accuracy), there is no clear difference between the three architectures.    obtained with binary explanations. An interesting observation is that, for a given recall level, the precision of deliberative explanations is even higher than for binary insecurities. This is seemingly counter intuitive, since more classes should increase the difficulty of the explanation. We hypothesize it happens because the three classes are very similar, having many attributes in common. Combining three attribution maps decreases the risk of missing common attributes. Another observation is that, similarly to binary deliberative and counterfactual explanations, the differences between attribution functions are small.

D. Segment Strength
The accuracy of segment strengths was evaluated by the Pearson correlation coefficient between strength and quality of the explanation, measured by segment precision. Table II shows a strong positive correlation for all explanations. This is sensible because strength is defined as the average intensity  of the attribution map inside the segment. Hence, the explanation should be more class-specific for larger strengths, corresponding to segments of higher quality.

E. Sanity Checks
Recent works have shown that attribution maps can be sensitive to data shifts and model variance [76], [77]. Data shift checks [76] test the robustness of the explanation to input shifts. For this, test images were randomly translated by 1 to 10 pixels along four directions. The resulting insecurities and counterfactual segments were compared to those obtained without translations, by measuring the similarity (IoU) between segments. The average IoU across all segments and examples is shown in Fig. 10 as a function of the threshold T . While these are plots for the 'easiness-Grad-VGG' configuration, they are typical. The average IoU is almost always above 75% showing that the explanations of GALORE are robust to image shifts. Parameter randomization tests [77] compare the explanation of well-trained and random initialized models. Similar outputs indicate that the explanation method is insensitive to model parameters, which is undesirable. Fig. 11 shows that all attribution functions passed the sanity check, since pre-trained models always outperformed random initialization. This was especially true for score-CAM and the differences were larger for counterfactual explanations.

F. Comparison to State of the Art
GALORE was compared with state of the art explanation methods, with the results of Table III. The left side of the table presents a counterfactual explanation comparison between GA-LORE, the method of [30], and CounteRGAN [63], for the two user types considered in this work. To the best of our knowledge there have been no other attempts in the literature to produce deliberative explanations. The right side of the table compares the deliberative explanations of GALORE to a baseline that we have designed, inspired by the method of [30] for counterfactual explanations.
This baseline is as follows. Given the query image x and associated candidate class ambiguity set A, a pair of images is randomly sampled from the training set for each ambiguity (a, b) ∈ A: x a,0 , of class a, and x b,0 , of class b. A sliding window is defined over x. For each window W, we exhaustively search matching windows W a in x a,0 and W b in x b,0 . The matching is defined as follows. Let x a (x b ) be x with W replaced by W a (W b ). The matching windows are those that minimize the change of prediction when inserted in x, Regions W a and W b should have features that are common to the two ambiguous classes, and thus be most confusing for the classifier.
For fair comparison, these experiments use the softmax score of (21), so that model sizes are equal for both [30] and the proposed approach. The size of the counterfactual (or deliberative) region is the receptive field size of one unit ( 1 14 * 14 ≈ 0.005 of image size for VGG16 and 1 7 * 7 ≈ 0.02 for ResNet-50). This is constrained by the speed of the algorithm of [30], where the counterfactual region is determined by exhaustive feature matching. For CounteRGAN, we guarantee the same region size by thresholding the residual outputs of the generator.
Several conclusions can be drawn from the table. First, GA-LORE outperforms the counterfactual explanations of [30], [63] and the baseline deliberative explanation for almost all metrics. Second, GALORE is much faster, improving the speed of [30] by 1000+ times on VGG and 50+ times on ResNet. This is because it does not require exhaustive feature matching. These gains increase with the size of the counterfactual (or deliberative) region, since computation time is constant for GALORE but exponential on region size for [30]. Third, due to the small size used in these experiments, PIoU is relatively low for all methods. It is, however, larger for GALORE explanations with large gains in some cases (VGG &amp; advanced). Fig. 14 shows that PIoU can raise to 0.5 for regions of 10% (VGG) or 20% (ResNet) of the image size. This suggests that, for such regions sizes, region pairs have matching semantics.   III  COMPARISON TO THE STATE OF THE ART IN COUNTERFACTUAL EXPLANATIONS. (IPS: IMAGES PER SECOND, IMPLEMENTED ON NVIDIA TITAN XP. RESULTS ARE  OMITTED FOR THE COUNTERGAN [63] DUE TO THE VERY LONG TRAINING TIMES IT   ambiguity with classes 'California gull' and 'Herring gull' that also have leg color 'buff,' belly color 'white,' and belly pattern 'solid'. The lower insecurity covers the bill/forhead region of the gull, due to an ambiguity between the 'Glaucous gull' and the 'Western gull' with whom the 'Glaucous gull' shares a 'hooked' bill shape and a 'white' colored forehead. The right side of the figure shows insecurities for a 'Black tern,' due to a tail ambiguity with 'Artic' and 'Elegant' terns and a wing ambiguity with 'Elegant' and 'Forsters' terns. These insecurities are much more informative of class ambiguity than those produced by the baseline, which sometimes localizes irrelevant regions, like backgrounds. Fig. 13 shows single GALORE insecurities from four images of ADE20 K. In all cases, the insecurities correlate with regions of attributes shared by different classes. This shows that deliberative explanations unveil truly ambiguous image regions, generating intuitive insecurities that help understand  network predictions. Note, for example, how the visualization of insecurities tends to highlight classes that are semantically very close, such as the different families of gulls or terns and class subsets such as 'plaza,' 'hacienda,' and 'mosque' or 'bedroom' and 'living room'. All of this suggests that the deliberative process of the network correlates well with human reasoning. Fig. 15 shows two examples of counterfactual visualizations on CUB200. The regions selected in the query and counter class image are shown in red. For CounteRGAN [63], the generated explanatory images are shown. The true y * and counter y c class are shown below the images and followed by the ground truth discriminative attributes for the image pair. Note how GALORE explanations identify semantically matched and class-specific bird parts on both images. For example, the throat and bill that distinguish Laysan from Sooty Albatrosses. This feedback enables a user to learn that Laysans have white throats and yellow bills, while Sootys have black throats and bills. This is unlike the regions produced by [30], also shown in the figure, which sometimes highlight irrelevant cues, such as the background. CounteRGAN, only generates some patterns from the counterfactual classes (zoom in for more detail), but not realistic images. This is consistent with the well known difficulty of GANs to translate images across hundreds of fine-grained classes. Fig. 16 presents similar figures for ADE20 K, where the proposed explanations tend to identify scene-discriminative objects. For example, that a promenade deck contains objects 'floor,' 'ceiling,' 'sea,' while a bridge scene includes 'tree,' 'river' and 'bridge'.   Fig. 17 shows the interface of the human experiment used to evaluate deliberative explanations on Amazon MTurk. The region of support of the uncertainty is shown on the left and examples from five classes are displayed on the right. These include the two ambiguous classes a and b found by the explanation algorithm, the "Laysan Albatross" and the "Glaucous Winged Gull". The Tuker is asked to select, among the five classes shown, the two to which the segment on the left is most likely to belong. If these two classes match the ambiguities found by the explanation algorithm the insecurity is considered intuitive. Otherwise, it is not. Turker performance was compared for insecurities generated by the explanation algorithm and randomly cropped regions of the same size. Turkers agreed amongst themselves on classes a and b for 59.4% of the insecurities and 33.7% of randomly cropped regions. They agreed with the algorithm for 51.9% of the insecurities and 26.3% of the random crops. This shows that 1) insecurities are much more predictive of the ambiguities sensed by humans, and 2) the algorithm predicts those ambiguities with significant levels of consistency. In both cases, the "Don't know" rate was around 12%.

B. Application to Machine Teaching
Goyal et al. [30] used counterfactual explanations to design an experiment to teach humans distinguish two bird classes. During a training stage, learners are asked to classify birds. When they make a mistake, they are shown counterfactual feedback of the type of Fig. 15, using the true class as y * and the class they chose as y c . This helps them understand why they chose the wrong label, and learn how to better distinguish the classes. In a test stage, learners are then asked to classify a bird without visual aids. Experiments reported in [30] show that this is much more effective than simply telling them whether their answer is correct/incorrect, or other simple training strategies. We made two modifications to this set-up. The first was to replace bounding boxes with highlighting of the counterfactual regions, as shown in Fig. 18. We also instructed learners not to be distracted by the darkened regions. Unlike [30], this guarantees that they do not exploit cues outside the counterfactual regions to learn bird differences. Second, to verify this, we added two experiments where 1) highlighted regions are generated randomly (without telling the learners); 2) the entire images are lighted. If these produce the same results, one can conclude that the explanations do not promote learning.
We also chose two more difficult birds, the Setophaga Citrina and the Kentucky Warbler (see Fig. 18), than [30]. These classes have large intra-class diversity and cannot be distinguished by color alone, unlike those of [30]. The experiment has three steps. The first is a pre-learning test, where humans are asked to classify 20 examples of the two classes, or choose a 'Don't know' option. The second is a learning stage, where counterfactual explanations are provided for 10 bird pairs. The third is a post-learning test, where humans are asked to answer 20 binary classification questions. In this experiment, all students chose 'Don't know' in the pre-learning test. However, after the learning step, they achieved 95% mean accuracy, compared to 60% (random highlighted regions) and 77% (entire images lighted) in the contrast settings. These results suggest that the proposed counterfactual explanations can help teach naive humans distinguish categories from an expert domain.

IX. CONCLUSION
In this work, we have proposed a new framework, GALORE, for visualization-based explanations of deep neural networks predictions. GALORE unifies attributive, counterfactual, and deliberative explanations, aiming to satisfy the requirements of a diverse set of end-users. Attributive explanations visualize how different pixels contribute to a class prediction, deliberative explanations address the "why?" question, and counterfactual explanations the "why not?" question. All explanations are based on a combination of attributions with respect to class predictions and confidence scores. This makes them very efficient to compute, in some cases orders of magnitude faster than the state of the art. We have also introduced an experimental protocol to evaluate explanation accuracy, which sidesteps the difficulty of replicating user experiments. We believe this will facilitate research in the visualization based XAI problem. Both this protocol and human experiments were used to evaluate GALORE on two fine-grained datasets, demonstrating that its explanations are more accurate than those previously available, intuitive, and correlate with human perception. In this process, we have also validated the importance of self-awareness both to define different explanations and to increase their accuracy. The counterfactual explanation results have shown to be beneficial for machine teaching.