Journals & Magazines >IEEE Access >Volume: 10

Heatmap Assisted Accuracy Score Evaluation Method for Machine-Centric Explainable Deep Neural Networks

Overall architecture of the proposed HAAS evaluation scheme.

Abstract:

There have existed many studies about the explainable artificial intelligence (XAI) that explains the logic behind the complex deep neural network called a black box. At ...Show More

Metadata

Abstract:

There have existed many studies about the explainable artificial intelligence (XAI) that explains the logic behind the complex deep neural network called a black box. At the same time, researchers have tried to evaluate the explainability performance of various XAIs. However, most previous evaluation methods are human-centric, that is, subjective, where they rely on how much the results of explanation are similar to what people’s decision is based on rather than what features actually affect the decision in the model. Their XAI selections are also dependent of datasets. Furthermore, they are focusing only on the output variation of a target class. On the other hand, this paper proposes a robust heatmap assisted accuracy score (HAAS) scheme over datasets that helps selecting machine-centric explanation algorithms to show what actually leads to the decision of a given classification network. The proposed method modifies the input image with the heatmap scores obtained by a given explanation algorithm and then puts the resultant heatmap assisted (HA) images into the network to estimate the accuracy change. The resultant metric (

$HAAS$ ) is computed as a ratio of accuracies of the given network over HA and original images. The proposed evaluation scheme is verified in the image classification models of LeNet-5 for MNIST and VGG-16 for CIFAR-10, STL-10, and ILSVRC2012 over totally 11 XAI algorithms of saliency map, deconvolution, and 9 layer-wise relevance propagation (LRP) configurations. Consequently, for LRP1 and LRP3, MINST showed largest

$HAAS$ values of 1.0088 and 1.0079, CIFAR-10 achieved 1.1160 and 1.1254, STL-10 had 1.0906 and 1.0918, and ILSVRC2012 got 1.3207 and 1.3469. While LRP1 consists of

$\epsilon$ -rules for input, convolutional, and fully-connected layers, LRP3 adopts a bounded-rule for an input layer and the same

$\epsilon$ -rules for other layers as LRP1. The consistency of evaluation results of HAAS and AOPC has been compared by means of Kullb...

Overall architecture of the proposed HAAS evaluation scheme.

Published in: IEEE Access ( Volume: 10)

Page(s): 64832 - 64849

Date of Publication: 20 June 2022

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2022.3184453

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Deep neural networks (DNNs) have gotten humongous attention thanks to a huge amount of datasets for training [1]–[3], the drastic advance in the computing hardware [4]–[9], and the advent of cloud computing [10]–[13], that have enabled the implementation of large number of layers and neurons to make much higher performance possible. DNNs have been developed in various network structures of multi-layer perceptron (MLP) [14], convolutional neural network (CNN) [15], recurrent neural network (RNN) [16], [17], generative adversarial network (GAN) [18], [19], and graph neural network (GNN) [20]. They have been also employed in a variety of applications such as classification [21], [22], object detection [23], segmentation [24], super resolution [25], machine translation [26], natural language processing [27], image/caption/speech generation [28]–[30], algorithmic trading [31], style transfer [32], circuit design [33], recommendation system [34], and so on. Until recently, many researches have been focusing on improving the performance of the networks by means of the increased computational complexity.

In contrast to the substantial achievement with respect to performance, the increased complexity has made it difficult to understand the logic behind the decisions from the networks. Therefore, DNNs are referred to black boxes. Since DNNs cannot guarantee the perfect performance like the accuracy of 100 %, their decisions need the explanation in some applications such as transportation, health care, legal, finance, and military areas [35]. Whereas transportation, health care, and military fields involve in the life-related decisions, legal and finance areas are regulated to provide reasons over decisions by law.

There have been many researches to provide the explainability over DNNs that are addressed as the explainable artificial intelligence (XAI). The importance of a feature was estimated by calculating the expectation of the output over other remaining features or by fixing them at randomly selected values [36], [37]. The sensitivity of the output computed by its derivative with respect to a specific input feature was considered as features’ importance, based on the assumption that more important features would cause larger variations of the outputs even at their small changes [38], [39]. The output of a specific target class was simply back-propagated toward the input layer through a given network by means of the deconvolution, resulting in the heatmap image that represented feature importance scores [40]. On the one hand, because simpler models were easy to interpret, the distillation scheme was proposed, where the knowledge of the high-capacity model was transferred to the small model by training the small one on the soft targets generated by the complicated model [41], [42]. Layer-wise relevance propagation (LRP) [43] used the back-propagation of the relevance score from the output layer to the input layer based on the relevance conservation rule. The propagated relevance values in the input layer were referred to the importance of the corresponding features. There were techniques to explain the models via prototype and criticism [44]. Prototype and criticism can be seen as good and bad examples, respectively. On the other hand, after the subsets of features were randomly selected by the masking data and processed by the deep network model, a simple linear model was built between masking data and corresponding outputs [45]. The weights of the linear model were interpreted as the features’ importance. The class activation map (CAM) method [46]–[48] was introduced to focus on the highest-level feature map that was the last convolutional layer. In addition, the counterfactual approach was proposed to show how the model would have to be different for a desirable output to occur [49].

Because it is impossible to realize any XAI methods that can provide the perfect explanations about decisions of DNNs, we need the evaluation on their explainability to figure out which one is the best explanation approach for a given application. To the best of our knowledge, most previous XAI evaluation approaches, which this paper classifies into human-centric methods, have focused on how similar their explanations look to what people’s decisions rely on [50]. On the other hand, this paper proposes a machine-centric evaluation method that takes into account how well the feature importance extracted by XAI reflects their actual contribution to the decision of the network. Consequently, while the results of previous methods show the dependency on datasets, the proposed evaluation scheme achieves the high consistency over datasets at the best XAI selection. Although this paper uses the network models for the image classification problem, it is ensured that the proposed idea is expandable to other decision problems.

This paper is organized as follows. Section II addresses the overview of the previous evaluation methods along with their pros and cons, and then Section III describes the proposed HAAS evaluation scheme. Section IV shows evaluation results and discussions. Section V concludes this paper.

SECTION II.

Previous Evaluation Methods

The human-friendly explanation is one of key properties that XAIs must have [35]. It means that the explainability methods should provide representations that people can understand. However, too much emphasis on the human-friendliness in previous methods resulted in the human-centric explanation which was biased on how human beings made a decision for given problems rather than what the current model actually concluded it from. Therefore, in most previous evaluation methods, better explainability was equivalent to what the people saw the better was, that is, the subjective criterion.

SmoothGrad [39] was introduced to provide cleaner explanation by removing many scattering points at the saliency map [38]. The scattering points could be eliminated by adding some noises to the input features without consideration on whether those scattering points in the saliency map were important for the decision or not. The resultant human-centric sensitivity map showed only the group of points gathered at the target object that people’s detection would be normally based on.

Guided back-propagation [51] represented the feature importance by propagating backwards only through neurons of positive gradients in order to visualize the part that most significantly activates the decision. Because the representations of positive gradients at the first layer were emphasized, the edges of objects were highlighted. The guided back-propagation gave rise to the edge-based explanation to show the shapes of objects that were believed to be the basis for people’s judgment. Therefore, this method could also be classified as human-centric.

Grad-CAM [47] and Grad-CAM++ [48] was proposed to localize the area of a target object in a given image rather than to figure out the feature importance. Grad-CAM++ obtained parameters of feature maps from a weighted average and localized the entire object area unlike Grad-CAM that could cover only its parts. Especially, its performance was evaluated objectively in terms of average drop%, %increase in confidence, and win%, compared to Grad-CAM. Average drop% was the drop in confidence by occluding parts of the most important regions, %increase in confidence was the increase by occluding unimportant regions, and win% was the number of cases in given images when the fall in the confidence for a map generated by Grad-CAM++ was lower than that by Grad-CAM. The performance of Grad-CAM++ was higher than Grad-CAM, however, it was natural because Grad-CAM++ highlighted bigger regions containing Grad-CAM.

LRP [43] gave rise to the fine-grained pixel-wise explanation by propagating relevance scores from the output of a target class to inputs in a fashion of layer by layer, where relevance values were computed with weights as well as activation outputs. Whereas other XAI schemes could focus on only the features of positive importance, LRP was able to provide the explanation about positive as well as negative influences. Besides a basic $\epsilon$ -rule, additional rules of relevance propagation such as $\alpha \beta$ -rule, flat-rule, and bounded-rule [52], [53] that are described in more detail in Section IV, were proposed to accomplish more human-centric explanation. Previous LRP papers subjectively evaluated their explainability based on how well the target object area was highlighted.

On the other hand, there have been quantitative evaluation approaches based on the heatmap visualization for image classification problems. The representative one was the area over the MoRF perturbation curve (AOPC) [54], where MoRF stood for most relevant first. The image was divided into predefined grids ( $\boldsymbol {r_{1}},\cdots,\boldsymbol {r_{L}}$ ) which locations were ordered according to heatmap scores. The score of $\boldsymbol {r_{i}}$ is equal to or lager than $\boldsymbol {r_{j}}$ when $i< j$ . Heatmap scores indicate how important the location is for representing the output class. The MoRF process was conducted recursively by adding perturbations to highest score regions of the image as described in (1) and Fig. 1, where $g$ was a function to eliminate the information of $\boldsymbol {r_{k}}$ from the modified image after $(k-1)$ MoRF operations ( $x^{(k-1)}_{\text {MoRF}}$ ) and $x^{(0)}_{\text {MoRF}}$ was the original image $x$ .

$\begin{equation*} x_{\mathrm {MoRF}}^{(k)} = {g}(x_{\mathrm {MoRF}}^{(k-1)}, \:\boldsymbol {r_{k}}),\quad \; 1 \leq k \leq L \tag{1}\end{equation*}$ View Source

FIGURE 1.

AOPC evaluation based on the iterative MoRF process. $f(x)$ is the output of a DNN model over the input image $x$ .

Show All

In each recursive step, the difference ( $Diff(k)$ ) between model outputs for original and perturbed images of $x^{(0)}_{\text {MoRF}}$ and $x^{(k)}_{\text {MoRF}}$ was plotted, and the final AOPC value became equivalent to the area under the difference curve, as expressed in (2), leading to the conclusion that a larger AOPC is obtained from a better XAI method. $\langle \cdot \rangle$ is the average function over all images. Consequently, AOPC results showed that LRPs were superior to saliency map [38] and deconvolution [40]. However, AOPC requires lots of iterations to achieve the evaluation results only with consideration on highest heatmap scores. Since AOPC is mainly determined by the first highest score location, it cannot accomplish the fair evaluation on other heatmap score areas. Also, AOPC makes use of the output value difference only at a target class of a given DNN model, regardless of the accuracy variation and other outputs’ changes. Therefore, their evaluation results vary depending on the dataset used. On top of AOPC, after the heatmap images were compressed, their file sizes were compared to figure out which XAI algorithms highlight the relevant regions and not more. The file sizes of LRP heatmaps were also smaller than saliency map and deconvolution, however, the assumption that better XAI schemes should have smaller file sizes was also human-centric since the explanation with more scattering points was considered to be worse like SmoothGrad.

$\begin{equation*} \text {AOPC} = \left \langle{ \frac {1}{L} \sum _{k=1}^{L} f(x_{\mathrm {MoRF}}^{(0)})-f(x_{\mathrm {MoRF}}^{(k)}) }\right \rangle \tag{2}\end{equation*}$ View Source

The inside-total relevance ratio $\mu$ [55] was introduced based on the hypothesis that pixels in the object area are more important for the model’s decisions. Therefore, $\mu$ was obtained as the ratio of inside-total positive relevance ( $R_{in}$ ) in the object area and total positive relevance ( $R_{tot}$ ) as presented in (3). Additionally, the dependency on the object size was avoided by multiplying $\mu$ with the ratio of total area ( $S_{tot}$ ) and object area ( $S_{in}$ ) resulting in $\mu _{w}$ as described in (4). Because this method used only positive relevance values as well as the assumption that the model’s decision should rely on the information in the object area, it was also human-centric. This assumption of object-centricity was verified to some extent by investigating the output changes for the occlusion of object and context areas, however, their evaluation results also showed that some features in the image context influenced the decision. The object-centricity is what the human desires for the model, but may not be what the decision actually is based on.

$\begin{align*} \mu=&\frac {R_{in}}{R_{tot}} \tag{3}\\ \mu _{w}=&\mu \cdot \frac {S_{tot}}{S_{in}} \tag{4}\end{align*}$ View Source

SECTION III.

Proposed HEATMAP Assisted Accuracy Score

The proposed heatmap assisted accuracy score (HAAS) scheme has four advantages over previous quantitative evaluation methods. First, HAAS is much simpler than AOPC and inside-total relevance ratio schemes since it does not require any iterations for sequential occlusion or information removal. Second, HAAS directly provides the accuracy ratio as the quantitative evaluation metric while previous methods focus on the variation of only the target output and need additional processes such as averaging, plotting, generating bounding boxes, summing relevance, and estimating area. Because the accuracy is determined by the maximum of all the output units, HAAS takes into account all the output variations. Third, HAAS is able to evaluate both positive and negative influences of features by taking into account all the heatmap scores directly, whereas previous schemes consider only the positive influences. Lastly, HAAS is a machine-centric method because XAI algorithms is evaluated directly based on the performance of the given model, but not from people’s perspectives. Their explainability is investigated by the accuracy changes in the given model using images modified according to features’ importance. In addition, whereas APOC results vary regarding to datasets, HAAS shows the robust evaluation.

A heatmap is one of visualization methods to show the importance of each feature over a given decision in color or gray scale. As depicted in Fig. 2 that was extracted by LRP over a DNN of the handwritten number classification, red and blue colors represent features of positive and negative influences, respectively. Green colors indicate that those features have little impacts on the output. In other words, while the data of red areas contribute to the increase of a specific output, blue regions lead to its decrease.

FIGURE 2.

Example of heatmap visualization for the digit 3 in the handwritten number classification. The heatmap image was obtained by LRP, where red and blue colors means positive and negative influences, respectively.

Show All

Our hypothesis for HAAS is that since heatmaps include the information about the features’ influence, the performance of the model should be improved by modifying input images according to heatmap scores. The more accurate explanation the heatmap provides with respect to a network model, the more the performance will be enhanced. Based on this hypothesis, the proposed HAAS scheme is illustrated in Fig. 3. $f(\cdot)$ is the model, $x$ is the input image, $h$ is the heatmap, $HA(x,h)$ is the HA image for $x$ and $h$ , and $Acc(\cdot)_{N}$ is the accuracy over $N$ images. After the decision of a classification model is interpreted in a shape of a heatmap by a given XAI algorithm, the input images are modified by the resultant scores assigned to pixels. The modified images are referred to HA images here. Those HA images are put into the model again and the explainability of the given XAI algorithm is evaluated by the performance improvement metric, $HAAS$ , that is computed by a ratio of accuracies over original and HA images as described in (5). When $HAAS$ is larger than 1.0, the HA images improve the accuracy of the classification model, leading to the conclusion that heatmap scores extracted by the XAI algorithm are explaining the feature’s importance well. Conversely, less $HAAS$ than 1.0 means that HA images have deteriorated the performance due to the incorrectly extracted explanation.

$\begin{equation*} HAAS = \frac {Acc(f(HA(x,h)))_{N}}{Acc(f(x))_{N}} \tag{5}\end{equation*}$ View Source

FIGURE 3.

Overall architecture of the proposed HAAS evaluation scheme.

Show All

For the generation of HA images, red areas are highlighted and blue areas are de-emphasized. Here, while highlighting features means making positive values more positive and negative values more negative, de-emphasizing features is equivalent to changing positive values toward negative and negative values toward positive. To support highlighting as well as de-emphasizing, we normalize both an input ( $x_{Norm}$ ) and a heatmap ( $h_{Norm}$ ) at the range of −1 to +1. Then, HA images ( $HA$ ) are composed by clipping the product of $x_{Norm}$ and $1+h_{Norm}$ within −1 and +1 as presented in (6). Finally, the proposed HAAS method measures the accuracy variation of the output decisions obtained after putting HA images into the classification model again.

$\begin{equation*} HA = \max \{-1,\min \{1,x_{Norm} \cdot (1+h_{Norm})\}\} \tag{6}\end{equation*}$ View Source

SECTION IV.

Evaluation Results

A. Evaluation Setup

The evaluation is conducted for two CNN models of LeNet-5 [56] and VGG-16 [57] with four datasets of MNIST [58], CIFAR-10 [59], STL-10 [60], and ILSVRC2012 [61] that are famous ones in the classification application. While LeNet-5 network for MNIST and VGG-16 networks for CIFAR-10 and STL-10 are re-trained with training datasets normalized at the range of −1 to +1, a pre-trained VGG-16 network is employed for ILSVRC2012. As presented in Fig. 4, the last Softmax functions are omitted during the XAI evaluation because the polarity information at the output should be used to express both positive and negative influences of features. However, during the training phase, all the networks have included Softmax functions for class outputs.

FIGURE 4.

Classification Models (a) MNIST + LeNet-5 (b) CIFAR-10+ VGG-16 (c) STL-10+ VGG-16 (d) ILSVRC2012+ VGG-16.

Show All

XAI algorithms used for the evaluation are saliency map, deconvolution, and LRPs with various rule configurations as investigated in the previous AOPC paper [54] because the performance of HAAS is compared to AOPC’s results. These XAIs bring out the explanation for each feature that is a pixel in an image, but not the object localization. Saliency maps [38] are constructed by (7), where $M_{p,c}$ is the magnitude of the sensitivity value for the input feature ( $x_{p}$ ) at a pixel $p$ and the target class $c$ , and $f_{c}(x)$ is the model output at the class $c$ for the input image $x$ . Therefore, $M_{p,c}$ does not includes the information about the polarity of the influence on the output. The resultant heatmap scores of the saliency map ( $h^{SM}_{p,c}$ ) are obtained by normalizing $M_{p,c}$ with the maximum value as shown in (8).

$\begin{align*} M_{p,c}=&\left |{ \frac {\partial f_{c}(x)}{\partial x_{p}} }\right | \tag{7}\\ h^{SM}_{p,c}=&\frac {M_{p,c}}{\max \{M_{c}\}} \tag{8}\end{align*}$ View Source

When the model consists of convolutional layers ( $Conv$ ) as expressed in (9), deconvolution methods [40] give rise to the explanation by deconvolutional layers ( $Deconv$ ) without ReLU activation functions as described in (10). $a^{(l)}$ is the activation output of the $(l)$ -th layer, $z^{(l+1)}$ is the output of the $(l+1)$ -th convolutional layer, $\theta ^{(l,l+1)}$ is the parameter of the $(l+1)$ -th convolutional layer, and $D^{(l)}_{c}$ is the backward signal in the $(l)$ -th layer for the target class $c$ . The parameters of $Deconv$ are set to be equal to $\theta$ of $Conv$ . The final heatmap score ( $h^{DC}_{p,c}$ ) of the deconvolution method at the pixel $p$ are achieved by normalizing $D^{(0)}_{p,c}$ with its maximum absolute value as shown in (11).

$\begin{align*} z^{(l+1)}=&Conv(a^{(l)},\theta ^{(l,l+1)}) \tag{9}\\ D^{(l)}_{c}=&Deconv(R^{(l+1)}_{c},\theta ^{(l,l+1)}) \tag{10}\\ h^{DC}_{p,c}=&\frac {D^{(0)}_{p,c}}{\max \{|D^{(0)}_{c}|\}} \tag{11}\end{align*}$ View Source

LRPs [43] are also investigated for 5 propagation rules of $\epsilon$ -rule, $\alpha 1\beta 0$ -rule, $\alpha 2\beta 1$ -rule, flat-rule, and bounded-rule. First of all, $z^{(l+1)}_{ij}$ is defined in (12) as the product of $a^{(l)}_{i}$ and the parameter ( $w^{(l+1)}_{ij}$ ) between $i$ -th input and $j$ -th output of the $(l+1)$ -th convolutional layer. The sum of all these products for the $j$ -th output with a bias ( $b^{(l+1)}_{j}$ ) is $z^{(l+1)}_{j}$ like (13).

$\begin{align*} z^{(l+1)}_{ij}=&a^{(l)}_{i} \cdot w^{(l+1)}_{ij} \tag{12}\\ z^{(l+1)}_{j}=&\sum _{i} z^{(l+1)}_{ij} + b^{(l+1)}_{j} \tag{13}\end{align*}$ View Source

Then, the $\epsilon$ -rule generates the relevance ( $R_{i\leftarrow j}^{\epsilon,(l,l+1)}$ ) propagated from the $j$ -th output to the $i$ -th input as described in (14) with a stabilizer ( $\epsilon$ ) to avoid too small denominator close to zero.

$\begin{equation*} R_{i\leftarrow j}^{\epsilon,(l,l+1)} = \frac {z^{(l+1)}_{ij}}{z^{(l+1)}_{j} + \varepsilon \cdot sign(z^{(l+1)}_{j})}\cdot R_{j}^{(l+1)} \tag{14}\end{equation*}$ View Source

$\alpha \beta$ -rules emphasize the positive components since $\alpha$ is set to be larger than $\beta$ based on $\alpha -\beta = 1$ . $z^{+,(l+1)}_{ij}$ and $z^{-,(l+1)}_{ij}$ are positive and negative components of $z^{(l+1)}_{ij}$ as shown in (15) and (16). $b^{+,(l+1)}_{j}$ and $b^{-,(l+1)}_{j}$ are positive and negative biases as shown in (17) and (18).

$\begin{align*} {z_{ij}^{+,(l+1)}}=&\begin{cases} \displaystyle z_{ij}^{(l+1)},&{\text {if}}~z_{ij}^{(l+1)} > 0\\ \displaystyle {0,}&{\text {otherwise.}} \end{cases} \tag{15}\\[3pt] {z_{ij}^{-,(l+1)}}=&\begin{cases} \displaystyle z_{ij}^{(l+1)},&{\text {if}}~z_{ij}^{(l+1)} < 0\\ \displaystyle {0,}&{\text {otherwise.}} \end{cases} \tag{16}\\[3pt] {b_{j}^{+,(l+1)}}=&\begin{cases} \displaystyle b_{j}^{(l+1)},&{\text {if}}~b_{j}^{(l+1)} > 0\\ \displaystyle {0,}&{\text {otherwise.}} \end{cases} \tag{17}\\[3pt] {b_{j}^{-,(l+1)}}=&\begin{cases} \displaystyle b_{j}^{(l+1)},&{\text {if}}~b_{j}^{(l+1)} < 0\\ \displaystyle {0,}&{\text {otherwise.}} \end{cases} \tag{18}\end{align*}$ View Source

With $z^{+,(l+1)}_{j}$ and $z^{-,(l+1)}_{j}$ obtained by the sums of positive and negative parts as expressed in (19) and (20), $\alpha \beta$ -rules are described in (21). For $\alpha 1\beta 0$ and $\alpha 2\beta 1$ , the numbers following $\alpha$ and $\beta$ are their assigned numbers respectively.

$\begin{align*} z_{j}^{+,(l+1)}=&\sum _{i} z_{ij}^{+,(l+1)}+b_{j}^{+,(l+1)} \tag{19}\\ z_{j}^{-,(l+1)}=&\sum _{i} z_{ij}^{-,(l+1)}+b_{j}^{-,(l+1)} \tag{20}\\ R_{ij}^{\alpha \beta,(l,l+1)}=&\left ({\alpha \cdot \frac {z_{ij}^{+,(l+1)}}{z_{j}^{+,(l+1)}}+\beta \cdot \frac {z_{ij}^{-,(l+1)}}{z_{j}^{-,(l+1)}} }\right)\cdot R_{j}^{(l+1)} \tag{21}\end{align*}$ View Source

The flat-rule is simply defined to uniformly distribute the relevance of a higher layer to a lower layer as presented in (22) without any consideration on activation outputs and parameters. $\sum _{i} 1$ is the total number of inputs connected to the $j$ -th output of the $(l+1)$ -th layer.

$\begin{equation*} R_{i\leftarrow j}^{\flat,(l,l+1)} = \left ({\frac {1}{\sum _{i} 1} }\right)\cdot R_{j}^{(l+1)} \tag{22}\end{equation*}$ View Source

Lastly, the bounded-rule is presented with $z_{ij}^{B,(l+1)}$ defined in (25). $w^{+,(l+1)}_{ij}$ and $w^{-,(l+1)}_{ij}$ are positive and negative parameters as expressed in (23) and (24). $low_{i}$ is the minimum value of features connected to $w^{+,(l+1)}_{ij}$ , and $high_{i}$ is the maximum value of features linked to $w^{-,(l+1)}_{ij}$ . Therefore, it is guaranteed that $z_{ij}^{B,(l+1)}$ is not negative. The resultant propagated relevance of the bounded-rule ( $R_{i\leftarrow j}^{B,(l,l+1)}$ ) with a stabilizer of $\epsilon$ is shown in (26).

$\begin{align*} {w_{ij}^{+,(l+1)}}=&\begin{cases} \displaystyle w_{ij}^{(l+1)},&{\text {if}}~w_{ij}^{(l+1)} > 0\\ \displaystyle {0,}&{\text {otherwise.}} \end{cases} \tag{23}\\ {w_{ij}^{-,(l+1)}}=&\begin{cases} \displaystyle w_{ij}^{(l+1)},&{\text {if}}~w_{ij}^{(l+1)} < 0\\ \displaystyle {0,}&{\text {otherwise.}} \end{cases} \tag{24}\\[3pt] z_{ij}^{B,(l+1)}=&z_{ij}^{(l+1)} -low_{i} w^{+,(l+1)}_{ij} -high_{i}w^{-,(l+1)}_{ij} \tag{25}\\[3pt] R_{i\leftarrow j}^{B,(l,l+1)}=&\frac {z_{ij}^{B,(l+1)}}{\sum \limits _{i} z_{ij}^{B,(l+1)}+\epsilon }\cdot R_{j}^{(l+1)} \tag{26}\end{align*}$ View Source

Consequently, the relevance ( $R_{i}^{(l)}$ ) at the $i$ -th input is defined as the sum of all relevance values propagated into the $i$ -th input like (27) and LRP heatmap scores ( $h^{LRP}_{p,c}$ ) are obtained by the normalized relevance at the input layer of $l=0$ as presented in (28).

$\begin{align*} R_{i}^{(l)}=&\sum _{j}R_{i\leftarrow j}^{(l,l+1)} \tag{27}\\[-4pt] h^{LRP}_{p,c}=&\frac {R^{(0)}_{p,c}}{\max \{|R^{(0)}_{c}|\}} \tag{28}\end{align*}$ View Source

While $\epsilon$ -rule, flat-rule, and bounded-rule are considered for the input layer, $\epsilon$ -rule, $\alpha 1\beta 0$ -rule, and $\alpha 2\beta 1$ -rule are applied for internal convolutional layers. All fully-connected (FC) layers are addressed only by a $\epsilon$ -rule. Consequently, this evaluation is composed of totally 9 combinations of LRP rules as summarized in Table 1.

TABLE 1 9 Combinations of LPR Rules for XAI Evaluation

B. HAAS Evaluation

All the networks are investigated by both AOPC and HAAS methods for eleven XAI algorithms. AOPC is conducted for 100 iterations per an image with regions of $3\times 3$ . First, LeNet-5 is investigated with the MNIST test dataset of 10,000 images as shown in Table 2, where the mean value of the outputs ( $f_{c}(x)$ ) at the label classes and accuracy for the original images are 16.06 and 99.07 %, respectively. The HAAS results show the similar trend to the AOPC, where most LRP configurations lead to the better explanations than saliency map and deconvolution. HAAS values of all LRPs are larger than 1.0, which means HA images of LRPs generate higher accuracies than that for the original images. LRP1, that employs only $\epsilon$ -rules for all layers, achieves the best HAAS performance of 1.0088 and LRP3 of bounded-rule for the input layer and $\epsilon$ -rule for $Conv$ and FC layers takes the second place of 1.0079. Heatmap and HA images over 10 original test images leading to the misclassification are illustrated in Figs. 5 and 6 along with their decisions and label classes’ outputs. The misclassification cases are marked in red and the correct decision cases are presented in blue. According to this result, the increased output level at the label class is not always linked to the accuracy improvement because the further increase at the other class output may cause misclassification. For example, in the case of the test image of 8, LRP7 with the larger output level of 7.19 provides the wrong decision of 9 while LRP8 with the smaller output level of 5.76 results in the correct decision of 8. Because the accuracy is directly related to the decision, the investigation of the accuracy is more important than the estimation of the output level to understand what the decision is made with a basis on.

TABLE 2 HAAS Evalution Results of Saliency Map (S. Map), Deconvolution (Deconv), and 9 LRP Configurations for MNIST. The Biggest Values are Marked in Bold and the Second Results are With Underlines

FIGURE 5.

Heatmap and HA images at MNIST images of 0 to 4 digits for 11 XAI algorithms of saliency map (S. Map), deconvolution (Deconv), and LRP1 to LRP9. Red and blue colors represent wrong and correct classifications, respectively.

Show All

FIGURE 6.

Heatmap and HA images at MNIST images of 5 to 9 digits for 11 XAI algorithms of saliency map (S. Map), deconvolution (Deconv), and LRP1 to LRP9. Red and blue colors represent wrong and correct classifications, respectively.

Show All

Second, VGG-16 is investigated with the CIFAR-10 test dataset of 10,000 images as shown in Table 3, where the mean value of $f_{c}(x)$ and accuracy for the original images are 16.38 and 88.05 %, respectively. Unlike the MNIST dataset of gray images, the HAAS results over CIFAR-10 of color images are different from AOPC results. Whereas HAAS of the saliency map is larger than the deconvolution, their AOPC values show the opposite result to each other. Additionally, the maximum HAAS of 1.1254 is achieved at LRP3, but the maximum AOPC of 543.73 is obtained from LRP7. For CIFAR-10, HAAS values of all LRPs are also bigger than 1.0. LRP3 of bounded-rule for the input layer and $\epsilon$ -rule for $Conv$ and FC layers achieves the best HAAS performance and LRP1 of only $\epsilon$ -rules takes the second place of 1.1160. Heatmap and HA images over 10 original test images leading to misclassification are illustrated in Figs. 7 and 8. For convolutional layers, $\alpha \beta$ -rules present the areas of objects in the heatmap image better with respect to the localization, but $\epsilon$ -rules give rise to the larger increase in both output and accuracy. Therefore, it is sure that $\epsilon$ -rules for all convolutional layers represent the feature importance of each pixel better.

TABLE 3 HAAS Evalution Results of Saliency Map (S. Map), Deconvolution (Deconv), and 9 LRP Configurations for CIFAR-10. The Biggest Values are Marked in Bold and the Second Results are With Underlines

FIGURE 7.

Heatmap and HA images at CIFAR-10 images of plane, car, bird, cat, and deer for 11 XAI algorithms of saliency map (S. Map), deconvolution (Deconv), and LRP1 to LRP9. Red and blue colors represent wrong and correct classifications, respectively.

Show All

FIGURE 8.

Heatmap and HA images at CIFAR-10 images of dog, frog, horse, ship, and truck for 11 XAI algorithms of saliency map (S. Map), deconvolution (Deconv), and LRP1 to LRP9. Red and blue colors represent wrong and correct classifications, respectively.

Show All

Third, VGG-16 is investigated with the STL-10 test dataset of 8,000 images as shown in Table 4, where the mean value of $f_{c}(x)$ and accuracy for the original images are 21.38 and 91.35 %, respectively. The HAAS results over STL-10 are similar to them of CIFAR-10. The maximum HAAS of 1.0918 is accomplished at LRP3 while the maximum AOPC of 346.21 is achieved at LRP5. LRP1 of only $\epsilon$ -rules also takes the second place of 1.0906. Heatmap and HA images over 10 original test images leading to misclassification are illustrated in Figs. 9 and 10. Like the HAAS evaluation results at CIFAR-10, the object areas are better marked in the heatmap images with $\alpha \beta$ -rules in convolutional layers, however, accuracy and output are more enhanced with $\epsilon$ -rules.

TABLE 4 HAAS Evalution Results of Saliency Map (S. Map), Deconvolution (Deconv), and 9 LRP Configurations for STL-10. The Biggest Values are Marked in Bold and the Second Results are With Underlines

FIGURE 9.

Heatmap and HA images at STL-10 images of airplane, bird, car, cat, and deer for 11 XAI algorithms of saliency map (S. Map), deconvolution (Deconv), and LRP1 to LRP9. Red and blue colors represent wrong and correct classifications, respectively.

Show All

FIGURE 10.

Heatmap and HA images at STL-10 images of dog, horse, monkey, ship, and truck for 11 XAI algorithms of saliency map (S. Map), deconvolution (Deconv), and LRP1 to LRP9. Red and blue colors represent wrong and correct classifications, respectively.

Show All

Lastly, VGG-16 is investigated with the ILSVRC2012 test dataset of 50,000 images as shown in Table 5, where the mean value of $f_{c}(x)$ , top-1 accuracy, and top-5 accuracy for the original images are 15.46, 63.41 %, and 85.82 %, respectively. The top-1 and top-5 HAAS results achieve the maximum HAAS values of 1.3469 and 1.1453 similarly to them of CIFAR-10 and STL-10. The maximum AOPC of 2232.26 is achieved at LRP7. LRP1 of only $\epsilon$ -rules also takes the second place of 1.3207 and 1.1391 for top-1 and top-5 HAAS values. Heatmap and HA images over 10 original test images leading to misclassification are illustrated in Figs. 11 and 12. The overall HAAS evaluation results are equivalent to CIFAR-10 and STL-10.

TABLE 5 HAAS Evalution Results of Saliency Map (S. Map), Deconvolution (Deconv), and 9 LRP Configurations for ILSVRC2012. The Biggest Values are Marked in Bold and the Second Results are With Underlines

FIGURE 11.

Heatmap and HA images at ILSVRC2012 images of kite, crane, boxer, beaver, and baseball for 11 XAI algorithms of saliency map (S. Map), deconvolution (Deconv), and LRP1 to LRP9. Red and blue colors represent wrong and correct classifications, respectively.

Show All

FIGURE 12.

Heatmap and HA images at ILSVRC2012 images of cradle, mask, scoreboard, teapot, and cup for 11 XAI algorithms of saliency map (S. Map), deconvolution (Deconv), and LRP1 to LRP9. Red and blue colors represent wrong and correct classifications, respectively.

Show All

While the best XAI algorithm selected from AOPC is strongly dependent of the dataset, the proposed HAAS shows the consistent recommendation of LRP1 and LRP3 regardless of datasets. We believe that if the evaluation method is machine-centric, the XAI selection should be independent of datasets. For the quantitative comparison for the consistencies of AOPC and HAAS, we employ Kullback–Leibler divergence ( $D_{KL}$ ) [62] as described in (29), where $p$ and $q$ are two probability distributions and $i$ is the index of possible outcomes. The more similar two distributions are, the closer $D_{KL}$ is to $0.~D_{KL}$ is not negative. If the evaluation method is consistent over datasets, the distributions of results over XAI algorithms must be similar leading to the smaller $D_{KL}$ between different datasets. The evaluation results ( $E_{k}(i)$ ) are normalized into $N_{k}(i)$ by (30), where $i=1,2,\cdots,11$ is the index of eleven XAI algorithms from saliency map to LRP9 and $k=1,2,3,4$ is the index of four datasets from MNIST to ILSVRC2012. Because the probability of 0 must be avoided for $D_{KL}$ computation, the small stabilizer ( $\gamma$ ) of 0.001 is added to the normalized values. After their values are modified into the probability ( $P_{k}(i)$ ) by dividing with their total sum like (31), divergence values of $D_{KL}(P_{1}|P_{2})$ , $D_{KL}(P_{1}|P_{3})$ , $D_{KL}(P_{1}|P_{4})$ , $D_{KL}(P_{2}|P_{3})$ , $D_{KL}(P_{2}|P_{4})$ , and $D_{KL}(P_{3}|P_{4})$ are calculated for six combinations of (MNIST, CIFAR-10), (MNIST, STL-10), (MNIST, ILSVRC2012), (CIFAR-10, STL-10), (CIFAR-10, ILSVRC2012), and (STL-10, ILSVRC2012).

$\begin{align*} D_{KL}(p|q)=&\sum _{i}p(i)\log \frac {p(i)}{q(i)} \tag{29}\\ N_{k}(i)=&\frac {E_{k}(i)-\min (E_{k})}{\max (E_{k})-\min (E_{k})}+\gamma \tag{30}\\ P_{k}(i)=&\frac {N_{k}(i)}{\sum _{i}N_{k}(i)} \tag{31}\end{align*}$ View Source

whereas AOPC results in

$D_{KL}$

values of 0.2761, 0.5337, 0.0329, 0.3685, 0.4289, and 0.1888 with the average of 0.3048, HAAS obtains divergence values of 0.0119, 0.0335, 0.0654, 0.0071, 0.0260, and 0.0067 which average is 0.0251 that is smaller than AOPC by a factor of 10. It is ensured that the distribution of HAAS evaluation is less dependent of datasets than AOPC. Therefore, we conclude that HAAS has established more robust evaluation for XAI algorithms. Our evaluation results lead to the conclusion that

$\epsilon$

-rule and bounded-rule for an input layer and

$\epsilon$

-rule for convolutional and FC layers are most suitable to achieve the machine-centric robust evaluation of XAIs over DNNs.

C. Inverted HA Test

To further verify our hypothesis of HAAS that emphasizing the pixels of the image according to heatmap scores enhances the accuracy of a given classification network, the inverted HA images ( $invHA$ ) are generated by inverted heatmap scores as described in (32) and Fig. 13 and put into the network to estimate the change of the accuracy, compared to the HA images generated with the original heatmap scores. If the proposed HAAS scheme is reflecting the feature importance properly, the inverted HA images must lead to larger accuracy degradation at the XAI algorithms of the higher $HAAS$ results.

$\begin{equation*} invHA = \max \{-1,\min \{1,x_{Norm} \cdot (1-h_{Norm})\}\} \tag{32}\end{equation*}$ View Source

FIGURE 13.

Heatmap and HA images at ILSVRC2012 for 11 XAI algorithms of saliency map (S. Map), deconvolution (Deconv), and LRP1 to LRP9.

Show All

Like the HAAS evaluation, the inverted HA test is also conducted with two DNNs of LeNet-5 and VGG-16, and four test datasets of MNIST, CIFAR-10, STL-10, and ILSVRC2012. The output value at the label class ( $f_{c}(invHA(x,h))$ ), accuracy at the inverted HA images, and the accuracy difference ( $\Delta$ Accuracy) between inverted HA and original HA images are measured as summarized in Tables 6, 7, 8, and 9. As expected, their results ensure that LRP1 and LRP3 with largest $HAAS$ values experience larger accuracy losses for inverted HA images such as −13.82 % and −10.86 % in MNIST, −81.92 % and −81.33 % in CIFAR-10, −80.00 % and −73.98 % in STL-10, and (−78.19 %, −83.83 %) and (−78.78 %, −82.53 %) in (Top-1, Top-5) of ILSVRC2012, compared to other XAI algorithms. As a conclusion, the hypothesis of the proposed HAAS scheme is reasonable, where emphasizing features according to heatmap scores affect the accuracy of DNN-based classifier. The HAAS evaluation gives rise to the machine-centric way that can explain how well the given XAI algorithm actually addresses each feature importance.

TABLE 6 Inverted HA Test Results for MNIST. The Values of LRP1 are Marked in Bold and the Ones of LRP3 are With Underlines

TABLE 7 Inverted HA Test Results for CIFAR-10. The Values of LRP1 are Marked in Bold and the Ones of LRP3 are With Underlines

TABLE 8 Inverted HA Test Results for STL-10. The Values of LRP1 are Marked in Bold and the Ones of LRP3 are With Underlines

TABLE 9 Inverted HA Test Results for ILSVRC2012. The Values of LRP1 are Marked in Bold and the Ones of LRP3 are With Underlines

SECTION V.

Conclusion

Whereas DNNs have accomplished dramatic performance improvement in many areas, understanding the inside of the networks becomes more difficult. Consequently, various XAI algorithms have been come up with to interpret the logic behind decisions of complex neural networks, and at the same time, it has been necessary to study methods to evaluate their explainability. This paper proposes a machine-centric HAAS scheme to evaluate the explainability of XAIs while most previous evaluation methods focus on how closer their results are to what people rely on. Furthermore, unlike AOPC needing many iterations and focusing only on the output of a target class, HAAS provides the quantitative scores directly by putting HA images generated from original image and heatmap scores into the given model without any iterations. To estimate the accuracy changes, it should take into account all the outputs at the same time. In addition, HAAS uses both positive and negative influences of features. Especially, over four datasets for classification networks, HAAS achieves lower $D_{KL}$ of 0.0251 than AOPC of 0.3048 in average. Low $D_{KL}$ means that HAAS provides more robust evaluation than AOPC independently of datasets. The evaluation through the accuracy changes of the network and the high consistency of the best XAI selection lead to the conclusion that HAAS is the machine-centric evaluation. Although the proposed HAAS has been investigated only for the classification networks, we are sure that our machine-centric evaluation scheme can pave the efficient way to further improvement on performance as well as explainability in a variety of DNNs.

References is not available for this document.

Heatmap Assisted Accuracy Score Evaluation Method for Machine-Centric Explainable Deep Neural Networks

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Previous Evaluation Methods

Proposed HEATMAP Assisted Accuracy Score