Heatmap Assisted Accuracy Score Evaluation Method for Machine-Centric Explainable Deep Neural Networks

There have existed many studies about the explainable artificial intelligence (XAI) that explains the logic behind the complex deep neural network called a black box. At the same time, researchers have tried to evaluate the explainability performance of various XAIs. However, most previous evaluation methods are human-centric, that is, subjective, where they rely on how much the results of explanation are similar to what people’s decision is based on rather than what features actually affect the decision in the model. Their XAI selections are also dependent of datasets. Furthermore, they are focusing only on the output variation of a target class. On the other hand, this paper proposes a robust heatmap assisted accuracy score (HAAS) scheme over datasets that helps selecting machine-centric explanation algorithms to show what actually leads to the decision of a given classification network. The proposed method modifies the input image with the heatmap scores obtained by a given explanation algorithm and then puts the resultant heatmap assisted (HA) images into the network to estimate the accuracy change. The resultant metric (HAAS) is computed as a ratio of accuracies of the given network over HA and original images. The proposed evaluation scheme is verified in the image classification models of LeNet-5 for MNIST and VGG-16 for CIFAR-10, STL-10, and ILSVRC2012 over totally 11 XAI algorithms of saliency map, deconvolution, and 9 layer-wise relevance propagation (LRP) configurations. Consequently, for LRP1 and LRP3, MINST showed largest HAAS values of 1.0088 and 1.0079, CIFAR-10 achieved 1.1160 and 1.1254, STL-10 had 1.0906 and 1.0918, and ILSVRC2012 got 1.3207 and 1.3469. While LRP1 consists of ϵ-rules for input, convolutional, and fully-connected layers, LRP3 adopts a bounded-rule for an input layer and the same ϵ-rules for other layers as LRP1. The consistency of evaluation results of HAAS and AOPC has been compared by means of Kullback-Leibler divergence, ensuring that HAAS is the more robust evaluation method than AOPC independently of datasets since HAAS has much lower average divergence of 0.0251 than AOPC of 0.3048. In addition, the validity of the proposed HAAS scheme is further investigated through the inverted HA test that employs inverted HA images made up with inverted heatmap scores and estimates the accuracy degradation caused by applying them to the network. The XAI algorithms with largest HAAS results experience biggest accuracy degradation in the inverted HA test.

In contrast to the substantial achievement with respect to performance, the increased complexity has made it difficult to understand the logic behind the decisions from the networks. Therefore, DNNs are referred to black boxes. Since DNNs cannot guarantee the perfect performance like the accuracy of 100 %, their decisions need the explanation in some applications such as transportation, health care, legal, finance, and military areas [35]. Whereas transportation, health care, and military fields involve in the life-related decisions, legal and finance areas are regulated to provide reasons over decisions by law.
There have been many researches to provide the explainability over DNNs that are addressed as the explainable artificial intelligence (XAI). The importance of a feature was estimated by calculating the expectation of the output over other remaining features or by fixing them at randomly selected values [36], [37]. The sensitivity of the output computed by its derivative with respect to a specific input feature was considered as features' importance, based on the assumption that more important features would cause larger variations of the outputs even at their small changes [38], [39]. The output of a specific target class was simply back-propagated toward the input layer through a given network by means of the deconvolution, resulting in the heatmap image that represented feature importance scores [40]. On the one hand, because simpler models were easy to interpret, the distillation scheme was proposed, where the knowledge of the highcapacity model was transferred to the small model by training the small one on the soft targets generated by the complicated model [41], [42]. Layer-wise relevance propagation (LRP) [43] used the back-propagation of the relevance score from the output layer to the input layer based on the relevance conservation rule. The propagated relevance values in the input layer were referred to the importance of the corresponding features. There were techniques to explain the models via prototype and criticism [44]. Prototype and criticism can be seen as good and bad examples, respectively. On the other hand, after the subsets of features were randomly selected by the masking data and processed by the deep network model, a simple linear model was built between masking data and corresponding outputs [45]. The weights of the linear model were interpreted as the features' importance. The class activation map (CAM) method [46]- [48] was introduced to focus on the highest-level feature map that was the last convolutional layer. In addition, the counterfactual approach was proposed to show how the model would have to be different for a desirable output to occur [49].
Because it is impossible to realize any XAI methods that can provide the perfect explanations about decisions of DNNs, we need the evaluation on their explainability to figure out which one is the best explanation approach for a given application. To the best of our knowledge, most previous XAI evaluation approaches, which this paper classifies into human-centric methods, have focused on how similar their explanations look to what people's decisions rely on [50]. On the other hand, this paper proposes a machinecentric evaluation method that takes into account how well the feature importance extracted by XAI reflects their actual contribution to the decision of the network. Consequently, while the results of previous methods show the dependency on datasets, the proposed evaluation scheme achieves the high consistency over datasets at the best XAI selection. Although this paper uses the network models for the image classification problem, it is ensured that the proposed idea is expandable to other decision problems.
This paper is organized as follows. Section II addresses the overview of the previous evaluation methods along with their pros and cons, and then Section III describes the proposed HAAS evaluation scheme. Section IV shows evaluation results and discussions. Section V concludes this paper.

II. PREVIOUS EVALUATION METHODS
The human-friendly explanation is one of key properties that XAIs must have [35]. It means that the explainability methods should provide representations that people can understand. However, too much emphasis on the humanfriendliness in previous methods resulted in the humancentric explanation which was biased on how human beings made a decision for given problems rather than what the current model actually concluded it from. Therefore, in most previous evaluation methods, better explainability was equivalent to what the people saw the better was, that is, the subjective criterion.
SmoothGrad [39] was introduced to provide cleaner explanation by removing many scattering points at the saliency map [38]. The scattering points could be eliminated by adding some noises to the input features without consideration on whether those scattering points in the saliency map were important for the decision or not. The resultant humancentric sensitivity map showed only the group of points gathered at the target object that people's detection would be normally based on.
Guided back-propagation [51] represented the feature importance by propagating backwards only through neurons of positive gradients in order to visualize the part that most significantly activates the decision. Because the representations of positive gradients at the first layer were emphasized, the edges of objects were highlighted. The guided backpropagation gave rise to the edge-based explanation to show the shapes of objects that were believed to be the basis for people's judgment. Therefore, this method could also be classified as human-centric.
Grad-CAM [47] and Grad-CAM++ [48] was proposed to localize the area of a target object in a given image rather than to figure out the feature importance. Grad-CAM++ obtained parameters of feature maps from a weighted average and localized the entire object area unlike Grad-CAM that could cover only its parts. Especially, its performance was evaluated objectively in terms of average drop%, %increase in confidence, and win%, compared to Grad-CAM. Average drop% was the drop in confidence by occluding parts of the most important regions, %increase in confidence was the increase by occluding unimportant regions, and win% was the number of cases in given images when the fall in the confidence for a map generated by Grad-CAM++ was lower than that by Grad-CAM. The performance of Grad-CAM++ was higher than Grad-CAM, however, it was natural because Grad-CAM++ highlighted bigger regions containing Grad-CAM.
LRP [43] gave rise to the fine-grained pixel-wise explanation by propagating relevance scores from the output of a target class to inputs in a fashion of layer by layer, where relevance values were computed with weights as well as activation outputs. Whereas other XAI schemes could focus on only the features of positive importance, LRP was able to provide the explanation about positive as well as negative influences. Besides a basic ϵ-rule, additional rules of relevance propagation such as αβ-rule, flat-rule, and boundedrule [52], [53] that are described in more detail in Section IV, were proposed to accomplish more human-centric explanation. Previous LRP papers subjectively evaluated their explainability based on how well the target object area was highlighted.
On the other hand, there have been quantitative evaluation approaches based on the heatmap visualization for image classification problems. The representative one was the area over the MoRF perturbation curve (AOPC) [54], where MoRF stood for most relevant first. The image was divided into predefined grids (r 1 , · · · , r L ) which locations were ordered according to heatmap scores. The score of r i is equal to or lager than r j when i < j. Heatmap scores indicate how important the location is for representing the output class. The MoRF process was conducted recursively by adding perturbations to highest score regions of the image as described in (1) and Fig. 1, where g was a function to eliminate the information of r k from the modified image after (k − 1) MoRF operations (x x (k) In each recursive step, the difference (Dif f (k)) between model outputs for original and perturbed images of x MoRF was plotted, and the final AOPC value became equivalent to the area under the difference curve, as expressed in (2), leading to the conclusion that a larger AOPC is obtained from a better XAI method. ⟨·⟩ is the average function over all images. Consequently, AOPC results showed that LRPs were superior to saliency map [38] and deconvolution [40]. However, AOPC requires lots of iterations to achieve the evaluation results only with consideration on highest heatmap scores. Since AOPC is mainly determined by the first highest score location, it cannot accomplish the fair evaluation on other heatmap score areas. Also, AOPC makes use of the output value difference only at a target class of a given DNN model, regardless of the accuracy variation and other outputs' changes. Therefore, their evaluation results vary depending on the dataset used. On top of AOPC, after the heatmap images were compressed, their file sizes were compared to figure out which XAI algorithms highlight the relevant regions and not more. The file sizes of LRP heatmaps were also smaller than saliency map and deconvolution, however, the assumption that better XAI schemes should have smaller file sizes was also human-centric since the explanation with more scattering points was considered to be worse like SmoothGrad.
The inside-total relevance ratio µ [55] was introduced based on the hypothesis that pixels in the object area are more important for the model's decisions. Therefore, µ was obtained as the ratio of inside-total positive relevance (R in ) in the object area and total positive relevance (R tot ) as presented in (3). Additionally, the dependency on the object size was avoided by multiplying µ with the ratio of total area (S tot ) and object area (S in ) resulting in µ w as described in (4). Because this method used only positive relevance values as well as the assumption that the model's decision should rely on the information in the object area, it was also human-centric. This assumption of object-centricity was verified to some extent by investigating the output changes for the occlusion of object and context areas, however, their evaluation results also showed that some features in the image context influenced the decision. The object-centricity is what the human desires for the model, but may not be what the decision actually is based on.

III. PROPOSED HEATMAP ASSISTED ACCURACY SCORE
The proposed heatmap assisted accuracy score (HAAS) scheme has four advantages over previous quantitative evaluation methods. First, HAAS is much simpler than AOPC and inside-total relevance ratio schemes since it does not require any iterations for sequential occlusion or information removal. Second, HAAS directly provides the accuracy ratio as the quantitative evaluation metric while previous methods focus on the variation of only the target output and need additional processes such as averaging, plotting, generating bounding boxes, summing relevance, and estimating area. Because the accuracy is determined by the maximum of all the output units, HAAS takes into account all the output variations. Third, HAAS is able to evaluate both positive and negative influences of features by taking into account all the heatmap scores directly, whereas previous schemes consider only the positive influences. Lastly, HAAS is a machinecentric method because XAI algorithms is evaluated directly based on the performance of the given model, but not from people's perspectives. Their explainability is investigated by the accuracy changes in the given model using images modified according to features' importance. In addition, whereas APOC results vary regarding to datasets, HAAS shows the robust evaluation. A heatmap is one of visualization methods to show the importance of each feature over a given decision in color or gray scale. As depicted in Fig. 2 that was extracted by LRP over a DNN of the handwritten number classification, red and blue colors represent features of positive and negative influences, respectively. Green colors indicate that those features have little impacts on the output. In other words, while the data of red areas contribute to the increase of a specific output, blue regions lead to its decrease.
Our hypothesis for HAAS is that since heatmaps include the information about the features' influence, the performance of the model should be improved by modifying input images according to heatmap scores. The more accurate explanation the heatmap provides with respect to a network model, the more the performance will be enhanced. Based on this hypothesis, the proposed HAAS scheme is illustrated in Fig. 3. f (·) is the model, x is the input image, h is the heatmap, HA(x, h) is the HA image for x and h, and Acc(·) N is the accuracy over N images. After the decision of a classification model is interpreted in a shape of a heatmap by a given XAI algorithm, the input images are modified by the resultant scores assigned to pixels. The modified images are referred to HA images here. Those HA images are put into the model again and the explainability of the given XAI algorithm is evaluated by the performance improvement metric, HAAS, that is computed by a ratio of accuracies over original and HA images as described in (5). When HAAS is larger than 1.0, the HA images improve the accuracy of the classification model, leading to the conclusion that heatmap scores extracted by the XAI algorithm are explaining the feature's importance well. Conversely, less HAAS than 1.0 means that HA images have deteriorated the performance due to the incorrectly extracted explanation.
For the generation of HA images, red areas are highlighted and blue areas are de-emphasized. Here, while highlighting features means making positive values more positive and negative values more negative, de-emphasizing features is equivalent to changing positive values toward negative and negative values toward positive. To support highlighting as well as de-emphasizing, we normalize both an input (x N orm ) and a heatmap (h N orm ) at the range of −1 to +1. Then, HA images (HA) are composed by clipping the product of x N orm and 1 + h N orm within −1 and +1 as presented in (6). Finally, the proposed HAAS method measures the accuracy variation of the output decisions obtained after putting HA images into the classification model again.

A. EVALUATION SETUP
The evaluation is conducted for two CNN models of LeNet-5 [56] and VGG-16 [57] with four datasets of MNIST [58], CIFAR-10 [59], STL-10 [60], and ILSVRC2012 [61] that are famous ones in the classification application. While LeNet-5 network for MNIST and VGG-16 networks for CIFAR-10 and STL-10 are re-trained with training datasets normalized at the range of −1 to +1, a pre-trained VGG-16 network is employed for ILSVRC2012. As presented in Fig.  4, the last Softmax functions are omitted during the XAI evaluation because the polarity information at the output should be used to express both positive and negative influences of features. However, during the training phase, all the networks have included Softmax functions for class outputs. XAI algorithms used for the evaluation are saliency map, deconvolution, and LRPs with various rule configurations as investigated in the previous AOPC paper [54] because the performance of HAAS is compared to AOPC's results. These XAIs bring out the explanation for each feature that is a pixel in an image, but not the object localization. Saliency maps [38] are constructed by (7), where M p,c is the magnitude of the sensitivity value for the input feature (x p ) at a pixel p and the target class c, and f c (x) is the model output at the class c for the input image x. Therefore, M p,c does not includes the information about the polarity of the influence on the output. The resultant heatmap scores of the saliency map (h SM p,c ) are obtained by normalizing M p,c with the maximum value as shown in (8).
When the model consists of convolutional layers (Conv) as expressed in (9), deconvolution methods [40] give rise to the explanation by deconvolutional layers (Deconv) without ReLU activation functions as described in (10). a (l) is the activation output of the (l)-th layer, z (l+1) is the output of the (l + 1)-th convolutional layer, θ (l,l+1) is the parameter of the (l+1)-th convolutional layer, and D p,c with its maximum absolute value as shown in (11).
LRPs [43] are also investigated for 5 propagation rules of ϵ-rule, α1β0-rule, α2β1-rule, flat-rule, and bounded-rule. First of all, z (l+1) ij is defined in (12) as the product of a Then, the ϵ-rule generates the relevance (R ϵ,(l,l+1) i←j ) propagated from the j-th output to the i-th input as described in (14) with a stabilizer (ϵ) to avoid too small denominator close to zero.
With z +,(l+1) j and z −,(l+1) j obtained by the sums of positive and negative parts as expressed in (19) and (20), αβrules are described in (21). For α1β0 and α2β1, the numbers following α and β are their assigned numbers respectively.
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication.
The flat-rule is simply defined to uniformly distribute the relevance of a higher layer to a lower layer as presented in (22) without any consideration on activation outputs and parameters. i 1 is the total number of inputs connected to the j-th output of the (l + 1)-th layer.
Lastly, the bounded-rule is presented with z B,(l+1) ij defined in (25). w +,(l+1) ij and w −,(l+1) ij are positive and negative parameters as expressed in (23) and (24). low i is the minimum value of features connected to w +,(l+1) ij , and high i is the maximum value of features linked to w −,(l+1) ij . Therefore, it is guaranteed that z B,(l+1) ij is not negative. The resultant propagated relevance of the bounded-rule (R B,(l,l+1) i←j ) with a stabilizer of ϵ is shown in (26).
Consequently, the relevance (R (l) i ) at the i-th input is defined as the sum of all relevance values propagated into the i-th input like (27) and LRP heatmap scores (h LRP p,c ) are obtained by the normalized relevance at the input layer of l = 0 as presented in (28).
While ϵ-rule, flat-rule, and bounded-rule are considered for the input layer, ϵ-rule, α1β0-rule, and α2β1-rule are applied for internal convolutional layers. All fully-connected (FC) layers are addressed only by a ϵ-rule. Consequently, this evaluation is composed of totally 9 combinations of LRP rules as summarized in Table 1.  and ϵ-rule for Conv and FC layers achieves the best HAAS performance and LRP1 of only ϵ-rules takes the second place of 1.1160. Heatmap and HA images over 10 original test images leading to misclassification are illustrated in Figs. 7 and 8. For convolutional layers, αβ-rules present the areas of objects in the heatmap image better with respect to the localization, but ϵ-rules give rise to the larger increase in both output and accuracy. Therefore, it is sure that ϵ-rules for all convolutional layers represent the feature importance of each pixel better.
Third, VGG-16 is investigated with the STL-10 test dataset of 8,000 images as shown in Table 4, where the mean value of f c (x) and accuracy for the original images are 21.38 and 91.35 %, respectively. The HAAS results over STL-10 are similar to them of CIFAR-10. The maximum HAAS of 1.0918 is accomplished at LRP3 while the maximum AOPC of 346.21 is achieved at LRP5. LRP1 of only ϵ-rules also takes the second place of 1.0906. Heatmap and HA images over 10 original test images leading to misclassification are illustrated in Figs. 9 and 10. Like the HAAS evaluation results at CIFAR-10, the object areas are better marked in the heatmap images with αβ-rules in convolutional layers, however, accuracy and output are more enhanced with ϵrules.
Lastly, VGG-16 is investigated with the ILSVRC2012 test dataset of 50,000 images as shown in Table 5 While the best XAI algorithm selected from AOPC is strongly dependent of the dataset, the proposed HAAS shows the consistent recommendation of LRP1 and LRP3 regardless of datasets. We believe that if the evaluation method is machine-centric, the XAI selection should be independent of datasets. For the quantitative comparison for the consistencies of AOPC and HAAS, we employ Kullback-Leibler divergence (D KL ) [62] as described in (29), where p and q are two probability distributions and i is the index of possible outcomes. The more similar two distributions are, the closer D KL is to 0. D KL is not negative. If the evaluation method is consistent over datasets, the distributions of results over XAI algorithms must be similar leading to the smaller D KL between different datasets. The evaluation results (E k (i)) are normalized into N k (i) by (30), where i = 1, 2, · · · , 11 is the index of eleven XAI algorithms from saliency map to LRP9 and k = 1, 2, 3, 4 is the index of four datasets from MNIST to ILSVRC2012. Because the probability of 0 must be avoided for D KL computation, the small stabilizer (γ) of 0.001 is added to the normalized values. After their values are modified into the probability (P k (i)) by dividing with their total sum like (31), divergence values of This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and Whereas AOPC results in D KL values of 0.2761, 0.5337, 0.0329, 0.3685, 0.4289, and 0.1888 with the average of 0.3048, HAAS obtains divergence values of 0.0119, 0.0335, 0.0654, 0.0071, 0.0260, and 0.0067 which average is 0.0251 that is smaller than AOPC by a factor of 10. It is ensured that the distribution of HAAS evaluation is less dependent of datasets than AOPC. Therefore, we conclude that HAAS has established more robust evaluation for XAI algorithms. Our evaluation results lead to the conclusion that ϵ-rule and bounded-rule for an input layer and ϵ-rule for convolutional and FC layers are most suitable to achieve the machinecentric robust evaluation of XAIs over DNNs.

C. INVERTED HA TEST
To further verify our hypothesis of HAAS that emphasizing the pixels of the image according to heatmap scores enhances the accuracy of a given classification network, the inverted 10 VOLUME 4, 2021 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and HA images (invHA) are generated by inverted heatmap scores as described in (32) and Fig. 13 and put into the network to estimate the change of the accuracy, compared to the HA images generated with the original heatmap scores. If the proposed HAAS scheme is reflecting the feature importance properly, the inverted HA images must lead to larger accuracy degradation at the XAI algorithms of the higher HAAS results.
Like the HAAS evaluation, the inverted HA test is also conducted with two DNNs of LeNet-5 and VGG-16, and four test datasets of MNIST, CIFAR-10, STL-10, and ILSVRC2012. The output value at the label class (f c (invHA(x, h))), accuracy at the inverted HA images, and the accuracy difference (∆Accuracy) between inverted HA and original HA images are measured as summarized in Tables 6, 7 way that can explain how well the given XAI algorithm actually addresses each feature importance.

V. CONCLUSION
Whereas DNNs have accomplished dramatic performance improvement in many areas, understanding the inside of the networks becomes more difficult. Consequently, various XAI algorithms have been come up with to interpret the logic behind decisions of complex neural networks, and at the same time, it has been necessary to study methods to evaluate their explainability. This paper proposes a machine-centric HAAS scheme to evaluate the explainability of XAIs while most previous evaluation methods focus on how closer their results are to what people rely on. Furthermore, unlike AOPC needing many iterations and focusing only on the output of a target class, HAAS provides the quantitative scores directly by putting HA images generated from original image and heatmap scores into the given model without any iterations. To estimate the accuracy changes, it should take into account all the outputs at the same time. In addition, HAAS uses both positive and negative influences of features. Especially, over four datasets for classification networks, HAAS achieves lower D KL of 0.0251 than AOPC of 0.3048 in average. Low D KL means that HAAS provides more robust evaluation than AOPC independently of datasets. The evaluation through the accuracy changes of the network and the high consistency of the best XAI selection lead to the conclusion that HAAS is the machine-centric evaluation.Although the proposed HAAS has been investigated only for the classification