Introduction
Deep neural networks (DNNs) have gotten humongous attention thanks to a huge amount of datasets for training [1]–[3], the drastic advance in the computing hardware [4]–[9], and the advent of cloud computing [10]–[13], that have enabled the implementation of large number of layers and neurons to make much higher performance possible. DNNs have been developed in various network structures of multi-layer perceptron (MLP) [14], convolutional neural network (CNN) [15], recurrent neural network (RNN) [16], [17], generative adversarial network (GAN) [18], [19], and graph neural network (GNN) [20]. They have been also employed in a variety of applications such as classification [21], [22], object detection [23], segmentation [24], super resolution [25], machine translation [26], natural language processing [27], image/caption/speech generation [28]–[30], algorithmic trading [31], style transfer [32], circuit design [33], recommendation system [34], and so on. Until recently, many researches have been focusing on improving the performance of the networks by means of the increased computational complexity.
In contrast to the substantial achievement with respect to performance, the increased complexity has made it difficult to understand the logic behind the decisions from the networks. Therefore, DNNs are referred to black boxes. Since DNNs cannot guarantee the perfect performance like the accuracy of 100 %, their decisions need the explanation in some applications such as transportation, health care, legal, finance, and military areas [35]. Whereas transportation, health care, and military fields involve in the life-related decisions, legal and finance areas are regulated to provide reasons over decisions by law.
There have been many researches to provide the explainability over DNNs that are addressed as the explainable artificial intelligence (XAI). The importance of a feature was estimated by calculating the expectation of the output over other remaining features or by fixing them at randomly selected values [36], [37]. The sensitivity of the output computed by its derivative with respect to a specific input feature was considered as features’ importance, based on the assumption that more important features would cause larger variations of the outputs even at their small changes [38], [39]. The output of a specific target class was simply back-propagated toward the input layer through a given network by means of the deconvolution, resulting in the heatmap image that represented feature importance scores [40]. On the one hand, because simpler models were easy to interpret, the distillation scheme was proposed, where the knowledge of the high-capacity model was transferred to the small model by training the small one on the soft targets generated by the complicated model [41], [42]. Layer-wise relevance propagation (LRP) [43] used the back-propagation of the relevance score from the output layer to the input layer based on the relevance conservation rule. The propagated relevance values in the input layer were referred to the importance of the corresponding features. There were techniques to explain the models via prototype and criticism [44]. Prototype and criticism can be seen as good and bad examples, respectively. On the other hand, after the subsets of features were randomly selected by the masking data and processed by the deep network model, a simple linear model was built between masking data and corresponding outputs [45]. The weights of the linear model were interpreted as the features’ importance. The class activation map (CAM) method [46]–[48] was introduced to focus on the highest-level feature map that was the last convolutional layer. In addition, the counterfactual approach was proposed to show how the model would have to be different for a desirable output to occur [49].
Because it is impossible to realize any XAI methods that can provide the perfect explanations about decisions of DNNs, we need the evaluation on their explainability to figure out which one is the best explanation approach for a given application. To the best of our knowledge, most previous XAI evaluation approaches, which this paper classifies into human-centric methods, have focused on how similar their explanations look to what people’s decisions rely on [50]. On the other hand, this paper proposes a machine-centric evaluation method that takes into account how well the feature importance extracted by XAI reflects their actual contribution to the decision of the network. Consequently, while the results of previous methods show the dependency on datasets, the proposed evaluation scheme achieves the high consistency over datasets at the best XAI selection. Although this paper uses the network models for the image classification problem, it is ensured that the proposed idea is expandable to other decision problems.
This paper is organized as follows. Section II addresses the overview of the previous evaluation methods along with their pros and cons, and then Section III describes the proposed HAAS evaluation scheme. Section IV shows evaluation results and discussions. Section V concludes this paper.
Previous Evaluation Methods
The human-friendly explanation is one of key properties that XAIs must have [35]. It means that the explainability methods should provide representations that people can understand. However, too much emphasis on the human-friendliness in previous methods resulted in the human-centric explanation which was biased on how human beings made a decision for given problems rather than what the current model actually concluded it from. Therefore, in most previous evaluation methods, better explainability was equivalent to what the people saw the better was, that is, the subjective criterion.
SmoothGrad [39] was introduced to provide cleaner explanation by removing many scattering points at the saliency map [38]. The scattering points could be eliminated by adding some noises to the input features without consideration on whether those scattering points in the saliency map were important for the decision or not. The resultant human-centric sensitivity map showed only the group of points gathered at the target object that people’s detection would be normally based on.
Guided back-propagation [51] represented the feature importance by propagating backwards only through neurons of positive gradients in order to visualize the part that most significantly activates the decision. Because the representations of positive gradients at the first layer were emphasized, the edges of objects were highlighted. The guided back-propagation gave rise to the edge-based explanation to show the shapes of objects that were believed to be the basis for people’s judgment. Therefore, this method could also be classified as human-centric.
Grad-CAM [47] and Grad-CAM++ [48] was proposed to localize the area of a target object in a given image rather than to figure out the feature importance. Grad-CAM++ obtained parameters of feature maps from a weighted average and localized the entire object area unlike Grad-CAM that could cover only its parts. Especially, its performance was evaluated objectively in terms of average drop%, %increase in confidence, and win%, compared to Grad-CAM. Average drop% was the drop in confidence by occluding parts of the most important regions, %increase in confidence was the increase by occluding unimportant regions, and win% was the number of cases in given images when the fall in the confidence for a map generated by Grad-CAM++ was lower than that by Grad-CAM. The performance of Grad-CAM++ was higher than Grad-CAM, however, it was natural because Grad-CAM++ highlighted bigger regions containing Grad-CAM.
LRP [43] gave rise to the fine-grained pixel-wise explanation by propagating relevance scores from the output of a target class to inputs in a fashion of layer by layer, where relevance values were computed with weights as well as activation outputs. Whereas other XAI schemes could focus on only the features of positive importance, LRP was able to provide the explanation about positive as well as negative influences. Besides a basic
On the other hand, there have been quantitative evaluation approaches based on the heatmap visualization for image classification problems. The representative one was the area over the MoRF perturbation curve (AOPC) [54], where MoRF stood for most relevant first. The image was divided into predefined grids (\begin{equation*} x_{\mathrm {MoRF}}^{(k)} = {g}(x_{\mathrm {MoRF}}^{(k-1)}, \:\boldsymbol {r_{k}}),\quad \; 1 \leq k \leq L \tag{1}\end{equation*}
AOPC evaluation based on the iterative MoRF process.
In each recursive step, the difference (\begin{equation*} \text {AOPC} = \left \langle{ \frac {1}{L} \sum _{k=1}^{L} f(x_{\mathrm {MoRF}}^{(0)})-f(x_{\mathrm {MoRF}}^{(k)}) }\right \rangle \tag{2}\end{equation*}
The inside-total relevance ratio \begin{align*} \mu=&\frac {R_{in}}{R_{tot}} \tag{3}\\ \mu _{w}=&\mu \cdot \frac {S_{tot}}{S_{in}} \tag{4}\end{align*}
Proposed HEATMAP Assisted Accuracy Score
The proposed heatmap assisted accuracy score (HAAS) scheme has four advantages over previous quantitative evaluation methods. First, HAAS is much simpler than AOPC and inside-total relevance ratio schemes since it does not require any iterations for sequential occlusion or information removal. Second, HAAS directly provides the accuracy ratio as the quantitative evaluation metric while previous methods focus on the variation of only the target output and need additional processes such as averaging, plotting, generating bounding boxes, summing relevance, and estimating area. Because the accuracy is determined by the maximum of all the output units, HAAS takes into account all the output variations. Third, HAAS is able to evaluate both positive and negative influences of features by taking into account all the heatmap scores directly, whereas previous schemes consider only the positive influences. Lastly, HAAS is a machine-centric method because XAI algorithms is evaluated directly based on the performance of the given model, but not from people’s perspectives. Their explainability is investigated by the accuracy changes in the given model using images modified according to features’ importance. In addition, whereas APOC results vary regarding to datasets, HAAS shows the robust evaluation.
A heatmap is one of visualization methods to show the importance of each feature over a given decision in color or gray scale. As depicted in Fig. 2 that was extracted by LRP over a DNN of the handwritten number classification, red and blue colors represent features of positive and negative influences, respectively. Green colors indicate that those features have little impacts on the output. In other words, while the data of red areas contribute to the increase of a specific output, blue regions lead to its decrease.
Example of heatmap visualization for the digit 3 in the handwritten number classification. The heatmap image was obtained by LRP, where red and blue colors means positive and negative influences, respectively.
Our hypothesis for HAAS is that since heatmaps include the information about the features’ influence, the performance of the model should be improved by modifying input images according to heatmap scores. The more accurate explanation the heatmap provides with respect to a network model, the more the performance will be enhanced. Based on this hypothesis, the proposed HAAS scheme is illustrated in Fig. 3. \begin{equation*} HAAS = \frac {Acc(f(HA(x,h)))_{N}}{Acc(f(x))_{N}} \tag{5}\end{equation*}
For the generation of HA images, red areas are highlighted and blue areas are de-emphasized. Here, while highlighting features means making positive values more positive and negative values more negative, de-emphasizing features is equivalent to changing positive values toward negative and negative values toward positive. To support highlighting as well as de-emphasizing, we normalize both an input (\begin{equation*} HA = \max \{-1,\min \{1,x_{Norm} \cdot (1+h_{Norm})\}\} \tag{6}\end{equation*}
Evaluation Results
A. Evaluation Setup
The evaluation is conducted for two CNN models of LeNet-5 [56] and VGG-16 [57] with four datasets of MNIST [58], CIFAR-10 [59], STL-10 [60], and ILSVRC2012 [61] that are famous ones in the classification application. While LeNet-5 network for MNIST and VGG-16 networks for CIFAR-10 and STL-10 are re-trained with training datasets normalized at the range of −1 to +1, a pre-trained VGG-16 network is employed for ILSVRC2012. As presented in Fig. 4, the last Softmax functions are omitted during the XAI evaluation because the polarity information at the output should be used to express both positive and negative influences of features. However, during the training phase, all the networks have included Softmax functions for class outputs.
Classification Models (a) MNIST + LeNet-5 (b) CIFAR-10+ VGG-16 (c) STL-10+ VGG-16 (d) ILSVRC2012+ VGG-16.
XAI algorithms used for the evaluation are saliency map, deconvolution, and LRPs with various rule configurations as investigated in the previous AOPC paper [54] because the performance of HAAS is compared to AOPC’s results. These XAIs bring out the explanation for each feature that is a pixel in an image, but not the object localization. Saliency maps [38] are constructed by (7), where \begin{align*} M_{p,c}=&\left |{ \frac {\partial f_{c}(x)}{\partial x_{p}} }\right | \tag{7}\\ h^{SM}_{p,c}=&\frac {M_{p,c}}{\max \{M_{c}\}} \tag{8}\end{align*}
When the model consists of convolutional layers (\begin{align*} z^{(l+1)}=&Conv(a^{(l)},\theta ^{(l,l+1)}) \tag{9}\\ D^{(l)}_{c}=&Deconv(R^{(l+1)}_{c},\theta ^{(l,l+1)}) \tag{10}\\ h^{DC}_{p,c}=&\frac {D^{(0)}_{p,c}}{\max \{|D^{(0)}_{c}|\}} \tag{11}\end{align*}
LRPs [43] are also investigated for 5 propagation rules of \begin{align*} z^{(l+1)}_{ij}=&a^{(l)}_{i} \cdot w^{(l+1)}_{ij} \tag{12}\\ z^{(l+1)}_{j}=&\sum _{i} z^{(l+1)}_{ij} + b^{(l+1)}_{j} \tag{13}\end{align*}
Then, the \begin{equation*} R_{i\leftarrow j}^{\epsilon,(l,l+1)} = \frac {z^{(l+1)}_{ij}}{z^{(l+1)}_{j} + \varepsilon \cdot sign(z^{(l+1)}_{j})}\cdot R_{j}^{(l+1)} \tag{14}\end{equation*}
\begin{align*} {z_{ij}^{+,(l+1)}}=&\begin{cases} \displaystyle z_{ij}^{(l+1)},&{\text {if}}~z_{ij}^{(l+1)} > 0\\ \displaystyle {0,}&{\text {otherwise.}} \end{cases} \tag{15}\\[3pt] {z_{ij}^{-,(l+1)}}=&\begin{cases} \displaystyle z_{ij}^{(l+1)},&{\text {if}}~z_{ij}^{(l+1)} < 0\\ \displaystyle {0,}&{\text {otherwise.}} \end{cases} \tag{16}\\[3pt] {b_{j}^{+,(l+1)}}=&\begin{cases} \displaystyle b_{j}^{(l+1)},&{\text {if}}~b_{j}^{(l+1)} > 0\\ \displaystyle {0,}&{\text {otherwise.}} \end{cases} \tag{17}\\[3pt] {b_{j}^{-,(l+1)}}=&\begin{cases} \displaystyle b_{j}^{(l+1)},&{\text {if}}~b_{j}^{(l+1)} < 0\\ \displaystyle {0,}&{\text {otherwise.}} \end{cases} \tag{18}\end{align*}
With \begin{align*} z_{j}^{+,(l+1)}=&\sum _{i} z_{ij}^{+,(l+1)}+b_{j}^{+,(l+1)} \tag{19}\\ z_{j}^{-,(l+1)}=&\sum _{i} z_{ij}^{-,(l+1)}+b_{j}^{-,(l+1)} \tag{20}\\ R_{ij}^{\alpha \beta,(l,l+1)}=&\left ({\alpha \cdot \frac {z_{ij}^{+,(l+1)}}{z_{j}^{+,(l+1)}}+\beta \cdot \frac {z_{ij}^{-,(l+1)}}{z_{j}^{-,(l+1)}} }\right)\cdot R_{j}^{(l+1)} \tag{21}\end{align*}
The flat-rule is simply defined to uniformly distribute the relevance of a higher layer to a lower layer as presented in (22) without any consideration on activation outputs and parameters. \begin{equation*} R_{i\leftarrow j}^{\flat,(l,l+1)} = \left ({\frac {1}{\sum _{i} 1} }\right)\cdot R_{j}^{(l+1)} \tag{22}\end{equation*}
Lastly, the bounded-rule is presented with \begin{align*} {w_{ij}^{+,(l+1)}}=&\begin{cases} \displaystyle w_{ij}^{(l+1)},&{\text {if}}~w_{ij}^{(l+1)} > 0\\ \displaystyle {0,}&{\text {otherwise.}} \end{cases} \tag{23}\\ {w_{ij}^{-,(l+1)}}=&\begin{cases} \displaystyle w_{ij}^{(l+1)},&{\text {if}}~w_{ij}^{(l+1)} < 0\\ \displaystyle {0,}&{\text {otherwise.}} \end{cases} \tag{24}\\[3pt] z_{ij}^{B,(l+1)}=&z_{ij}^{(l+1)} -low_{i} w^{+,(l+1)}_{ij} -high_{i}w^{-,(l+1)}_{ij} \tag{25}\\[3pt] R_{i\leftarrow j}^{B,(l,l+1)}=&\frac {z_{ij}^{B,(l+1)}}{\sum \limits _{i} z_{ij}^{B,(l+1)}+\epsilon }\cdot R_{j}^{(l+1)} \tag{26}\end{align*}
Consequently, the relevance (\begin{align*} R_{i}^{(l)}=&\sum _{j}R_{i\leftarrow j}^{(l,l+1)} \tag{27}\\[-4pt] h^{LRP}_{p,c}=&\frac {R^{(0)}_{p,c}}{\max \{|R^{(0)}_{c}|\}} \tag{28}\end{align*}
While
B. HAAS Evaluation
All the networks are investigated by both AOPC and HAAS methods for eleven XAI algorithms. AOPC is conducted for 100 iterations per an image with regions of
Heatmap and HA images at MNIST images of 0 to 4 digits for 11 XAI algorithms of saliency map (S. Map), deconvolution (Deconv), and LRP1 to LRP9. Red and blue colors represent wrong and correct classifications, respectively.
Heatmap and HA images at MNIST images of 5 to 9 digits for 11 XAI algorithms of saliency map (S. Map), deconvolution (Deconv), and LRP1 to LRP9. Red and blue colors represent wrong and correct classifications, respectively.
Second, VGG-16 is investigated with the CIFAR-10 test dataset of 10,000 images as shown in Table 3, where the mean value of
Heatmap and HA images at CIFAR-10 images of plane, car, bird, cat, and deer for 11 XAI algorithms of saliency map (S. Map), deconvolution (Deconv), and LRP1 to LRP9. Red and blue colors represent wrong and correct classifications, respectively.
Heatmap and HA images at CIFAR-10 images of dog, frog, horse, ship, and truck for 11 XAI algorithms of saliency map (S. Map), deconvolution (Deconv), and LRP1 to LRP9. Red and blue colors represent wrong and correct classifications, respectively.
Third, VGG-16 is investigated with the STL-10 test dataset of 8,000 images as shown in Table 4, where the mean value of
Heatmap and HA images at STL-10 images of airplane, bird, car, cat, and deer for 11 XAI algorithms of saliency map (S. Map), deconvolution (Deconv), and LRP1 to LRP9. Red and blue colors represent wrong and correct classifications, respectively.
Heatmap and HA images at STL-10 images of dog, horse, monkey, ship, and truck for 11 XAI algorithms of saliency map (S. Map), deconvolution (Deconv), and LRP1 to LRP9. Red and blue colors represent wrong and correct classifications, respectively.
Lastly, VGG-16 is investigated with the ILSVRC2012 test dataset of 50,000 images as shown in Table 5, where the mean value of
Heatmap and HA images at ILSVRC2012 images of kite, crane, boxer, beaver, and baseball for 11 XAI algorithms of saliency map (S. Map), deconvolution (Deconv), and LRP1 to LRP9. Red and blue colors represent wrong and correct classifications, respectively.
Heatmap and HA images at ILSVRC2012 images of cradle, mask, scoreboard, teapot, and cup for 11 XAI algorithms of saliency map (S. Map), deconvolution (Deconv), and LRP1 to LRP9. Red and blue colors represent wrong and correct classifications, respectively.
While the best XAI algorithm selected from AOPC is strongly dependent of the dataset, the proposed HAAS shows the consistent recommendation of LRP1 and LRP3 regardless of datasets. We believe that if the evaluation method is machine-centric, the XAI selection should be independent of datasets. For the quantitative comparison for the consistencies of AOPC and HAAS, we employ Kullback–Leibler divergence (\begin{align*} D_{KL}(p|q)=&\sum _{i}p(i)\log \frac {p(i)}{q(i)} \tag{29}\\ N_{k}(i)=&\frac {E_{k}(i)-\min (E_{k})}{\max (E_{k})-\min (E_{k})}+\gamma \tag{30}\\ P_{k}(i)=&\frac {N_{k}(i)}{\sum _{i}N_{k}(i)} \tag{31}\end{align*}
C. Inverted HA Test
To further verify our hypothesis of HAAS that emphasizing the pixels of the image according to heatmap scores enhances the accuracy of a given classification network, the inverted HA images (\begin{equation*} invHA = \max \{-1,\min \{1,x_{Norm} \cdot (1-h_{Norm})\}\} \tag{32}\end{equation*}
Heatmap and HA images at ILSVRC2012 for 11 XAI algorithms of saliency map (S. Map), deconvolution (Deconv), and LRP1 to LRP9.
Like the HAAS evaluation, the inverted HA test is also conducted with two DNNs of LeNet-5 and VGG-16, and four test datasets of MNIST, CIFAR-10, STL-10, and ILSVRC2012. The output value at the label class (
Conclusion
Whereas DNNs have accomplished dramatic performance improvement in many areas, understanding the inside of the networks becomes more difficult. Consequently, various XAI algorithms have been come up with to interpret the logic behind decisions of complex neural networks, and at the same time, it has been necessary to study methods to evaluate their explainability. This paper proposes a machine-centric HAAS scheme to evaluate the explainability of XAIs while most previous evaluation methods focus on how closer their results are to what people rely on. Furthermore, unlike AOPC needing many iterations and focusing only on the output of a target class, HAAS provides the quantitative scores directly by putting HA images generated from original image and heatmap scores into the given model without any iterations. To estimate the accuracy changes, it should take into account all the outputs at the same time. In addition, HAAS uses both positive and negative influences of features. Especially, over four datasets for classification networks, HAAS achieves lower