Visual Interpretation of CNN Prediction Through Layerwise Sequential Selection of Discernible Neurons

In recent years, researchers have been working to interpret the insights of deep neural networks in the pursuit of overcoming their opaqueness and so-called ‘black-box’ tag. In this work, we present a new posthoc visual interpretation technique that finds out discriminative image regions contributing highly towards networks’ prediction. We select most discernible set of neurons per layer and engineer the forward pass operation to gradually reach most discriminative image locations. While searching for discernible neurons, existing approaches either end up with low-resolution visualization maps, or suffer lack of neuron discriminativity in the way. Moreover, some methods concentrate only on current layer information overlooking meaningful information from adjacent layers, limiting the overall scope of selection. We address these issues by exploring the layer-to-layer connected structure of a neuron and obtaining contributions from its current layer along with its adjacent layers, e.g., succeeding and preceding layer. We introduce a score function where such contributions are assembled with appropriate priorities. Layerwise discernible neurons are then selected based on top scores, ensuring a reliable and credible selection. We validate our proposal through objective and subjective evaluations by examining its performance in terms of models’ faithfulness and human-trust, where we visualize its efficacy over other existing methods. We also perform sets of sanity check experiments on our method to show its overall reliability as a visualization map.


I. INTRODUCTION
With the rise of unprecedented performance of deep learning methods in various computer vision tasks over last few years, network architectures also get complex to preserve diverse variations [1], [2]. Despite offering higher performance, these complex architectures provide less explanation of the question: 'how it actually works?'. As a result, deep learning models have been tagged as black-box models, raising the necessity of eradicating its opaqueness while being more understandable and transparent for general use [3], [4].
The associate editor coordinating the review of this manuscript and approving it for publication was Kaustubh Raosaheb Patil .
To understand the inner representations of deep networks, it is important to see how and what a network learns in practice. One possible way is to look for salient image regions that contribute most for the networks' prediction. Thus, we know how a network is making its prediction, as well as get an idea which portion of input data is guiding the network towards a miss-prediction. Such regions, in fact, can act as the visual explanations for a predicted label exhibiting class-specific patterns learned by a model. In this way, these posthoc visualizations aid interpreting the insights of CNN's prediction.

A. PREVIOUS WORKS
The major works on this area utilize gradient information to visualize salient regions [5]- [13]; however, according to the related comparative discussions at [14]- [21], we broadly divide the existing works into following categories.

1) PERTURBATION-BASED APPROACH
These approaches mainly observe the network behavior after perturbing specific input or neurons, aiming to understand its influence on later input or neurons. For example, Zeiler and Fergus [6] visualized the change in activations (of later layers) by occluding different parts of the input image. Zintgraf et al. [22] also analyzed the change in the prediction after marginalizing over each input patch. These approaches, however, often become practically inefficient due to the additional perturbation strategies that usually need different forward propagation through the networks. In this line of work, we also observe few works focusing on learning interpretable models on the model predictions [23], or perturbing the inputs to see the reaction at blackbox model [24], or both [25].

2) BACKPROPAGATION-BASED APPROACH
Unlike the above methods, backpropagation-based approaches propagate discernible signals from output node to input data through the layers in a single pass. The dominant group of works in this category are, in fact, gradient-based. For instance, Simonyan et al. [5] compute the sensitivity of classification score in terms of partial derivative of the classification score for a given class with respect to the changes in pixel values. Deconvolution-based works [6] take a similar approach to visualize salient feature concepts across different layers. Nonetheless, lack of neuron-discriminativity in these methods may affect the visualization, as pointed in [26]. Guided backprop [7] utilizes the gradient by modifying them for a better qualitative visual representation. Such gradientbased visualization maps offen suffer from noise due to local variations in partial derivatives. Smilkov et al. [10] addresses this issue by SmoothGrad that smooths the gradients with a Gaussian kernel. Other works [11], [27] come up with gradient×input strategy, which is basically an element-wise product between the visualization maps (e.g., [5]) and the input. Gradients are also used after integrating them with respect to the series of interpolated images that attribute the prediction of a deep network to its inputs [12].

3) CAM BASED APPROACH
The works done based on the popular Class-specific Activation Map (CAM) [8], [28] concept are categorized in this group. CAM generates salient feature-maps by combining intermediate feature maps before global average pooling layer. Although this technique provides better flexibility than the prior approaches in terms of interpreting the prediction, its architecture-specific design limits its adaption scope to different architectures and applications, as also pointed in [29]. Improvements are, however, done over this method utilizing gradient information and named as Grad-CAM [9]. Selvaraju et al. [9] also perform element-wise product between the scores from Grad-CAM and Guided-backprop, aiming to come up with more fine-grained visualization, FIGURE 1. Input image and corresponding discriminative locations found by our method for a pre-trained VGG-16 model. The visualization map (c) is generated from the discriminative locations (b) after Gaussian blurring. The example is generated after VGG-16 predicts the input image as 'shetland sheepdog. ' which they name as Guided Grad-CAM. Nevertheless, this strategy possesses the issues of Guided backprop caused by zeroing-out negative gradients during the back-propagation, as pointed in [11]. Moreover, Grad-CAM still uses lowresolution maps which often limits the performance of practical applications, e.g., weakly-supervised tasks [30]. Another line of works in this category are gradient-free CAMs [31]- [33], which basically capture the importance of each activation map by the target score in forward propagation. However, these methods are time-consuming and suffer parameter-dependency (e.g., Gaussian noise parameter). Use of ReLU non-linearity may lead some of these maps (e.g., [31]) to be noise-prone as well. As a result, concentration of the maps often spread-out, and create ambiguity while interpreting the miss-classified samples, as pointed in [32].

4) STRUCTURE-BASED APPROACH
We also observe group of works [22], [34]- [37] taking relevance score for every feature with respect to a class and estimate whether the prediction change in absence of that feature. Large changes in prediction indicate the importance of feature while small changes do not affect the decision. Some other approaches [35] take probabilistic approaches to find contribution of each image patch (or pixel) to the classification detailing their understanding. Zhang et al. [38] compute marginal winning probabilities for neurons at each layer, where distinct attention map is computed as the sum of these probabilities across the feature maps. Huber et al. [39] utilizes only the most relevant neurons of each convolution layer and thus generates more selective saliency maps. Mopuri et al. [29], in the same line, proposes CNN-Fixations that selects important neurons to trace salient image region in order to interpret the prediction. One important part of [29] is to select salient neurons at each layer, where authors suggest a naive approach of such selection by considering the neurons with high activations. During selection, this approach only concentrates on information from target layer, e.g., activations, and thus overlooks available meaningful information from other neighboring layers.

B. OUR MOTIVATION AND PROPOSAL
As discussed, structure-based approaches select and utilize discriminative neurons per layer to search for salient VOLUME 10, 2022 image locations, and so far have shown their efficacy. We also observe in the former discussion that lack of neuron discriminativity affect visualization in some backpropagationbased methods. Hence, motivated by the importance of selecting discriminative neurons in a good visualization, we concentrate on coming up with a reliable selection strategy of such neurons per layer which will be sequentially propagated towards input layer to find most discriminative image locations. However, while selecting these neurons, some approaches, e.g., [29] only rely on current layer activations, ignoring other available significant information from adjacent layers which limits the overall scope and reliability of selection. Observed by the layer-to-layer connected structure of network where information at one layer contributes heavily to the outputs of its next layer, we infer that impact of a neuron can be better understood analyzing its influence on current and neighboring adjacent layers. We explore this layerwise connected structure of neuron with the notion that information from current layer together with information from its adjacent layers simultaneously contribute towards selecting a neuron as an important one. We also get motivation from one recent work on network pruning [40] that explores such layerwise structure, where authors prune a specific layer using weight tensors from adjacent layers along with its own layerweights exhibiting superior results. However, we incorporate the idea of using adjacent layer information carefully, where we assemble each of their contributions with current layer's contribution in a prioritized way. Contributions from all these layers are the key in gaining confidence for a neuron to be selected as an important one; nonetheless, priorities are also added to put appropriate attention to each of them. For a traditional neural net, the above way reliably ensures the selection of important neurons per layer, starting from last fully-connected (fc) layer to initial input data. As soon as we reach the input image, we generate visualization map by Gaussian blurring the salient neurons, i.e., image locations. One sample visualization of our method is provided in Fig. 1. Our contributions can be summarized as follows, • We explore the layer-to-layer connected structure of a neuron to understand its impact through the contributions from its current and adjacent layers.
• We introduce a score-function to assemble the layerwise contributions with corresponding priorities, where a reliable way of selecting such priorities is also presented. Most impactful neurons are selected per layer based on top scores, and gradually propagated towards input data to locate and highlight desired discriminative regions.
• We conduct extensive experiments to examine the efficacy of proposed technique against other existing methods through objective and subjective evaluations.
We also conduct sets of sanity check experiments to show its overall reliability as a visualization map.

II. METHODOLOGY
Our key purpose is to visually explain a CNN prediction by highlighting most important image pixels responsible for the prediction. To search for these pixels, we trace back corresponding prediction sequentially from last layer to input data by finding layerwise most discriminative neurons so that important neurons (pixels) of input image are identified. In practice, our method takes a trained neural network, where we pass a test image to make prediction which is then traced back from final softmax layers. Neurons with top n-probabilities are considered as the neurons of interest at this layer, for which we try to find a set of neurons from its prior layer (e.g., last fc layer) contributing most to activate them. In this scenario, last fc layer is considered as our current (target) layer, softmax layer is the succeeding layer, and second-last fc layer (if exists, otherwise available conv/pool layer) is considered as the preceding layer (such a scenario is depicted in Fig. 2). That is, for a set of neurons at succeeding layer, we search for the neurons of current layer contributing most towards activating succeeding layers' neurons of interest. The contribution of a neuron is defined through a proposed score function formulated from collective information from its current and adjacent layers. Important neurons at each layer are then carefully selected by exploring the score values. We iterate this selection process from softmax layer till input data to determine most discriminative image pixels, which are then highlighted to aid for interpretation purpose.
The proposal of contribution-score, mentioned above, is the key highlight of our method, through which we assess a neurons' impact at a layer. For a neuron, its activation is typically considered as the key measure to estimate its impact [29]. We ask couple of questions in this regard: 'why the activation gets high/low?' and 'how a specific activation contributes next?'. Regarding the former question, we observe that strong connections from preceding layer having high activation and weight basically cause current activation to be higher. We find this an important cue to analyze the generation of an impactful activation while having an essence how it may influence later. For the later question, we observe that a neuron with high activation may contribute firing multiple succeeding neurons, which also provides a discernible cue to estimate its impact. Thus, a comprehensive understanding on this issue can be gained exploring these aspects. Hence, while generating contribution-score for a neuron, we accumulate individual scores from its adjacent preceding and succeeding layers analyzing above aspects, and carefully aggregate them with current layer score.

A. FORMULATION
From now on, we denote current layer as c, and its adjacent layers, e.g., preceding and succeeding layer as p and s-layer, respectively. A neuron from these layers will be denoted by putting a [ ] at the top asĉ,p,ŝ; and a selected discriminative neuron will be noted by putting an [ * ] at the top (e.g.,ĉ * ,p * ,ŝ * ). Now, lets assume there is a target layer c for which we like to generate desired set ofĉ * , and we have a number of selectedŝ * neurons from layer s. For eacĥ s * , we first calculate contribution-scores for allĉ, and then, exploit these scores to get the set ofĉ * for each respectiveŝ * . Example of a fc layer and conv layer are shown, through which we sequentially find discriminative neurons. Purple neurons denote the set of important neurons for a layer. For such a neuron (purple with glow) from the succeeding layer, we calculate contribution-scores for a target neuron (green with glow). Z p , Z c and Z s denote scores from the preceding, current and succeeding layers. Z p and Z c are derived multiplying activations and associated weight between the neurons under considerations; Z s = 3 and Z s = 2 denote target neuron contributes most to activate three and two selected neurons of succeeding layer, in respective cases. Calculation-details of these Z -scores can be found in Section II-A.
Finally, we carefully unify all the generated set ofĉ * to get the final desired set of neurons for layer c. We describe this process elaborately in the following part of this subsection.

1) GENERATING CONTRIBUTION-SCORE OF NEURONS
Here, we discuss how we generate score for a neuronĉ, for a selectedŝ * . For this, we obtain individual contributions from its own layer c along with its adjacent layers p and s within respective score-functions. At first, we derive a score for layer c itself. For this, we see the strength of the connection betweenĉ andŝ * since a stronger connection shows a higher impact ofĉ onŝ * . We define this by multiplying its activation with connection-weight as, where, Z c is the score for layer c, Aˆc is the activation value of c, and Wˆs * c is the connection weight betweenĉ andŝ * . To obtain contribution from layer p, we scrutinize the impact of preceding neurons onĉ's activation in order to get an essence of its generation, strength and probable impact. Now, in a forward pass, signal is transferred from prior to later layer after a weighted sum of the activations of all prior neurons. Although there exists contributions from all neurons, all of them may not contribute significantly due to either low activations or low connection-weights. Hence, to understand the key point of the generation of Aˆc, we only consider contributions from the most impactful neuron of prior layer. One may consider multiple such neurons; nonetheless, for simplicity, we stick to the most impactful one. This is because, a targetĉ with most impactfulp of high activation & connection-weight is likely to influence more than aĉ with suchp * of moderate/low activation & connectionweight. Thus, obtaining information of prior connection in the above way gives us an essence on the generation and strength of Aˆc, and brings more explainability on the score function. To generate the score, we first observe individual connection-strengths of allp by multiplying their activations to the corresponding connection-weights (associated to targetĉ), and select the highest value among them as, where, Z p is the score for layer p; arg max(F) function returns highest value from operation F; for ap, Ap denotes its activation value, and Wˆĉ p is the connection-weight betweenp andĉ. Lastly, we generate score for layer s by analyzing the influence of targetĉ on activating other neurons of succeeding layer. We observe this influence by calculating the number of timesĉ possesses highest contribution in terms of Z c scores for all respectiveŝ * . By contributing to activate more neurons of later layer, a targetĉ explicitly shows its impact at that layer, and becomes a strong candidate to be picked. Hence, we calculate the number of timesĉ contributes most forŝ * neurons, and consider this value as the score for layer s. We define it as, where, Z s is the score for layer s; function G(·) takes input targetĉ & one selectedŝ * , and returns 1 if targetĉ has the highest contributions forŝ * (in terms of multiplication of its activation Aˆc and connected-weight Wˆs * c ) among other existingĉ. We iterate this process for all selectedŝ * through a counter, and assign final value to Z s . A lower default value is assigned to Z s in case no such contribution is found.
After we obtain individual scores from all the layers, we aggregate them to generate final score function. There are multiple ways to aggregate them; nonetheless, we opt for a weighted sum approach. The 'weight' here denote the priorities or importance we want to give to a layers' score.  Select k-discriminative neurons with highest scores, and define this set as, Hˆs * (ĉ * ) = arg max k { ŝ * (ĉ) : ∀ĉ}. 6: end for 7: Select the final set of discriminative neurons for layer c by taking union of all Hˆs * (ĉ * ) as, H(ĉ * ) = ∀ŝ * Hˆs * (ĉ * ).
Now, note that due to the disparate type of contributions from the layers, we need to aggregate them carefully. Specifically, Z c and Z p are devised from the corresponding activation values, and hence we can sum them easily. Besides, Z s is a numerical measurement exhibiting number of timesĉ contributes most for selectedŝ * neurons. Hence, we use Z s as a multiplier to the weighted sum of Z c and Z p in order to amplify the importance of targetĉ. Formally, we devise it as, where, ŝ * (ĉ) denote the final score ofĉ for a selectedŝ * ; α with subscript s, c, p denote priorities for s, c, and p layer, respectively. Values of α are learned beforehand, as discussed in Section II-B2.

2) SELECTION OF DISCERNIBLE NEURONS
For a selectedŝ * , we calculate score (Eq. 4) for allĉ in the above way, and select most influential k-number ofĉ based on their highest score values. We define this set of selected neurons as, where, arg max operator returns k-top neurons based on the highest values; Hˆs * (ĉ * ) is the selected set ofĉ * for a particularŝ * .
In this way, we obtain respective sets for allŝ * . We now take the union of all these sets, and remove repetition of same neurons to generate desired final set of discriminative neurons for c layer. We define it as, where, operation combines neurons from all Hˆs * (ĉ * ) sets into the final set, H(ĉ * ), after removing repetition of same neurons.
In the above way, we sequentially select the set of most discernible neurons at each layer. Finally, we reach to input image and obtain most contributing neurons of it (i.e., image pixels). As soon as we reach to these pixels, we generate visualization map by Gaussian blurring the pixel locations, as depicted in Fig. 1. The Gaussian parameters are found empirically.

B. DISCUSSION ON METHOD-ATTRIBUTES
In this part, we elaborate the descriptions of different methodattributes, such as, selection process for conv/pooling layer; selection of the priority values (denoted in Eq. 4); and visualization of different method variants.

1) SELECTION PROCESS IN CONVOLUTION/ POOLING LAYER
In conv-layer, pixel within a receptive field is convolved with kernel weights to generate activation in the same spatial location at next layer. Oneŝ * in conv-layer is basically a pixel location in one of the channels of s layer (as denoted by purple colored pixel in Fig. 2). For such aŝ * , we find desired set of neuronsĉ * (pixel locations) by analyzing the locations at same (x, y) spatial position among all the channels of layer c. That is, scores are calculated along the channels of layer c at the same (x, y) locations those in fact aid activatingŝ * . Lastly, most contributing locations are selected. This selection is done using similar procedure as mentioned above in Section II-A, except some modifications in the score calculations due to the convolution process which we elaborate here.
Among the scores, Z c is calculated through the multiplication of activations of neuronĉ and respective weights between c andŝ * . In conv-layer, we regard activations of a location c through its surrounding pixel values within a receptive field, as depicted in conv-layer part in Fig. 2, and weights are its associated kernel weights. Both the activations and weights appear as 3D-blobs having spatial locations (x, y) along with channel information. To replicate the multiplication, we compute hadamard product between them, which also results in a 3D-blob of same size as the activation blob. For each channel in this output blob, we sum up respective values across (x, y) spatial region to process each of them with a single numeric value. These values are basically the respective Z c values for candidate locations. Note that Z p for such a candidate location,ĉ, is calculated in the similar way, but for the layer p, while picking the maximum value from the channels. Besides, Z s is simply the numeric count to assess the effect ofĉ in later layer, hence we do not have to bring any modification for this. Other procedures, such as the selection of most contributing locations are same as detailed in Section II-A2.
In pooling layer, an activation in layer s is basically the maximum value present in the corresponding receptive field in the layer c. Hence, while backtracking an activation across such a layer, we locate the maximum activation of layer c within its receptive field. That is, for pooling layers, we first extract 2D receptive fields of the corresponding neuron in layer c, and then find the location having highest activation. We choose to use maximum activation since most of the models normally use max-pooling to sub-sample the feature maps.
Sample illustration for fc and conv layer in extracting the important neurons is given in Fig. 2, and the general algorithm is provided in Alg. II-A1.

2) SELECTION OF PRIORITY-VALUES
In this part, we discuss the optimal selection of the priority values (α) that are multiplied with each Z -scores in the score-function (Eq. 4). Assigning such layerwise priorities is important since different layers contribute differently, and hence priorities favor us providing appropriate importance to each layers' information while bringing more generality to the score function (Eq. 4). Note that among the Z scores, Z s is a multiplier that amplifies a neuron's chance to be selected as an important one. Putting α s also as a multiplier (to Z s ) do not contribute much in this context. For this reason, we simply omit α s from the equation. Now, there can be multiple ways of selecting the optimal values of α p and α c . One way is to search for their best experimental outcomes in a sample validation-set based on an evaluation metric, such as IoU (Intersection over Union) score. To generate IoU score, bounding box is first generated in the predicted image (using visualization map is one way of generating such box), and it is compared with ground truth bounding box in terms of a ratio-value between their overlap area and the unified area. This strategy shows how accurately model identifies class-specific object regions by rewarding predicted bounding boxes for heavily overlapping with the ground-truth. Our idea is that a map with good parameter-choice would successfully trace back the discriminant locations in test image, and thus significantly aids the generation of accurate bounding box and respective IoU score. Hence, we first generate maps for different α p & α c values, and then calculate IoU scores using the bounding boxes generated from each of such maps. We now find out the optimal combination of priority values providing best IoU scores.
For the experiment, we randomly select 1000 images from ImageNet Validation dataset [41] and calculate average IoU (Intersection over Union) scores for them using our method for a range of pre-defined α p and α c values ([0.1 ∼ 1.0]). For a test image, after reaching its discriminative regions using proposed method, we generate bounding box around the target object simply by tracking the largest connecting components . 1 However, the average IoU scores for different α c and α p are given in Fig. 3. We observe that IoU scores start increasing as we increase α c value and IoU scores decrease as we decrease α c . For α p , we observe the opposite case; that is, as we increase α p value, IoU decreases, and vice-verse. These results lead us to the fact that performance depends a lot on the correct combination of priority values. In fact, the linear relationship between α c and IoU scores indicates that performance mainly depends on the contribution from current layer while being partially dependent on the contribution from preceeding layer. The highest score is obtained at α c = 0.9 and α p = 0.3, which also supports the prior observation. Hence, we select these values as the optimal values for α c and α p .
Note that we keep the priority-values same for all the layers from top to bottom. Considering all these layers producing diverse information, one may prioritize them differently, e.g., using different priority values for top, middle and bottom layers. One may also consider deriving priority values for other datasets to see whether those values differ for other images. We leave these issues here as potential future research direction.

3) METHOD-VARIANTS
To this part, we explain how the inclusion of adjacent layer information benefits proposed visualization. Note that in prior discussions (Section I-B and preambles of Section II), we discuss primary motivation and idea of including such information. Here, we explicitly illustrate how such information aids current layer information generating accurate, credible and aesthetic visualization. For this, we assess visualization maps with and without adjacent layers' information after modifying the score function (Eq. 4). We modify it for different cases as, We provide visual maps for each of these cases for different objects in Fig. 4. We observe, compared to Case A where we use layer c information only, target objects are covered more accurately as we add information from layer p (Case B) and layer s (Case C), respectively. In case of using both these layers along with layer c (Case D), maps produce visually most aesthetic and accurate maps than other given cases. We also derive Case E where we do not use layer c, rather use information from layer s and p only. Despite neglecting information from the most important layer, objects are still covered through a somewhat scattered visualization in this case. This is a key case in our proposal showing that adjacent layers possess meaningful knowledge of current activation, and thus, can significantly aid current layers' information in our gradual search for important neurons.

III. EXPERIMENTAL RESULTS
One such visualization work must show its consistency against model's prediction (faithfulness) while gaining human trust. In this regard, we conduct objective and subjective evaluations to compare our method against other existing methods in terms of faithfullness and human trust. We also conduct some sanity checks (e.g., model randomization test and class-sensitivity analysis) to demonstrate its overall reliability as a visualization map.

A. OBJECTIVE EVALUATION
For objective evaluation, we evaluate faithfulness of our method with respect to the model prediction. This can be understood analyzing a maps' ability to accurately explain the function learned by a model, as pointed in [9] and [43]. Ribeiro et al. [25] also suggested the explanations to be locally accurate in the vicinity of predicted object in order to be faithful to the model. It would be interesting to see how tweaking the explanation map region would question the locality of class-object and influence the prediction confidence. This analysis can be a significant cue to understand consistency of a map to the model, which we aim to observe in this part. In this regard, we evaluate the performance of our  [43]. method against others on two different metrics: (a) average drop % and (b) % increase in confidence-each of which is described below.

1) AVERAGE DROP %
In the first experiment, we compare the prediction confidence of an input image against the confidence of its corresponding explanation map when the map is used as input. A good map always highlights the respective class-specific parts, and thus it would keep most important such regions intact in that image. Hence, when an explanation map is used as input, most relevant object parts are expected to be occluded to some extent. Since a network mainly searches important object patterns to make a decision, occluding such parts would mostly lower the confidence of the model as compared to the confidence of original image for the particular class. We expect this drop in confidence to be sufficiently low in practice. We define the metric (%drop) as the ratio of confidence score between explanation map and original input image, and average result is reported for the samples under our considerations. A lower value, in this regard, indicates better consistency of the method.
The idea of this experiment comes from [43], where the results are generated using PASCAL VOC 2007 [44] and ImageNet [41] validation-images. Similar to [43], 2510 validation images of PASCAL VOC 2007 dataset have been used for the experiment. In ImageNet, we randomly select three subset of 2510 validation images, and average out the results in order to maintain consistency with the PASCAL VOC 2007 setting. VGG-16 is chosen to generate results. We compare our results with other existing methods, including Integrated Gradients [12], Gradient×Input [11], [27], Guided Backprop, different Grad-CAM variations, including Grad-CAM [9], Guided Grad-CAM [9], Grad-CAM++ [43], and CNN-Fixations [29]. Results are given at Table 1. As we observe, our method achieves the lowest score among others in both the datasets exhibiting its efficacy in accurate tracking of predicted object region; which, in turn, shows its consistency with the models' prediction. Lowest values in both datasets also suggest that proposed visualization includes more of what is relevant for a correct decision, pointing to its adjustability to the model function. Note that for this experiment, we use the implementation of [45] to generate visualization maps for Grad-CAM, Integrated Gradients, Gradient×Input, and Guided backprop. Others are generated using the given code repositories of corresponding papers. Visualization maps of all these methods for some sample-objects are also provided in Fig. 5 in order to compare them qualitatively.

2) % INCREASE IN CONFIDENCE
Since CNN mainly looks at most relevant object region for respective prediction, at times the entirety of such region could be highlighted by the map. In this case, providing only the explanation map region as input would increase the confidence in prediction compared to the confidence of original image due to ignoring insignificant image parts. We measure this increase in confidence by the number of times such improvement observed for the particular class while providing only explanation map regions as input. We report the result as percentage in Table 2. For this experiment, we keep the settings same as prior experiments, and generate results for other methods under our considerations. Results in Table 2 shows that proposed approach gains boost in confidence for the most time than other methods in both the datasets, which again shows its efficacy in discriminative localization of target object.
Superior results in above experiments demonstrate that proposed method shows better confidence in generating faithful explanations than other methods demonstrating its consistency with the models' prediction. In both experiments, CNN-Fixations achieves closer performance to the ours while variants of CAM-based methods, e.g., Grad-CAM++ and Guided Grad-CAM also achieve competitive results.

B. SUBJECTIVE EVALUATION
So far, we have explored the faithfulness property of our method in above experiments. We now conduct couple of subjective evaluation experiments to analyze its interpretibility in terms of class-discrimination and human-trust.

1) CLASS-DISCRIMINATION EVALUATION
In this experiment, we measure visual discriminativity of visualization methods in representing predicted object by VOLUME 10, 2022  examining which of them best represent the class-specific features from human visual perspective. For the experiment, we select 10 different classes from ImageNet validation set, and for each class we randomly select 5 different images. For the selected images, we generate maps for different visualization methods with VGG-16 network. We now provide these maps to 20 attendees along with their original inputimage and corresponding class-label after hiding the methods' identity, and ask them 'Which map best describes the class-object present in the input image?' In this way, we aim to obtain human perspective about the best visualization map exhibiting class-specific object properly while providing an aesthetic visualization.
To keep this evaluation simple for the attendees, we decide to provide them a limited set of maps, instead of providing many of them as used in Section III-A. Observed by the dominant performance from the prior objective evaluation, we consider Grad-CAM++, Guided Grad-CAM, CNN-Fixations and the proposed one, for this test. That is, for each given image, we provide 4-visualization maps to each attendee along with original image and corresponding class label. Attendees are requested to answer through a googleform, where the answers were given in a checkbox manner. If an attendee chooses single option (visualization map) for a question, score 1 is added to that option. In case of multiple answers provided, the scores are normalized to 1. Thus, for each question, we obtain normalized scores for the given options. These scores are then added, ending up to a total achievable score of 50. As we observe the final scores given at Table 3, proposed method achieves a score of 26.41, which is by far better than other methods, e.g, Guided Grad-CAM (7.00), Grad-CAM++ (7.42) and CNN-Fixations (9.17). Hence, according to these human evaluations, our method seems to provide most aesthetic and class-specific visualization maps than other given methods.

2) HUMAN TRUST EVALUATION
In this experiment, we examine if there are multiple maps for different networks produced by a visualization method, which map would seem more trustworthy in human perspective. Basically, we examine whether this human perception (about a dominating map of a network among others) complies with precalculated machine accuracy of respective network; and should they comply each other, it would help place trust in the explanation. We compare our visual maps produced for AlexNet and VGG-16 with the knowledge that VGG-16 shows better accuracy than AlexNet on image classification tasks [9], [46]. To check whether proposed visualization comply with this result, we select 58 ImageNet samples, where both networks make correct prediction, and then we generate maps for them using our method. For each image, we provide corresponding maps for AlexNet and VGG-16 to 12 attendees, and ask them 'Which map best describes the object present in the image? '. Surprisingly, all attendees voted for VGG-16 as producing a better map than AlexNet. We also ask them to rate the reliability for both maps by requesting them to provide score between 1 ∼ 5 (worst to best) within radio-button option. Obtained scores are then normalized, where we found VGG-16 (38.08) has a higher score than AlexNet (36.07). The result complies with the previous finding that VGG-16 shows superior performance against AlexNet for object classification. In this way, based on the consistent explanation from human prediction, our visualization method can help users place trust in a models' prediction.
Superior results in above experiments show that proposed method enables discriminative visualization of predicted classes (class-discrimination) and helps picking more reliable models (human-trust) in practice. Note that in the Appendix of this paper, we provide screenshots of sample questionnaire for both the above-mentioned subjective evaluation experiments.

C. SANITY CHECKS
Adebayo et al. [47] points out that visualization methods may not always show its dependency towards corresponding model and data. Hence, sole reliance on visual inspection can often be misleading. To assess the reliability of a method, authors stress to conduct sanity checks before deploying it in practice. Following the recommendation, we conduct couple of sanity check experiments on our method, which we describe in this part.

1) MODEL RANDOMIZATION TEST
In this test, we check sensitivity of our visualization map against model properties, e.g., model parameters. The assumption is that if a visualization method depends on the learned parameters of a model, we should expect a substantially different map in case the model parameter changes. If the output remains similar, this can be regarded as insensitive to model properties, and hence can be inferred as unreliable for interpretation.
To conduct this test, we compare visual map of a target image for a trained network against its map for the same network but having randomly initialized parameters (untrained). For this, we randomize network-weights in a cascading fashion; that is, starting randomizing from top layer and continue up to the bottom layer. While randomization up to a specific layer, we check the difference of corresponding map against original map. We first calculate this difference in terms of a popular correlation measure, i.e., Spearman's rank correlation [48]. Spearman's rank correlation (denoted as ρ or r s in the literature) measures the strength of association between two ranked variables. A r s value varies between −1 and +1, where a value of ± 1 indicates a perfect degree of either positive or negative association between two variables. In this VOLUME 10, 2022  regard, a lower r s value for randomizing upto a particular layer would denote a significant change in map between original and distorted image. Visualization maps for some input images and their adversarial pairs. Adversial pairs are generated for VGG-16 using Deepfool [52]. Corresponding labels are given at the top of each image.
Using proposed method, Spearman's correlation values are generated for 1000 randomly selected ImageNet samples, and layerwise (i.e., randomization from top layer to target layer) average results for AlexNet and VGG-16 are shown at Fig. 6a and Fig. 6b, respectively. As evident from the figures, while continuing weight-randomization for both networks, we observe gradually decreasing correlation values, indicating substantial dissimilarities in respective maps. Now, for visual inspection, we provide qualitative results for a sample image (junco) for both AlexNet (Fig. 7) and VGG-16 (Fig. 8).
Here, we explicitly observe how visualization get scattered as we advance from top to bottom layers after destroying their learned weights.
For a comparative evaluation, we have also generated results within same experimental settings for other existing visualization methods, including Integrated Gradients, Gradient×Input, Guided Backprop, Guided Grad-CAM, Grad-CAM, Grad-CAM++ and CNN-Fixations. Results are provided in Fig. 6a and Fig. 6b, respectively. We observe that Guided Grad-CAM and Guided Back-propagation show insensitivity to higher layer weights although they get better as they move towards lower layers. Gradient×Input and Integrated Gradients show consistently higher r s -values, indicating correspondence between original map and distorted map throughout the way. On the contrary, Grad-CAM, Grad-CAM++ and CNN-Fixations show low r s values than the above-mentioned methods. Note that similar behavior is also observed for the respective methods in the sanity checks conducted in [47]. However, proposed method achieves lowest r s values in all the distorted layers, demonstrating its strong sensitivity towards the model compared to other methods.  To assess the consistency of above findings, we use another correlation measure, e.g., Structural Similarity Index (knows as SSIM) [49] to observe the difference of the maps. SSIM actually measures the perceptual difference between two images. The SSIM values ranges between 0 to 1; 1 means perfect match between the images being compared. Our results (with SSIM) for AlexNet and VGG-16 are given at Fig. 6c and Fig. 6d respectively, where we also observe as we continue weight randomization from top to bottom layers, SSIM values decrease slowly, demonstrating the consistent changes in the maps. Similar as prior experiment, we compare the performance against other methods in this setting, and provide results in Fig. 6c and Fig. 6d. Guided backprop and Guided Grad-CAM show very high SSIM values while Grad-CAM, Grad-CAM++, CNN-Fixations show moderately low SSIM values. Proposed one achieves lowest SSIM values in all the distorted layers. We observe that results are in line with the prior experiment. Thus, comparative quantitative measures (Fig. 6) together with individual qualitative results ( Fig. 7 and Fig. 8) demonstrate that as model parameter changes, our method consistently follows respective changes by shifting its focus, showing its dependency and sensitivity towards model parameters.

2) CLASS-SENSITIVITY ANALYSIS
We conduct another sanity check on our method in terms of its sensitivity towards predicted-class. Similar experiments are observed in the literature [50], [51], where the purpose is to check whether visualization map focuses on the representative class-object in a correct and reliable manner. For instance, if current focus of the image is distracted intentionally, an unreliable visualization map would discontinue to follow the changes, and hence, can be regarded as class-insensitive. Should the map be class-sensitive, its focus must shift substantially with respect to the change of the image.
To check this aspect, we generate adversarial pair of an input image, and generate visualization maps for both of them to examine their difference of focus. An adversial pair is generated through adversial attack which brings subtle changes to the image leading the network predict a different class. In Fig. 9, we provide visualization maps of some sample images and their adversial pairs. As we observe, in the adversarial images, main concentration of their visualization map shifts to other areas; for example, bassinet to triangle-like tripod shape (top-left example), and so on. Furthermore, for quantitative analysis, we collect 1000 ImageNet samples, and calculate difference between the visual map of original and their adversial pairs. Similar as before, difference is calculated using Spearman's rank correlation (r s ) and SSIM values, and average results are reported at Table 4 for our method along with other existing methods. In comparison, our method achieves lowest dissimilarity values than the other methods for both AlexNet and VGG-16 networks. Such low r s and SSIM values suggest considerable amount of dissimilarities between original and adversial map, pointing substantial shifts of concentration in adversial maps. Thus, above results show that proposed visualization map is not biased towards image itself, rather as the concentration changes in image, it tracks the changes quite successfully exhibiting its consistency towards respective image-class. From the experimental results reported above (Section III-A∼ III-C), we observe that some methods, e.g., Guided Grad-CAM and Guided backprop show poor results in both the sanity check experiments although they show moderate performance in objective evaluation. We found Integrated Gradients, Gradient×Input performing poorly while Grad-CAM, Grad-CAM++ and CNN-Fixations show comparably better results than the above-mentioned methods in objective evaluation and sanity check experiments. However, proposed method achieves superior results in objective and subjective evaluation; as well as shows strong modeldependency in the sanity-check experiments.

IV. CONCLUSION AND FUTURE DIRECTION
In this work, we present a new visual interpretation method that sequentially selects discernible neurons per layer and gradually trace back the prediction up to input image to find discriminative image locations responsible for the prediction. While selecting such neurons per layer, we consider contributions from adjacent layers along with its current layer after adding layer-specific priority-values. In the proposal, we present reliable ways of extracting such layerwise contributions and layer-specific priority-values, and show their consistency in our experiments. We conduct comparative subjective & objective evaluations and couple of sanity-checks to evaluate the proposal's overall reliability as a visualization map, where we visualize its superior performance over other existing methods.
However, there is still enough room to further explore the attribute and applicability of our method. For instance, in the score function (Eq. 4), currently we consider only the first-placed adjacent layer information. How information from sequentially appeared multiple adjacent layers would affect the visualization is an interesting research direction. In Section II-B2), we show an IoU-based approach for layerwise priority-value selection. We acknowledge that such priorities may vary in diverse cases, and hence, a more generalized solution can be formulated to facilitate the cases where necessary boundary information for IoU calculation is unavailable. Moreover, in the current work, we showed the applicability of our idea to traditional architectures, like AlexNet and VGG-16. It is worthwhile to mention that some minor tweaking in the design should be addressed for recent modules, like inception, skip-connection and LSTM units. Lastly, such a visualization map can be utilized in various applications, such as, object localization, image captioning, visual question answering, knowledge distillation etc. (as also done in [9], [20], [29] and [43]), where we are yet to test the utility of our proposal. We leave these issues here as our future endeavor.

APPENDIX
Here we provide screenshots of example questionnaires for two subjective evaluation experiments, detailed in Section III-B. Since the attendees are not subject-expert, at first we briefly explain them about good and bad representation of visualizations with some examples. One such example slide is given in Fig. 10. In Fig. 11, we provide couple of representative screenshots for both subjective evaluation experiment. In the given images, Subjective Evaluation I denotes class-discrimination evaluation done in Section III-B1, and Subjective Evaluation II denotes humantrust evaluation done in Section III-B2. We hide the exact experiment title in order to restrict any sort of bias towards the experiment.