Interpretable CNNs for Object Classiﬁcation

—This paper proposes a generic method to learn interpretable convolutional ﬁlters in a deep convolutional neural network (CNN) for object classiﬁcation, where each interpretable ﬁlter encodes features of a speciﬁc object part. Our method does not require additional annotations of object parts or textures for supervision. Instead, we use the same training data as traditional CNNs. Our method automatically assigns each interpretable ﬁlter in a high conv-layer with an object part of a certain category during the learning process. Such explicit knowledge representations in conv-layers of CNN help people clarify the logic encoded in the CNN, i.e. answering what patterns the CNN extracts from an input image and uses for prediction. We have tested our method using different benchmark CNNs with various structures to demonstrate the broad applicability of our method. Experiments have shown that our interpretable ﬁlters are much more semantically meaningful than traditional ﬁlters.


INTRODUCTION
In recent years, convolutional neural networks (CNNs) [8], [13], [16] have achieved superior performance in many visual tasks, such as object classification and detection.In spite of the good performance, a deep CNN has been considered a black-box model with weak feature interpretability for decades.Boosting the feature interpretability of a deep model gradually attracts increasing attention recently, but it presents significant challenges for state-of-the-art algorithms.
In this paper, we focus on a new task, i.e. without any additional annotations for supervision, revising a CNN to make its high conv-layers (e.g. the top two conv-layers) encode interpretable object-part knowledge.The revised CNN is termed an interpretable CNN.
More specifically, we propose a generic interpretable layer to ensure each filter in the proposed interpretable layer learns specific, discriminative objectpart features.Filters in the interpretable layer are supposed to have some introspection of their feature representations and regularize their features towards object parts during the end-to-end learning.We trained interpretable CNNs on several benchmark datasets, and experimental results show that filters in the interpretable layer consistently represented the same object part across input images.
In addition, as discussed in [2], filters in low convlayers usually describe textural patterns, while filters in high conv-layers are more likely to represent part patterns.Therefore, we focus on part-based interpretability and propose a method to ensure each filter in a high conv-layer to represent an object part.
Fig. 1 visualizes the difference between a traditional filter and our interpretable filter.In a traditional CNN, a filter usually describes a mixture of patterns.For example, the filter may be activated by both the head part and the leg part of a cat.In contrast, the filter in our interpretable CNN is expected to be activated by a certain part.
Thus, the goal of this study can be summarized as follows.We propose a generic interpretable convlayer to construct the interpretable CNN.Feature representations of the interpretable conv-layers are interpretable, i.e. each filter in the interpretable conv-layer learns to consistently represent the same object part across different images.In addition, the interpretable conv-layer needs to satisfy the following properties: • The interpretable CNN needs to be learned with-  During the forward propagation, our CNN assigns each interpretable filter with a specific mask w.r.t. each input image during the learning process.

ReLU
out any additional annotations of object parts for supervision.We use the same training samples as the original CNN for learning.
• The interpretable CNN does not change the loss function of the classification task, and it can be broadly applied to different benchmark CNNs with various structures.• As an exploratory research, learning strict representations of object parts may hurt a bit the discrimination power.However, we need to control the decrease within a small range.Method: As shown in Fig. 2, we propose a simple yet effective loss.We simply add the loss to the feature map of each filter in a high conv-layer, so as to construct an interpretable conv-layer.The filter loss is proposed based the assumption that only a single target object is contained in the input image.The filter loss pushes the filter towards the representation of a specific object part.
Theoretically, we can prove that the loss encourages a low entropy of inter-category activations and a low entropy of spatial distributions of neural activations.In other words, this loss ensures that (i) each filter must encode an object part of a single object category, instead of representing multiple categories; (ii) The feature must consistently be triggered by a single specific part across multiple images, rather than be simultaneously triggered by different object regions in each input image.It is assumed that repetitive patterns on different object regions are more likely to describe low-level textures, instead of high-level parts.
Value of feature interpretability: Such explicit object-part representations in conv-layers of CNN can help people clarify the decision-making logic encoded in the CNN at the object-part level.Given an input image, the interpretable conv-layer enable people to explicitly identify which object parts are memorized and used by the CNN for classification without ambiguity.Note that the automatically learned object part may not have an explicit name, e.g. a filter in the interpretable conv-layer may describe a partial region of a semantic part or the joint of two parts.
In critical applications, clear disentanglement of visual concepts in high conv-layers helps people trust a network's prediction.As analyzed in [45], a good performance on testing images cannot always ensure correct feature representations considering potential dataset bias.For example, in [45], a CNN used an unreliable context-eye features-to identify the "lipstick" attribute of a face image.Therefore, people need to semantically and visually explain what patterns are learned by the CNN.
Contributions: In this paper, we focus on a new task, i.e. end-to-end learning an interpretable CNN without any part annotations, where filters of high conv-layers represent specific object parts.We propose a simple yet effective method to learn interpretable filters, and the method can be broadly applied to different benchmark CNNs.Experiments show that our approach has significantly boosted feature interpretability of CNNs.
A preliminary version of this paper appeared in [46].

RELATED WORK
The interpretability and the discrimination power are two crucial aspects of a CNN [2].In recent years, different methods are developed to explore the semantics hidden inside a CNN.Our previous paper [48] provides a comprehensive survey of recent studies in exploring visual interpretability of neural networks, including (i) the visualization and diagnosis of CNN representations, (ii) approaches for disentangling CNN representations into graphs or trees, (iii) the learning of CNNs with disentangled and interpretable representations, and (iv) middle-to-end learning based on model interpretability.

Interpretation of pre-trained neural networks
Network visualization: Visualization of filters in a CNN is the most direct way of exploring the pattern that is encoded by the filter.Gradient-based visualization [19], [28], [40] showed the appearance that maximized the score of a given unit.Furthermore, Bau et al. [2] defined and analyzed the interpretability of each filter.They classified all potential semantics into the following six types, objects, parts, scenes, textures, materials, and colors.We can further summarize the semantics of objects and parts as part patterns with specific contours and consider the other four semantics as textural patterns without explicit shapes.Recently, [20] provided tools to visualize filters of a CNN.Dosovitskiy et al. [5] proposed up-convolutional nets to invert feature maps of conv-layers to images.However, up-convolutional nets cannot mathematically ensure the visualization result reflects actual neural representations.
Although above studies can produce clear visualization results, theoretically, gradient-based visualization of a filter usually selectively visualizes the strongest activations of a filter in a high conv-layer, instead of illustrating knowledge hidden behind all activations of the filter; otherwise, the visualization result will be chaotic.Similarly, [2] selectively analyzed the semantics of the highest 0.5% activations of each filter.In comparisons, we aim to purify the semantic meaning of each filter in a high conv-layer, i.e. letting most activations of a filter be explainable, instead of extracting meaningful neural activations for visualization.
Pattern retrieval: Unlike passive visualization, some methods actively retrieve certain units with certain meanings from CNNs.Just like mid-level features [30] of images, pattern retrieval mainly focuses on mid-level representations in conv-layers.For example, Zhou et al. [49], [50] selected units from feature maps to describe "scenes".Simon et al. discovered objects from feature maps of convlayers [26], and selected certain filters to represent object parts [27].Zhang et al. [42] extracted certain neural activations of a filter to represent object parts in a weakly-supervised manner.They also disentangled CNN representations via active question-answering and summarized the disentangled knowledge using an And-Or graph [43].[44] used human interactions to refine the AOG representation of CNN knowledge.[7] used a gradient-based method to explain visual question-answering.Other studies [12], [17], [34], [36] selected filters or neural activations with specific meanings from CNNs for various applications.Unlike the retrieval of meaningful neural activations from noisy features, our method aims to substantially boost the interpretability of features in intermediate convlayers.
Model diagnosis: Many approaches have been proposed to diagnose CNN features, including exploring semantic meanings of convolutional filters [32], evaluating the transferability of filters [39], and the analysis of feature distributions of different categories [1].The LIME [21] and the SHAP [18] are general methods to extract input units of a neural network that are used for the prediction score.For CNNs oriented to visual tasks, gradient-based visualization methods [6], [24] and [14] extracted image regions that are responsible for the network output, in order to clarify the logic of network prediction.These methods require people to manually check image regions accountable for the label prediction for each testing image.[10] extracted relationships between representations of various categories from a CNN.In contrast, given an interpretable CNN, people can directly identify object parts or filters that are used for prediction.
As discussed by Zhang et al. [45], knowledge representations of a CNN may be significantly biased due to dataset bias, even though the CNN sometimes exhibits good performance.For example, a CNN may extract unreliable contextual features for prediction.Network-attack methods [11], [31], [32] diagnosed network representation flaws using adversarial samples of a CNN.For example, influence functions [11] can be used to generate adversarial samples, in order to fix the training set and further debug representations of a CNN.[15] discovered blind spots of knowledge representation of a pre-trained CNN in a weakly-supervised manner.
Distilling neural networks into explainable models: Furthermore, some method distilled CNN knowledge into another model with interpretable features for explanations.[33] distilled knowledge of a neural network into an additive model to explain the knowledge inside the network.[47] roughly represented the rationale of each CNN prediction using a semantic tree structure.Each node in the tree represented a decision-making mode of the CNN.Similarly, [41] used a semantic graph to summarize and explain all part knowledge hidden inside conv-layers of a CNN.

Learning interpretable feature representations
Unlike the diagnosis and visualization of pre-trained CNNs, some approaches were developed to learn meaningful feature representations in recent years.Automatically learning interpretable feature representations without additional human annotations proposes new challenges to state-of-the-art algorithms.For example, [22] required people to label dimensions of the input that were related to each output, in order to learn a better model.Hu et al. [9] designed logic rules to regularize network outputs during the learning process.Sabour et al. [23] proposed a capsule model, where each feature dimension of a capsule may represent a specific meaning.Similarly, we invent a generic filter loss to regularize the representation of a filter to improve its interpretability.
In addition, unlike the visualization methods (e.g. the Grad-CAM method [24]) using a single saliency map for visualization, our interpretable CNN disentangles feature representations and uses different filters to represent the different object parts.

ALGORITHM
Given a target conv-layer of a CNN, we expect each filter in the conv-layer to be activated by a certain object part of a certain category, while remain inactivated on images of other categories 1 .Let I denote a set of training images, where I c ⊂ I represents the subset that belongs to category c, (c = 1, 2, . . ., C). Theoretically, we can use different types of losses to learn CNNs for multi-class classification and binary classification of a single class (i.e.c = 1 for images of a category and c = 2 for random images).
In the following paragraphs, we focus on the learning of a single filter f in a conv-layer.Fig. 2 shows the structure of our interpretable conv-layer.We add a loss to the feature map x of the filter f after the ReLU operation.The filter loss Loss f pushes the filter f to represent a specific object part of the category c and keep silent on images of other categories.Please see Section 3.2 for the determination of the category c for the filter f .Let X = {x|x = f (I) ∈ R n×n , I ∈ I} denote a set of feature maps of f after an ReLU operation w.r.t.different images.Given an input image I ∈ I c , the feature map in an intermediate layer x = f (I) is an n × n matrix, x ij ≥ 0. If the target part appears, we expect the feature map x = f (I) to exclusively activate at the target part's location; otherwise, the feature map should keep inactivated.
Therefore, a high interpretability of the filter f requires a high mutual information between the feature map x = f (I) and the part location, i.e. the part location can roughly determine activations on the feature map x.
Accordingly, we formulate the filter loss as the minus mutual information, as follows.
Given an input image, the above loss forces each filter to match and only match one of the templates, i.e. making the feature map of the filter contain a single significant activation peak at most.This ensures each filter to represent a specific object part.
• p(µ) measures the probability of the target part appearing at the location µ.If annotations of part locations are given, then the computation of p(µ) is simple.People can manually assign a semantic part with the filter f , and then p(µ) can be determined using part annotations.
However, in our study, the target part of filter f is not pre-defined before the learning process.Instead, the part corresponding to f needs to be determined during the learning process.More crucially, we do not have any ground-truth annotations of the target part, which boosts the difficulty of calculating p(µ).
• The conditional likelihood p(x|µ) measures the fitness between a feature map x and the part location µ ∈ Ω.In order to simplify the computation of p(x|µ), we design n 2 templates for f , {T µ1 , T µ2 , . . ., T µ n 2 }.As shown in Fig. 3, each template T µi is an n × n matrix.T µi describes the ideal distribution of activations for the feature map x when the target part mainly triggers the i-th unit in x.In addition, we also design a negative template T − corresponding to the dummy location µ − .The feature map can match to T − , when the target part does not appear on the input image.In this study, the prior probability is given as where α is a constant prior likelihood.
Note that in Equation (1), we do not manually assign filters with different categories.Instead, we use the negative template µ − to help the assignment of filters.I.e. the negative template ensures that each filter represents a specific object part (if the input image does not belong to the target part, then the input image is supposed to match µ − ), which also ensures a clear assignment of filters to categories.Here, we assume two categories do not share object parts, e.g.eyes of dogs and those of cats do not have similar contextual appearance.
We define p(x|µ) below, which follows a standard form widely used in [25], [38]. where indicates the trace of a matrix, and tr(x Part templates: As shown in Fig. 3, a negative template is given as , where • 1 denotes the L-1 norm distance.Note that the lowest value in a positive template is -1 instead of 0. It is because that the negative value in the template penalizes neural activations outside the domain of the highest activation peak, which ensures each filter mainly has at most a single significant activation peak.

Part localization & the mask layer
Given an input image I, the filter f computes a feature map x after the ReLU operation.Without groundtruth annotations of the target part for f , in this study, we determine the part location on x during the learning process.We consider the neural unit with the strongest activation μ = argmax µ=[i,j] x ij , 1 ≤ i, j ≤ n as the target part location.In fact, the algorithm also supports a round template based on the L-2 norm distance.Here, we use the L-1 norm distance instead to speed up the computation.
As shown in Fig. 2, we add a mask layer above the interpretable conv-layer.Based on the estimated part position μ, the mask layer assigns a specific mask with x to filter out noisy activations.The mask operation is separate from the filter loss in Equation (1).Our method selects the template T μ w.r.t. the part location μ as the mask.We compute x masked = max{x • T μ, 0} as the output masked feature map, where • denotes the Hadamard (element-wise) product.The mask operation supports gradient back-propagations.
Fig. 4 visualizes the masks T μ chosen for different images, and compares the original and masked feature maps.The CNN selects different templates for different images.
Note that although a filter usually has much stronger neural activations on the target category than on other categories, the magnitude of neural activations is still not discriminative enough for classification.Moreover, during the testing process, people do not have ground-truth class labels of input images.Thus, to ensure stable feature extraction, our method only selects masks from the n 2 positive templates {T µi } and omits the negative template T − for all images, no matter whether or not input images contain the target part.Such an operation is conducted during the forward process for both training and testing processes.

Learning
We train the interpretable CNN in an end-to-end manner.During the forward-propagation process, each filter in the CNN passes its information in a bottomup manner, just like traditional CNNs.During the back-propagation, each filter in an interpretable convlayer receives gradients w.r.t.its feature map x from both the final task loss L(ŷ k , y * k ) on the k-th sample and the filter loss, Loss f , as follows: where λ is a weight.Then, we back propagate ∂Loss ∂xij to lower layers and compute gradients w.r.t.feature maps and gradients w.r.t.parameters in lower layers to update the CNN.
For implementation, gradients of Loss f w.r.t. each element x ij of feature map x are computed as follows.
where T μ is the target template for feature map x.If the input image I belongs to the target category of filter f , then μ = argmax µ=[i,j] x ij .If image I belongs to other categories, then μ = µ − .Please see the appendix for the proof of the above equation.
Considering ∀µ ∈ Ω \ {μ}, e tr(x•T μ) e tr(x•Tµ) and p(μ) p(µ) after initial learning episodes, we make the above approximation to simplify the computation.Because Z μ is computed using numerous feature maps, we can roughly treat Z μ as a constant to compute gradients in the above equation.We gradually update the value of Z μ during the training process.More specifically, we can use a subset of feature maps to approximate the value of Z µ , and continue to update Z µ when we receive more feature maps during the training process.Similarly, we can approximate p(x) using a subset of feature maps.We can also approximate p Determining the target category for each filter: We need to assign each filter f with a target category ĉ to approximate gradients in Equation ( 4).We simply assign the filter f with the category ĉ whose images activate f the most, i.e. ĉ = argmax c E x=f (I):I∈Ic ij x ij .

Understanding the filter loss
The filter loss in Equation ( 1) can be re-written as where is a constant prior entropy of part locations.Thus, the filter loss minimizes two conditional entropies, H(Ω |X) and H(Ω + |X = x).Please see the appendix for the proof of the above equation.

•Low inter-category entropy:
The second term  4. Given an input image I, from left to right, we consequently show the feature map of a filter after the ReLU layer x, the assigned mask T μ, the masked feature map x masked , and the image-resolution RF of activations in x masked computed by [49]. where We define the set of all real locations Ω + as a single label to represent category c.We use the dummy location µ − to roughly indicate matches to other categories.
This term encourages a low conditional entropy of inter-category activations, i.e. a well-learned filter f needs to be exclusively activated by a certain category c and keep silent on other categories.We can use a feature map x of f to identify whether or not the input image belongs to category c, i.e. x fitting to either T μ or T − , without significant uncertainty.
•Low spatial entropy: The third term in Equation ( 5) is given as where p(µ|x) = p(µ|x) p(Ω + |x) .This term encourages a low conditional entropy of the spatial distribution of x's activations.I.e.given an image I ∈ I c , a well-learned filter should only be activated in a single region μ of the feature map x, instead of being repetitively triggered at different locations.

EXPERIMENTS
In experiments, we applied our method to modify four types of CNNs with various structures into interpretable CNNs and learned interpretable CNNs based on three benchmark datasets, in order to demonstrate the broad applicability.We learned interpretable CNNs for binary classification of a single category and multi-category classification.We used different techniques to visualize the knowledge encoded in interpretable filters, in order to qualitatively illustrate semantic meanings of these filters.Furthermore, we used two types of evaluation metrics, i.e. the objectpart interpretability and the location instability, to measure the clarity of the meaning of a filter.
Our experiments showed that an interpretable filter in our interpretable CNN usually consistently represented the same part through different input images, while a filter in an ordinary CNN mainly described a mixture of semantics.
We chose three benchmark datasets with part annotations for training and testing, including the ILSVRC 2013 DET Animal-Part dataset [42], the CUB200-2011 dataset [35], and the VOC Part dataset [4].These datasets provide ground-truth bounding boxes of entire objects.For landmark annotations, the ILSVRC 2013 DET Animal-Part dataset [42] contains groundtruth bounding boxes of heads and legs of 30 animal categories.The CUB200-2011 dataset [35] contains a total of 11.8K bird images of 200 species, and the dataset provides center positions of 15 bird landmarks.The VOC Part dataset [4] contains groundtruth part segmentations of 107 object landmarks in six animal categories.
We used these datasets, because they contain ground-truth annotations of object landmarks 2 (parts) to evaluate the semantic clarity of each filter.As mentioned in [4], [42], animals usually consist of nonrigid parts, which present considerable challenges for part localization.As in [4], [42], we selected animal categories in the three datasets for testing.
We learned interpretable filters based on structures of four typical CNNs for evaluation, including the AlexNet [13], the VGG-M [29], the VGG-S [29], the 2. To avoid ambiguity, a landmark is referred to as the central position of a semantic part (a part with an explicit name, e.g. a head, a tail).In contrast, the part corresponding to a filter does not have an explicit name.We used [49] to estimate the image-resolution receptive field of activations in a feature map to visualize a filter's semantics.Each group of four feature maps for a category are computed using the same interpretable filter.These images show that each interpretable filter is consistently activated by the same object part through different images.Four rows visualize filters in interpretable CNNs, and two rows correspond to filters in ordinary CNNs.(bottom) The clear disentanglement of object-part representations help people to quantify the contribution of different object parts to the network prediction.We show the explanation for part contribution, which was generated by the method of [3].

Filters in
VGG-16 [29].Note that skip connections in residual networks [8] make a single feature map contain patterns of different filters.Thus, we did not use residual networks for testing to simplify the evaluation.Given a CNN, all filters in the top conv-layer were set as interpretable filters.Then, we inserted another convlayer with M filters above the top conv-layer, which did not change the size of output feature maps.I.e.
we set M = 512 for the VGG-16, VGG-M, and VGG-S networks, and M = 256 for the AlexNet.Filters in the new conv-layer were also interpretable filters.Each filter was a 3 × 3 × M tensor with a bias term.

Implementation details:
We set parameters as τ = 0.5 n 2 , α = n 2 1+n 2 , and β ≈ 4. β was updated during the learning process.We set a decreasing weight for filter losses, i.e. λ ∝ 1 t E x∈X max i,j x ij for the t-th epoch.We initialized fully-connected (FC) layers and the new conv-layer, but we loaded parameters of the lower conv-layers from a CNN that was pre-trained using [13], [29].We then fine-tuned parameters of all layers Image Heatmap Image Heatmap Image Heatmap Image Heatmap Image Heatmap Image Heatmap Fig. 6.Heatmaps for distributions of object parts that are encoded in interpretable filters.We use all filters in the top conv-layer to compute the heatmap.Interpretable filters usually selectively modeled distinct object parts of a category and ignored other parts.
in the interpretable CNN using training images in the dataset.To enable a fair comparison, when we learned the traditional CNN as a baseline, we also initialized FC layers of the traditional CNN, used pre-trained parameters in conv-layers, and then fine-tuned the CNN.

Experiments
Binary classification of a single category: We learned interpretable CNNs based on above four types of network structures to classify each animal category in above three datasets.We also learned ordinary CNNs using the same data for comparison.We used the logistic log loss for binary classification of a single category from random images.We followed experimental settings in [41], [42] to crop objects of the target category as positive samples.Images of other categories were regarded as negative samples.
Multi-category classification: We learned interpretable CNNs to classify the six animal categories in the VOC Part dataset [4] and also learned interpretable CNNs to classify the thirty categories in the ILSVRC 2013 DET Animal-Part dataset [42].In experiments, we tried both the softmax log loss and the logistic log loss 3 for multi-category classification.

Qualitative Visualization of filters
We followed the method proposed by Zhou et al. [49] to compute the receptive fields (RFs) of neural activations of a filter.We used neural activations after ReLU and mask operations and scaled up RFs to the image resolution.As discussed in [2], the traditional 3. We considered the output yc for each category c independent to outputs for other categories, thereby a CNN making multiple independent binary classifications of different categories for each image.Table 7 reported the average accuracy of the multiple classification outputs of an image.idea of directly propagating the theoretical receptive field of a neural unit in a feature map back to the image plane cannot accurately reflect the real imageresolution RF of the neural unit (i.e. the image region that contributes most to the score of the neural unit).Therefore, we used the method of [49] to compute real RFs.
Studies in both [49] and [2] have introduced methods to compute real RFs of neural activations on a given feature map.For ordinary CNNs, we simply used a round RF for each neural activation.We overlapped all activated RFs in a feature map to compute the final RF of the feature map.
Fig. 5 shows RFs 4 of filters in top conv-layers of CNNs, which were trained for binary classification of a single category.Filters in interpretable CNNs were mainly activated by a certain object part, whereas feature maps of ordinary CNNs after ReLU operations usually represented various object parts and textures.The clear disentanglement of object-part representations can help people to quantify the contribution of different object parts to the network prediction.Fig. 5 shows the explanation for part contribution, which was generated by the method of [3].
We found that interpretable CNNs usually encoded head patterns of animals in its top conv-layer for classification, although no part annotations were used to train the CNN.We can understand such results from the perspective of the information bottleneck [37] as follows.(i) Our interpretable filters selectively encode the most distinct parts of each category (i.e. the head for most categories), which minimizes the conditional entropy of the final classification given feature maps of a conv-layer.(ii) Each interpretable filter represents a specific part of an object, which minimizes the mutual information between the input image and middle-layer feature maps.The interpretable CNN  7. Grad-CAM visualizations [24] of the traditional conv-layer and the interpretable conv-layer.Unlike the traditional conv-layer, the interpretable conv-layer usually selectively modeled distinct object parts of a category and ignored other parts.
"forgets" as much irrelevant information as possible.
In addition to the visualization of RFs, we also visualized heatmaps for part distributions and the grad-CAM attention map of an interpretable convlayer.Fig. 6 shows heatmaps for distributions of object parts that were encoded in interpretable filters.Fig. 7 compares grad-CAM visualizations [24] of an interpretable conv-layer and those of a traditional convlayer.We chose the top conv-layer of the traditional VGG-16 net and the top conv-layer of the interpretable VGG-16 net for visualization.Interpretable filters usually selectively modeled distinct object parts of a category and ignored other parts.

Quantitative evaluation of part interpretability
Filters in low conv-layers usually represent simple patterns or object details, whereas those in high convlayers are more likely to describe large-scale parts.Therefore, in experiments, we used the following two metrics to evaluate the clarity of part semantics of the top conv-layer of a CNN.

Evaluation metric: part interpretability
The metric was originally proposed by Bau et al. [2] to measure the object-part interpretability of filters.For each filter f , X denotes a set of feature maps after ReLU/mask operations on different input images.Then, the distribution of activation scores over all positions in all feature maps was computed.[2] set a threshold T f such that p(x ij > T f ) = 0.005 to select strongest activations from all positions [i, j] from x ∈ X as valid activations for f 's semantics.f,k > 0.2 was stricter than IoU I f,k > 0.04 in [2], because object-part semantics usually needs a stricter criterion than textural semantics and color semantics in [2].The average probability of the k-th part being associating with the filter f was reported as P f,k = E I:with k-th part 1(IoU I f,k > 0.2).Note that a single filter may be associated with multiple object parts in an image.The highest probability of part association for each filter was used as the interpretability of filter f , i.e.P f = max k P f,k .
For the binary classification of a single category, we used testing images of the target category to evaluate the feature interpretability.In the VOC Part dataset [4], four parts were chosen for the bird category.We merged segments of the head, beak, and l/r-eyes as the head part, merged segments of the torso, neck, and l/r-wings as the torso part, merged segments of l/r-legs/feet as the leg part, and used the tail segment as the fourth part.We used five parts for both the cat category and the dog category.We merged segments of the head, l/r-eyes, l/r-ears, and nose as the head part, merged segments of the torso and neck as the torso part, merged segments of frontal l/r-legs/paws as the frontal legs, merged segments of back l/r-legs/paws as the back legs, and used the tail as the fifth part.Part definitions for the cow, horse, and sheep category were similar those for the cat category, except for that we omitted the tail part of these categories.In particular, we added l/r-horn segments of the horse to the head part.The average part interpretability P f over all filters was computed for evaluation.
For the multi-category classification, we first determined the target category ĉ for each filter f i.e. ĉ = argmax c E x=f (I):I∈Ic i,j x ij .Then, we computed f 's object-part interpretability using images of the target category ĉ by following above instructions.
4. [49] computes the RF when the filter represents an object part.Fig. 5 used RFs computed by [49] to visualize filters.However, when a filter in an ordinary CNN does not have consistent contours, it is difficult for [49] to align different images to compute an average RF.Thus, for ordinary CNNs, we simply used a round RF for each valid activation.We overlapped all activated RFs in a feature map to compute the final RF as mentioned in [2].For a fair comparison, in Section 4. 3

Evaluation metric: location instability
The second metric measures the instability of part locations, which was used in [41], [46].It is assumed that if f consistently represented the same object part through different objects, then distances between the inferred part μ and some ground-truth landmarks 2 should keep stable among different objects.For example, if f represented the shoulder part without ambiguity, then the distance between the inferred position and the head will not change a lot among different objects.Therefore, the deviation of the distance between the inferred position μ and a specific ground-truth landmark among different images was computed.The location μ was inferred as the neural unit with the highest activation on f 's feature map.We reported the average deviation w.r.t.different landmarks as the location instability of f .Please see Fig.Because each landmark could not appear in all testing images, for each filter f , the metric only used inference results on top-ranked 100 images with the highest inference scores to compute D f,k .In this way, the average of relative location deviations of all the filters in a convlayer w.r.t.all K landmarks, i.e.E f E K k=1 D f,k , was reported as the location instability of f .We used the most frequent object parts as landmarks to measure the location instability.For the ILSVRC 2013 DET Animal-Part dataset [42], we used the head and frontal legs of each category as landmarks for evaluation.For the VOC Part dataset [4], we selected the head, neck, and torso of each category as landmarks.For the CUB200-2011 dataset [35], we used the head, back, tail of birds as landmarks.
In particular, for multi-category classification, we first determined the target category of for each filter f and then computed the relative location deviation D f,k using landmarks of f 's target category.Because filters in baseline CNNs did not exclusively represent a single category, we simply assigned filter f with the category whose landmarks can achieve the lowest location deviation to simplify the computation.I.e. for a baseline CNN, we used E f min c E k∈P artc D f,k to evaluate the location instability, where P art c denotes the set of part indexes belonging to category c.

Comparisons between metrics of filter interpretability and location instability
Although the filter interpretability [2] and the location instability [46] are the two most state-of-the-art metrics to evaluate the interpretability of a convolution filter, these metrics still have some limitations.
Firstly, the filter interpretability [2] assumes that the feature map of an automatically learned filter should well match the ground-truth segment of a semantic part (with an explicit part name), an object, or a texture.For example, it assumes that a filter   E f,k [D f,k ]) in CNNs that are trained for the binary classification of a single category using the ILSVRC 2013 DET Animal-Part dataset [42].Filters in our interpretable CNNs exhibited significantly lower localization instability than ordinary CNNs.
may represent the exact segment of the head part.However, without ground-truth annotations of object parts or textures for supervision, there is no mechanism to assign explicit semantic meanings with filters during the learning process.In most cases, filters in an interpretable CNN (as well as a few filters in traditional CNNs) may describe a specific object part without explicit names, e.g. the region of both the head and neck or the region connecting the torso and the tail.Therefore, in both [2] and [46], people did not require the inferred object region to describe the exact segment of a semantic part, and simply set a relatively loose criterion IoU I f,k > 0.04 or 0.2 to compute the filter interpretability.
Secondly, the location instability was proposed in [46].The location instability of a filter is evaluated using the average deviation of distances between the inferred position and some ground-truth landmarks.There is also an assumption for this evaluation metric, i.e. the distance between an inferred part and a specific landmark should not change a lot through different images.As a result, people cannot set landmarks as the head and the tail of a snake, because the distance between different parts of a snake continuously  changes when the snake moves.Generally speaking, there are two advantages to use the location instability for evaluation: • The computation of the location instability [46] is independent to the size of the receptive field (RF) of a neural activation.This solves a big problem with the evaluation of filter interpretability, i.e. state-of-the-art methods of computing a neural activation's image-resolution RFs (e.g.[49]) can only provide an approximate scale of the RF.The metric of location instability only uses central positions of part inferences of a filter, rather than use the entire inferred part segment, for evaluation.Thus, the location instability is a robust metric to evaluate the object-part interpretability of a filter.• The location instability allows a filter to represent an object part without an explicit name (a half of the head).
Nevertheless, the evaluation metric for filter inter-  pretability is still an open problem.

Robustness to adversarial attacks
In this experiment, we applied adversarial attacks [32] to both original CNNs and interpretable CNNs.The CNNs were learned to classify birds in the CUB200-2011 dataset and random images.Table 15 compares the average adversarial distortion of the adversarial signal among all images between original CNNs and interpretable CNNs, where I represents the input image while I denotes the adversarial counterpart.Because interpretable CNNs exclusively encoded object-part patterns and ignored textures, original CNNs usually exhibited stronger robustness to adversarial attacks than interpretable CNNs.

Experimental results and analysis
Feature interpretability of different CNNs is evaluated in Tables 1, 2, 3, 4, 5, and 6.Tables 1 and 2 show results based on the metric in [2].Tables 3, 4, and 5 list location instability of CNNs for binary classification of a single category.Table 6 reports location instability of CNNs that were learned for multi-category classification.
We compared our interpretable CNNs with two types of CNNs, i.e. the original CNN, the CNN with an additional conv-layer on the top (termed AlexNet/VGG-16/VGG-M/VGG-S+ordinary layer).To construct the CNN with a new conv-layer, we put a new conv-layer on the top of conv-layer.The filter size of the new conv-layer was 3 × 3 × channel number, and output feature maps of the new conv-layer were in the same size of input feature maps.Because our interpretable CNN had an additional interpretable conv-layer, we designed the baseline CNN with a new conv-layer to enable fair comparisons.Our interpretable filters exhibited significantly higher part interpretability and lower location instability than traditional filters in baseline CNNs over almost all comparisons.Table 7 reports the classification accuracy of different CNNs.Ordinary CNNs exhibited better performance in binary classification, while interpretable CNNs outperformed baseline CNNs in multi-category classification.
In addition, to prove the discrimination power of the learned filter, we further tested the average accuracy when we used the maximum activation score in a single filter's feature map as a metric for binary classification between birds in the CUB200-2011 dataset [35] and random images.In the scenario of classifying birds from random images, filters in the CNN was expected to learn the common appearance of birds, instead of summarizing knowledge from random images.Thus, we chose filters in the top conv-layer.If the maximum activation score of a filter exceeded a threshold, then we classified the input image as a bird; otherwise not.The threshold was set to the one that maximized the classification accuracy.Table 8 reports the average classification accuracy over all filters.Our interpretable filters outperformed ordinary filters.
Given a CNN for binary classification of an animal category in the VOC Part dataset [4], we manually annotated the part name corresponding to the learned filters in the CNN.Table 9 reports the ratio of interpretable filters that corresponds to each object part.
Besides, we also analyzed samples that were incorrectly classified by the interpretable CNN.We used VGG-16 networks for the binary classification of an animal category in the VOC Part dataset [4].We annotated the object-part name corresponding to each interpretable filter in the top interpretable layer.For each false positive sample without the target category, Fig. 9   This figure helped people understand the reason for misclassification.

Effects of the filter loss
In this section, we evaluated effects of the filter loss.We compared the interpretable CNN learned with the filter loss with that without the filter loss (i.e.only using the mask layer without the filter loss).

Semantic purity of neural activations
We proposed a metric to measure the semantic purity of neural activations of a filter.If a filter was activated at multiple locations besides the highest peak (i.e. the one corresponding to the target part), we considered this filter to have low semantic purity.
The semantic purity of a filter was measured as the ratio of neural activations within the range of the  mask to all neural activations of the filter.In other words, the purity of a filter indicated that whether a filter was learned to represent a single part or represent multiple parts.Let x ∈ R n×n be neural activations of a filter (after the ReLU layer and before the mask layer).The corresponding mask was given as T μ ∈ R n×n .The purity of neural activations was defined as purity . 1(•) was the indicator function, which returns 1 if the condition in the braces was satisfied, and returns 0 otherwise.The purity was supposed to be higher if neural activations were more concentrated.
We compared the purity of neural activations between the interpretable CNN and the CNN trained with the mask layer but without filter loss.We constructed these CNNs based on the architectures of VGG-16, VGG-M and VGG-S, and learned the CNNs for binary classification on an animal category in the VOC Part dataset and the ILSVRC 2013 DET Animal-Part dataset.Experimental results are shown in Table 10 and Table 11.It demonstrated that filters learned with the filter loss exhibited higher semantic purity than those learned without the filter loss.The filter loss forced each filter to exclusively represented a single object part during the training process.

Visualization of filters
Besides the quantitative analysis of neural activation purity, we visualized filter activations to compare the interpretable CNN and the CNN trained without the filter loss.Fig. 10 visualized neural activations of the filter in the first interpretable conv-layer of VGG-16 before the mask layer.Visualization results demonstrated that filters trained with the filter loss could generate more concentrated neural activations.

Location instability
We compared the location instability among interpretable filters learned with the filter loss, those learned without the filter loss, and ordinary filters.Here, we used filers in the first interpretable convlayer of the interpretable CNN and filters in the corresponding conv-layer of the traditional CNNs for comparison.
We constructed the competing CNNs based on the VGG-16 architecture, and these CNNs were trained for single-category classification based on the VOC Part dataset and the ILSVRC 2013 DET Animal-Part dataset.As shown in Table 12 and Table 13, the filter loss forced each filter to focus on a specific object part and reduced the location instability.

Activation magnitudes
We further tested effects of the interpretable loss on neural activations among different categories.We used the VGG-M, VGG-S, and VGG-16 networks with either the logistic log loss or the softmax loss, which was trained to classify animal categories in the VOC Part dataset [4].For each interpretable filter, given images of its target category, we recorded their neural activations (i.e.recording the maximal activation value in each of their feature maps).At the same time, we also recorded neural activations on other categories of the filter.The interpretable filter was supposed to activate much more strongly on its target category than on other (unrelated) categories.We collected all activation records on corresponding categories of all filters, and their mean value is reported in Table 14.
In comparison, we also computed the mean value of all activations on unrelated categories of all filters in Table 14.This table shows that the interpretable filter was usually activated more strongly on the target category than on other categories.Furthermore, we also compared the proposed interpretable CNN with the ablation baseline w/o filter loss, in which the CNN was learned without the filter loss.with and without the filter loss.Filters learned with the filter loss exhibited significantly higher semantic purity than those learned without the filter loss.

CONCLUSION AND DISCUSSIONS
In this paper, we have proposed a general method to enhance feature interpretability of CNNs.We design a loss to push a filter in high conv-layers towards the representation of an object part during the learning process without any part annotations.Experiments have shown that each interpretable filter consistently represents a certain object part of a category through different input images.In comparison, each filter in the traditional CNN usually represents a mixture of parts and textures.Meanwhile, the interpretable CNN still has some drawbacks.First, in the scenario of multi-category classification, filters in a conv-layer are assigned with different categories.In this way, when we need to classify a large number of categories, theoretically, each category can only obtain a few filters, which will decrease a bit the classification performance.Otherwise, the interpretable conv-layer must contain lots of filters to enable the classification of a large number of categories.Second, the learning of the interpretable CNN has a strong assumption, i.e. each input image must contain a single object, which limits the applicability of the interpretable CNN.Third, the filter loss is only suitable to learn high conv-layers, because low conv-layers usually represent textures, instead of object parts.Finally, the interpretable CNN is not suitable to encode textural patterns.Fig. 10.Visualization of neural activations (before the mask layer) of interpretable filters learned with and without the filter loss.We visualized neural activations of the first interpretable conv-layer before the mask layer in the CNN.In comparison, visualization results in Fig. 5 correspond to feature maps after the mask layer.Filters trained with the filter loss tended to generate more concentrated neural activations and have higher semantic purity that filters learned without the filter loss.]) in the first conv-layer.The CNN was trained for the binary classification of a single category using the VOC Part dataset [4].For the baseline, we added a conv-layer to the ordinary VGG-16 network, and selected the corresponding conv-layer in the network to enable fair comparisons.Interpretable filters with both the filter loss and the mask layer exhibited much lower localization instability than those learned with the mask layer but without the filter loss.]) in the first interpretable conv-layer.The CNN was trained for the binary classification of a single category using the ILSVRC 2013 DET Animal-Part dataset [42].For the baseline, we added a conv-layer to the ordinary VGG-16 network, and selected the corresponding conv-layer in the network to enable fair comparisons.Interpretable filters learned with both the filter loss and the mask layer exhibited much lower localization instability than those learned with the mask layer but without the filter loss.

FeatureFig. 1 .
Fig. 1.Comparison of an interpretable filter's feature maps with a filter's feature maps in a traditional CNN.

Fig. 2 .
Fig. 2. Structures of an ordinary conv-layer and an interpretable conv-layer.Solid and dashed lines indicate the forward and backward propagations, respectively.During the forward propagation, our CNN assigns each interpretable filter with a specific mask w.r.t. each input image during the learning process.

Fig. 3 .
Fig.3.Templates of T µi .We show a toy example of n = 3.Each template T µi matches to a feature map x when the target part mainly triggers the i-th unit in x.In fact, the algorithm also supports a round template based on the L-2 norm distance.Here, we use the L-1 norm distance instead to speed up the computation.
Fig.4.Given an input image I, from left to right, we consequently show the feature map of a filter after the ReLU layer x, the assigned mask T μ, the masked feature map x masked , and the image-resolution RF of activations in x masked computed by[49].

Fig. 5 .
Fig. 5. Visualization of filters in top conv-layers (top) and quantitative contribution of object parts to the prediction (bottom).(top) We used[49] to estimate the image-resolution receptive field of activations in a feature map to visualize a filter's semantics.Each group of four feature maps for a category are computed using the same interpretable filter.These images show that each interpretable filter is consistently activated by the same object part through different images.Four rows visualize filters in interpretable CNNs, and two rows correspond to filters in ordinary CNNs.(bottom) The clear disentanglement of object-part representations help people to quantify the contribution of different object parts to the network prediction.We show the explanation for part contribution, which was generated by the method of[3].
Fig.7.Grad-CAM visualizations[24] of the traditional conv-layer and the interpretable conv-layer.Unlike the traditional conv-layer, the interpretable conv-layer usually selectively modeled distinct object parts of a category and ignored other parts.
Then, image-resolution RFs of valid neural activations of each input image I were computed4 .The RFs on image I, termed S I f , corresponded to part regions of f .The fitness between the filter f and the k-th part on image I was reported as the intersection-overunion score IoU I f,k = ground-truth mask of the k-th part on image I. Given an image I, the filter f was associated with the k-th part if IoU I f,k > 0.2.The criterion IoU I

8 .
Given an input image I, d I (p k , μ) = p k −p(μ) √ w 2 +h 2 denotes the normalized distance between the inferred part and the k-th landmark p k , where p(μ) is referred to as the center of the unit μ's RF. √ w 2 + h 2 measures the diagonal length of I. D f,k = var I [d I (p k , μ)] is termed as the relative location deviation of filter f w.r.t. the k-th landmark, where var I [d I (p k , μ)] is the variation of d I (p k , μ).

Fig. 9 .
Fig. 9. Examples that were incorrectly classified by the interpretable CNN.

TABLE 2 Part
.1, we uniformly applied these RFs to both interpretable CNNs and ordinary CNNs.
[4]erpretability of filters in CNNs that are trained for multi-category classification based on the VOC Part dataset[4].Filters in our interpretable CNNs exhibited significantly better part interpretability than ordinary CNNs in all comparisons.

TABLE 3
Location instability of filters (

TABLE 4
[4]ation instability of filters (E f,k [D f,k ]) in CNNs that are trained for binary classification of a single category using the VOC Part dataset[4].Filters in our interpretable CNNs exhibited significantly lower localization instability than ordinary CNNs.

TABLE 6
Location instability of filters (E f,k [D f,k ]) in CNNs that are trained for multi-category classification.Filters in our interpretable CNNs exhibited significantly lower localization instability than ordinary CNNs in all comparisons.
localized the image regions that were incorrectly detected as specific object parts by interpretable filters.

TABLE 9
Statistics of semantic meanings of interpretable filters."-" indicates that the part is not selected as a label to describe the filter in a CNN.Except for CNNs for the bird and the horse, CNNs for other animals paid attention to detailed structures the head.Thus, we annotated fine-grained parts inside the head for these CNNs.

Table 14
shows that the filter loss made each filter more prone to being triggered by a single category, i.e. boosting the feature interpretability.

TABLE 10
Semantic purity of neural activations of interpretable filters learned with and without the filter loss from the VOC Part dataset.Filters learned with the filter loss exhibited significantly higher semantic purity than those learned without filter loss.

TABLE 11
Semantic purity of neural activations of interpretable filters learned the ILSVRC 2013 DET Animal-Part dataset

TABLE 12
Location instability of filters (Ef,k [D f,k

TABLE 13
Location instability of filters (Ef,k [D f,k TABLE 14The mean value of neural activations on the target categories and those on other categories.Filters learned with the filter loss exhibited were usually more discriminative than those learned without the filter loss.

TABLE 15
Average adversarial distortion of the original CNN and the interpretable CNN.