Extraction of an Explanatory Graph to Interpret a CNN

This paper introduces an explanatory graph representation to reveal object parts encoded inside convolutional layers of a CNN. Given a pre-trained CNN, each filter1 in a conv-layer usually represents a mixture of object parts. We develop a simple yet effective method to learn an explanatory graph, which automatically disentangles object parts from each filter without any part annotations. Specifically, given the feature map of a filter, we mine neural activations from the feature map, which correspond to different object parts. The explanatory graph is constructed to organize each mined part as a graph node. Each edge connects two nodes, whose corresponding object parts usually co-activate and keep a stable spatial relationship. Experiments show that each graph node consistently represented the same object part through different images, which boosted the transferability of CNN features. The explanatory graph transferred features of object parts to the task of part localization, and our method significantly outperformed other approaches.


INTRODUCTION
In this paper, we investigate the disentanglement of intermediate-layer feature representations of a CNN pre-trained for object classification. We notice that each filter in a CNN usually encodes mixed features of object parts and textural patterns. Therefore, in this paper, given a CNN, we propose to learn an explanatory graph without any part annotations. The explanatory graph automatically reveals how objectpart features are organized in the CNN. The explanatory graph 1. disentangles features of object parts from mixed features in intermediate-layers of a CNN; 2. encodes which object parts are usually co-activated and keep the stable spatial relationship.
As Fig. 1 shows, the explanatory graph encodes the compositional hierarchy of object parts encoded inside conv-layers of the CNN, as follows.
• The explanatory graph consists of multiple layers.
Each layer of the graph corresponds to a convlayer of the CNN and contains thousands of nodes. • Each node represents an object part that is encoded in a filter of a conv-layer. A filter in a conv-layer is usually activated by multiple parts and textural patterns. As Fig. 1 shows, a filter's feature map 1 may be activated by both the head and the neck of a horse. 1. The output of a conv-layer is called the feature map of a convlayer. Each channel of this feature map is produced by a filter, so we call a channel the feature map of a filter.
Given the feature map of a filter, a graph node can identify neural activations in the feature map, which correspond to a specific object part. Theoretically, a CNN with ReLU layers can be considered to encode high-order piecewise linear representations. An object part corresponding to a node is encoded inside a specific feature space divided by the piecewise partitions. Multiple nodes are learned for each filter, i.e. neural activations in its feature map are divided and explained as different multiple parts. • A graph edge connects two nodes in adjacent layers. The two connected nodes represent two object parts, which usually appear simultaneously in an image and keep a stable spatial relationship among different images. For example, the ear part and the face part of a horse usually coappear on different images with similar spatial relationships.
Constructing the explanatory graph is a process of mining object parts from intermediate conv-layers. Nodes in the explanatory graph represent all candidate parts learned from the entire set of training images by the CNN. Consequently, in the inference process, given an input image, the explanatory graph automatically selects a small ratio of nodes. These chosen nodes identify neural activations in intermediatelayer feature maps, which correspond to specific object parts. Given different images, the explanatory graph selects different sets of nodes for explanation. Moreover, since the same part may appear at various locations, given different images, the same node may identify neural activations at different positions as the target part.
The explanatory graph mainly takes two advan-  Fig. 1. An explanatory graph represents the compositional hierarchy of object parts encoded in conv-layers of a CNN. Each filter in a pre-trained CNN may be activated by different object parts. Our method disentangles object parts from each filter in an unsupervised manner.
tages, i.e. the disentanglement and the transferability, as follows.
Disentangling object parts from a single filter is the core technique of building an explanatory graph. In this study, we develop a simple yet effective method to automatically disentangle different object parts from a single filter without using any annotations of object parts, which presents considerable challenges for state-of-the-art algorithms. In this way, the explanatory graph exclusively localizes neural activations of object parts in the feature map, and ignores noisy activations and activations of textural patterns.
More specifically, for each input image, the explanatory graph (i) infers which parts (nodes) are responsible for the feature map of a filter and (ii) localizes these parts.
Graph nodes with high transferability: The explanatory graph contains off-the-shelf features of object parts in a compositional hierarchy, like a dictionary. Thus, the explanatory graph enables us to accurately transfer such object-part features to other tasks. Since all filters in the CNN are learned to encode common features shared by numerous training images, each graph node can be regarded as a transferable detector for common parts among different images.
To demonstrate the above advantages, we learn different explanatory graphs for various CNNs (e.g. the VGG-16, residual networks, and the encoder of a VAE-GAN) and analyze the explanatory graphs from various perspectives as follows. Visualization & reconstruction: We visualize object parts encoded by graph nodes using the following two approaches. First, for each graph node, we draw image regions corresponding to the node's part localizations on different input images. Second, we learn another neural network, which uses activation states of graph nodes to reconstruct the input image. Evaluation of part interpretability of graph nodes: Given an explanatory graph, we propose a new metric to quantitatively evaluate whether a node consistently represents the same part in different images.
Examination of location instability of graph nodes: Besides the part interpretability, we also use a new metric, namely location instability, to measure the semantic clarity of each graph node. It is assumed that if a graph node consistently represents the same object part, then the distances between the inferred part and some ground-truth landmarks of the object should not change a lot through different images. Thus, the evaluation metric uses the deviation of such relative distances over images to measure the instability of the part representation. Testing transferability: The transferability of graph nodes is tested in the scenario of few-shot part localization. We associate certain graph nodes with explicit part names based on feature maps of very few images, in order to localize the target part. The superior localization performance proves the good transferability of graph nodes.
Contributions of this paper are summarized as follows.
• In this paper, we, for the first time, propose a simple yet effective method to extract and summarize object parts encoded inside intermediate conv-layers of a CNN and organize them using an explanatory graph. Experiments show that each graph node consistently represents the same object part in different input images. • The proposed method can be used to learn explanatory graphs for various CNNs, e.g. VGGs, residual networks, and the encoder of a VAE-GAN. • Graph nodes have good transferability, especially in the task of few-shot part localization. Although our graph nodes were learned without part annotations, our transfer-learning-based part localization still outperformed approaches using part annotations to learn part representations.
A preliminary version of this paper appeared in [43].

Semantics in the CNN
The interpretability of neural networks receives increasing attention in recent years [51]. Different methods have been developed to explore visual concepts encoded inside a CNN.
Visualization & interpretability of CNN filters: Visualization of filters in a CNN is the most direct way of diagnosing representations of a CNN. Dosovitskiy et al. [6] proposed up-convolutional nets to invert feature maps of conv-layers to input images. Gradient-based visualization [19], [21], [22], [31] showed the appearance that maximized a network output, the activation score of a specific filter, or certain neural activations in a feature map. Furthermore, Bau et al. [3] defined and analyzed the interpretability of each filter. In recent years, [23] provided a reliable tool to visualize filters in different conv-layers of a CNN.
[3] selectively analyzed the semantics among the highest 0.5% activations of each filter. In contrast, our method provides a solution to explaining both strong and relatively weak activations from each filter, instead of exclusively extracting significant neural activations.
Active network diagnosis: Going beyond "passive" visualization, some methods "actively" diagnose a pre-trained CNN to obtain insight understanding of CNN representations. Many statistical methods [1], [35] have been proposed to analyze CNN features. [35] explored semantic meanings of convolutional filters. [1], [17] computed feature distributions of different categories.
The model bias and dataset bias are typical problems in deep learning, which has been illustrated in recent studies of [14], [20], [24]. Zhang et al. [47] has proposed a method to discover biased representations due to dataset bias. The CNN usually uses unreliable contexts for classification. For example, a CNN may extract features from hairs as a context to identify the smiling attribute.
Therefore, in order to ensure the correctness of feature representations, network-attack methods [12], [34], [35] diagnosed network representations by computing adversarial samples for a CNN. In particular, influence functions [12] were proposed to compute adversarial samples, create training samples to attack the learning of CNNs, fix the training set, and further debug a CNN. [13] discovered blind spots of CNN representations in a weakly-supervised manner. In comparison, our method disentangles features of object parts from a pre-trained CNN and builds an explanatory graph to reveal object parts encoded inside the CNN. It is because just like And-Or graphs [39], [40], [41], our explanatory graph naturally represents the local, bottom-up, and top-down information to construct a hierarchical object representation.
Diagnosis of network predictions: Some previous studies aimed to explain the reason for each network prediction. Methods of [7], [27] propagated gradients of feature maps w.r.t. the CNN loss back to the image, in order to estimate the image regions that directly contribute the network output. The LIME [24], the SHAP [18], and [4], [42] extracted input units that were closely related to a specific prediction.
Pattern retrieval: Some studies retrieve specific activation units with specific meanings from intermediate-layer feature maps. Like middle-level feature extraction [33], pattern retrieval mainly learns mid-level representations of CNN features. Zhou et al. [52], [53] selected activation units from feature maps to describe scenes. In particular, [52] accurately computed the image-resolution receptive field of neural activations in a feature map. Theoretically, the actual receptive field of a neural activation is smaller than that computed using the filter size. Simon et al. discovered objects from feature maps of unlabeled images [29], and selected a filter to describe each part in a supervised fashion [30]. However, most methods simply assumed that each filter mainly encoded a single visual concept, and ignored the case that a filter in high conv-layers encoded a mixture of object parts and textural patterns. [44], [45], [46] extracted certain neural activation units from a filter's feature map to describe an object part in a weakly-supervised manner (i.e. learning from active question answering and human interactions).
In this study, the explanatory graph disentangles features of different object parts from the CNN without part annotations. Compared to raw feature maps, graph nodes are well disentangled and more interpretable.
CNN semanticization: Compared to the diagnosis of CNN representations and the pattern retrieval, semanticization of CNN representations is closer to the spirit of building interpretable representations.
Hu et al. [11] designed logic rules for network outputs, and used these rules to regularize neural networks and learn meaningful representations. [3], [52] extracted visual semantics from intermediate layers of a CNN. [37] distilled representations of a neural network into an additive model to explain the network. [53] also used additive structures, i.e. the global average pooling layer to explain neural networks. [50] used a tree structure to approximate the rationale of the CNN prediction on each specific sample. Capsule nets [26] and interpretable CNNs [49] used specific network structures and loss functions, respectively, to make the network automatically encode interpretable features in intermediate layers.
In comparison, we aim to explore the compositional hierarchy of object parts encoded inside conv-layers of a CNN. The explanatory graph boosts the transferability of CNN features to other part-based tasks. The (L+2)-th conv-layer The (L+1)-th conv-layer The L-th conv-layer Fig. 2. Schematic illustration of the explanatory graph. The explanatory graph encodes spatial and co-activation relationships between object parts in the explanatory graph. Nodes in high layers help localize nodes in low layers. From another perspective, we can regard low-layer nodes represent compositional parts of high-layer nodes.

Weakly-supervised knowledge transferring
Knowledge transferring has been widely used in deep learning. Typical research includes end-to-end finetuning and transferring CNN representations between different datasets [8]. In contrast, a transparent representation of the explanatory graph will create a new possibility of transferring object-part features to other applications. Experiments have demonstrated the superior transferability of graph nodes in few-shot part localization.

ALGORITHM
A single filter is usually activated by different parts of the object (see Fig. 2). Let us assume that given an input image, a filter is activated at N positions, i.e. there are N activation peaks on the filter's feature map. Some peaks represent common parts of the object. Other activation peaks may correspond to background noises or textural patterns. Our goal is to disentangle activation peaks corresponding to object parts from a filter's feature map. I.e. we select certain neural activations, which represent specific object parts. We propose an explanatory graph for the disentanglement. Each activation peak of a filter corresponding to an object part is represented as a graph node. Let an activation peak represent a specific object part. Then, it is assumed that the CNN usually contains other filters to represent neighboring parts of the target part. I.e. some activation peaks of other filters must keep stable spatial relationships with the target part. Such spatial relationships are encoded in edges of the explanatory graph, which connect each node in a layer to some nodes in the neighboring upper layer.
Object parts are mined layer by layer. Given object parts mined from the upper layer, we extract activation peaks that keep stable spatial relationships with specific upper-layer parts through different images, as parts in the current layer.
Nodes in high layers usually represent large-scale object parts, while nodes in low layers mainly describe small and relatively simple shapes, which are usually compositions of high-layer parts. Nodes in high layers are usually discriminative, and the explanatory graph uses high-layer nodes to filter out noisy activations. Nodes in low layers are disentangled based on their spatial relationship with highlayer nodes.

Explanatory graph
Before the introduction of technical details of the algorithm, we first give a brief overview of the explanatory graph.
We are given a CNN, which is learned using a set of training samples I. We construct an explanatory graph G based on this CNN and all training samples in I. As Fig. 4 illustrates, G contains several layers, each corresponding to a single conv-layer in the CNN. Each layer of the explanatory graph is composed of hundreds/thousands of nodes, which represent object parts encoded in this conv-layer. Each node is linked with some graph nodes in the upper layer. The linkage/edge indicates that object parts of the two linked nodes usually co-appear in the image with stable spatial relationship. In this way, an explanatory graph can be considered as a dictionary of object parts, which are extracted from various images.
In the training phase, each node in G is supposed to disentangle a part from a conv-layer's feature maps. In inference phase, given feature maps of an input image I, the explanatory graph G uses its nodes to localize neural activations corresponding to different parts.

Top-down iterative learning of the explanatory graph
Given all training images I, we expect that (i) all nodes in the explanatory graph can be well fitted to A node in the explanatory graph N L,d (or N d ) The node number extracted from the d-th channel of the L-th conv-layer The node set extracted from the d-th channel of the L-th conv-layer Ω L (or Ω) The node set extracted from the L-th convlayer θ Parameters of nodes in the L-th layer X I The feature maps of the L-th conv-layer given input image I x ∈ X I A neural activation unit in the feature map X I R I Position inference results of nodes in the L + 1-th layer, which are represented using spatial coordinates. px The center of the receptive field in the image plane of the neural activation unit x p V The position inference result (i.e. the spatial coordinate) in the image plane of the node V given input image I µ V The average position of the node V in the image plane E V The set of parent nodes of node V , which are localized in the upper layer.
feature maps of all images, and (ii) nodes in the lower layer always keep consistent spatial relationships with nodes in the upper layer given each input images. Therefore, the learning of an explanatory graph is conducted in a top-down manner as follows.
The learning of an explanatory graph is conducted layer by layer. We first disentangle parts from the top conv-layer of the CNN and construct the top layer of the explanatory graph. Then, we conduct position inferences for all nodes in the top layer (the inference process will be introduced in Section 3.3). We use inference results to help disentangle parts from the neighboring lower conv-layer. In this way, the lower layer of the explanatory graph is constructed using inference results of the neighboring upper layer.
Construction of the L-th 2 layer: In the following paragraphs, we will introduce how to recursively learn the L-th layer of the explanatory graph given the (L + 1)-th layer.
Our method disentangles the d-th filter of the L-th conv-layer into N L,d parts. These parts are modeled as a set of N L,d nodes in the L-th layer of G, denoted by Ω L,d . Ω L = ∪ d Ω L,d denotes the entire node set for the L-th layer. In following paragraphs, we can simply omit the subscript L without ambiguity. θ represents parameters of nodes in the L-th layer, which mainly encode spatial relationships between these nodes and nodes in the (L + 1)-th layer. Table 1 summarizes the notation used in this paper.
Given an input image I ∈ I, the L-th conv-layer of the CNN generates a feature map 1 , denoted by X I . 2. Note that our method is not limited to using consecutive convlayers to learn the explanatory graph. People can select inconsecutive conv-layers. Without loss of generality, the L-th ranked layer among all conv-layers, which are selected from the CNN, is termed as the L-th conv-layer for simplicity.
Then, for each node V ∈ Ω d , the explanatory graph infers whether or not the part indicated by V appears in the d-th channel 1 of X I , as well as its part location (if the part appears).
For each node V in the L-th layer, our method learns the following two terms: (i) the parameter µ V ∈ θ and (ii) a set of nodes E V ∈ θ in the upper layer that are connected to V . µ V ∈ θ denotes the prior location of V . Thus, for each node V ∈ E V , µ V − µ V corresponds the prior displacement between V and node V in the upper layer. The explanatory graph uses the displacement µ V − µ V to model the spatial relationships between nodes.
Just like an EM algorithm, we use the current explanatory graph to fit feature maps of training images. Then, we use matching results as feedback to modify the prior location µ V and edges E V of each node V in the L-th layer, in order to make the explanatory graph better fit the feature maps. We repeat this process iteratively to obtain the optimal prior location and edges for V .
In other words, our method automatically extracts pairs of related nodes and learns the optimal spatial relationships between them during the iterative learning process, which best fit feature maps of training images.
Therefore, the objective function of learning the Lth layer is formulated as Let us focus on the feature map X I of image I. Without ambiguity, we ignore the superscript I to simplify notations in following paragraphs. We can regard X as a distribution of "neural activation entities." The neural response of each unit x ∈ X can be considered as the number of "activation entities." In other words, each neural activation unit x in the feature map X is identified by its spatial position p x 3 and its channel number d x (i.e. an activation unit of the d x -th filter). We use F (x) = β · max{f x , 0} to measure the number of activation entities at the location p x , where f x is the normalized activation value of x; β is a constant. We use R I to represent position inference results of all nodes in the upper conv-layer (i.e. the L + 1-th conv-layer).
Just like a Gaussian mixture model, all nodes in Ω d comprise a mixture model, which explains the distribution of activation entities on the d-th channel of X.
where each node V ∈ Ω d is treated as a hidden variable or an alternative component in the mixture 3. To make unit positions in different conv-layers comparable with each other (e.g. µ V →V in Eq. 4), we project the position of unit x to the image plane. We define the coordinate px on the image plane, instead of on the feature-map plane. model to describe activation entities. P (V ) = 1 N d +1 is a constant prior probability. P (p x |V, R, θ) measures the compatibility of using node V to describe an activation entity at p x . In particular, we add a dummy node V none to the mixture model for noisy activations, in order to explain neural activations unrelated to object parts, e.g. those of noises and textural patterns. The compatibility between V and p x is based on spatial relationship between V and its connected nodes in G, which is approximated as In the above equations, V has M related nodes in the upper layer. The set of nodes E V ∈ θ connected to V would be determined during the learning process. The overall compatibility P (p x |V, R, θ) is divided into the spatial compatibility between node V and each related node V , P (p x |p V , θ). ∀V ∈ E V , p V ∈ R denotes the position inference result of V , which have been given. λ = 1 M is a constant for normalization. γ is a constant, which roughly ensures P (p x |V, R, θ)dp x = 1 and can be eliminated during the learning process.
As Fig. 3 shows, an intuitive idea is that the relative displacement between V and V should not change a lot among different images. Then, p x − p V will approximate to the prior displacement µ V − µ V , if node V can well fit the activation at p x . Given E V , we assume the spatial relationship between V and V follows a Gaussian distribution in Eqn. 4, where we define µ V →V = µ V −µ V +p V as the prior localization of V given V . The variation σ 2 V can be estimated from data 4 .
The explanatory graph is learned in a top-down manner, and the learning process is summarized in Algorithm 1. Our method first learns nodes in the top-layer of G, and then learns for the neighboring 4. We can prove that for each Algorithm 1 Learning sub-graph in the L-th layer Inputs: feature map X of the L-th conv-layer, inference results R in the upper conv-layer.
to construct E V based on a greedy strategy, which maximize I∈I P (X|R, θ). end for end for lower layer. For the sub-graph in the L-th layer, our method recursively estimates µ V and E V for nodes in the sub-graph.
The special case is the node in the top conv-layer. For each node V in the top conv-layer, we simply define where V dummy is a node in the dummy layer above the top conv-layer. Based on Eqns. (3) and (4), we obtain

Part localization
Given feature maps of an input image, we can assign nodes with different activations peaks on feature maps, in order to infer object parts represented by these neural activations. The explanatory graph simply assigns node V ∈ Ω d with the unitx = argmax x∈X:dx=d S I V →x on the feature map as the in-

EXPERIMENTS
In this section, we conducted several experiments to demonstrate the effectiveness, board applicability, and the high accuracy of our method. We learned explanatory graphs to interpret four types of CNNs, i.e. the VGG-16 [32], the 50-layer and 152-layer Residual Networks [10], and the encoder of the VAE-GAN [15]. These CNNs learned using a total of 37 animal categories in three datasets, which included the ILSVR-C 2013 DET Animal-Part dataset [44], the CUB200-2011 dataset [38], and the VOC Part dataset [5]. As discussed in [5], [44], animals usually contain nonrigid parts, which presents a key challenge for part localization. Thus, we selected animal categories in the three datasets for testing.
We designed three experiments to evaluate the explanatory graph from different perspectives. In the first experiment, we visualized object parts corresponding to nodes in the explanatory graph. The second experiment was designed to evaluate the interpretability of nodes, i.e. checking whether or not a node consistently represents the same object part among different images. We compared our nodes with three types of middle-level features and network features. In the third experiment, we used our graph nodes for the task of few-shot part localization, in order to test the transferability of nodes. We learned an And-Or graph (AOG) with very few part annotations, which associated the well learned nodes with explicit part names. We used the AOG to conduct part localization and compared its performance with fourteen baselines.

Implementation details
We first trained/fine-tuned a CNN using object images of a category, which were cropped using object bounding boxes. Then, we set parameters τ = 0.1, M = 15(except for results in Table 9), T = 20, and β = 1 to learn an explanatory graph for the CNN.
We learned explanatory graphs for the VGG-16, residual networks, and the VAE-GAN. We mainly extracted object parts from high conv-layers of these neural networks, because as discussed in [3], high conv-layers contain large-scale parts.
•VGG-16: The VGG-16 was first pre-trained using the 1.3M images in the ImageNet dataset [25]. We then fine-tuned all conv-layers of the VGG-16 using object images in a category. The loss for fine-tuning was for binary classification between the target category and background images. The VGG-16 has thirteen convlayers and three fully connected layers. We selected the ninth, tenth, twelfth, and thirteenth conv-layers of the VGG-16 as four valid conv-layers, and accordingly, we built a four-layer graph. We extracted N d nodes from the d-th filter of the L-th layer, where we set N d = 40 for all channels of the first and second conv-layers (L = 1, or 2) and set N d = 20 for all channels of the third and fourth conv-layer (L = 3, or 4).
•Residual Networks: Two residual networks, i.e. the 50-layer and 152-layer ones, were used in experiments. The fine-tuning process for each network was exactly the same as that for VGG-16. We built a three-layer graph based on each residual network by selecting the last conv-layer with a 28×28×128 feature output, the last conv-layer with a 14 × 14 × 256 feature map, and the last conv-layer with a 7 × 7 × 512 feature map as valid conv-layers. We set N d as 40, 20, and 10 for all channels in the first, second, and third convlayers, respectively.
•VAE-GAN: For each category, we used the cropped object images to train a VAE-GAN. We learned a three-layer graph based on all three conv-layers of the encoder of the VAE-GAN. We set N d as 52, 26, and 13 for all channels for the first, second, and third conv-layers, respectively.

Experiment 1: part visualization
The global structure of an explanatory graph for a VGG-16 network is visualized in Fig. 4. Fig. 12 shows the histogram of P (p x |p V , θ) values among all edges in an explanatory graph. In general, the distribution of P (p x |p V , θ) satisfied the assumption of the Gaussian distribution. Fig. 13 demonstrates the convergence of our method.
We visualized object parts of graph nodes from the following three perspectives.
Top-ranked patches: For each image I, we performed the part localization on its feature maps. For a node V , we extracted a patch at the location of px V 5 with a fixed scale of 70 pixels × 70 pixels to represent V . Fig. 5 shows a part's image patches that had highest inference scores. In this figure, we used two different methods to infer the object part for each node. The first method wasx = argmax x∈X:dx=d S I V →x as mentioned before. The second method incorporated gradients to localize parts, 5. We projected the unit to the image to compute its position.
The CNN is learned to classify a single category from random images.
The CNN is learned to classify multiple categories from random images.  (1) The top nine layers visualized nodes corresponding to CNNs, each learned for a single category. We used two methods to infer the image patch for each node. In top nine rows, part location was inferred asx = argmax x∈X:dx=d S I V →x . In following three rows, parts were localized viâ x = argmax x∈X:dx=d f x · ∂y ∂fx . (2) The bottom four layers visualized image patches of graph nodes, when the CNN was learned to classify multiple categories. In this case, each node usually encoded parts shared by different categories. Texts before each group of image patches indicate their corresponding categories. Part location was inferred asx = argmax x∈X:dx=d S I V →x . Please read texts for detailed explanations.
i.e.x = argmax x∈X:dx=d f x · ∂y ∂fx , where y and f x denote the classification output of the target class and the activation value of the neural activation unit x, respectively. f x · ∂y ∂fx is a classical evaluation of the numerical attribution of the neural activation unit x [47].
Note that in this study, we assumed that the CNN was learned to classify a single category from random images. However, it would be quite interesting if we visualized graph nodes corresponding to a CNN encoding parts of multiple categories. To this end, we learned a VGG-16 network to classify six animal categories (bird, cat, cow, dog, horse, sheep) from other fourteen categories in the VOC Part dataset [5] and built an explanatory graph for the CNN. Fig. 5 also visualizes nodes in this explanatory graph. Each node usually represented parts that were shared by multiple categories

Heatmaps of the distribution of object parts:
Given part localization results w.r.t. a cropped object image I, we drew heatmaps to show the spatial distribution of the inferred parts. We drew a heatmap for each layer L of the graph. Each part V ∈ Ω was visualized as a weighted Gaussian distribution  Fig. 6. Heatmaps of the distribution of object parts. We use a heatmap to visualize the spatial distribution of the top-50% object parts in the L-th layer of the explanatory graph with the highest inference scores. We also compare heatmaps with the grad-CAM [27] of the feature map. Unlike the grad-CAM, our heatmaps mainly focus on the foreground of an object and uniformly pay attention to all parts, rather than only focus on most discriminative parts. Fig. 6 shows heatmaps of the top-50% parts with the highest scores of S I V →x . Due to the lack of the ground truth for explanations, it is difficult to evaluate the attribution/attention/saliency map of a neural network. In general, two terms have to be considered in the evaluation, i.e. (1) whether or not the attribution map fits the human cognition and (2) whether or not the attribution map objectively reflects true reasons for the network prediction. From this perspective, in Fig. 6, results of the Grad-CAM better fit human cognition than our method. On the other hand, Fig. 6 visualizes the distribution of graph nodes, whose semantic meanings were verified in experiments. Therefore, the explanatory graph can better show object parts encoded in the CNN than the Grad-CAM method.
Node-based image synthesis: We used the upconvolutional network [6] to visualize parts of graph nodes. Given an object image I, we used the explanatory graph for part localization, i.e. assigning each node V with a certain neural activation unit x V as its position inference 5 . We considered the top-10% nodes with highest scores of S I V →x as valid ones. We filtered out all neural responses of units, which were not assigned to valid nodes, from feature maps (setting these responses to zero). We selected the filtered feature map corresponding to the second graph layer and used the up-convolutional network to synthesize the filtered feature map to the input image. Fig. 7 shows image-synthesis results, which can be regarded as the visualization of the inferred nodes.

Experiment 2: semantic interpretability of nodes
In this experiment, we evaluated whether or not each node consistently represented the same object part through different images. Four explanatory graphs were built for a VGG-16 network, two residual networks, and a VAE-GAN. These networks were learned using the CUB200-2011 dataset [38]. We used the following two metrics to measure the interpretability of nodes.
Part interpretability of nodes: The evaluation metric was inspired by Zhou et al. [52]. For each given node V , we used V to localize object parts among all images. We regarded inference results with the top-K inference scores S Ii V among all images as valid representations of V . We required the K highest inference scores S Ii V on images {I 1 , . . . , I k } to take about 30% of the inference energy, i.e. we use K i=1 S Ii V = 0.3 i∈I S I V to compute K. We asked human raters to count the number of inference results, which described the same object part, among the top K, in order to compute the purity of part semantics of node V . In addition, as mentioned before, f x · ∂y ∂fx is a classical evaluation of the numerical attribution of the neural activation unit x [47]. Thus, we designed a baseline method, namely Ours with top-ranked f x · ∂y ∂fx , to select inference results with top-ranked f x · ∂y ∂fx values that took 30% of the total f x · ∂y ∂fx score of all images. The table in Fig. 8(top-left) shows the semantic purity of the nodes in the second layer of the graph. Let the second graph layer correspond to the L-th convlayer with D filters. The raw filter maps baseline used all neural activation in the feature map of a filter to describe a part. The raw filter peaks baseline considered the highest peak on a filer's feature map as the part detection. Like our method, the two baselines also visualized top-K part inferences (the K feature maps' neural activations took 30% of activation energies over all images). We back-propagated the center of the   Fig. 8. Purity of part semantics (top-left). We compared object parts corresponding to nodes in the explanatory graph with features of raw filters. We draw raw feature maps of filters (left), the highest activation peaks on feature maps of filters (middle), and image regions corresponding to each node in the explanatory graph (right). Based on such visualization results, we use human users to annotate the semantic purity of each node/filter. receptive field of each neural activation to the image plane and draw the image region corresponding to each neural activation. Fig. 8 compares the image region corresponding to each graph node and image regions corresponding to feature maps of each filter. Our graph nodes represented explicit object parts, but raw filters encoded mixed semantics. Because the baselines simply averaged the semantic purity among the D filters, we also computed average semantic purities using the top-D nodes with the highest scores of i∈I S I V to enable a fair comparison. Location instability of inference positions: We defined the location instability for each node as another evaluation metric of interpretability. Note that we used the localization of object parts, rather than the localization of entire objects, to evaluate the clarity  of semantic meanings of each node. We assumed that if a node was always activated by the same object part through different images, then the distance between the node's inference position and a groundtruth landmark of the object part should not change a lot among various images.     [33] 0.1341 [30] 0.2291 As Fig. 9 shows, given a testing image I, d head We compared its location instability of an explanatory graph with three baselines. The first baseline treated each filter in a CNN as a detector of a certain part. Thus, given the feature map of a filter (after the ReLu operation), we used the method of [52] to localize the unit with the highest response value as the part position. The other two baselines   TABLE 3 Normalized distance of part localization on the CUB200-2011 dataset [38]. The second column indicates whether the baseline used all object-box annotations in the category to fine-tune a CNN. were typical methods to extract middle-level features from images [33] and extract parts from CNNs [30], respectively. For each baseline, we chose the top-500 parts, i.e. 500 nodes with top scores in the explanatory graph, 500 filters with strongest activations in the CNN, and the top-500 middle-level features. For each node, we selected position inferences on the top-20 images with highest scores to compute the location  6 Normalized distance of part localization on the ILSVRC 2013 DET Animal-Part dataset [44]. The second column indicates whether the baseline used all object-box annotations in the category to fine-tune a CNN. instability. Table 2 compares the location instability of different baselines. Nodes in the explanatory graph had significantly lower location instability than baselines.

Hybrid And-Or graph for semantic parts
The explanatory graph makes it plausible to transfer intermediate-layer features of a CNN to semantic object parts. In this section, we further designed a hybrid And-Or graph (AOG) to connect the explanatory graph, and the AOG associated nodes in the explanatory graph with explicit part names. We used the AOG to test the transferability of nodes in the explanatory graph. It is because the AOG has been demonstrated as a classical model, which is suitable for representing the compositional hierarchy of objects [28], [54]. Adapting nodes in the explanatory graph enabled us to evaluate the clarify of compositional hierarchy that was encoded in a pretrained CNN.
The structure of the AOG is inspired by [48], and the learning of the AOG was originally proposed in [44]. As Fig. 10 shows, the AOG encodes a four-layer hierarchy for each semantic part, i.e. the semantic part (OR node), part templates (AND node), latent parts (OR nodes, i.e. nodes in the explanatory graph), and neural activation units (terminal nodes).

Layer Name
Node type Notation 1 semantic part OR node V sem 2 part template AND node V tmp ∈ Ω tmp 3 latent part OR node V lat ∈ Ω lat 4 neural unit Terminal node x ∈ Ω unt where latent parts correspond to nodes from the explanatory graph.
In the AOG, each OR node (e.g. a semantic part or a latent part) contains a list of alternative appearance (or deformation) candidates. Each AND node (e.g. a part template) uses a number of latent parts to describe its compositional regions.
• The OR node of a semantic part contains a total of m part templates to represent alternative appearance or pose candidates of the part. • Each part template (AND node) retrieve K latent parts from the explanatory graph as children.
These latent parts describe compositional regions of the part. • Each latent part (OR node) has all units in its corresponding filter's feature map as children, which represent its deformation candidates on image I. Technical details: Based on the AOG, we use the extracted latent parts to infer semantic parts in a bottom-up manner. We first compute inference scores of different units at the bottom layer w.r.t. different  Fig. 10. Schematic illustration of an And-Or graph for semantic object parts. The AOG encodes a four-layer hierarchy for each semantic part, i.e. the semantic part (OR node), part templates (AND node), latent parts (OR nodes, those from the explanatory graph), and neural activation units (terminal nodes). In the AOG, the OR node of semantic part contains a number of alternative appearance candidates as children. Each OR node of a latent part encodes a list of neural activation units as alternative deformation candidates. Each AND node (e.g. a part template) uses a number of latent parts to describe its compositional regions.
latent parts, and then we propagate inference scores up to the layers of part templates and the semantic part for part localization.
The top OR node of the semantic part V sem contains a total of m part templates to represent alternative appearance or pose candidates of the part. We manu-ally define the composition of the M part templates. During part-inference process, given an image I, V sem selects its best child as the true part template: where S V X , X ∈ {sem,tmp,lat,unit} denotes the inference score of V X . Then, each part template V tmp uses a number of latent parts to describe sub-regions of the part. In the scenario of one-shot learning, we only annotate one part sample belonging to the part template. Then, we retrieve latent parts (nodes) that are related to the annotated part from all nodes in the disentangling graph. Given the inference score S V lat and inferred position p V lat of each latent part V lat on I, we retrieve the top K latent parts with the highest scores of S V lat N (p V lat |µ = p * V tmp , σ 2 ) as children of V tmp . p * V tmp denotes the annotated position of the part V tmp ; σ 2 = (0.3 ×image width) 2 is a constant variation.
When we have extracted a set of latent parts for a part template, given a new image, we can use inference results of the latent parts to localize the part template: where ∆p V lat ,V tmp denotes a constant displacement from V lat to V tmp .
Each latent part V lat has a channel of units as children, which represent its deformation candidates on image I. The score of each unit x is given as S V lat →x = F (x)P (p x |V lat , R, θ). The OR node of V lat selects the unit with the maximum score as its deformation configuration: Please see [44] for details of the AOG.

Experimental settings of three-shot learning
Given a fine-tuned VGG-16 network, we learned an explanatory graph and built the AOG upon the explanatory graph following the scenario of few-shot learning in [44]. For each category, we set three templates for the head part (m = 3) and used three partbox annotations for the three templates. Note that we used object images without part annotations to learn the explanatory graph, and we used three part annotations provided by [44] for each part to build the AOG. All these object-box annotations and part annotations were equally provided to all baselines to enable fair comparisons (besides part annotations, all baselines also used object annotations contained in the datasets for learning). We set K = 0.1 L,d N L,d to learn AOGs for categories in the ILSVRC Animal-Part and CUB200 datasets and set K = 0.4 L,d N L,d for VOC Part categories. Then, we used the AOGs to localize semantic parts on objects. Baselines: We compared AOGs with a total of fourteen baselines for part localization. The baselines included (i) approaches for object detection (i.e. directly detecting target parts from objects), (ii) graphical/part models for part localization, and (iii) the methods selecting CNN features to describe object parts.
The first baseline was the standard fast-RCNN [9], namely Fast-RCNN (1 ft), which directly fine-tuned a VGG-16 network based on part annotations. Then, the second baseline, namely Fast-RCNN (2 fts), first used massive object-box annotations in the target category to fine-tune the VGG-16 network with the loss of object detection. Then, given part annotations, Fast-RCNN (2 fts) further fine-tuned the VGG-16 to detect object parts. We used [30] as the third baseline, namely CNN-PDD. CNN-PDD selected certain filters of a CN-N to localize the target part. In CNN-PDD, the CNN was pre-trained using the ImageNet dataset [25]. Just like Fast-RCNN (2 ft), we extended [30] as the fourth baseline CNN-PDD-ft, which fine-tuned a VGG-16 network using object-box annotations before applying the technique of [30]. The fifth and sixth baselines were DPM-related methods, i.e. the strongly supervised DPM (SS-DPM-Part) [2] and the technique in [16] (PL-DPM-Part), respectively. Then, the seventh baseline, namely Part-Graph, used a graphical model for part localization [5]. For weakly supervised  learning, "simple" methods are usually insensitive to model over-fitting. Thus, we designed six baselines as follows. First, we used object-box annotations in a category to fine-tune the VGG-16 network. Then, given a few well-cropped object images, we used the selective search [36] to collect image patches, and used the VGG-16 network to extract fc7 features from these patches. The baselines fc7+linearSVM, fc7+RBF-SVM, fc7+NN used a linear SVM, an RBF-SVM, and the nearest-neighbor method (selecting the patch closest to the annotated part), respectively, to detect the target part. The other three baseline fc7+sp+linearSVM, fc7+sp+RBF-SVM, fc7+sp+NN combined both the fc7 feature and the spatial position (x, y) (−1 ≤ x, y ≤ 1) of each image patch as features for part detection. The last competing method is weakly supervised mining of parts from the CNN [44], namely supervised-AOG. Unlike our method (unsupervised), supervised-AOG used part annotations to extract parts.

Comparisons:
We divided all baselines into three groups. The first group, namely not-learn parts, included traditional methods without using deep features, such as SS-DPM-Part, PL-DPM-Part, and Part-Graph. These methods did not learn deep features 6 . The second group, termed super-learn parts, contained Fast-RCNN (1 ft), Fast-RCNN (2 ft), CNN-PDD, CNN-PDD-ft, supervised-AOG, fc7+linearSVM, and fc7+sp+linearSVM. These methods learned deep features using part annotations, e.g. fast-RCNN methods used part annotations to learn features; supervised-AOG used part annotations to select filters from the CNN to localize parts. The third group (unsuper-learn 6. Representation learning in these methods only used object-box annotations, which is independent to part annotations. A few part annotations were used to select off-the-shelf pre-trained features.
Edges between the 1 st and 2 nd layers Edges between the 2 nd and 3 rd layers Edges between the 3 rd and 4 th layers

Layer 4
Layer 3 Layer 1 Layer 2 Fig. 13. Convergence of the learning process. We showed the average value of log P (X I |R I , θ) after different iterations during the learning process.
parts) included CNN-PDD, CNN-PDD-ft, and our method. These methods learned deep features using object-level annotations, rather than part annotations. Fig. 11 visualizes localization results based on AOGs, which were learned using three annotations of the head part of each category. We used the normalized distance (used in [30], [44]) and the traditional intersection-over-union (IoU) criterion to evaluate the localization performance. Tables 3, 4, 5, 6, and 7 show part-localization results on the CUB200-2011 dataset [38], the VOC Part dataset [5], and the ILSVRC 2013 DET Animal-Part dataset [44]. AOGs based on our graph nodes exhibited outperformed all baselines in few-shot learning. Note that our AOGs simply localized the center of an object part without sophisticatedly modeling the scale of the part. Thus, detection-based methods, which also estimated the part scale, performed better in very few cases. Table 8 compares the unsupervised and supervised learning of parts. In the experiment, our method outperformed all baselines, even including approaches that learned part features using part annotations. Finally, Table 9 compares the part-localization performance when we set different edge numbers M for each node. It shows that explanatory graphs with each node containing 15 edges usually performed better in the perspective of the intersection-over-union (IoU) criterion, and explanatory graphs with each node containing 25 edges exhibited lower normalized distances of part localization.
Note that we tested the explanatory graph and its corresponding AOG from the perspective of part localization, instead of evaluating their performance of object recognition. It is because the explanatory graph was proposed to explain object-part semantics in intermediate layers of the CNN, and the AOG was designed for part localization (i.e. estimating the part location under the condition that the image contains the target part), instead of object recognition (i.e. identifying whether or not the target object appears). Moreover, theoretically, it was difficult for nodes in the explanatory graph to outperform the original CNN, because the explanatory graph selectively retrieved part-alike neural activations from high convlayers, and ignored other activations, whereas fullyconnected layers in the CNN used all information (including both object parts and textures) to recognize objects. I.e. the original CNN used much richer information than the explanatory graph.

CONCLUSION AND DISCUSSIONS
In this paper, we have developed a simple yet effective method to learn an explanatory graph that reveals the compositional hierarchy of object parts encoded inside conv-layers of a pre-trained CNN. The explanatory graph filters out noisy activations, disentangles object parts from each filter, and models co-activation relationships and spatial relationships between parts. Experiments showed that our graph nodes had significantly higher stability than baselines. More crucially, our method can be applied to different types of networks, including the VGG-16, residual networks, and the VAE-GAN, to explain their convlayers.
The transparent representation of the explanatory graph boosts the transferability of CNN features. Partlocalization experiments well demonstrated the good transferability of graph nodes. Our method even outperformed the supervised learning of part representations. Nevertheless, the explanatory graph is just a rough representation of the CNN. It is still difficult to well disentangle textural patterns from filters of the CNN. Fig. 11. Localization results based on AOGs that are learned using three annotations of the head part.