SIGN: Statistical Inference Graphs Based on Probabilistic Network Activity Interpretation

Convolutional neural networks (CNNs) have achieved superior accuracy in many visual-related tasks. However, the inference process through a CNN's intermediate layers is opaque, making it difficult to interpret such networks or develop trust in their operation. In this article, we introduce SIGN method for modeling the network's hidden layer activity using probabilistic models. The activity patterns in layers of interest are modeled as Gaussian mixture models, and transition probabilities between clusters in consecutive modeled layers are estimated to identify paths of inference. For fully connected networks, the entire layer activity is clustered, and the resulting model is a hidden Markov model. For convolutional layers, spatial columns of activity are clustered, and a maximum likelihood model is developed for mining an explanatory inference graph. The graph describes the hierarchy of activity clusters most relevant for network prediction. We show that such inference graphs are useful for understanding the general inference process of a class, as well as explaining the (correct or incorrect) decisions the network makes about specific images. In addition, SIGN provide interesting observations regarding hidden layer activity in general, including the concentration of memorization in a single middle layer in fully connected networks, and a highly local nature of column activities in the top CNN layers.


INTRODUCTION
T HANKS to their impressive performance, convolutional neural networks (CNNs) are the leading architecture for tasks in computer vision [2], [3], [4]. However, current deeplearning methods suffer from poor interpretability of the inference process conducted by their hidden layers. Due to their end-to-end training and complex architecture, the reasoning behind their decision-making process is hard to interpret. The lack of transparency undermines the trustworthiness and reliability of deep networks, especially in sectors where explainability is deemed crucial, such as the healthcare and autonomous-driving sectors. From this need emerged the research field of Explainable Artificial Intelligence (XAI), aimed at promoting intuitive, human-understandable explanations of AI decision-making process [5], [6], [7].
A majority of XAI techniques focus on explaining local decisions for specific images or neurons [9], [10], [11]. These methods are popular due to their simplicity and ease of use. However, this kind of interpretation can easily miss important factors in the decision chain leading to a specific prediction. Understanding CNN reasoning by decomposing it into layerwise stages can provide insights about cases of failure, and reveal weak spots in the network architecture, training scheme, or data collection mechanism. In turn, these insights can lead to more robust networks, and allow us to develop more trust in CNN decisions. Some efforts are made in this direction in recent years [12], [13], however it remains a premature field with many challenges and opportunities ahead.
In this work, we seek to enable a better human-understanding of the deep network inference process. As we see it, this mission requires facing several challenges: (1) Transforming a distributed high dimensional representation into a discrete representation amendable to human reasoning. Deep networks operate through a series of distributed layer representations, manifested by a single activation vector in fully connected (FC) layers and a collection of spatial activation vectors (i.e., columns) in convolutional layers. Human language, however, is made up of discrete symbols, i.e., words, having meaning grounded by their reference in the world of objects and their interrelations. Given the richness of a distributed representation and the sheer size of modern networks, discretizing an internal representation may require thousands of visual words. Hence, (2) Selecting a relevant subset of visual words and connections for a specific analysis task (e.g., class-specific or image-specific analysis) is inherent to this endeavor. (3) Development of a visualization system that enables understanding of the visual words and their interrelations, while taking into account the human perceptual and cognitive limitations and display capacity [14].
In this paper we introduce SIGN, a statistical explanation framework of the inference process conducted by a deep neural network, based on probabilistic interpretation of network activity. A demonstration of the SIGN framework output is shown in Fig. 1. In this example, the framework provides an interpretation of a false prediction made by VGG-16, of a pineapple image to class "swing". SIGN points out what are the crucial visual words for the network's prediction and where these visual words occur in the image.
The SIGN framework, illustrated in Fig. 2, is composed of a few stages. A probabilistic model is learned on top of the network architecture, and can be trained simultaneously with network training or post-hoc, to generatively describe network layer activity behavior and layer-wise dependencies. Activity vectors in each layer are modeled as arising from a multivariate Gaussian mixture model (GMM). Layer activity in FC layers, or spatial location activity in convolutional layers, is associated with one of K clusters (GMM components), each representing a visual word. Together, all visual words with in the same layer forming the layer's dictionary. Connections between visual words of consecutive layers are modeled using conditional probabilities. For a multi layer perceptron (MLP) network, a full model with efficient inference can be obtained using a hidden Markov model (HMM). For convolutional layers, each spatial location has its own hidden variable. A full exact inference is thus infeasible due to the high induced width of the resulting graphical model. Instead, in the suggested model dependencies between neighboring words are ignored, and dependencies among visual words in consecutive layers are described using conditional probability tables. Given a selected subset of images to be explained, the decision process of the network can be described using an inference graph. The inference graph levels representing the visual words used to explain the subset of images in different layers of the network and their weighted connections. As the full graph may contain thousands of visual words in all network layers, a useful explanation has to find informative subgraph containing the most explanatory words for clarifying the network decision. Therefore, we suggest a maximum likelihood based Node Selection Algorithm for finding such informative subgraphs. Finally, based on the node selection algorithm we visualize the network decision process explanation as an Inference Graph.
With SIGN we aim to provide an inference investigation tool for deep models with the following contributions: A class-specific inference graph: The class inference graph provides a succinct summary of the inference process toward a specific class, as it progresses through the network layers. The connections between visual words discovered by SIGN in consecutive layers provide clear insights into the feature aggregation process for this class. An image-specific inference graph: The inference graph for a specific image highlights the visual words most contributing to a class decision on this image (see Fig. 1). Such a graph is highly useful as a debugging tool to analyze network failures as it enables finding what are the main features leading to a false class prediction, which are the layers these features appear, and where do they appear in the input image. Differences in layer activity behavior between MLP networks and CNNs: Our model demonstrates significant differences in activity behavior between the two  [8] dataset image of a pineapple (red-circled), falsely classified by VGG-16 to the "swing" class. This is a partial inference graph with "swing" as the inferred class at the top, and the three most influential visual words from a high level layer (block5_conv1 layer), forming it. The three visual words show what the network found to be the most important features for classification of the inspected image. Each such word is represented using six images containing it, with the visual word itself in a red rectangle. The analyzed image is presented above each visual word, with red dots showing where this visual word is found. As seen in the graph, the image is classified based on three "swing"-related visual words describing "rope," "sand" and "grass". The full inference graph is seen in Fig. 9 Section 4.3. Fig. 2. Illustration of SIGN framework flow. (A) Modeling neural network layers as arising from a probabilistic generative model, and forming a visual word dictionary to each modeled layer. Training the generative model can be simultaneously with the network or post-hoc. (B) Forward pass through the modeled network on a subset inputted test images. Calculate the co-occurrence matrix statistics for all visual words that appear in the image. Apply Node Selection Algorithm on the matrix to obtain the most explanatory visual words. Produce an image inference graph of words as nodes and their connection strengths as edges. (C) The same as (B) but with a subset of analyzed images that represent a class predicted by the network. network types as inference progresses through the network. For MLP networks, visual words gradually converge, from multiple input-related words to unique class-related words. In contrast, CNN behavior remains local and diverse even at the uppermost layers, with each class represented by a combination of distinct multiple words. Overfitting capacity of a single layer: Following Zhang et al. [15], we analyze a case of extreme overfit by learning with random labels. Our model discovers that for MLP network forced to such extreme overfit, the overfitting transformation is concentrated in a single (the middle) hidden layer. In such a case, our suggested tools enable, for the first time as far as we are aware, characterizing where in the network overfit occurs.

RELATED WORK
An important contribution of XAI methods involves explaining neural networks decisions and internal mechanisms. Methods can be generally categorized by the scope of their explanations: instance-level explanation methods, and internal network behavior explanation. We review below some of the important works in this vast domain. For a more comprehensive review of XAI methods, readers are referred to recent surveys [5], [6], [7].

Instance-Level Explanations
A favorable group of methods for network interpretation involves local explanations of specific data instances or a network component. This is mainly achieved via back propagation techniques [9], [10], [11], [16], [17] and perturbation methods [18], [19], [20]. Back propagation based methods use the gradient information back-propagated from the output prediction layer back to the input layer. Among the popular techniques are activation maximization and attribution. With activation maximization [9], [16], [17], the input space is randomly initialized and optimized to maximize the score of a specific feature, producing an image of what this feature is looking for. Zeiler et al. [19] introduced deconvolution layers showing which input pattern originally induced a given activation. This is done by creating an input map that keep the examined activation values, and set all other to zero. Attribution methods [10], [11], [21] highlight the input regions that are most valuable for the network prediction for a specific image. This technique was first introduced as saliency maps by Simonyan et al. [10], that shows gradients of a class prediction with respect to an input image. Zhou et al. [21] suggested Class Activation Mapping (CAM), a simple modification to global average pooling that reveals how regions in an input image are correlated to a specific class with a single forward-pass. This idea was later enhanced by Grad-CAM [11] that generalized CAM method to a broader types of CNN models.
Perturbation methods search for the correlation between the input and the output while the input image is changing (i.e., perturbing). This is mostly done by occluding image pixels group and observing the networks prediction changes. LIME [18] uses superpixel method to occlude pixels group, then they approximating a linear model for the network behaviour which can be interpreted. In [20], the authors search for the perturbation mask that gain the maximal effect on the network's output among all masks.
While such visualizations producing good local explanation and mostly are easy to implement, they are anecdotal and insufficient for understanding the full network's reasoning. In addition, these methods are sensitive to input noise [22] and hyper-parameter tuning [23]. As recently shown in [24], they are not more effective for user understanding than showing the nearest training set examples in most cases.
Instance-level explanations can be enhanced with the addition of network architecture context, to provide an hierarchical analysis on how the decision is propagated through the network's layers. Olah et al. [25] proposed a tool for visualizing the network path for a single image. They decomposed each layer's activations into neuron groups using matrix factorization, and visualized each group, both by activation maximization and attribution. Then, groups from consecutive layers were connected to form a graph structure by their weight connection strength.
The SIGN method offers instance level explanations with context of network architecture, showing the critical features in each layer leading to classification. This forms a decision-chain of the instance along the network layers. Feature visualizations are created based on aggregated information from the entire training set. Visual words are represented by image patches, thus keeping the context of real-world objects.

Internal Network Behavior Explanation
Some explanation efforts are aimed at promoting transparency of hidden layers behavior. One approach tries to quantify the features by their roles across different layers and provide a layer-wise summary of the model. Bau et al. [26] defined six types of semantic patterns (colors, textures, materials, parts, objects, and scenes) identified by CNNs, and labeled image pixels accordingly. In their later work [13], they increased the number of semantic concepts roles, thus gained more fine grain feature groups. In [27], the authors present a model explanation summary where they grouped together images with similar explanations and sub-grouped them based on similar features that explains them. They evaluate explanation based on information theory.
Another form of network-centric XAI techniques aim to look beyond individual features and provide a layer-wise description of inner-decision mechanism operating to produce the desired outputs. Some methods do so by enforcing hidden features to represent meaningful representations [28], [29]. Similar to [26], Zhang et al. [28] aimed to associate filters with object part but in this work they enforce the network filters to do so. Chen et al. [29] impose interpretability by encouraging spatial columns of the topmost conv layer to represent part prototypes of a specific class, with each prototype equated with a spatial region patch of the input image belong to the same class. Loss terms are added to enforce between-class separation, and inner-class property.
Some works wish to explain the overall internal model behaviour without alter the network representations. CNNVis [30] provide a model summary, with neurons in each layer clustered to form groups having similar activity patterns. A graph between neuron clusters of subsequent layers is then formed based on the average weight strengths over the cluster's neurons. Hohman et al. [12] proposed Summit, an interactive visualization tool presenting class attribution graph. This graph visualize the aggregated top channel activations in each layer across all images within the same class. Connections between channels in consecutive layers are quantified based on the influence of the former channel on the latter. This graph reveals interesting and unexpected connections between channels associated with different classes across layers.
The above methods offer a wider perspective on how the intelligent systems work. Inspired by these works, we suggest a novel approach for hidden layers representation. We go beyond specific features and simple clustering techniques to offer an holistic, generative model for full network explanation.

METHOD
Inference graphs for an MLP, for which a full graphical model can be suggested, are presented in Section 3.1. The more general case of a CNN is discussed in Section 3.2, and its related graph-mining algorithm in Section 3.3. Models can be trained on the full set of network layers or on a subset, indexed by l 2 f1; . . . ; Lg.

Inference Graphs for MLPs
A network composed of FC layers can be modeled by a single probabilistic graphical model based on the following assumptions: (a) The activity of a layer can be modeled by a single mixture model. (b) Conditional independence holds between the activity of layer l and activities of layers preceding l À 1 given the activity of layer l À 1. (c) Layer activity is generated by a rectified normal distribution [31], censored at zero according to ReLU operation. For such a network, the activity of hidden layers is modeled by an HMM structure, enabling closed-form inference. The model structure is shown in Fig. 3.
For the lth FC layer with D l neurons, denote the activation vector as x l ¼ ðx l ½1; ::; x l ½D l Þ 2 R D l . The distribution of x l is modeled using a mixture of K l hidden states (i.e., clusters) with a discrete hidden variable h l 2 f1; . . .; K l g denoting the cluster index. To model the ReLU operation, each neuron activation x l ½d is generated from a rectified Gaussian distribution. The conditional probability P ðx l jh l Þ is hence assumed to be a rectified multivariate Gaussian distribution with a diagonal covariance matrix. Connections between hidden variables in consecutive layers, are modeled by a conditional probability table (CPT) P ðh l jh lÀ1 Þ.
Using this generative model, an activity pattern for the network is sampled by three steps. First, a path ðh 1 ; . . . ; h L Þ of hidden states is generated according to the transition probabilities where t l 2 R K l ÂK lÀ1 is a learned CPT. For notation simplicity, we define h 0 ¼ fg, so P ðh 1 jh 0 Þ is actually P ðh 1 Þ parametrized by P ðh 1 ¼ kÞ ¼ t 1 k . After path generation, "pre-ReLU" Gaussian vectors ðy 1 ; . . . ; y L Þ, with y l 2 R D l , are generated based on the chosen hidden variables. A single variable y l ½d is formed according to P ðy l ½djh l ¼ kÞ $ N ðy l ½djm l d;k ; s l d;k Þ; where m l d;k and s l d;k are the mean and standard deviation of the dth element in the kth component of layer l. Since the observed activity x l ½d, generated as x l ½d ¼ maxðy l ½d; 0Þ, is a deterministic function of y l ½d, its conditional probability P ðx l ½djy l ½dÞ can be written as with d ðx¼cÞ as the Dirac delta function concentrating the distribution mass at c. The full likelihood of the model is given by where P ðh l jh lÀ1 Þ, P ðy l jh l Þ, and P ðx l jy l Þ are stated in (1), (2), and (3), respectively. Y; H; X are tuples representing their respective variables across all layers, e.g., H ¼ fh l g L l¼1 . The set of parameters Q learned to optimize the model likelihood is Each orange rectangle is a layer activation vector after the ReLU operation. Activation x l ½d of neuron d in layer l is assumed to be generated from a rectified Gaussian density, resetting values lower than zero to zero. y l ½d is the parent of x l ½d, describing the original Gaussian density before it was rectified. h l is a hidden variable generating the hidden vector of multivariate Gaussians Y l . h l takes K l different states, creating a mixture of multivariate Gaussians for layer l.
Training Algorithm. In [32], the EM formulation was suggested for training a mixture of censored Gaussians. We extended this idea to the HMM formulation in an online setting. Following [33], the online EM algorithm tracks the sufficient statistics using running averages, and updates the model parameters using these statistics. We train with minibatches, which enables scalable learning for large-scale networks. Training iterates between updating the running averages of the sufficient statistics (an online approximation of the E-step), and updating the parameters based on these averages (the M-step). This procedure is shown [33] to be consistent (i.e., finding a stationary point of the data log likelihood with probability 1) and asymptotically efficient. Update Equations. In a batch EM formulation, model parameters are updated based on sample statistics of interest. Each statistic is defined as the sample average 1 N P N i¼1 fðX i Þ for a function fðxÞ of interest. In the online setting, for each such function fðxÞ, an online sample estimator is kept, denoted here by < fðXÞ > . Given a batch of examples fX i g B i¼1 and adaptation parameter a > 0, < fðXÞ > is updated in iteration q þ 1 by The tracked sufficient statistics are used in the update of the model parameters as follows: The transition probability between hidden states t l k;k 0 update is given by The average joint distribution of clusters from consecutive layers < Pðh l ¼ k; h lÀ1 ¼ k 0 jX; QÞ > is a tracked statistic, computed for each example using the forward-backward algorithm [34]. For the first layer, t 1 k is updated analogously using t 1 The mean m l d;k for an estimated Gaussian before rectification (dropping the l index for notation convenience) is The new mean is a weighted average of all examples' y activities, with each example contributing based on its probability to belong to the cluster. When x½d ¼ 0, the value of the activity prior to the ReLU operation is estimated using the first moment of a rectified Gaussian M 1 ðm d;k ; s d;k Þ, with m d;k and s d;k values are taken from the previous iteration. M 1 ðm d;k ; s d;k Þ has a closed form solution [32] for known mean and variance where GðÀ m s j0; 1Þ and CðÀ m s j0; 1Þ are the density and cumulative values of the normal distribution at À m s . Sinceŷ½d has two cases, two running statistics are tracked for the computation of the nominator in Eq. (8): < Pðh ¼ kjx½d; QÞ Á x½d1 x½d > 0 > and < P ðh ¼ kjx½d; QÞ Á 1 x½d¼¼0 >: The std of an estimated Gaussian density before rectification s l d;k , dropping the l index, is updated by This formula can be seen as a weighted sum-ofsquares and a correction factor The term M 2 ðm d;k ; s d;k Þ is the second moment of a censored Gaussian distribution, which also has a closed form solution [32]

Inference Graphs for CNN
In a CNN, the activation output of the lth convolutional layer is a tensor X l 2 R H l ÂW l ÂD l , where H l , W l , and D l correspond to the height, width, and number of maps, respectively. We consider the activation tensor as consisting of H l Â W l spatial column examples, x l p 2 R D l , located at p ¼ ði; jÞ 2 ff1; . . . ; H l g Â f1; . . . W l gg, and wish to model each such location as containing a separate visual word from a dictionary shared by all locations. The number of hidden variables (one per location) is much larger than in an FC layer (where a single hidden variable per layer was used), and their connectivity pattern across layers is dense, leading to a graphical model with high induced width, but with infeasible exact inference [35]. Hence, we turn to simpler model and training techniques, that are scalable to the size and complexity of CNNs. In this model, activities of different layers are modeled independently, each using a Gaussian mixture model, and transition probabilities between clusters in consecutive layers are modeled a-posteriori (i.e., they are not part of the generative model).
Layer Dictionaries. Similarly to the full activity vector in FC layers, the activity of a spatial column x l p is described as arising from a GMM of K l clusters, regarded as visual words forming the layer dictionary. Using a training image set S T ¼ fðI n ; y n Þg N T n¼1 , a GMM is trained independently for each layer of interest. While each location in layer l has a separate hidden random variable h l p , the GMM parameters are shared across all the spatial locations of that layer, i.e., a single GMM is trained using all spatial columns of layer l and all training images. Training is done using a gradient descent procedure on mini-batches, enabling straightforward GPU implementation. After model training, the activity tensor of layer l for a new example I can be mapped into a tensor P 2 R H l ÂW l ÂK l holding P ðh l p ðIÞ ¼ kÞ. We say that the visual word h l p ðIÞ ¼ k Ã (an activation column of image I in position p is assigned to cluster k Ã ) iff k Ã ¼ arg max k P ðh l p ðIÞ ¼ kÞ. Accordingly, visual word k in layer l is the cluster C l k ¼ fðI; pÞ; I 2 S T : h l p ðIÞ ¼ kg containing activations over all positions for all images in the training set S T , where cluster k has the highest P ðh l p ðIÞ ¼ kÞ. When the CNN also contains FC layers, these can be modeled using a GMM trained on the layer's activity vectors. This can be regarded as a degenerate case of convolutional layer modeling, where the number of spatial locations is one. Specifically, the output layer (X L ) of the network, containing B class pseudo probabilities (output neurons after softmax), is modeled using a GMM of B components (the same amount as classes was set for this layer GMM components). This GMM is not trained, and instead is fixed such that m d;b ¼ 1 for d ¼ b and 0 otherwise, and a constant variance parameter of s d;b ¼ 0:1. In this setting, cluster b of the output layer contains images that the network predicts to be of class b.
Probabilistic Connections Between Layer Dictionaries. Transition probabilities between visual words in consecutive layers are modeled a-posteriori. For two consecutive modeled layers l 0 and l (l 0 < l), the receptive field RðpÞ of location p in layer l is defined as the set of locations fq ¼ p þ o : o 2 Og in layer l 0 used in the computation of x l p . O is a set of fðDx; DyÞg integer offsets. Using a validation sample S V ¼ fI n g N V n¼1 , we compute for each two consecutive modeled layers l and l 0 the co-occurrence matrix N 2 M K l ÂK l 0 between the visual words these two dictionaries contain, Nðk; k 0 Þ ¼ fðI n ; p; qÞ : h l p ðI n Þ ¼ k; h l 0 q ðI n Þ ¼ k; 0 q 2 RðpÞg : (13) Using N, we can obtain the following first and second order statistics: The transition probabilities as defined above are abbreviated in the following discussion toP ðh l 0 ¼ k 0 jh l ¼ kÞ. As defined, these probabilities are averaged over specific positions in the receptive field, since modeling of position-specific transition probabilities separately would lead to proliferation in the parameter number. Training Algorithm. The GMM parameters Q l of layer l are trained by associating a GMM layer to each modeled layer of the network. Since we do not wish to alter the network's behavior, the GMM gradients do not propagate towards lower layers of the network. We considered two different optimization approaches for training Q l : Generative loss-The optimization objective is to minimize the negative log-likelihood function with where G is the Gaussian distribution function, p l k is the mixture probability of the k'th component in layer l, and I eye 2 M D l ÂD l is the identity matrix. Discriminative loss-The probability tensor P is summarized into a histogram of visual words Hist l ðX l ðI n ÞÞ 2 R K l using a global pooling operation. A linear classifier W Á Hist l ðI n Þ is formed and optimized by minimizing a cross entropy loss, where W is the classifier weights vector L D ðX l ðI n Þ; Q l ; y n Þ ¼ Àlog P ðŷ n ¼ y n jW Á Hist l ðX l ðI n Þ; Q l ÞÞ : Here, y n is the true label of image I n andŷ n is the predicted output after a softmax transformation. For the sake of clarity, these two losses are separate approaches for optimizing the model's parameters. Empirical comparison between these two approaches is given in Section 4.3.
For ImageNet-scale networks, full modeling of the entire network at once may require thousands of visual words per layer. Training such large dictionaries is not feasible with current GPU memory limitations (12 GB for a TitanX). Our solution is to train a class-specific model, explaining network behavior for a specific class b and its "neighboring" classes, i.e., all classes erroneously predicted by the network for images of class b. The set of neighboring classes is chosen based on the network's confusion matrix computed on the validation set. The model is trained on all training images of class b and its neighbors.

Node Selection Algorithm
Consider a graph in which column activity clusters (i.e., visual words) fC l k g L;K l l¼1;k¼1 are the nodes, and transition probabilities between clusters of consecutive layers quantify edges between the nodes. Typically, this graph contains thousands of nodes and, thus, is not feasible for human interpretation. However, specific subgraphs may have high explanatory value. Specifically, nodes (clusters) of the final layer C L k in this graph represent images for which the network predicted a class k. To understand this decision, we evaluate clusters in the previous layer C LÀ1 k 0 using a score based on the transition probabilities P ðh L ¼ kjh LÀ1 ¼ k 0 Þ. The step of finding such a set of "explanatory" clusters in layer L À 1 is repeated to lower layers. Below, we develop a suitable iterative algorithm. Given a validation subset of images V ¼ fI n g N n¼1 , it outputs a subgraph of the nodes that most "explain" the network decisions on V, where "explanation" is defined in the maximum-likelihood sense. We first explain node selection for a single visual word in a single image, and then extend this notion to a full algorithm operating on multiple visual words and images.

Explaining a Single Visual Word
Consider an instance of a single visual word h l p ðIÞ ¼ s, derived from a column activity location p in layer l for image I. Given this visual word, we look for the visual words in RðpÞ most contributing to its likelihood, given by (omitting the image notation I in h l p ðIÞ for brevity) In the last step, two simplifying assumptions were made: conditional independence over locations in the receptive field (nominator) and independence of locations (denominator). Taking the logarithm, we decompose the expression and see the contribution of visual words to the likelihood Denote by Q l 0 t ðI; pÞ ¼ fq : h l 0 q ¼ t; q 2 RðpÞg the number of times visual word t appears in the receptive field of location p. We look for a subset of words T & f1; . . . ; K l 0 g, which contribute the most to the likelihood of h l p ¼ s. Thus, the problem we solve is The solution is obtained by choosing the first Z words for which the score is the highest. Intuitively, the score of visual word t is the product of two terms, Q l 0 t ðI; pÞ, which measures the word frequency in the receptive field, and log P ðh l 0 q ¼tjh l p ¼s;q2RðpÞÞ P ðh l 0 q ¼tÞ which measures how likely it is to see word t in the receptive field compared to seeing it in general. To compute the probabilities in the log term of the score, we use the estima-tionsP ðh l ¼ kÞ andP ðh l 0 ¼ k 0 jh l ¼ kÞ given by Eqs. (14) and (15), respectively.

Explaining Multiple Words and Images
The optimization problem presented in Eq. (20) can be extended to multiple visual words in multiple images using column position and image independence assumptions. Assume a set of validation images V is being analyzed, and a set of words S & f1; . . . ; K l g from layer l has to be explained by lower layer words for these images. We would like to maximize the likelihood of the set of all column activities fh l p ðI n Þ : h l p 2 S; I n 2 Vg, in which a word from S appears. Assuming column position independence, this likelihood decomposes into terms similar to Eq. (18) log P h l p ðI n Þ : h l p ðI n Þ 2 S; I n 2 V n o h l 0 q ðI n Þ : log P h l p ðI n Þ h l 0 q ðI n Þ; q 2 RðpÞ : Repeating the derivation also given in Eqs. (18), (19), and (20) for this expression, we get a similar optimization problem, where Q l 0 t;s ðVÞ is the aggregation of Q l 0 t ðI; pÞ over multiple positions and images That is, Q l 0 t;s ðVÞ is the number of occurrences of word s with word t in its receptive field in all the images in V. The solution is given by choosing the Z words in layer l 0 for which the score S l 0 ðV; S; tÞ ¼ is maximized. Like the score in Eq. (21), the contribution of word t to the explanation of a single word s is a product of its frequency in the relevant receptive fields and its discriminative value term. . . . ; z l Z to be the Z clusters indices with the largest scores S l ðV; S; tÞ Set S ¼ fz l 1 ; . . . ; z l Z g and e l i;j ¼ S l ðV; z lþ1 i ; z l j Þ, 8i; The inference graph is generated by going over the layers backwards, from the top layer, for which the decision has to be explained, and downwards towards the input layer, selecting the explaining nodes using the score of Eq. (25). See Algorithm 1 for details.

Visualization Techniques
We consider visualization at several levels starting from simple visual word to path inference, and relating to the two types of networks.

Simple Cluster
We visualize a cluster C l k by showing the m examples (m ¼ 6) with the highest P ðh l p ðIÞ ¼ kÞ across ðI; pÞ 2 S v . For FC layers, for each representative example, the full image is shown. For convolutional layers, visual word examples are typically sub-regions of the input image (if the receptive field does not cover the image entirely). For receptive fields larger than 25% of the image size, the visual word occurrence is visualized by drawing the relevant image with a red rectangle around the relevant receptive field (see Fig. 1). When the receptive field is smaller, only the region of the receptive field is shown instead of the entire image.

Cluster as a Decision Junction
In MLP networks, the entire layer activity is assigned to a single cluster. We consider such a cluster as a "decision junction," where a decision regarding the consecutive layer cluster is made. For the visualization of such a decision, the activity vectors assigned to C l k are labeled according to their cluster index in the consecutive layer, thereby forming subclusters. We use linear discriminant analysis (LDA) [36] to find a two-dimensional projection of the activities that maximizes the separation of the examples with respect to their sub-cluster labels. To understand the semantics of the subclusters, we draw the three most typical representative examples from each sub-cluster near the sub-cluster centroid. We define most typical examples as the three minimal l 2 distance images to the sub-cluster center (see Fig. 4).

Inference Graphs
For MLP networks, the inference path of a specific example I contains a single visual word at each layer. It can be defined by the maximum a-posterior (MAP) cluster sequence, i.e., the sequence H ¼ ðh 1 ; . . .; h L Þ satisfying max h 1 ;...;h L log P ðh 1 ; . . .; h L jXðIÞÞ: H can be found using the Viterbi algorithm [37]. The path nodes are visualized using the decision junction technique of Section 3.4.2. For CNNs, multiple spatial words are active at each layer. We use the technique explained in Section 3.3.2 (Algorithm 1) to generate inference graphs highlighting the main active words. Such graphs can be built for an entire class by choosing the input V of Algorithm 1 to be the set of all images predicted to the class, or for a single example. In the visualization of such graphs, each visual word is displayed using its m most representative spatial examples in the validation set. Connections between words are characterized by their contribution to the log-ratio component in the maximum likelihood score (Eq. (25), right term). For inference graphs of a specific image I, node visualization also shows all the spatial locations in I belong to the corresponding visual word.

Method Limitations and Assumptions
The probabilistic model presented, requires 2D l K l þ K l additional parameters for each new modeled layer. As we use a GMM for activations modeling, a single forward pass calculates the conditional probability of each spatial location over the K l modeled Gaussians. This yields an activation tensor size W l Â H l Â K l . Since K l typically grows linearly with the number of classes whose inference is modeled, the model is less suitable for modeling many classes simultaneously. Furthermore, in backward steps, the derivative of a Gaussian pdf is calculated with some additional computational cost. However, if training is done post-hoc, typically only $20 epochs are required until model convergence.
In order to enable a feasible probabilistic model, some assumptions were made in SIGN. The most prominent is the assumption of independence of consecutive layers and of spatial columns activations. This is not valid since the receptive fields of close activations overlap, and they are propagated from the same regions in lower layers. However, such simplifications are necessary for modeling each layer separately using a GMM, as well as calculating the gain score of a single visual word (Eq. (18)). A second simplifying assumption is the use of a diagonal covariance matrix in the GMM components, required to avoid an OðKD 2 Þ parameter complexity. This amounts to an independence assumption between different maps within the same layer.
While these simplifying assumptions have been made, the resulting model enables statistical inference, and provides a useful XAI framework.

Implementation Details
The HMM for MLP formalism was tested by training fully connected networks on the MNIST [38] and CIFAR10 [39] datasets, containing 10 classes each. The networks included six layers with the first five containing 1,000 neurons each (the last layer has 10 neurons according to the number of classes). Based on a preliminary evaluation, the number of visual words K l was set at 40 for all layers.
CNN models included ResNet20 [40] trained on CIFAR10, and VGG-16 [10] and ResNet50 trained on the ILSVRC 2012 dataset [8]. For ResNet20, the output of all add-layers after each skip connection were modeled, as these outputs are expected to contain aggregated information. For ResNet50, there are 16 add-layers and the output of add-layers 3, 7, 13, and 16 were modeled. For VGG-16, the first convolutional layers at each block were modeled (block1_conv1,..., block5_-conv1). The numbers of visual words were set at 60; 100; 200; 450, and 1,500 for layers 1-5, respectively, according to the GPU memory limitation.
In all experiments and modeled layers, the GMM's mean parameters were initialized using K l randomly selected examples. The variance parameters were initialized as the variances computed from 1,000 random examples. Prior probabilities were uniformly initialized to be 1 K l .

Inference Modeling in MLP Networks
Sequential Path. Fig. 4 depicts how inference paths, drawn as sequences of decision nodes, are useful for error diagnosis. In Fig. 4 (top), a path of an erroneous "car" example in the CIFAR10 network is partially presented. The sub-clusters containing the example are marked with full cyan circles. While layers fc-2 and fc-4 primarily make decisions based on color, the wrong decision leading to the mis-classification of the car as "truck" is made at layer fc-3. The example's cluster in layer fc-3 contains six sub-clusters, leading to car and truck clusters in the consecutive layer. At this point, the example was wrongly associated with the sub-cluster representing "truck" due to its exceptional rear appearance, resembling the appearance of a truck front. From this point onwards, the path is associated with "truck" clusters, up until the classification layer. In Fig. 4 (bottom), a path of an erroneous "nine" example in the MNIST network is partially presented with full blue clusters. Correct network decisions are made in layers fc-1 and fc-2, where the network associates the example with primary "nine" sub-clusters. The wrong decision of the network is made in layer fc-3, where it decided to "send" the example to a "four" cluster in layer fc-4, continuing with this pattern up until the classification layer. Cluster Similarity Development Across Layers. Progressing through FC layers, activity clusters tend to become more class oriented, i.e., dominated by examples from a single class. Furthermore, these clusters become increasingly similar with layer index progression, indicating convergence of class examples toward a single class-specific representation. For each cluster, we define its class naturally as the class whose examples are the most frequent among the cluster examples, and the class dominance index is the percentage of dominant class examples. In Fig. 5 (top) cluster purity and inter-cluster distances are shown for the CIFAR10 FC network layers. The similarity of clusters representing the same class gradually increases in layers 3-5, as evident from the emerging block structure. It can be observed that a significant portion of the training occurs in the middle layer of the network, specifically in the transition between the third representation (last layer of the first half) and the fourth (first in the second half). This phenomenon was enhanced in the case of severe overfit discussed below.
Overfitting Capacity in a Single Layer. It is known that networks can be trained to an extreme overfit condition by using randomly generated labels in training [15]. The resulting classifier has a training error of 0, meaning that it had successfully memorized a mapping between all training images to their pseudo labels, but its generalization error is of chance level. We utilize our model to understand how this memorization mapping is formed. A six FC-layer network was trained on the 60,000 examples of the MNIST data with random labels, until reaching a zero training error. In Fig. 6, the similarity matrices between cluster centers for layers 1-5 are shown (top), with typical clusters at each layer (bottom), presented using their label histograms and representative images.
Surprisingly, one can observe a concentration of the network overfit behavior in a single layer transformation between the third and fourth layers. In layers 1 À 3, clusters are input related. They contain images with similar appearance, hence, with similar true labels (and uniformly distributed pseudo labels). These clusters are not close in the euclidean sense. In Layer 4, there is a sharp transition to clusters which are completely (pseudo) label dominated, as indicated by their class purity (histogram) and block structure in the similarity matrix. The fact that the memorization of the full data (60,000 instances) concentrates almost completely at a single transformation in the middle of the network (between layers 3 and 4) is a novel unexpected observation, for which we do not currently have a good explanation.

Inference Modeling in CNNs
Loss and Dictionary Sizes. The discriminative quality of a visual dictionary can be quantified by using it to form word histograms, then checking the error of a linear classifier on this representation (a bag-of-words methodology). Fig. 7 (Left) shows the errors obtained for dictionaries trained with losses L G (16) and L D (17) on intermediate convolutional layers of a ResNet20 network on CIFAR10. The graphs show error rates as a function of the layer index and dictionary size. It can be seen that errors decrease with the layer index (from 1 to 9), reaching the original network error of 0.088. As expected, the discriminative loss L D , whose minimization directly decreases the error, leads to a higher accuracy than the generatively-optimized loss. Therefore, all models presented below were trained with the L D loss. In addition, the error rate decreases monotonically with the dictionary size for all layers, but the gain from dictionary sizes larger than 500 is minor in most cases.
To validate SIGN clustering scheme, the suggested representation's accuracy is compared to a baseline clustering method, in which clusters correspond to channels of the output tensor. In the baseline method, each column activation x l p is assigned to a cluster based on its maximal activity channel. The cluster histogram is then fed to a linear classifier in the same procedure as mentioned above. Error rates of SIGN representations are shown compared to those of the max channel clustering representations across the modeled layers in Fig. 7 (Right). For SIGN method, we present the error rate based on 500 visual words in all modeled layers, both for discriminative and generative losses. For max channel clustering method, we used the discriminative loss (i.e., cross-entropy based on clusters bag-of-words). The discriminative GMM loss produced the lowest error rate across all layers, with significant margin over the plain channel-wise clustering. Relative improvement from the baseline technique ranges from 20% in lower layers, up to 45% in the highest layer add_9.
Cluster Development Across Layers. Fig. 5 (bottom) shows the cluster distance matrix for several convolutional layers of a ResNet20 network trained on CIFAR10. Unlike in MLPs, learned clusters represent activity at specific spatial locations. Even though receptive fields of clusters from advance layers cover the entire input space, there is no apparent similarity between clusters with the same dominant class. This indicates that activity columns of advanced layers remain local, an observation also supported by the inference graph visualizations shown below. While clusters do become more class specific, they are less specific than in the MLP network, and the final classification is based on several class-specific words appearing simultaneously in different image regions.
Class Inference Graph. An example of a class inference graph for the class "pineapple" in VGG-16 is presented in Fig. 8. This graph is generated by training a clustering model on the "pineapple" and related classes (see Section 3.2), then applying the node selection algorithm (Section 3.3) to the set V of pineapple images included in the validation set. The graph shows that the most influential words in the top convolution layer can be roughly characterized as "grassy-head," "pineapple-body," and "roughround-edge". The origins of these words can be traced back to lower layers. For example, "grassy-head" is composed of word capturing mostly "lengthy-vegetation" in the layer below. The "pineapple-body" is composed of words contain "vegetation" types and "cross" textures (block4_conv1) which are in turn generated from words describing mostly green and yellow textures and shapes (block3_conv1).  6. Cluster development for an extreme overfit case. Cluster distances and average purity for a network trained with random labels are shown. Top: Cluster similarity matrices for layers 1-5. Bottom: Typical clusters at these layers. For each cluster the (pseudo) label histogram is shown, as well as some representative cluster images. An abrupt transition from input-dominated to output-dominated representation occurs in the transformation between the third and fourth layers. Fig. 7. Classification accuracy analysis as a function of dictionary size and layer depth. Error rates of linear classifiers, based on cluster/visual word histograms are shown. All experiments conducted on five ResNet20 conv-layers, trained on CIFAR10. Left: Error rates based on SIGN generative and discriminative loss functions (Eqs. (16) and (17)), as a function of dictionary size. Right: Error rates of SIGN clustering method and plain clustering based on the maximal active channel (see text for explanation).
Image Inference Graphs. Fig. 9 shows an image inference graph for a pineapple image wrongly classified to the "swing" class. Using the inference graph, we can analyze the dominant (representative) visual words that led to this erroneous classification: block5_conv1 (top layer): The visual words connected directly to "swing" class, hence, causing the error, can be characterized as "grass/foliage-texture," "sand," and "vertical-rope". Such words are indeed statistically related to swing presence in images, and many map locations in the inspected image are assigned to them. layers block4_conv1 and block3_conv1: The "verticalrope" (block5_conv1) originates from a similar visual word, "vertical-stripe," of layer block4_conv1, and this in turn depends strongly on the "isolated-vertical-line" word in layer block3_conv1. The foliage word (block5_-conv1) mainly originates from the "grassy-ground" word in layer block4_conv1, which in turn heavily depends on the two "ground-structure" and "grassstructure" words in layer block5_conv1. layer block2_conv1: The main explanatory words are green and bright vertical edges and lines, which are combined to construct the "isolated-vertical-line" and "grass-structure" words in layer block3_conv1. In Fig. 10, we show part of the inference graph for a successfully classified zebra image, focusing on the bottom three layers. The gradual development of discriminative stripe-based features can be seen. Visual words in block3_-conv1 (top layer) are each characterized by a single orientation: vertical (left), leaning to the right (middle), or leaning to the left (right). These words are less sensitive to spatial frequency. They abstract over the spatial frequency by combining words from layer block2_conv1 that mostly differ w. r.t their line spatial frequency and edge patterns. In layer block1_conv1, we encounter the edge feature patterns composed to create the words of layer block2_conv1.
In Fig. 11, we show partial inference graphs for ResNet-50. On the left, upper nodes from the inference graph of "pineapple" image from Fig. 9 are shown. Unlike VGG-16, ResNet50 successfully classifies this image. As can be seen in layer add_16, ResNet50 successfully detects the pineapple location in the image (marked in green circle), where both visual words presented contain strong "pineapple" features. On the right, we show upper nodes of the ResNet-50 inference graph for the "zebra" image from Fig. 10, successfully classified also by ResNet-50. The top word shown usually captures the zebra's head, with its receptive field center located on the neck. This "zebra-head-neck" visual word is formed from the bottom-left word representing a near-head stripes texture, and the bottom-right word which captures a diamond shape located between the zebra's eyes. Both words are discriminative, as indicated by their logodds score (higher than 2).

Biased Data Experiment
A user experiment with human subjects was carried in order to test whether SIGN can effectively enable detection of artifacts in the data. Specifically, SIGN was tested if it enables a user detection of class-specific artifacts using class inference graphs. That is, we injected class-specific biases in Fig. 8. Pineapple inference graph. The graph is generated by training a model on the "pineapple" class and its neighboring classes. The top node is a visual word of the output layer, representing the predicted class "pineapple". The lower levels in the graph show the three most influential words in preceding modeled layers (block5_conv1,..., block1_conv1). Visual words are manifested by the six representative examples for which P ðh l ¼ kjx l p Þ is the highest. For modeled layer block5_conv1, examples are presented by showing the example image with a rectangle highlighting the receptive field of the word's location. For lower layers, the receptive field patches themselves are shown. Images are annotated by their true label. Arrows are shown for the two most significant connections for each lower visual word. When the log-ratio term (right element in Eq. (25)) is positive, it is colored (1) black: 0 < log-ratio < 1, (2) light green: 1 log-ratio < 2, (3) mild green: 2 log-ratio < 3, or (4) dark green: 3 log-ratio. In addition, a tag above each visual word was added by the authors for convenience. The figure is best inspected by zooming in on clusters of interest. the data where all images of a certain class were identified with a unique reoccurring artifact. For the experiment, we trained five ResNet20 models. One was trained to classify CIFAR10 (baseline), and four others were trained with CIFAR10 versions where images were corrupted with watermarks. In each experiment with corrupted data, a single purple English letter was placed at a random position in images of 2 classes of the 10. In two of the experiments, images contained clearly visible opaque watermarks, and in the remaining two experiments images contained hard-todetect transparent (a ¼ 0:5) watermarks. Class inference graphs were formed for a couple of classes in each experiment, and users were asked to determine using the graph whether the data of the corresponding experiment is clean or corrupted. Since SIGN finds features associated with a certain class, we expected the distinctive features to be found by the graph.
A total of 60 subjects of different ages and backgrounds volunteered to participate in the experiment. Each participant was presented with 16 inference graphs. Among the 16 graphs, 4 graphs were produced from clean data, 6 from a model trained with opaque watermarks, and 6 with transparent watermarks. Among the 12 graphs corrupted with watermarks, we presented 6 graphs from the corrupted classes and 6 graphs from the non-corrupted classes.
Experiment results are shown in Fig. 12. The accuracy of users is shown for each bias scenario. Participants classified clean data correctly with 90% accuracy. For graphs with opaque watermarks, participants found corruption with 93% accuracy when presented with class inference graphs of the corrupted class and detect corruption with 70% when shown inference graphs of the non-corrupted class. This Fig. 9. An image inference graph of an erroneous image. An image inference graph for a pineapple image wrongly classified to the class "swing". The model is trained using "pineapple" and its neighboring classes (same as in Fig. 8), where the neighbor class "swing" is included. The graph is generated by applying the node selection algorithm (Section 3.3) to a set V containing this single erroneous image. The analyzed image is shown on the top of each cluster node, with red dots marking spatial locations assigned to the cluster. In the top node, the pineapple object is marked in red circle for clarification. difference intensifies with transparent watermarks, where watermarks are less visible. 90% of the participants correctly classified the data as corrupted when shown the corrupted class inference graphs and 53% when shown class inference graphs of non-corrupted classes. When producing class inference graphs of corrupted classes (Fig. 13), the SIGN model finds clear watermark-related visual words in all modeled layers as a strong feature of the class. It can be observed that the red center point of the receptive field patch is always located in proximity to the watermark. In upper layers add_6 and add_8, it can be seen that without the red center point, pointing at the watermark location, it is difficult to detect these type of corruption. When producing graphs of non-corrupted classes, watermark-related visual words do not appear in lower layers, and the users find it harder to detect watermarks. For the suggested experiment, where images of an entire class are contaminated by a unique watermark, we can expect that attribution methods will detect the watermark. Indeed, as shown in Fig. 14, Grad-CAM [11] deployed on the last convolutional layer of ResNet20 highlights the watermarks in stained classes. However, Grad-CAM visualization only analyzes the impact of the final convolutional layer on the decision. The SIGN method provides deeper Fig. 11. Sub-graph inference for ResNet50. Left: The pineapple image wrongly classified to class "swing" in VGG-16, presented in Fig. 9, is correctly classified to "pineapple" class in ResNet50 (the pineapple object is marked in green circle). The sub-graph presents the visual word from the top layer (add_16) connected to a visual word from the lower layer (add_13). Right: The zebra image from Fig. 10, correctly classified in VGG-16, is correctly classified in ResNet50 as well. The sub-graph includes a visual word from the top layer (add_16) aggregated from two visual words from the lower layer (add_13). Fig. 12. Biased data experiment results. Users were asked to determine whether inference graphs shown to them contain corrupted data. The graphs shown were produced using ResNet20 trained either with clean data (blue), or with opaque/transparent watermarks induced in two classes of CIFAR10. For the datasets with watermarks, users were shown inference graphs of corrupted classes (green) and non-corrupted classes (orange). Users easily detected corruptions in class inference graph of classes stained with watermarks. When shown inference graphs of non-corrupted classes from the same corrupted dataset, the user accuracy is lower, especially with transparent watermarks. Fig. 13. Corrupted data debugging with SIGN. Class inference graph of "truck" class from ResNet20, trained on biased CIFAR10. The data was corrupted by inducing 2 classes with transparent watermarks (a ¼ 0:5) at random places. The above graph shows the "truck" class inference graph in an experiment where classes "truck" and "cat" were corrupted with purple letters "T" and "A" respectively. It is recommended to zoom in for better inspection. analysis by enabling understanding of the full inference process through the hidden layers based on the statistical analysis of the full training population. Such analysis enables understanding semantics of error sources (Fig. 9), feature development (Fig. 10), and even identification of layers responsible for overfitting (Fig. 6).

CONCLUSION
In this paper, we introduced SIGN, a new approach for interpreting hidden layers activity of deep neural networks based on learning activity cluster dictionaries and transition probabilities between clusters of consecutive modeled layers. We formalized a maximum-likelihood criterion for mining explanatory clusters, and an algorithm for the construction of inference graphs with manageable sizes. Inference graphs can be constructed for entire classes, to understand the general network reasoning for this class, or for specific images for which error analysis may specifically be sought.
The tools developed here can be used to verify the soundness of the network reasoning and to better understand the network's hidden mechanisms, or conversely, reveal weaknesses and main error causes. Network debugging is currently a difficult and daunting task, and we believe the suggested tools may be a useful component in a developer's debugging toolbox. Beyond its utility in network interpretation and debugging, the suggested approach and tools revealed several surprising network behavior patterns, such as the extreme locality of activity columns in top CNN layers, and the concentration of memorization in a single intermediate middle layer in fully connected networks.
Several interesting avenues are open for future work. One such avenue may be re-training more explainable networks by enforcing, during training, only activity of clusters and connections with high explanatory value. Another direction is to use the models in a task-transfer scenario. Network refinement for a new task may be constrained to use only relevant activation clusters, as determined by a human observer. In this way, a human may use the suggested tool to define a relevant prior for the new task. Finally, we may try to design explanatory capability that goes beyond statistical analysis and maximum-likelihood justification, and into causal analysis based on intervention.