From Clustering to Cluster Explanations via Neural Networks

A recent trend in machine learning has been to enrich learned models with the ability to explain their own predictions. The emerging field of explainable AI (XAI) has so far mainly focused on supervised learning, in particular, deep neural network classifiers. In many practical problems, however, the label information is not given and the goal is instead to discover the underlying structure of the data, for example, its clusters. While powerful methods exist for extracting the cluster structure in data, they typically do not answer the question why a certain data point has been assigned to a given cluster. We propose a new framework that can, for the first time, explain cluster assignments in terms of input features in an efficient and reliable manner. It is based on the novel insight that clustering models can be rewritten as neural networks—or “neuralized.” Cluster predictions of the obtained networks can then be quickly and accurately attributed to the input features. Several showcases demonstrate the ability of our method to assess the quality of learned clusters and to extract novel insights from the analyzed data and representations.


I. INTRODUCTION
Clustering is an important class of unsupervised learning models that aims to reflect the intrinsic heterogeneities of common data generation processes [1], [2], [3], [4].Natural cluster structures are observed in a variety of contexts from e.g.gene expression [5] and ecosystems composition [6] to textual data [7].Methods that can accurately identify the cluster structure have thus been the object of sustained research over the past decades [8].Basic techniques such as k-means [9] have been extended to operate in kernel feature spaces [10], [11], or on the representations built by a deep neural network [12], [13], [14], [15].
Due to the ever growing complexity of ML models and their use in increasingly sensitive applications, it has become crucial to endow these models with the capability to explain their own predictions in a way that is interpretable for a human.Explainable AI (XAI) has emerged as an important direction for machine learning, and excellent results have been reported in selected tasks such as explaining the predictions of popular DNN classifiers [16], [17], [18], [19], [20].
In this paper, we bring these newly developed explanation capabilities to clustering, a highly needed functionality, considering that in the first place one of the main motivations for performing a clustering is knowledge discovery.Especially in high-dimensional feature space, a clustering for knowledge discovery can only provide a few prototypical data points for each cluster.Such prototypes, however, do not reveal which features made them prototypical.Instead, we would like to let the clustering model explain itself in terms of the very features that have contributed to the cluster assignments.-Tothe best of our knowledge, our work is the first ever attempt to systematically and comprehensively obtain such explanations.Specifically, we are able to supply explanations of why each individual point is clustered in the way it is.
The method we propose, puts forward the novel insight that a broad range of clustering models can be rewritten, without retraining, as functionally equivalent neural networks, which then serve as a backbone to guide the explanation process.Technically, we suggest to apply the following two steps: (1) The cluster model is 'neuralized' by rewriting it as a functionally equivalent neural network with standard detection/pooling layers.(2) Cluster assignments formed at the output of the neural network are then propagated backwards using an LRP-type procedure (cf.[17], [21], [22]) until the input variables (e.g.pixels or words) are reached.
The proposed 'neuralization-propagation' procedure (or short, NEON) is tested on a number of datasets and clustering models, including recent deep clustering models such as SCAN [23].Each time, NEON accurately explains cluster assignments, and extracts useful insights.Experiments also demonstrate the practical value of our two-step approach compared to a potentially simpler one-step approach without neuralization.Our contributions can be summarized as follows: • Introduction of XAI to clustering, specifically, explanation of the assignment of individual data points onto clusters, in terms of input features.
• Formulation of the clustering decisions for a broad range of clustering models as being functionally equivalent neural networks, thus enabling the application of state- Overview of our contributions.B1.We enrich the cluster assignment with an explanation highlighting what input features mostly contribute to the cluster decision.B2.We achieve this technically by observing that the clustering decision can be rewritten as a neural network (neuralization), enabling fast and robust explanations via the LRP technique (propagation).
of-the-art XAI techniques to these models.
• Theoretical embedding of our neuralization-propagation approach to explaining clustering, specifically providing an interpretation of our approach, for special cases, in terms of Shapley Values.
• Demonstration of the benefit of bringing XAI to clustering showcased for two real-world examples, and extensive quantitative validation of our proposed explanation method.Fig. 1 shows a cartoon of our contributions in order to provide the general underlying intuition to the reader.We stress that our method applies to many popular clustering algorithms and is a generic blueprint as it does neither rely on predesigned interpretability structures nor algorithms, nor any retraining.This will prove useful in the future for shedding new light into existing cluster-based typologies used e.g. in computational biology [24], [25] or consumer data [26], which researchers and practitioners have started to use increasingly to support their scientific reasoning and to take insightful decisions.

A. Related Work
So far, research on explanation methods has been overwhelmingly focused on the case of supervised learning.Methods based on the gradient [27], [28], [29], local perturbations [16], [30], or surrogate functions [18] do not make specific assumptions about the structure of the model and are thus applicable to a wide range of classifiers.Other methods require the classifier to have a neural network structure and apply a purposely designed backward propagation pass [17], [21], [22], [31], [32], [33] to produce accurate explanations at low computational cost.While recent work has extended the principle to other types of models such as one-class SVMs [34], LSTM networks [35], or graph neural networks [36], the method we propose here contributes by offering a solution to the so far unsolved problem of explaining cluster assignments.
Note that the few cluster interpretability techniques have so far been based on surrogate decision trees [37], [38], [39], [40], [41], where the decision tree is trained to approximate the k-means clustering as closely as possible, and where the cluster assignment is then interpreted using Explainable AI techniques specific to decision trees.With such a surrogate approach, the user typically has to trade off faithfulness to the original model against explainability.
Related to the connections we establish in this paper between clustering models and neural networks, some works explore ways of merging the two in order to produce better, more flexible ML models.For example deep clustering approaches typically build a clustering objective on top of deep representations [12], [14], [15], [42], [43].Other models, in particular, the k-meansNet [44] design the neural network in a way that simulates a clustering model, so that the learned neural networks solution can be interpreted as a clustering solution.Note that in all these works, the purpose is more to enhance a basic clustering model by providing the flexibility of neural network representation and training, whereas our work focuses on making existing popular clustering algorithms explainable.
Another set of related works focus on the problem of learning a good clustering model, by identifying a subset of relevant features that support the cluster structure.Some methods identify relevant features by running the same clustering algorithm multiple times on different feature subsets [45].Other approaches simultaneously solve feature selection and clustering by defining a joint objective function to be minimized [46].While feature selection can identify the set of features required to represent the overall cluster structure, our work builds up by identifying among those features which ones are truly responsible for a given cluster or a given cluster assignment.
Further related works focus on quantitatively validating clustering solutions.Examples of validation metrics are compactness / separation of clusters [47], cluster stability under resampling / perturbations [48], [49], or purity, i.e. the absence of examples with different labels in the same cluster [50].Our work enhances the validation of clustering models by producing human-interpretable feedback, a critical step to identify whether cluster assignments are supported by meaningful features or by what the user would consider to be artifacts.
Lastly, user interfaces have been developed to better navigate cluster structures, as they occur, e.g. in biology applications [51], [52].Also, the use of prototypes has been proposed to visualize deep image clustering models [15] or explain kernel methods for property prediction of chemical compounds [53].While these works produce useful and informative visualizations which may help to guide the process of clustering, our approach contributes by answering the precise question "why a given data point is assigned to a particular cluster."

II. EXPLAINING K-MEANS CLUSTER ASSIGNMENTS
The k-means algorithm [9] is one of the best known approaches to clustering and is used in many scientific and industrial applications (e.g.[54], [55], [56]).This section presents our neuralization-propagation approach for explaining a k-means cluster assignment in terms of input features.Due to the simplicity of the k-means model, this section also has a tutorial purpose.More complex and powerful clustering models based on kernels [11], deep neural networks [12], [14], or more general clustering techniques, are discussed in Sections III-V.
The k-means algorithm finds a set of centroids that minimizes the total squared distance between each data point and their nearest centroid.The k-means model assigns points to clusters based on their distance to each centroid µ k ∈ R d , specifically the model assigns a point x ∈ R d to cluster c if In principle, it is conceivable to use Explainable AI techniques such as prediction difference analysis [16], [57] or LIME [18], as they apply out-of-the-box to any model or decision function.However, these approaches require to evaluate the function multiple times to test for the effect of each input dimension.This can become slow when the data is high-dimensional, e.g. when clustering images or gene expression data [25].Also, local perturbation may not faithfully depict the overall contribution of a feature to the clustering decision, especially if multiple features needs to be perturbed in order to affect the decision.
In the context of supervised learning, more efficient Explainable AI techniques have been proposed, which rely on a model that induces the decision function, and from which meaningful gradient information and intermediate representations can be extracted.Such methods include, among others, integrated gradients [29], or Layer-wise Relevance Propagation (LRP) [17], [22], [58].The LRP method in particular, leverages the neural network structure of the prediction to produce a robust explanation in the order of a single forward/backward pass.The LRP method was used in a wide range of applications (e.g.[22], [36], [59], [60], [61], [62], [63], [64]), and can be embedded in the framework of deep Taylor decomposition [31].

A. Neuralization of the Cluster Assignment
In order to bring these efficient XAI techniques to clustering, we propose to enrich the clustering decision function g c (x) with a neural network model.The latter is designed to exactly replicate the cluster assignments of the original clustering model and is more amenable to explainability.Furthermore, we also require that such neural network model is obtained readily from the cluster solution (i.e. the centroids) without incurring any additional training step.We call the process of obtaining such a neural network "neuralization."Proposition 1.The decision function of Eq. (1) can be reproduced by a two-layer neural network composed of a standard linear layer and a (min-)pooling layer: where , and assigning to cluster c if f c (x) > 0.
(cf.Appendix A of the Supplement for a derivation).The first layer corresponds to a collection of linear functions aligned with the different cluster centroids.The min-pooling selects which linear function is active at a given location.These two layers together build a piecewise linear function.A simple two-dimensional example with three clusters is shown in Fig. 2. We observe that the neural network output f c (x) (right) exactly reproduces the true cluster decision boundary, specifically, the Voronoi partition associated to the given kmeans model (left).The neural network above can be also interpreted in neuroscientific terms as the alternation of 'simple cells' and 'complex cells' [65], or 'executive organs' and 'restoring organs' in automata theory [66].We also note that earlier works have already linearized elements of the cluster model such as the square distance for the purpose of training [44].Here, our contribution differs by extracting a piecewise linear view of the whole model, and additionally, identifying a neural network structure for this piecewise linear form.We provide similar neuralization results for the soft k-means case, as well as a probabilistic interpretation, in Appendix E of the Supplement.We will also study more complex neuralization scenarios in Sections III and IV when considering kernel-based clustering and deep clustering.

B. Propagation of the Cluster Assignment
So far, we have rewritten the k-means decision function for each cluster as a neural network.This initial step gives access to a broader range of explanation techniques such as integrated gradients [29], or layer-wise relevance propagation (LRP) [17], [67].The LRP technique, in particular, leverages the neural network structure to produce robust explanation in a single forward/backward pass.Unlike the standard gradient propagation pass which provides a highly localized view of the function, LRP applies propagation rules that redistribute the quantity to explain from layer to layer.These rules are purposely designed for the task of explanation.LRP ensures certain desirable properties of an explanation such as conservation of predicted evidence and local continuity of explanations [17], [67].
Let us start with the output of the neural network f c , which we wish to attribute as a first step to neurons in the intermediate layer (h k ) k , by propagating through the min function.Similar to [34], we follow a min-take-most (MTM) strategy, where smallest inputs to that function receive the largest share of the quantity to redistribute, in particular, we apply the propagation rule: where R k is the 'relevance' of neuron h k to the cluster assignment f c , and where β ∈ R + is a stiffness hyperparameter.The stiffness parameter interpolates between a uniform redistribution strategy (β = 0) and a min-take-all strategy (β → ∞).Note that compared to these two extreme cases, our approach allows to contextualize the explanation (i.e.not redistributing on clusters competitors that are too far and therefore irrelevant), and at the same time, ensures continuity of the explanation as we transition from one nearest cluster competitor to another.We propose to set this parameter according to the simple heuristic: where the expectation is computed over the whole dataset.In other words, considering f c to be a 'typical' score in the pool, we want the stiffness parameter to be inversely proportional to it.
We now consider how to further redistribute the intermediate relevance scores R k to the input layer, where the dimensions correspond to observed quantities that are assumed to be interpretable by the user.To achieve this, we propose the LRP propagation rule: where m k = (µ c + µ k )/2 is the mid-point between the centroids of the cluster of interest and the competitor.In other words, we attribute on dimensions where the input activation relative to the mid-point, x−m k , matches the model response w k .
It can be noted that the proposed propagation rules ensure a certain number of desirable properties of an explanation, in particular, it satisfies the conservation property i R i = f c (x), it preserves the continuity of f c (x), and it is invariant to any translation of the clustering in input space.

C. Theoretical Embedding
We provide further theoretical support for the rules in Eqs.
(2) and ( 4) by showing that their application produce, for special cases, explanations that coincide with the Shapley Value.The Shapley Value [30], [68], [69], originally proposed in the context of game theory, is an axiomatic solution to the problem of attributing the value of a coalition of players to individual players in the coalition.For our comparison, we interpret the set of players as the individual input features (or activations) and the withdrawal of a player from the coalition as replacing the corresponding feature value x i by some reference value x i .
Proposition 2. Redistribution performed by Eq. (2) with parameter β = 0, corresponds to the Shapley Value of the function f c (h) with the reference point h = 0.
(The proof is given in Appendix B of the Supplement.)The parameter β = 0 corresponds to a uniform redistribution of f c to the cluster competitors.The corresponding reference point h = 0 can be interpreted as the image of a point x in input space that is equidistant from all cluster centroids.(Note that this point may not exist in low-dimensional spaces.)(See Appendix B of the Supplement for a proof.)In other words, the explanation coincides with Shapley values with the reference point x chosen at the mid-point between the clusters centroids µ k and µ c .Such reference point is a natural choice for explaining why a point is member of cluster c and not of cluster k.

III. EXTENSION TO KERNEL K-MEANS
The standard k-means clustering algorithm has strong limitations in terms of representation power, as it only allows to represent clusters that are pairwise linearly separable.The kernel k-means model [11] is a straightforward extension of k-means where the data is first mapped to a feature space via some map x → Φ(x) induced by some kernel function K(x, u).The decision function implemented by kernel kmeans is given by: where the centroids are also defined in feature space.
If we were to apply the same explanation framework as in Section II, we would obtain an explanation in terms of dimensions of the feature space, and we would then need to further backpropagate through the feature map Φ.While this is technically possible (e.g. for a Gaussian kernel K(x, u) = exp(−γ x − u 2 ), one can use random approximations of the feature map), we consider instead a more intuitive formulation, specific to the Gaussian kernel case, where the distance to a particular cluster is modeled by a soft minimum over distance to the cluster members.Specifically, we consider in place of Eq. ( 5) the decision function where (u i ) i and (u j ) j are sets of data points (or support vectors) representing the two clusters, C c , C k ⊂ N are the nonoverlapping sets of indices of support vectors that represent these clusters, and where LME −γ denotes a generalized Fmean with F (t) = e −γt , i.e.
The latter can be interpreted as a soft min-pooling and it converges to a hard min-pooling when γ → ∞.The two distance measures on which the decision functions of Eqs. ( 5) and ( 6) are based, are illustrated for some toy onedimensional cluster c composed of 6 data points in Fig. 3.
. Distance between some data point x and a cluster c depicted as a collection of black dots.The distance is either computed in feature space, or using the soft min-pooling of Eq. ( 6).
While the two functions clearly differ, one can also observe that they build comparable level sets.In fact, we show in Proposition 4 that these two measures of distance are essentially the same up to some monotonous nonlinear transformation, thereby leading to the same decision function.
where Φ is some feature map associated to the Gaussian kernel K(x, u) = exp(−γ x − u 2 ) and Z c is a normalization factor.The two distance functions appearing in Eqs. ( 5) and (6) can be related as: where g c is a monotonically increasing function defined as: with Li 1 is the polylogarithm of order A proof is given in Appendix C of the Supplement.Formally, equivalence between the two decision functions (Eqs.( 5) and ( 6)) is ensured when the function g c does not depend on the choice of cluster c.When choosing the normalization factor Z c = |C c | (standard kernel k-means), the term H c vanishes but the term ∆ c remains, and the converse happens if setting µ c = 1, i.e.Z c = i∈Cc Φ(u i ) (spherical kernel k-means).In practice, both terms remain near zero if we observe that each cluster is equally heterogeneous and has consequently the same norm in feature space.In that case, the two decision boundaries become equivalent.An advantage of the latter decision function is that it can exactly reproduced by a neural network.
Proposition 5.The decision function in Eq. ( 6) can be reproduced by a four-layer neural network composed of a linear layer followed by three pooling layers: ) where , where LME γ and LME −γ can be interpreted as soft max-pooling and soft min-pooling respectively, and assigning to cluster c if f c (x) > 0.
The proof is given in Appendix D of the Supplement.An example showing the equivalence between the neural network output and Eq. ( 6) is given in Fig. 4.This neural network we have proposed can now be used to support the process of explanation.Because the network is again composed of linear and pooling layers, propagation rules proposed for the k-means case remain applicable.In particular, redistribution in pooling layers can be achieved using Eq. ( 2) (and switching the sign for the soft max-pooling case). 1 The directional redistribution in the first layer can be achieved using Eq. ( 4).However, we must handle the case where some relevance lands on a deactivated (or weakly activated) neuron h ijk , as the latter does not provide directionality in input space.Such special case can be handled by only propagating part of the relevance (and dissipating the rest), specifically, by performing the reassignment: The latter ensures that the relevance continuously converges to zero as the neuron h ijk becomes deactivated.
In terms of computational cost, we note that the number of neurons in our neuralized k-means model grows quadratically with the number of support vectors per cluster, whereas the complexity of a simple evaluation of the decision function is linear with the number of support vectors.For NEON to remain computationally favorable, the number of support vectors must be kept small, typically, in the order of 10 support vectors per cluster.Practical approaches to produce a limited number of support vectors include e.g.reduced sets [70], [71], [72], vector quantization [73], or representing each cluster as a mixture model with finitely many mixture elements (we use this approach in Section VI-A).Alternatively, when for modeling purposes it is necessary to maintain a large number of support vectors per cluster, one can adopt a pruning strategy, where we only evaluate in the forward and backward pass the most relevant part of the network, i.e. neurons to which the min-and max-pooling functions in the network are effectively sensitive.

IV. EXTENSION TO DEEP CLUSTERING
Unlike kernel k-means, deep k-means makes use of a feature map given explicitly as a sequence of layer-wise mappings , and the feature map is typically learned via backpropagation to produce the desired cluster structure.
Various formulations of deep k-means have been proposed in the literature.Clustering solutions produced by [14], [15] optimize a hard k-means objective based on distances in feature space.Using the same assignment model as for kmeans, but this time in feature space, we decide for cluster c if: This lets us rewrite the full model as a the stacking of the L layers of the neural network Ψ with the neuralized k-means model defined in Proposition 1: 1 The relevance attributed to neuron h ijk is thus given as where Note that beyond a simple application of standard k-means on top of a given layer, there have been many proposals for deep clustering.
Other quite popular formulations make use of a soft cluster assignment model, specifically, a softargmax model [23], [42], or a t-Student similarity model [12], [43].These soft clustering approaches bring a probabilistic interpretation of cluster assignments, and enable entropy-based optimization criteria.In the soft k-means models of [23], [42], the data is first projected on some direction µ c associated to the cluster, and mapped to a probability score using a softmax.Here we first consider the explanation of the clustering outcome, in other words, we place the decision boundary at the location where there is as much evidence for the given cluster assignment as for the assignment onto the nearest competitor. with and a = Ψ(x) Proposition 6.The decision function of Eq. ( 12) can be expressed by the neural network: Neuralized deep soft clustering (relative) where w k = µ c − µ k , and testing for f c ≥ 0. Furthermore, f c has a probabilistic interpretation as the log-likelihood ratio A proof is given in Appendix E of the Supplement.The solutions in [12], [43] are also based on a soft-assignment model, where the exponential terms are replaced by t-Student distributions.The latter do not allow for a similar neural network reformulation as above, however, they still converge to hard k-means when the clusters become increasingly distant.
Alternatively, one can be interested in why an assignment onto a cluster exceeds a particular probability threshold.Specifically, we would like to explain the decision function: where the probability scores are defined in the same way as in Eq. ( 12), and where θ is some value between 0 and 1.
Proposition 7. The decision function of Eq. ( 13) can be expressed by the neural network: Examples of clustering models whose cluster assignments can be explained with our NEON approach.The neuralized models, each of which can be expressed as combinations of detection layers and pooling layers, are depicted along with the propagation rules applied at each layer.

a log-likelihood ratio plus an offset.
A proof is given in Appendix E of the Supplement.Like for the k-means and kernel k-means cases, the min-take-most propagation rule can be applied to the top layer.For the last neuralized variant featuring the LME computation, one also needs to handle the case where non-zero relevance scores R k land on deactivated neurons (h k = 0).To avoid this, we perform the reassignment R k ← R k • (h k /f c ).For further propagation of relevance scores into the neural network, we notice that all layers up to layer L + 1 form a standard neural network.Hence, propagation rules designed in the context of neural network are applicable.For propagation rules specific to deep neural networks, we refer to the papers [58], [74] which cover in particular convolutional layers and LSTM blocks.

V. EXTENSION TO ANY CLUSTERING
Not all clusterings can be readily obtained by standard / kernel / deep k-means or combinations of them.Algorithms such as DBSCAN [75], hierarchical agglomerative clustering [76], or spectral clustering [77], [78], are based on a different principle, and typically lead to different cluster solutions.For these clusterings we observe however that the decision function they implement is typically based on evaluating distances between individual data points.Hence, the kernel k-means model we have proposed provides a natural surrogate for modeling the cluster assignment of these models.In particular, the identified four-layer architecture can be kept fixed, and the parameters (e.g.data point weightings) can be fine-tuned to fit the decision boundary.Once the model boundaries coincide, the model can be used in a second step to extract explanations.The same fine-tuning strategy can be used to handle cluster solutions that are not the sole result of a cluster algorithm but that have instead been curated by humans to match their expert knowledge.
Compared to a standard surrogate approach that would use a generic classifier to fit the cluster assignments, using a standard / kernel / deep k-means surrogate ensures that the needed adjustment is minimal, thereby preventing the decision strategy of the two models to become substantially different.In particular, one minimizes the risk of introducing a Clever Hans effect into the surrogate model (cf.[79]), or removing such Clever Hans effect.The risk would indeed be that the surrogate model yields a false interpretation (too optimistic or too pessimistic) of the original model's decision strategy.

VI. APPLICATIONS
We have proposed to extend Explainable AI to clustering, and have contributed the neuralization-propagation technique (NEON) to efficiently extract these explanations.In the following, we demonstrate on three showcase examples how one benefits in multiple ways from enriching cluster assignments with explanations.

A. Better Validation of a Clustering Model
The following showcase demonstrates how an explanation of cluster assignments can serve to produce a rich and nuanced assessment of cluster quality that goes beyond conventional metrics such as cluster purity.
We consider for this experiment the 20newsgroups dataset [80] that contains messages from 20 public mailing lists, recorded around the year 1996.Headers, footers and quotes are removed from the messages.Each document D is represented as a collection of words defined as any sequence of letters of length at least three.Stop words are removed.Document vectors are then produced by mapping each word t it contains to its tok2vec2 representation ϕ(t) (similar to word2vec [81]), and computing the average x = 1 |D| t∈D ϕ(t).We cluster the data using a kernel k-means model with 10 support vectors per cluster.Initializing the kernel clustering with ground truth labels and training the kernel k-means model with an EM-style procedure (see Appendix F of the Supplement for details), the cluster assignment converges to a local optimum with the final assignment visualized in Fig. 6 (middle).
We now focus on assessing the quality of the learned clusters.The Adjusted Rand Index (ARI) metric gives a score of 32%, whereas the same model trained with fixed assignments  to the true labels reaches 45%.From this score, one could conclude that the algorithm has learned 'bad' clusters.Instead, cluster explanations, which expose to the user what in a given document is relevant for its membership to a certain cluster, will give a quite different picture.We first note that a direct application of the NEON method we have proposed to obtain such explanation would result in an explanation in terms of the dimensions of the input vector x, which is not interpretable by a human as word and document embeddings are usually abstract.A more interpretable word-level explanation can be achieved, by observing that the mapping from words to document (an averaging of word vectors) and the first layer of the neuralized kernel k-means, are both linear.Thus, they can be combined into a single 'big' linear layer that takes as input each word distinctly.These scores can then be pooled over word dimensions [82], leading to a single relevance score R t for each individual word t.These explanations can be rendered as highlighted text.
We select a few messages that we show in Fig. 6.The two messages on the left are assigned to the same cluster but were posted to different newsgroups (i.e. have different labels, and thus hurt the ARI).Here, NEON highlights in both documents the term "version".Closely related terms like "DOS", "windows" and "ghostscript" are highlighted as well.The fact that "version" was found in both messages and that other related words were present constitutes an explanation and justification for these two messages being assigned to the same cluster.
As a second example, consider messages on the right in Figure 6, posted on two different groups, but that are assigned to the same cluster.The top message is discussing specifics of Mercury's motion, whilst the bottom message draws an analogy between physical objects and morals.The most relevant terms are related to physics, such as "Einstein" or "atoms".Also more broadly used terms (that may appear in other clusters too) like "motion" or "smallest" provide evidence for cluster membership.Here again, the words that have been selected hint at meaningful similarity between these two messages, thus justifying the assignment of these messages to the same cluster.
Overall, in this showcase experiment, minimizing the clustering objective has led to a rather low ARI.According to common validation procedures, this would constitute a reason for rejection.Instead, the cluster membership explanations produced by NEON could pinpoint to the user meaningful cluster membership decisions that speak in favor of the learned cluster structure.

B. Getting Insights into Neural Network Representations
Our second showcase example demonstrates how cluster explanations can be applied beyond clusters assessment, in particular, how it can be used as a way of getting insights into some given data representation Ψ, e.g.some layer of a neural network.An direct inspection of the multiple neurons composing the neural network layer is generally unfeasible as there are many such neurons, and their relation to the input is highly nonlinear.The problem of understanding deep representations has received significant attention in recent years [79], [83], [84].
We consider the data representations built by the wellknown VGG-16 convolutional network [85].The VGG-16 network consists of a classifier built on a feature extractor.The feature extractor is composed of five blocks alternating multiple convolutions and ReLU activations.Each block terminates with a 2 × 2 spatial pooling, thereby creating increasingly more abstract and spatially invariant representations.
To analyze representations produced by VGG-16, we feed some image of interest into the network, leading to spatial activation maps at the output of each block.Collecting the activations at the output of a given block, we build a dataset, where each spatial location in the block corresponds to one data point.After this, we apply k-means with K = 8 on these data points (rescaled to unit norm) and neuralize the model.
For each cluster, we consider the model outputs f c , and propagate these outputs backward through the network using LRP in the neuralized model and further down into the VGG-16 layers to form a collection of pixel-wise heatmaps associated to each cluster.When computing the explanations, we set β according to our heuristic in Eq. ( 3), and in convolution layers, we use LRP-γ [58] with γ = 0.25 in blocks 1-3 and γ = 0.1 in blocks 4-5.
Cluster explanations are shown in Fig. 7 for an artificial spiral image, and one of the well-known "dogs playing poker" images, titled "Poker Game" by Cassius Marcellus Coolidge, 1894.Images were fed to the network at resolution 448 × 448.In the artificial spiral image, clusters at the output of block 3 map to edges with certain angle orientations as well as colors (black and white) or edge types (black-to-white, or white-to-black).Interestingly, strictly vertical and strictly horizontal edges fall in clusters with very high angle specificity, whereas edges with other angles fall into broader clusters.When building clusters at block 4, color and edge information become less prominent.Clusters are now very selective for the angle of the curvature, something needed to represent higher-level concepts.Hence, this analysis reveals to the user a specific property of the VGG-16 neural network which is the progressive building of curvature in deep representations.In the Poker Game image, we observe at block 3 a cluster that spans the green texture in the background, one that spans the fur texture associated to the dogs, and further clusters that react to edges of various orientations.After block 5, the clusters once again form higher-level concepts.There is a cluster for the big lamp at the top of the image, a cluster for the painting in the upper right, and a cluster that represents the dogs.Note that it only represents the most discriminative part of the dog, and build invariance w.r.t.other parts of the dogs, in particular, the fur texture.This reveals to the user how VGG-16 progressively builds high-level abstractions and become invariant to certain visual features.
To summarize, our cluster explanations could extract useful insight about the way VGG-16 represents its input from a small selection of images.In particular, our analysis does away with the high dimensionality of neural network representation by providing an explanation that fits in only 8 heatmaps, hence easily interpretable by the user.

C. Getting Insights into the Data
While Explainable AI techniques have shown helpful to shed light into the decision strategy associated to specific models and data representations, it also provides a useful tool to extract insight into the data distribution itself (exploratory data analysis).This is often desirable in scientific applications [25], [61], where the model serves to discover interesting correlations in the data rather than being of interest on its own.Our last showcase demonstrates that NEON, in conjunction with a well-functioning clustering model, can extract such insight into the data.In particular, we find that clusters of the data can be linked to contiguous patterns in pixel space, often corresponding to the image segments provided by the user.
To demonstrate this property of the data, we consider the PASCAL VOC 2007 dataset [86] which comes with segmentation masks separating the different objects.We consider a similar setting to Section VI-B, where we build a collection of K-cluster models based on activation vectors at different spatial locations and at a given layer of the pretrained VGG-16 network.The assignment of these activation vectors onto the learned clusters is then attributed to the input pixels using our NEON explanation framework to form a collection of K heatmaps.Fig. 8 (top) shows an example of heatmaps we get for an image of a kid with a small motorbike.We observe that the attribution of cluster membership onto pixels highlights that each cluster represents distinct objects in the image, here, the kid, the motorbike and the background.We perform an experiment where we measure to what degree explained clusters match the different segmentation masks.Similarity between heatmaps and segmentation masks is measured by a maximum weight matching (Hungarian algorithm) between masks and clusters, where the weight is given by their cosine similarity.The procedure is depicted in Fig. 8  For comparison, we construct two simple baselines that do not make use of clustering: The first baseline takes the top-k most activated (in the L ∞ sense) feature maps (FM).The second baseline takes the top-k most activated locations (LO).In addition we consider a recently proposed method, NetDissect [84], which identifies meaningful segments of an image by thresholding spatial activation maps.Thresholds applied by NetDissect are learned in a supervised manner to match a rich set of concepts (e.g.wood, red or carpet) from the Broden dataset.The NetDissect1 baseline takes the top-K segmentation maps.NetDissect2 takes K centroids from all segmentation maps.For every method in our benchmark, we fix K = 4 (the average number of objects in the dataset) and apply the same LRP propagation rules for NEON, FM and LO.Examples of heatmaps produced by each method are given in Appendix J.
Average cosine similarities for each method applied at the output of each block3 are given in Fig. 8 (bottom).The NEON approach clearly and consistently delivers the best results except for block 5, where NetDissect2 shows a better performance.Interestingly, the highest correlation is found in lower layers, confirming that low-level features such as color or textures are good descriptors of the spatial occupancy of an object, whereas higher-level features may build too much invariance to comprehensively highlight segments (see also Section VI-B).The higher performance of NetDissect in higher-layer can be attributed to the smoother way it renders explanation in pixel-space (cf.Appendix J in the Supplement), thereby 'undoing' some of the invariances the neural network might have built.
Overall, our NEON approach allows to shed light into the statistics of complex data distributions, for example, by finding that clusters in image data, especially those coding for lowlevel information content such as texture or color, substantially correlate with image segments.

VII. EVALUATION
While the section above has demonstrated the multiple practical benefits one can get from bringing Explainable AI to clustering, we would like to study here more specifically the technical ability of NEON as an explanation method for clustering.We consider a broad spectrum of desiderata of an explanation method, and evaluate NEON against a number of simple contributed baselines.We stress that the baselines we use were originally proposed for explaining classification, however, with some adaptations that we propose, they can be extended to the clustering case and therefore serve as baselines in our evaluation.
In particular, we consider integrated gradients (IG) [29] where the explanation scores are computed by integrating the model output between the origin and the data point x following some linear path.We then apply Prediction Difference Analysis (PDA) [57], [87] where we score the different dimension based on the effect on the decision function of removing the corresponding feature.The missing feature is either set to zero (PDA 0 ) or imputed using a KDE conditional sampler (PDA cs ), which we describe in Appendix G of the Supplement.Finally, we include four simple baselines: random attribution, squared features x 2 , sensitivity analysis (∇f ) 2 , which computes the square of the derivative along each input dimension, and a method specific to standard k-means, 'nearest centroid analysis' (NCA) that computes (x−µ k ) 2 −(x−µ c ) 2 where µ c and µ k are the centroids of the assigned cluster and nearest competing cluster respectively, and where the squaring operation applies element-wise.

A. Desiderata and Evaluation Metrics
In the context of explaining image classifiers, [88] proposed the 'pixel-flipping' technique for evaluating explanations.The technique consists of constructing a plot that keeps track of decision function (in our case, this will be the cluster indicator function g c (x) = 1 {x→cluster c} ) as we add or remove features by order of relevance according to the explanation, and measuring the area under the curve (AUC).We start from this algorithm and adapt it to our setting.In particular, instead of flipping pixels, we consider general features, and similar to [36] start from an 'empty' data point, and add the features from most to least relevant.Missing features are inpainted using a conditional sampler built on the simple kernel density estimation (KDE) model, the details of which we provide in Appendix G in the Supplement, or replaced by zero when the input features are activations of a deep neural network.The procedure for computing the AUC is detailed in Algorithm 1, where the AUC output is a number between 0 and 100.The higher the AUC, the better the explanation.The analysis can be extended to a whole dataset by computing by averaging the AUC obtained for each individual data point, and repeating the whole procedure multiple times to reduce the variance produced by the KDE sampling.
Algorithm 1 Area under the curve (AUC) computation for a data point z ∈ R d and the explanation (R i ) i ∈ R d of its prediction.
curve.append(g c (x)) end for return area_under(curve)• 100 / d Consider now the five desiderata of an explanation listed in [89], namely, fidelity, understandability, sufficiency, low construction overhead, and runtime efficiency.We argue that Algorithm 1 captures to a reasonable extent the first three of them: Fidelity (D1): Algorithm 1 keeps track of the model output as we add features.This favors techniques that explain the model output rather than some other function.Understandability (D2): It is desirable that the explanation is understandable by its user, e.g.expressible in terms of input features, and simple enough (e.g. a few relevant features).Algorithm 1 implements such desiderata by verifying whether the few most relevant features returned by the explanation produce a substantial increase of the model output.Sufficiency (D3): The explanation should be sufficient for its user, i.e. provide sufficient information about the model's decision strategy.Algorithm 1 requests a score for each individual feature (or at least a full ranking of those features).This favors explanations with this level of resolution compared to more coarse-grained explanations.
To assess the fulfilment of the last two desiderata, we proceed as follows: Low construction overhead (D4): The ex-planation technique should not be too complex or costly to implement.Our evaluation will rank explanation methods depending on whether they only need access to the decision function, access to some differentiable function reproducing the decision function, or access to the neural network internals of that function.Runtime efficiency (D5): The explanation should be computable quickly.In our evaluation, we will provide the algorithmic complexity of each explanation method and perform additional runtime comparisons.

B. AUC Evaluation Results
To test desiderata D1-D3, we first perform the AUC evaluation presented in Algorithm 1 on a set of models trained on different datasets of various dimensionality and complexity.We consider first a set of standard k-means models trained on a number of datasets from the UCI repository (details and links to the datasets are provided in Appendix H of the Supplement), and where the number of clusters K is determined using the elbow method [90].Then, we consider more complex kernel kmeans models which we train on further datasets from the UCI repository.We also consider the kernel k-means model trained on the 20newsgroup dataset [80] (news in Table I) which we have showcased in Section VI-A.The training algorithm we have used for kernel k-means is detailed in Appendix F in the Supplement.Finally, we consider deep k-means models built on the popular STL-10 [91] image recognition dataset.We consider either a standard k-means model built on the features at the output of block 5 of the VGG-16 deep neural network pretrained on ImageNet (VGG-s), or the same VGG-16 network without supervised pretraining (VGG-u) and coupled with the recently proposed SCAN [23] clustering model4 for deep clustering.For each dataset and model, we set the NEON hyperparameter according to the heuristic in Eq. (3).For deep models, we choose β in the same way and furthermore choose the LRP rule LRP-γ [58], with the parameter γ set heuristically to 0.1.For these two deep clustering models, we consider as unit of interpretability the 256 feature maps at the output of block 3 of the VGG-16 network, and thus produce explanations in R 256 .Results are shown in Table I.
We observe that the proposed NEON explanation method is superior to all baselines for the vast majority of considered clustering models and datasets.We note the relatively poor performance of PDA, where the removal of individual features seems insufficient to capture the more global structure of the cluster assignment.To get further insights into the performance of NEON, we perform an experiment where we take an existing dataset, the winer dataset, and generate scenarios of varying complexity by training clustering model between K = 2 to K = 64, and also removing input features to generate dataset dimensions from d = 2 to d = 13.The results are shown in Fig. 9.We observe that in every regime, NEON has equal or superior performance to all baselines.Anecdotally, NEON performs equivalently to NCA for K = 2, but it start to outperform it as soon as the number of clusters grows.

C. Sensitivity of NEON to Hyperparameters
Unlike other baseline methods used in our benchmark, NEON comes with a 'stiffness' hyperparameter β which we have proposed to choose heuristically following Eq.(3).For deep clustering, one also needs to choose the parameter γ associated with the propagation in convolution layers.We would like to test the sensitivity of NEON to these parameters, first to verify the soundness of our heuristic, but also to check whether other choices of parameters lead to further improvements or conversely a degradation of NEON performance.Results are given in Fig. 10, where we superpose on the same plot the performance at the heuristically set value for the hyperparameter (orange dot), the performance for other values The first two rows show the effect of the min-take-most parameter β, with the orange marker indicating the proposed heuristic β = E[fc] −1 , the dotted line is the best performing baseline (cf.Table I).The last row shows the effect of the LRP convolution parameter γ, with the orange marker indicating our heuristic γ = 0.1, and where we set of the hyperparameter (solid gray line), and the performance of best performing baseline (dotted blue line).We observe that the simple heuristic proposed in Eq. (3) nicely correlates with the peak of AUC performance, thereby providing empirical justification for the proposed heuristic.We note that even if the hyperparameter β is chosen inadequately, AUC performance degrades in most cases only to a minor extent.Conversely, an optimization of the NEON hyperparameters brings slight additional gains on the AUC score.Notably, the seemingly limited performance of NEON on deep

Method
Overhead (D4) Runtime (D5) standard kernel clustering with K = 1000 can be overcome by choosing a larger value for the parameter γ, in turn making NEON again the best performing method.In addition to maximizing the AUC score, the hyperparameters of NEON and the possibility to optimize them can be especially useful when bringing explainability to new tasks with specific performance metrics.

D. Construction Overhead and Runtime
Lastly, we would like to study the fulfillment by NEON of desiderata D4 (low construction overhead) and D5 (runtime efficiency), comparatively to other methods in our benchmark.We resort to a qualitative analysis for D4, where we categorize methods according to what needs to be constructed additionally to the clustering decision function.Results are shown in Table II (second column).The symbol '-' indicates that we do not even need the decision function, 'g c ' indicates that we need the decision function only, '∇f c ' indicates that we need a differentiable surrogate function f c and its gradient, '(µ c ) c ' indicates that we need the cluster centroids, and finally, 'NN' indicates that we need the neural network equivalent of the surrogate function f c .The proposed NEON method has the highest overhead in our benchmark as it requires a neural network equivalent.However, since we have already derived these neural network equivalents in the technical sections, there is no significant obstacle to apply NEON on the studied models (k-means, kernel k-means, deep clustering, and related).
Regarding the runtime efficiency (D5), we perform a complexity analysis of the different explanation methods, where d is the number of input dimensions, K is the number of clusters, and p is the number of support vectors per cluster in the kernel k-means case.Results are shown in Table II (last column).We observe that for k-means, NEON computational cost is lower or equal to most explanation methods, by only requiring a single forward and backward pass, whereas several explanation methods need to evaluate the model multiple times.(An empirical runtime comparison to all baselines for various kmeans models can be found in Appendix I of the Supplement.)For kernel k-means, results are more balanced, with NEON being slower than simple sensitivity analysis, but running faster than the more advanced PDA and IG competitors if the number support vectors is smaller than the number of input dimensions or the number of integration steps respectively.Hence, while for standard k-means, we can generally claim that NEON has high efficiency, for kernel k-means, one need to additionally ensure that the number of support vectors remains small, typically less than 10.
Overall, we have demonstrated in our evaluation that NEON fares on average the highest, comparing favorably to all competitors when considering the multiple aspects that enter into the assessment of an explanation method.Therefore, NEON constitutes so far the most appropriate and powerful method for tackling the problem of explaining cluster assignments.

VIII. CONCLUSION
We have contributed by for the first time bringing Explainable AI to clustering and have proposed a general framework, called neuralization-propagation, for explaining cluster assignments of a broad range of clustering models.The proposed method converts, without retraining, the clustering model into a functionally equivalent neural network composed of detection and pooling layers.This conversion step which we have called 'neuralization' enables cluster assignments to be efficiently attributed to input variables by means of a reverse propagation procedure.
Quantitative evaluation shows that our explanation method is capable of identifying cluster-relevant input features in a precise and systematic manner, from the simplest k-means model to some of the most recent proposals such as the SCAN deep clustering model [23].The performance remains high across all considered data types, in particular, abstract vector data, text, natural images, or neuron activations.
The method we have proposed complements standard cluster validation techniques by providing a rich interpretable feedback into the nature of the clusters that are built.Furthermore, when paired with a well-functioning clustering algorithm, it provides a useful tool for exploratory data analysis and knowledge discovery where complex data distributions are first summarized into finitely many clusters, that are then exposed to the human in an interpretable manner.

APPENDIX A NEURALIZATION OF K-MEANS
We recall that the decision function implemented by k-means can be expressed as: i.e. one assigns a given data point x to the cluster c if every other cluster k = c has higher distance between the data point and the cluster centroid.
Proposition 1.The decision function of Eq. (1) can be reproduced by a two-layer neural network composed of a standard linear layer and a (min-)pooling layer: Neuralized k-means where , and assigning to cluster c if f c (x) > 0.
Proof.We first note that the decision function of Eq. ( 1) can be rewritten more compactly by testing only for the nearest cluster competitor, i.e.
If this holds, the same necessarily holds for all remaining cluster competitors.We now rewrite the function f c (x) computed at the output of the neural network model in a way that let appear distances: It can now be seen easily that the predicate f c (x) > 0 is equivalent to Eq. ( 2). where From ( 4) to (5), we have made use of the fact that setting one of the inputs to zero in the min function suffices to make the output of that function zero.Specifically, the rightmost term, which never includes feature i is therefore always zero and can be dropped.From ( 5) to (6), we have observed that only one element in the sum has an associated function value that is non-zero, specifically, the one for which all features are included.From ( 6) to (7), we simply replace the coefficient α by its numerical value, and observe that x Ω and x denote the same quantity. Proof.

A. Attribution on Intermediate Layer
We recall that in the main paper, we have proposed to perform redistribution in the second layer of neuralized k-means using the rule 11) where the latter interpolates between min-take-all redistribution when β → ∞ and uniform redistribution when β = 0.
Proposition 2. Redistribution performed by Eq. (11) with parameter β = 0, corresponds to the Shapley Value of the function f c (h) with the reference point h = 0.
Proof.This is a direct consequence of Lemma 1 which shows that attributing a min-function using the origin as a reference point yields a uniform redistribution strategy.

B. Two-Cluster Case
We observe that when there are only two clusters (the one on which the data point is assigned and the competitor, respectively denoted by c and k) we get by application of Eq. ( 11) that multiple terms are equal: We also recall that the rule we use for the first layer is given by Observing that the denominator of the propagation rule is equivalent to h k , we get the closed form attribution This result naturally follows from application of Lemma 2 for the function f c (x) and replacement value x = m k , and observing that the resulting Shapley value matches the output of NEON given in Eq. ( 14).

APPENDIX C DECISION FUNCTION OF KERNEL K-MEANS
We recall that to address the limitation of a simple discriminant built in terms of distances in kernel feature space, i.e.
we have proposed the alternate the discriminant: where Φ is some feature map associated to the Gaussian kernel K(x, u) = exp(−γ x − u 2 ) and Z c is a normalization factor.The two distance functions appearing in Eqs.(15) and ( 16) can be related as: where g c is a monotonically increasing function defined as: with Li 1 is the polylogarithm of order 1, ∆ c = (1 − µ c 2 )/2, and Proof.We inject the original squared distance in the function g c and recall that the polylogarithm is given by Li 1 (t) = − log(1 − t), a convex function.We arrive after a few steps of derivation to the function used in the second discriminant: APPENDIX D NEURALIZATION OF KERNEL K-MEANS Proposition 5.The decision function in Eq. ( 16) can be reproduced by a four-layer neural network composed of a linear layer followed by three pooling layers: ) where , where LME γ and LME −γ can be interpreted as soft max-pooling and soft min-pooling respectively, and assigning to cluster c if f c (x) > 0.
Proof.First, we show that the LME γ operator is commutative w.r.t.additive scalars: This allows for a more high level point of view that holds for hard-as well as soft-min pools: a difference of minima equals a minimax of differences, By exploiting this fact multiple times, we derive the following reformulation of the discriminant for kernel clustering.Let which gives us the form of Eq. ( 30).
The results provided by Lemmas 3 and 4 will help us to derive neuralized forms for several practical soft clustering models, in particular, soft k-means, and a deep clustering model.

A. Neuralization of Soft K-Means
We consider a soft version of k-means where the probabilities are generated using the softmax function of Eq. ( 23) and where the logits entering into the softmax function are (rescaled) square distances: The parameter τ is a scaling parameter, commonly referred as 'stiffness'.Applying Lemmas 3 and 4, and observing that a difference of logits can be written as a linear function of x, i.e., we can reproduce the decision functions of Eq. ( 24) by the neural network: Neuralized soft k-means (relative)

B. Neuralization of Deep Soft Clustering
We recall that the decision function implemented by deep soft clustering (cf.Section IV of the main paper) is given by: where w k = µ c − µ k , and testing for f c ≥ 0. Furthermore, f c has a probabilistic interpretation as the log-likelihood ratio log(p c (x)/ max k =c {p k (x)}).
Proof.The proof follows from observing that the decision function is of the same type as Eq. ( 23) with logit z c (x) = µ c a, applying Lemma (3), and observing that the difference of logits simplifies to z c (x)−z k (x) = (µ c − µ k ) a, i.e. a homogeneous linear model built on the activation vector a.
We recall that the decision function implemented by deep soft clustering, when considering an absolute probability threshold (cf.Section IV of the main paper), is given by: p c (x) > θ (33) where p c (x) is defined in the same way as in Eq. ( 32).Proof.The proof proceeds in a similar way as for Proposition 6, but making use of Lemma 4 instead of Lemma 3.This appendix gives additional details and intermediate results for the quantitative experiment of Section VI-C in the main paper.We consider images from the PASCAL VOC 2007 dataset [1] along with their segmentations masks.For each image, we add a 'background' segment representing all non-segmented areas.For a selection of images, we compute the heatmaps summaries obtained by applying NEON or one of the three other methods in our benchmark (FM, LO and NetDissect [2]) at a given layer of the VGG-16 network [3].
Figure 2 shows best, median and worst case images as evaluated by our heatmap-segmentation matching metric at the output of block 5. Images from the dataset are ranked based on their matching scores averaged over NEON, FM, LO and NetDissect.We superimpose them with their matched segmentation mask displayed as a black contour.The 'background' segment is displayed with a solid border frame.The scores in the bottom right of each heatmap are cosine similarities between the heatmap and the matching segmentation mask.
We observe that the best case image depicts a simple scene composed of clearly distinguishable objects with different colors and textures.Normally, objects that appear several times form a cluster, although sometimes there is no specific segmentation for that kind of object, e.g.windows or candles in the middle row.On the other hand, there are also segments for background objects that do not form their own cluster, such as the chair.We can also observe that even for the worst case example, heatmaps may still focus on distinct objects, e.g.windows or table clutter.
Figure 3 shows heatmap summaries at the output of block 5 for NEON and the LO, FM, and NetDissect methods, on some image with typical matching quality.
We observe for this image that the heatmaps produced by FM focus multiple times on the same region of the input image, thereby not delivering an appropriate spatial summary.The LO heatmaps are too localized to exhaustively capture large image parts (e.g. the background).Similarly to FM, the NetDissect1 method (top-4 concepts) is affected by a redundancy problem where the different concepts (e.g.texture, color, object) are bound to the same image regions.In the NetDissect2 variant, where concepts are grouped into 4 clusters, the summary becomes more complete, leading to a higher cosine similarity.Finally, our proposed method NEON, although not making use of any supervised information besides what is already contained in the feature representation, produces qualitatively the most satisfying summary.The cosine similarity is here mainly limited by the heatmap sparsity that is especially strong at the output of block 5. Sparsity prevents a full correlation with the dense segmentation masks.While we have used the raw NEON output in our comparison, further benchmark-specific post-processing steps, e.g.spatial interpolation, could in principle be used to further improve the score.

Fig. 1 .
Fig. 1.From clustering to cluster explanations via neural networks.A. Standard clustering scenario where data are assigned onto clusters according to the clustering model.B. Overview of our contributions.B1.We enrich the cluster assignment with an explanation highlighting what input features mostly contribute to the cluster decision.B2.We achieve this technically by observing that the clustering decision can be rewritten as a neural network (neuralization), enabling fast and robust explanations via the LRP technique (propagation).

Fig. 2 .
Fig. 2. Left: Decision function of a k-means clustering model with centroids µ 1 , µ 2 , µ 3 .Data points in the region highlighted in red are assigned to the cluster c = 1.Right: Contour plot of the function fc(x) for the cluster c = 1.

Proposition 3 .
When the number of clusters is equal to 2, the model reduces to f c (x) = w k x + b k , and redistribution by Eqs.(2) and (4) corresponds to the Shapley Value of the function f c (x) with the reference point x = m k .

Fig. 4 .
Fig. 4. Left: Partition implemented by a kernel k-means clustering with three clusters supported by seven support vectors each.Right: Neural network output fc(x) associated to the first cluster.
Fig.5.Examples of clustering models whose cluster assignments can be explained with our NEON approach.The neuralized models, each of which can be expressed as combinations of detection layers and pooling layers, are depicted along with the propagation rules applied at each layer.

Fig. 6 .
Fig. 6.Application of NEON to the clustering of newsgroup data.Newsgroup texts where words relevant for cluster membership are highlighted.Gray words are out of vocabulary.

Fig. 7 .
Fig. 7. NEON analysis of images represented at different layers of a deep neural network (pretrained VGG-16).K-means clustering with K = 8 is performed at the output of these two blocks.Each column shows the pixel-wise contributions for one of these clusters.

𝔼Fig. 8 .
Fig. 8. Quantitative evaluation of NEON's ability to extract meaningful summaries.Top: The cluster explanation is matched with ground truth object segmentation masks by means of cosine similarity.Bottom: Comparison of NEON to other methods.For each method we show the average cosine score over the whole dataset.Results are shown for different blocks on the x-axis.

Fig. 9 .
Fig.9.Effect of the number of retained dimensions d and the number of clusters K on the AUC performance of each explanation method on the winer dataset.

Fig. 10 .
Fig. 10.Evaluating of NEON hyperparameters on a selection of clustering models.1st row: k-means models, 2nd row: kernel k-means models, 3rd row: deep models (VGG-u / SCAN).The y-axis shows the pixel flipping AUC.The first two rows show the effect of the min-take-most parameter β, with the orange marker indicating the proposed heuristic β = E[fc] −1 , the dotted line is the best performing baseline (cf.TableI).The last row shows the effect of the LRP convolution parameter γ, with the orange marker indicating our heuristic γ = 0.1, and where we set β = E[fc] −1 .
Let x ∈ R d be our data point.Let I denote a collection of input features, and x I denote the data point x where values for features i / ∈ I have been replaced by the corresponding values of some reference data point x.In particular, denoting by Ω the set of all d features and by ∅ the empty set, we have x Ω = x and x ∅ = x.We consider a function f : R d → R mapping the data points to real-valued scores.In this context, the Shapley Value provides a way of attributing this score to the individual input features via the formula

) Proposition 3 .
When the number of clusters is equal to 2, the model reduces to f c (x) = w k x + b k , and redistribution by Eqs.(11) and (13) corresponds to the Shapley Value of the function f c (x) with the reference point x = m k .

Fig. 1 .
Fig. 1.Runtime of explaining the predictions of a k-means model on the winer dataset, as a function of the number of retained input dimensions and the number of clusters .

Fig. 2 .
Fig. 2. NEON of a k-means model built at the output of block 5 of the VGG-16 model, and applied to images with best / median / worst matching between heatmaps and segmentations.

Fig. 3 .
Fig. 3. NEON compared to the FM, LO and NetDissect baselines, when considering a model built at the output of block 5 of the VGG-16 model.We show an image corresponding to a median matching quality between heatmaps and segmentation masks.
Yes, long before Star Trek.Before Einstein, in fact.Vulcan as a planet inside Mercury was hypothesized to explain a perturbation of Mercury's orbit that could not be explained by the known planets.But Einstein's theory of relativity explained Mercury's motion, and analysis of Mercury's motion now shows there are _not_ any planets inside its orbit.He is probably referring to the DOS version.. the dos versions is up to like version 6 i think.The window version just came out recently so it is only up to like version 2 or something.

TABLE I AUC
SCORE COMPUTED WITH ALGORITHM 1 AND SERVING AS A PROXY FOR THE FULFILLMENT OF DESIDERATA D1-D3.THE HIGHER THE AUC SCORE THE BETTER THE EXPLANATIONS.WE FIND THAT THE PROPOSED NEON METHOD SCORES THE HIGHEST FOR THE VAST MAJORITY OF CLUSTERING MODELS.ENTRIES WHERE METHODS ARE INAPPLICABLE OR COMPUTATIONALLY PROHIBITIVE ARE DENOTED BY '-'.

TABLE II FULFILLMENT
OF LOW CONSTRUCTION OVERHEAD AND RUNTIME EFFICIENCY DESIDERATA FOR THE METHODS IN OUR BENCHMARK.
document contains supplementary material supporting the results and experiments from the main paper.Appendices A-E contain proofs and justifications for some of the non-trivial steps taken in Sections II-IV of the main paper to neuralize the k-means models.Appendix F describes the modified training procedure used for producing the kernel k-means models of Sections VI-A and VII of the main paper.Appendices H-J give additional information, evaluations, and results, for the experiments of Sections VI-C and VII of the main paper. This is a weighting coefficient satisfying I⊆Ω\{i} α I = 1.We first present two (known) intermediate results that we can use to prove Propositions 2 and 3 of the main paper.Lemma 1.Let x ∈ R d + .The Shapley Value of the function f (x) = min{x 1 , . . ., x d } with replacement value