Diffeomorphic Counterfactuals With Generative Models

Counterfactuals can explain classification decisions of neural networks in a human interpretable way. We propose a simple but effective method to generate such counterfactuals. More specifically, we perform a suitable diffeomorphic coordinate transformation and then perform gradient ascent in these coordinates to find counterfactuals which are classified with great confidence as a specified target class. We propose two methods to leverage generative models to construct such suitable coordinate systems that are either exactly or approximately diffeomorphic. We analyze the generation process theoretically using Riemannian differential geometry and validate the quality of the generated counterfactuals using various qualitative and quantitative measures.


Introduction
Deep neural network models are widely used to solve complex problems from computer vision (e.g.[1][2][3][4][5]), strategic games and robotics (e.g.[6][7][8]), to medicine (e.g.[9][10][11]) and the sciences (e.g.[12][13][14][15][16]).However, they are traditionally seen as black-box models, i.e. given the network model, it has been unclear to the user and even the engineer designing the algorithm, what has been most important to reach a particular output prediction.This can cause serious obstacles for applications since, say, networks using spurious image features that are only present in the training data (Clever Hans effect [17,18]) might go unnoticed.Such undesired behaviour hampering the network's generalization ability is particularly problematic in safety-critical areas.
saliency maps for classifiers or regressors [39] which highlight areas of the input that were particularly important for the classification.
A different approach to explain a neural network is given by providing counterfactuals to the original inputs [40][41][42].These are realistic-looking images which are semantically close to the original but differ in distinct features so that their classification matches the desired target class, cf. Figure 1.Counterfactuals aim to answer questions like "Why was this input classified as A and not as B?" or "What would need to change in the input so that it is no longer classified as A but instead as B?" [40][41][42] and thereby provide an explanation for the classifier.Unlike attribution methods, counterfactuals do not provide a relevance map, but an image that is similar to the original input and serves as a kind of counter example or hypothetical alternative for the original prediction.
Crucially, the counterfactual is required to be a realistic sample from the data distribution in order to elucidate the behaviour of the network on the data.This requirement poses the greatest practical challenge to computing counterfactuals since naively optimizing the output of the network with respect to the input via gradient ascent yields adversarial examples [43] which essentially add a small amount of noise to the original input, as illustrated in the example given in Figure 1.This behavior can be understood using the manifold hypothesis: the images are assumed to lie on a low dimensional manifold embedded in the high dimensional input space, cf. Figure 2 (a).The gradient ascent algorithm then walks in a direction orthogonal to the decision boundary which is with high probability also orthogonal to the data manifold, resulting in a small perturbation which is not semantic, as illustrated in Figure 2 (b).
We propose to use insights from the mathematical discipline of differential geometry to mitigate this problem.Differential geometry can be understood as analysis on curved (hyper-) surfaces and thus provides the appropriate tools to study gradients on the data manifold.It has been valuable for the field of ML in general [44][45][46] and XAI specifically [47][48][49].A cornerstone of differential geometry is the idea that geometric quantities can be described equivalently in different coordinate systems.However, not all coordinate systems are equally useful in practice.This phenomenon is ubiquitous in physics: for instance, the mathematical expressions governing planetary motions greatly simplify in a heliocentric (sun-centered) coordinate system as opposed to a geocentric (earth-centered) coordinate system.In heliocentric coordinates, the relevant degrees of freedom are easily recognizable (Keppler ellipses) and, as a result, physical intuition and interpretations are much easier to deduce.Similarly, we attribute the difficulty to construct counterfactuals by optimizing the output of a neural network classifier with respect to its input as in Figure 2 (b) to the poor choice of coordinates X in the input space given by the raw image data.In contrast, in a suitably chosen coordinate system Z, the data manifold would extend more evenly in all directions, allowing for an optimization that stays on the data manifold and thereby producing a counterfactual that has been changed semantically when compared to the original.In order to find such a coordinate transformation (called a diffeomorphism in differential geometry) between X and Z, we use a normalizing flow trained on the image data set under consideration.Since the flow is by construction bijective and differentiable with a differentiable inverse, it satisfies the technical conditions for a diffeomorphism in differential geometry.Furthermore, the base distribution of the flow is fixed to be a univariate Gaussian and hence free of pathological directions.Moreover, this change of coordinate system will lead to no information loss which is in stark contrast to existing methods for generating counterfactuals.
In our method, the counterfactual is computed by taking the gradient in the gradient ascent update with respect to the representation in the base space of the normalizing flow as opposed to the input of the classifier (Figure 2 (b)).This method comes with rigorous theoretical guarantees and we refer to it as diffeomorphic counterfactuals.In partic-ular, we show that this introduces a metric into the update step which shrinks the gradient in directions orthogonal to the data manifold.Furthermore, we propose two separate methods which only approximately lead to a diffeomorphism.While these approximate methods come with a lower level of theoretical guarantees, and can, in practice, lead to some information loss, they can be scaled easily to very high-dimensional datasets, as we demonstrate experimentally.We refer to these methods as approximate diffeomorphic counterfactuals.We theoretically prove that these methods also stay on the data manifold under suitable assumptions.Our theoretical analysis therefore provides a unified mathematical framework for the application of generative models in the context of counterfactuals.Importantly, we can not only optimize the output of a classifier network on the data manifold in this manner, but also that of a regressor.
This analysis is supported by our experimental results for various application domains, such as computer vision and medical radiology and a number of architectures for classifiers, regressors and generative models.Note that we lay emphasis on using quantitative metrics -as opposed to only qualitative analysis -to evaluate the proposed methods; some quantitative results are exemplified in Figure 2 (d) and (e).
The paper is structured as follows: in Section 2, we introduce the proposed methods.Specifically, we will introduce diffomorphic explanations in Section 2.3 and the approximate versions thereof in Section 2.4.We then analyse the proposed methods theoretically using Riemannian differential geometry in Section 3.This is followed by Section 4 which provides a detailed experimental analysis of the diffeomorphic counterfactuals, in Section 4.1, and approximate diffeomorphic counterfactuals in Section 4.2.In Section 5, we give an extensive discussion of related work.The code for a toy example and our main experiments is publicly accessible1 .

Methods
In this section, we will introduce in detail our novel diffeomorphic and approximately diffeomorphic counterfactuals.For this, we will start by reviewing the basics of counterfactual explanations and then present our two proposed methods.

Counterfactual Explanations
Consider a classifier f : X → R C which assigns to an input x ∈ X the probability f (x) c to be part of class c ∈ {1, . . ., C}. Counterfactual explanations of the classifier f provide minimal deformations x = x + δx such that the prediction of the classifier is changed.∂x the prediction flips but the resulting image is an adversarial example that looks indistinguishable from the original for a human observer.The changes to the original image are not semantic, but are limited to specific noise.The Euclidean difference between adversarial and original is therefore very small when measured in X but large when measured in Z. (c) We use to the normalizing flow g to obtain the latent space representation z = g −1 (x) of our original image x.We then perform gradient ascent in the latent space Z.The prediction flips, but this time the resulting image is a counterfactual.The changes to the original image are semantic.The Euclidean difference between counterfactual and original is small when measured in X and Z.In many cases of practical relevance, the data lies approximately on a submanifold D ⊂ X which is of significantly lower dimensionality N D than the dimensionality N X of the input space X .This is known as the manifold hypothesis in the literature (see e.g.[50]).For counterfactual explanations, as opposed to adversarial examples, we are interested in deformations x which lie on the data manifold.Additionally, we require the deformations to the original data to be minimal, i.e. the perturbation δx should be as small as possible.The relevant norm is however measured along the data manifold and not calculated in the input space.For example, a slightly rotated number in an MNIST image may have large pixel-wise distance but should be considered an infinitesimal perturbation of the original image.
We mathematically formalize the manifold hypothesis by assuming that the data is concentrated in a small region of extension δ around D. As we will show in Section 3, this implies that the support S of the data density p is a product manifold where ) is an open interval of length δ (with respect to the Euclidean distance on the input space X ).We assume that δ is small, i.e. the data lies approximately on the low-dimensional manifold D and thus fulfills the manifold hypothesis.We can think of the I δ as arising from the inherent noise in the data.
Furthermore, we define the set of points in S classified with confidence Λ ∈ (0, 1) as class t ∈ {1, . . ., C} by where d γ (x , x) is the distance computed by the Riemannian metric γ on S (which is induced from the flat metric by the diffeomorphism given by the generative model).We will review the necessary concepts of Riemannian geometry in Section 3.1.

Generation of Counterfactuals
Often, counterfactuals are generated by performing gradient ascent in the input space X , see [41] for a recent review on counterfactuals.More precisely, for step size η and target class t, one performs the gradient ascent step until the classifier has reached a threshold confidence Λ, i.e. f (x (i+1) ) t > Λ.The resulting samples will however often not lie on the data manifold and differ from the original image x only in added unstructured noise rather than in an interpretable and semantically meaningful manner.Especially when applied to high dimensional image data such samples are usually referred to as adversarial examples and not counterfactuals.We therefore propose to estimate the counterfactual x of the original data point x by using a diffeomorphism g : Z → S. We then perform gradient ascent in the latent space Z, i.e.
with step size λ ∈ R + .This has the important advantage that the resulting counterfactual will lie on the data manifold.Furthermore, since we consider a diffeomorphism g, and thus in particular a bijective map, no information will be lost by considering the classifier f • g on Z instead of the original classifier f on the data manifold S, i.e. there exists a unique z = g −1 (x) ∈ Z for any x ∈ S. We show pseudo code for our approach in Algorithm 1.
As illustrated in Figure 3, gradient ascent in X and Z are well-suited to generate adversarial examples and counterfactuals, respectively.

Algorithm 1 Generating counterfactuals
z ← optimizer.step(λ,∇ z ) return g(z) 7: end if 8: end for 9: return None Note: x is the input for which we desire to find a counterfactual explanation, f the predictive model, g the generative model, g −1 the (approximate) inverse of g, t the target class, Λ the target confidence, λ the learning rate and N the maximum number of update steps.
For regression tasks there is no explicit decision boundary, but we can still follow the the same algorithm by directly maximizing (or minimizing) the output r of regressor f (x) until we reach the desired target regression value.

Novel Method 1: Diffeomorphic Counterfactuals
We propose to model the map g by a normalizing flow and will refer to the corresponding modified data x as diffeomorphic counterfactuals in the following.Specifically, a flow g is an invertible neural network which equips, by the change-of-variable theorem, the input space  X with a probability density where q Z is a simple base density, such as a univariate normal density, on the latent space Z.The flow can be trained by maximum likelihood, i.e. by minimizing KL(p|q) = −E x∼p log q(x) + const.
where x i ∼ p are samples from the data density p.Since the flow is bijective on the entire input space X , it will, in particular, be bijective on the data manifold S ⊂ X .Furthermore, we will also rigorously show in Section 3.3 that a well-trained flow maps (to very good approximation) only to the data manifold, i.e. g(Z) ≈ S. Therefore, flows guarantee that no information is lost when performing gradient ascent in the latent space Z and also ensure that the resulting counterfactuals lie on the data manifold S. Indeed, the flow can be understood as inducing a certain coordinate change of the input space X which is particularly suited for the generation of counterfactuals.

Novel Method 2: Approximate Diffeomorphic Counterfactuals
While the method of the last section is very appealing as it comes with strong guarantees, it may be challenging to scale to very high-dimensional data sets.This is because flows have a very large memory footprint on such datasets as each layer has the same dimensionality as the data space X to ensure bijectivity.We therefore posit an alternative method, called approximate diffeomorphic counterfactuals, which comes with less rigorous theoretical guarantees, but can scale better to very high-dimensional data.Specifically, we propose two varieties of approximate diffeomorphic counterfactuals: Autoencoder-based: the reconstruction loss of an autoencoder (AE), i.e.
with encoder e : X → Z and generator g : Z → X is minimized if the encoder is the inverse of the generator on the data manifold S, i.e.
This implies, in particular, that g(Z) = S if dim(Z) = dim(S).As for normalizing flows, the image of the autoencoder is the data manifold if the model has been perfectly trained.However, an autoencoder will only be invertible on the data manifold in this perfect training limit and if the latent space Z has the same dimension as the data space S.This is in contrast to normalizing flows which are invertible on all of X by construction.As a result, the autoencoder will necessarily lead to loss of information unless the model is perfectly trained and the latent space dimensionality perfectly matches the dimension of the data.GAN-based: Generative Adversarial Networks (GANs) consist of a generator g : Z → X and a discriminator d : X → {0, 1}.Training then proceeds by minimizing a certain minimax loss, see [51] for details.It can be shown that the global minimizer of this loss function ensures that samples of the optimal generator g are distributed according to the data distribution, i.e.
We refer to Section 4.1 of [51] for a proof.However, the optimal generator g is not necessarily bijective on the data manifold.This implies that even for a perfectly trained GAN, there may not exist a unique z ∈ Z for a given data sample x ∈ X such that x = g(z).Furthermore, there is no manifest mechanism to obtain the corresponding latent sample z ∈ Z for a given input x ∈ X .This is in contrast to normalizing flows and autoencoders, since, for these generative models, the inverse map g −1 : X → Z is either explicitly or approximately known, respectively.However, there is an extensive literature for GAN inversion, see [52] for a recent review.For a given generator g and data sample x ∈ X , these methods aim to find a latent vector z ∈ Z such that x ≈ g(z).This is often done by minimizing the difference between the activations of an intermediate layer of some auxiliary network, i.e.
For example, h can be chosen to be an intermediate layer of an Inception network [53] trained on samples from the data density p.Note that these inversion methods do not come with rigorous guarantees as the optimization objective is non-convex and it is unclear whether the values of the intermediate layer activations are sufficient to distinguish different inputs.

Theoretical analysis
In this section, we employ tools from differential geometry to show that for well-trained generative models, the gradient ascent update (5) in the latent space Z does indeed stay on the data manifold, as confirmed by the experimental results presented in Section 4. Intuitively, since in (5) we take small steps in Z, where the probability distribution is, for example, a normal with unit variance, we do not leave the region of high probability in the latent space and hence stay in a region of high probability also in X .
We prove this statement for the case of diffeomorphic counterfactuals, i.e. for normalizing flows, and -under stronger assumptions -also for approximate diffeomorphic counterfactuals, i.e. for autoencoders and generative adversarial networks.

Differential Geometry
In this section, we briefly introduce the most fundamental notions of differential geometry used in our discussion further down.For a comprehensive textbook, see e.g.[54].
Differential geometry is the study of smooth (hyper-)surfaces.The central notion of this branch of mathematics is that of an n-dimensional (differentiable) manifold M which is equipped with coordinate functions x µ : M → R n (so-called charts which are assembled into an atlas).These coordinates allow for explicit calculations in R n , but geometric objects (tensors) are independent of the chosen coordinates and transform in well-defined ways under changes of coordinates.Such a change of coordinates can be interpreted as a differentiable bijection φ : M → M whose inverse is also differentiable, a so-called diffeomorphism.
At each point p ∈ M, we attach an n-dimensional vector space T p M, the tangent space at p. Coordinates x µ induce a basis in T p M and we will denote the components of v ∈ T p M in this basis by v µ .Under a diffeomorphism φ, the components of v transform as To capture the notion of distance (and curvature) on M, a metric tensor γ(p) : The metric defines a canonical isomorphism between T p M and its dual space T * p M. Following the usual convention in the general relativity literature, we use lower indices v µ to denote the components of the dual vector γ(p)(v, •).This implies that contraction with the metric is used to raise and lower indices, where we used the Einstein summation convention to sum over repeated upper and lower indices and introduced γ µν for the inverse of γ µν , the components of γ in the basis induced by the coordinates x µ .
Given a metric, it is natural to consider shortest paths between points on M. The corresponding curves are called geodesics.If the length of the tangent vector of a geodesic σ is constant (as measured by the metric) along σ, the geodesic is affinely-parametrized.Importantly, the notion of an affinely parametrized geodesic is coordinate independent and can therefore itself be used to construct coordinates on M, as we will see below.

Mathematical Setup
In order to analyze the gradient ascent (5) in the latent space Z, we define in this section the necessary manifolds and coordinates.
As above, let X be an N X -dimensional manifold which is the input space of the classifier f : X → R C with C classes.An implementation of the classifier corresponds to a function on R N X and we denote the coordinates on X in which our classifier is given by x α .These coordinates could e.g.be suitably normalized pixel values.We furthermore use an N Z -dimensional manifold Z as the latent space for our generative model g : Z → X .For GANs and AEs, we typically have N Z < N X and for normalizing flows N Z = N X .In the latter case we have moreover X = Z and g bijective with differentiable inverse implying that g is a diffeomorphism.Similarly to the classifier, also the generative model is implemented in specific coordinates on Z which we denote by z a .
We equip Z with a flat Euclidean metric δ ab .Then, the generative model g induces an inverse metric γ αβ on g(Z) by in the case of N Z < N X , γ is singular.This metric is the crucial new ingredient when performing the gradient ascent update in the latent space (5) as opposed to in the input space (4), as the following calculation shows.One step of gradient ascent in the latent space Z is given by the image under g of the update step (5).In x α coordinates and to linear order in the learning rate λ, it is given by If we start from the same points, x (i) = g(z (i) ), the difference between gradient ascent in latent space (5) and input space ( 4) is just given by the contraction of the gradient of f with respect to x with the inverse induced metric Hence, in order to understand why the prescription (5) stays on the data manifold, we will in the following investigate the properties of γ for the case of welltrained generative models.
Before returning to γ, we will first discuss the structure of the data.The probability density of the data on X is denoted by p : X → R and the probability density induced by g is denoted by q : X → R. For q in x α coordinates, we use the notation q x : R N X → R. The data is characterized by S = supp(p) ⊂ X which becomes S x ⊂ R N X in x α coordinates.We will assume that the data lives approximately on a submanifold D ⊂ S of X with dimension N D N X .In relation to the dimension of our generative model, we assume that N D ≤ N Z ≤ N X .As a subset of X and in x α coordinates, D will be denoted by D x ⊂ R N X .To capture that the data does not extend far beyond D, we assume that S has Euclidean extension δ 1, normal to D in x α coordinates, i.e.2 Next, we will define coordinates in a neighborhood of D which separate the directions tangential and normal to D as illustrated in Figure 4. Our construction is similar to the constructions of Riemannian and Gaussian normal coordinates, adapted for a submanifold of codimension larger than one.First, we choose coordinates y on D and, for each p ∈ D, a basis {n i } for the tangent space T p D ⊥ of the normal to D at p. Following the usual construction of Riemannian normal coordinates, we assign coordinates to a point q in some neighborhood of D by constructing an affinely parametrized geodesic σ : [0, 1] → X which satisfies σ(0) = p and σ(1) = q and which has tangent vector σ (0) ∈ T p D ⊥ .The coordinates of q are then y where the i th component of y ⊥ is given by the i th component of σ (0) in the basis {n i }.In a sufficiently small neighborhood around D, we can find a unique basepoint p ∈ D and geodesic σ for every q.
One important aspect of this construction is that by rescaling the basis vectors {n i }, we can rescale the components of σ (0). 3 This means we can rescale the y ⊥ coordinates arbitrarily and hence we can use this freedom to bound S in y coordinates by the same δ that appeared in (16), ( Furthermore, in g(Z), we can choose the basis {n i } orthogonal with respect to the (singular) metric γ and obtain in some neighborhood of D ∩ g(Z) Note that this form of the metric together with (17) implies in particular that S takes the product form mentioned in (1).In the following, we will show that for well-trained generative networks and thin data distributions (i.e. for small δ), γ −1 ⊥i → 0. To understand the consequences for the gradient ascent update step, consider (15) in y µ coordinates For γ −1 ⊥i → 0 and ∂x ∂y ⊥ bounded, the second term vanishes and we arrive at and hence the orthogonal directions in the update step (15), leading away from the data manifold D, are suppressed.Therefore, (15) produces counterfactuals instead of adversarial examples.

Diffeomorphic Counterfactuals
In this section, we show that for well-trained normalizing flows, the orthogonal components of the inverse metric γ −1 ⊥i vanish for thin data manifolds, as formalized in the following theorem.Theorem 1.For ∈ (0, 1) and g a normalizing flow with Kullback-Leibler divergence KL(p, q) < , The main argument of the formal proof given in Appendix A.1 proceeds as follows: First, we show that a small Kullback-Leibler divergence implies that most of the induced probability mass lies in the support of the data distribution, Next, we write q x as the pull-back of the latent distribution q z under the flow g using the familiar change-of-variables formula for normalizing flows.In the y µ coordinates introduced above, the resulting integral then factorizes according to the block-diagonal structure (18) of the metric with integration domain [−δ/2, δ/2] for the y i ⊥ directions.As δ → 0, the bound (21) can only remain satisfied if the associated metric component γ ⊥i diverges, implying that γ −1 ⊥i → 0. Following the steps at the end of Section 3.2, we see that this necessarily implies that the gradient ascent update (5) stays on the data manifold, since ∂x ∂y ⊥ is constant (and therefore bounded) as δ → 0.

Approximate Diffeomorphic Counterfactuals
We now present a theorem similar to Theorem 1 for the case of approximately diffeomorphic counterfactuals, i.e. for AEs and GANs, showing that these models can also be used to construct counterfactuals.This will however necessitate stronger assumptions since the generative model is in this case not bijective.In particular, we will assume that the generative model captures all of the data, i.e. that D ⊂ g(Z), implying that in y coordinates, although γ is singular for N Z < N X , the component γ D is non-singular.Therefore, we split the y ⊥,i directions into N X −N Z singular directions and N Z − N D non-singular directions.Since the inverse metric vanishes by definition in the singular directions, the theorem focuses on the non-singular directions and can then be stated as follows, Theorem 2. If g : Z → X is a generative model with D ⊂ g(Z) and image g(Z) which extends in any non-singular orthogonal direction y i ⊥ outside of D, for δ → 0 for all non-singular orthogonal directions y i ⊥ .The proof can be found in Appendix A.2 and proceeds as follows: First, we construct a curve τ : [0, 1] → Z which cuts through S along the y i ⊥ -coordinate line and lies completely in g(Z), as illustrated in Figure 5.Then, the length L(τ ) of this curve (with respect to γ) computed in y µ -coordinates is, for small δ, approximately given by Bounding the difference by δ and using that L(τ ) is constant, yields the desired result.As in the case of Theorem 1 above, this implies again that the gradient ascent update (5) does not leave the data manifold as shown in (20).

Experiments
Equipped with our theoretical results, we are now ready to present our experimental findings.We start by illustrating diffeomorphic explanations using a toy example in three-dimensional space.This allows us to directly visualize the data manifold and the trajectories of gradient ascent in X and Z.
We then apply our diffeomorphic counterfactual method, using normalizing flows, to four different image data sets.We use MNIST, CelebA and CheXpert for classification tasks and the Mall data set for a regression task.We evaluate the results qualitatively and quantitatively.Furthermore we discuss approximate diffeomorphic counterfactuals, using VAEs and GANs, which allow us to consider high resolution data.
For all experiments, we use the same setup: We require a pretrained generator g and a pretrained classifier f .We start with a data point x from the test set that is predicted by the classifier f as belonging to the source class.We define target class t and target confidence Λ.To produce an adversarial example, we then update the original data point following the gradient in X , ∂ft(x)  ∂x , until we reach the desired target confidence.To produce a counterfactual we first project the original data point into the latent space of the generative model g by applying the inverse generative model g −1 (x) = z, or an appropriate approximation (for GANs).We then update the original latent representation grad asc in X grad asc in Z x x g(z ) Figure 6: Gradient ascent in X leads to points that lie significantly off-manifold while gradient ascent in Z moves along the data manifold.The ground truth for different classes is depicted in orange (source class) and gray (target class).
z following the gradient in Z, ∂(ft•g)(z) ∂z , until we reach the desired target confidence.
For more details on model configuration, training and hyperparameters we refer to Appendix B.

Toy example
We consider data uniformly distributed on a onedimensional manifold, a helix, that is embedded in threedimensional space and train a simple normalizing flow that approximates the data distribution.As illustrated in Figure 6, we divide the data into two classes corresponding to the upper and the lower half of the helix and train a classifier.
We then generate counterfactuals by the gradient ascent optimization in input space X and in the latent space of the flow Z, i.e. by using (4) and ( 5) respectively.
Starting from the original data point x, we observe that gradient ascent in X leads to points that lie significantly off data manifold S. In contrast to that, the updates of gradient ascent in the latent space Z follow a trajectory along the data manifold resulting in counterfactuals with the desired target classification which lie on the data manifold.We illustrate this in Figure 6.
As we have an analytic description of the data manifold, we can reliably calculate the distances to the data manifold for all points found via gradient ascent in X or Z.We compare 1000 successful optimizations (all optimizations reach the desired target confidence) in the input space X and latent space Z.The median value for the distances to the data manifold when performing gradient ascent in X or in Z are 2.34 and 0.01 respectively (see also Figure 16 in the appendix).This intuitively and clearly illustrates the benefit of performing gradient ascent in the latent space Z.

Tangent space of data manifold
A non-trivial consequence of our theoretical insights is that we can infer the tangent space of each point on the data manifold from our flow g.Specifically, we perform a singular value decomposition of the Jacobian ∂g ∂z = U Σ V and rewrite the inverse induced metric as As we saw in Section 3, for data concentrated on an N Ddimensional data manifold D in an N X -dimensional embedding space X , the inverse induced metric γ −1 has N X − N D small eigenvalues.Furthermore, the eigenvectors corresponding to the large eigenvalues will approximately span the tangent space of the data manifold.For our toy example from Section 4.1.1,we can directly show the parallelepiped spanned by the three eigenvectors in threedimensional space.Figure 7 (left) indeed shows that the parallelepipeds are significantly contracted in two of the three dimensions making them appear as one dimensional lines.For the high dimensional image data sets, which are discussed in Section 4.1.3,we show the sorted eigenvalues, averaged over 100 random data points per data set, cf. Figure 7 (right).Our experiments confirm the theoretical expectation that the large eigenvectors indeed span the tangent space of the manifold.

Image classification and regression
We now demonstrate applications of diffeomorphic counterfactuals to image classification in several domains.

Classifiers
For the Mall data set, we train a U-Net [62] that outputs a probability map of the size of the image and a scalar regression value, which corresponds to the approximated number of pedestrians in the picture.
Following the definitions by Ribera et al. [63], our trained U-Net reaches a RMSE for the head count of 0.63.When we run our gradient ascent algorithm, we aim to maximize/minimize merely the scalar regression value, i.e. the number of pedestrians.we maximize the regression value r (threshold at r = 10) if few pedestrians were detected in the original image x and minimize the regression value (threshold at r = 0.01) if many pedestrians were detected in the original image x.We use Adam to optimize in X or Z until the confidence threshold Λ for the target class t is reached.

Qualitative analysis:
Qualitative analysis: Qualitative analysis: Qualitative analysis: Qualitative analysis: Qualitative analysis: Qualitative analysis: Qualitative analysis: Qualitative analysis: Qualitative analysis: Qualitative analysis: Qualitative analysis: Qualitative analysis: Qualitative analysis: Qualitative analysis: Qualitative analysis: Qualitative analysis: Our diffeomorphic counterfactuals produced by the normalizing flows indeed show semantically meaningful deformations in particular when compared to adversarial examples produced by gradient ascent in the data space X .
We show examples in Figure 8.The counterfactuals resemble images from the data set that have the target class as the ground truth label.At the same time the counterfactuals are similar to their respective source images with respect to features that are irrelevant for the differentiation between source and target class.
For MNIST, the stroke width and the writing angle remain unchanged in the counterfactuals while the gap in the upper part of the 'four' changes to the characteristic upper loop of the 'nine'.
For CelebA, the changes in the counterfactuals are focused on the hair area as evident from the heatmaps.Facial features and background stay (approximately) constant.
The counterfactuals for the CheXpert data set mostly brighten the pixels in the central region of the picture leading to the appearance of an enlarged heart.The other structures in the image remain mostly constant.
Also for pictures taken from the Mall data set, we observe that the counterfactuals remain close to the original images.When maximizing the regression value, pedestrians are generated at the picture's edge or appear around darker areas in the original image.When minimizing pedestrians, we observe that the counterfactuals reproduce the darker parts of the floor and lines between the tiles.

Quantitative analysis:
Quantitative analysis: Quantitative analysis: Quantitative analysis: Quantitative analysis: Quantitative analysis: Quantitative analysis: Quantitative analysis: Quantitative analysis: Quantitative analysis: Quantitative analysis: Quantitative analysis: Quantitative analysis: Quantitative analysis: Quantitative analysis: Quantitative analysis: Quantitative analysis: To quantitatively assess the quality of our counterfactuals, we use several measures, as detailed in the following.
Oracle: Oracle: Oracle: Oracle: Oracle: Oracle: Oracle: Oracle: Oracle: Oracle: Oracle: Oracle: Oracle: Oracle: Oracle: Oracle: Oracle: We train a 10-class SVM on MNIST (test accuracy of 92%) and binary SVMs on CelebA (test accuracy of 85%) and CheXpert (test accuracy of 70%).The counterfactuals found by performing gradient ascent in the base space of the flow generalize significantly better to these simple models suggesting that they indeed use semantically more relevant deformations than conventional adversarial examples produced by gradient ascent in X space.
For the Mall data set, we train a slightly larger U-Net (RMSE for head count 0.72) and calculate regression values for the original images, the images modified with gradient ascent in X -space and the images modified with gradient ascent in Z-space.As expected, the regression values for the counterfactuals are significantly closer to the target values (0.01 for minimizing pedestrians and 10 for maximizing pedestrians) than those of original images and adversarial examples.Figure 9 summarizes these findings.
In Figure 10, we show the localization of heads for the counterfactuals and the adversarial examples for the Mall data set from Figure 8 using the original and the oracle U-Net.In order to find the head locations, the regression value is rounded to the closest integer representing the number of pedestrians in the image.A Gaussian mixture model with the number of pedestrians as components is then fitted to the probability map.Finally the head positions are defined as the means of the fitted Gaussians.The original U-Net is deceived by the adversarial examples: When maximizing pedestrians (second row) the original U-Net produces false positives, leading to markers at head locations where there are no pedestrians.When minimizing pedestrians, the adversarial examples (forth row) fool the original U-Net into making false negative errors, that is failing to detect pedestrians, although they are clearly present.The oracle U-Net on the other hand produces regression values and probability maps that enable correct identification of pedestrian's head positions (or lack thereof) for the adversarial examples when maximizing (second row) and minimizing (forth row) pedestrians.
For the diffeomorphic counterfactuals (first and third row in Figure 10), the predictions of the two U-Nets are similar, showing that these counterfactuals generalize to the independently trained oracle U-Net. Nearest where AE c0 and AE t are two autoencoders which were each trained on data from only one class (original class c 0 and target class t, respectively) and is a small positive value.
The second metric IM2 is defined by where AE is an autoencoder trained on all classes.IM1 and especially IM2 have been repeatedly criticized [67][68][69].For IM2, we devide by the one-norm ||x || 1 of the modified image.This value is large if the image has more bright pixels.Consequently, images with brighter pixels will tend to have a smaller IM2, even though they might not be more interpretable.We therefore limit our evaluation to IM1.In Table 1 since the relevant distance is to be computed by the induced metric on the data manifold S or, equivalently, the flat metric in the latent space Z.However, our counterfactuals still preserve high similarity to the respective source image.We confirm this by calculating the Euclidean distances in X and Z between counterfactuals and all images of the source class (this effect is illustrated for the CelebA dataset in Figure 12).The average Euclidean norm between counterfactuals and the respective source images is significantly lower than the average Euclidean norm between counterfactuals and all images of the source class.For adversarial examples, we expect the Euclidean distances in X to the respective source image to be very small while the Euclidean distances in Z should be larger.Figure 12 shows the distribution of distances in X and Z between counterfactuals/adversarials and their respective source images as well as distances between counterfactuals/adversarials and all images of the source class for the CelebA data set.
We refer to the Appendix B.7 for graphs for the other data sets.
In Table 2, and Table 3 we show the averaged Euclidean norms of the distances in X and Z for counterfactuals and adversarials respectively, confirming our expectiations.

Approximate Diffeomorphic Counterfactuals
In this section, we present our experimental analysis of approximate diffeomorphic counterfactuals for which we use variational autoencoders (VAEs) [70] and generative adversarial networks (GANs) [51].As explained in Section 2.4, an important downside of approximate diffeomorphic counterfactuls is that the latent representation z of the original image x is generically lossy, i.e. g(z) = x, since the diffeomorphism is only approximate and not exact.On the other hand, an advantage of this approximation is that the method can be scaled to data of very high dimensionality.
Both of these statements will be demonstrated experimentally in the following.We use a simple convolutional VAE (cVAE) on the MNIST data set.Results are shown in Figure 13 in the left most block.The encoded and decoded image x (second column of the block) appears slightly fuzzy but approximately reproduces the original image.Approximate diffeomorphic counterfactuals, found by gradient ascent in the latent space of the VAE, replicate features irrelevant for classification, such as stroke width and writing angle while structurally modifying the image so that it resembles an image of the digit nine.
As discussed in Section 2.4, GANs do generically not require an encoder during the training process and we apply GAN inversion methods to find an encoding of the source image.Specifically, for these relatively low-dimensional data sets, we find the latent representation z by minimising the Euclidean norm between the decoded latent representation g(z) and the original image x.We apply our method to a simple convolutional GAN (dcGAN) for MNIST and a progressive GAN (pGAN) [71] for CelebA.Results are shown in Figure 13 in the middle and right block.
The dcGAN on MNIST produces some random pixel artifacts, but the generated images are sharper than those produced by the cVAE.
For the CelebA images generated with pGAN, we see that the decoded optimized latent representation of the original image deviates slightly from the original.This is especially visible if the composition is not typical (arm is not properly reproduced in the first row) or the background is highly structured (second row).For the approximate diffeomorphic counterfactuals, we observe even larger changes in the background.This may be attributed to the imperfect inversion process and the quality of the pGAN, i.e. the fact that the diffomorphism is only approximate and not exact.
To demonstrate that approximate diffeomorphic explanations can scale to very high-dimensional data, we use a pretrained StyleGAN [72,73] for images of resolution 1024 × 1024 from the CelebA-HQ data set [71].To find the initial latent representation, we use HyperStyle [74] GANinversion techniques.In order to use the same classifier as before, we downscale the images to 64 × 64 resolution before using them as input to the classifier.As demonstrated by Figure 14, approximate diffeomorphic counterfactuals lead to semantically meaningful and interpretable results even on these very high-dimensional data sets.

Related work
In this section, we compare our approach with existing methods.
This work builds upon and substantially extends our workshop contribution [75].Specifically, we introduce both diffeomorphic and approximate diffeomorphic counterfactuals and discuss their relationship.To this end, we theoretically analyze this broader class of generative models in a unified manner which allows us to compare the relative strengths and weaknesses of these approaches.In addition, we provide an expanded discussion of the rigorous construction of the different coordinate systems.In our experimental studies, we consider additional generative models,  specifically GANs and AEs, to find approximately diffeomorphic explanations for different data sets demonstrating that our method scales to a high-resolution data sets.Furthermore, we include experiments for additional datasets, such as CelebA-HQ and Mall, as well as further tasks, i.e. regression in addition to classification, and neural network architectures.We also consider a variety of quantitative evaluations of counterfactuals to evaluate the performance of the proposed methods.

Counterfactuals with generative models
A comparatively small number of publications consider normalizing flows, which started to gain attention relatively recently, in the context of generating counterfactuals.Hvilshøj et al. [76] argue that a simple linear interpolation along the vector defined by the difference between two class centres in the base space of a flow can be interpreted as a counterfactual.In contrast to our method, their approach does not involve the classifier and thus does not guarantee that the counterfactual is classified as the target class.In addition to that, their approach requires access to labelled training data, as the class specific centres in the base space have to be computed.
Sixt et al. [77] train a linear binary classifier directly in the base space of the flow.Adding the weight vector corresponding to the target class to the base space representation and projecting back to image space then produces a counterfactual with semantically changed features.Unlike our method, this approach requires training of a classifier and does not work for a general classifier of arbitrary architecture.In addition to that, their approach relies on the classifier being linear and binary so that the direction in which the image is modified can be determined analytically.Our method is more modular in the sense that the classifier can be pretrained and is independent of the generative model.Furthermore, we allow for the classifier to be nonbinary as well as non-linear and we apply our framework for regression.
Our approach is closest in spirit to the one taken by Joshi et al. [78], who introduce an algorithm that does gradient ascent in the latent space of a generative model (they mention VAEs and GANs), while minimizing the difference between original and modified data point.The application concentrates on recourse for tabular data.The examples shown for image data are limited to a VAE, which results in relatively low quality counterfactuals, and lack quantitative evaluation.Our approach focuses on image data and embeds their method into our more general framework with a detailed theoretical foundation.Furthermore, we propose to use normalizing flows as generative models for which the diffeomorphism is exact.We theoretically discuss the relation between diffeomorphic and approximately diffeomorphic methods and conduct extensive experiments for various tasks, data sets, and classifier architectures.To reliably evaluate the resulting counterfacuals we propose a number of approaches, specifically emphasising quantitative evaluation.
Other works also use autoencoders to generate counterfactuals.
Dhurandhar et al. [79] use elastic net regularization to keep the perturbation δ to the original data small and sparse.Furthermore, they use an autoencoder to minimize the reconstruction loss of the modified image and thus make sure the counterfactual lies on the data manifold.This approach was expanded by adding a prototype loss [66] that guides the counterfactuals towards the closest prototype that does not represent the source class and thus speeds up optimization and produces more interpretable counterfactuals.The prototypes are defined in the latent space of an autoencoder.Both approaches apply their algorithm on tabular data and MNIST.Our approach differs from these works as we are not using generative models as a regularizer but directly modify the latent space representation.Our method also does not require access to labelled training data in order to compute prototypes, is applicable to high dimensional image data sets, and has no hyperparameters weighting different loss components.
Kim et al. [80] specifically train a Disentangled Causal Effect Variational Autoencoder (DCEVAE) and then generate counterfactuals conditioned on the original image and the label they aim to change.Our method on the other hand produces classifier dependent counterfactuals and has thus potential to give insight into the decision processes of different classification models.Another advantage of our approach is that in contrast to [80], the generative models for our method are not required to be trained with labeled training data.
A number of references use GANs to generate counterfactuals.
Zhao et al. [81] perturb the latent representation of a Wasserstein GAN using exhaustive search or continuous relaxation until they achieve a desired target classification.To get to the initial latent representation, they train an inverter.Zhao et al. refer to the resulting manipulated data points as natural adversarial examples which we understand as effectively counterfactuals since they share characteristics like lying on the data manifold and having meaningful semantic deviations from the original.Although their counterfactuals in latent space are restricted to be close to the initial latent representation the resulting images deviate significantly from the original images for high dimensional data (LSUN), which makes it hard to identify features most relevant for the decision process.This could be due to inaccurate inversion or the randomized search process.Our diffeomorphic counterfactuals can circumvent the problem of inaccurate inversion by using normalizing flows.In addition to that we use gradient ascent which has the advantage of potentially faster retrieval of counterfactuals and helps to guide changes to only relevant features.
Chang et al. [82] use a conditional GAN for infilling re-gions that were previously removed from the original image.Their proposed algorithm aims to find an infilling mask which maximizes or minimizes the classification confidence while penalizing the size of the region that is replaced in the original.Compared to this, our approach has the advantage of not only directly influencing which pixels are changed but also how they are changed.We do not penalize the size of the region that changes as counterfactuals for different classes might have large differences in size of relevant features (for example a hair color change will require more changed pixels than a lipstick color change).Samangouei et al. [83] propose to use classifier prediction specific encoders together with a GAN which are trained to generate a reconstruction, a counterfactual, and a mask indicating which pixels should change between the counterfactual and the original.A similar approach is proposed by Singla et al. [84] as they also train a GAN, that is conditioned on the classifier predictions, jointly with an encoder to produce realistic looking counterfactuals.In both works, the classifier is incorporated in the training process of the GAN.After the training, the GAN generates counterfactuals without querying the classifier.As a consequence information about the classifier is integrated into the GAN purely during training, while our approach can be applied to independently trained models, which allows us to find counterfactuals for different classifiers using the same generative model.
Goetschalckx et al. [85] learn directions in the latent space by differentiating through a classifier and the generator so that cognitive properties of generated images, such as memorability, can be modified by moving in those directions.They do not specifically aim to produce counterfactuals but their approach touches on related concepts.A difference to our work is that the latent representation is restricted to be modified along a single direction, while for our method the direction of change is dictated by the gradient over several update steps.
Lang et al. [86] train a StyleGAN with a classifier specific style space.The image can then be manipulated along the learned style coordinates.Our approach does not require training of additional models but can be applied to existing generators and classifiers.In contrast to our approach, their approach does not use a gradient-based approach to find directions that are sensitive to the classifier output.Instead, coordinate directions in style space that correspond to specific classes are detected by testing how changing a coordinate affects the classification of a number of samples.
Lius et al. [87] use a GAN specifically trained for editing that they condition on the original query image and the desired attributes.They apply gradient descent to find attributes that cause the GAN to generate an image that the classifier predicts as the target class, while at the same time enforcing the image to be close to the original.In contrast to this approach, we directly modify the latent representation rendering our approach independent of the exact structure of the generative model.We also observe that our counter-factuals stay close to the original image without explicitly enforcing similarity.
Shen et al. [88] train a linear SVM in the latent space of a GAN using generated images, for which the facial attributes are determined after projection to the image space using a pretrained classifier.They can then modify the latent representation of an inverted image linearly along the normal directions of the learned SVMs.In contrast to our approach their method requires sampling and training.They do not use gradients and can only linearly modify the latent code.The classifier is only indirectly used for labeling the generated samples prior to training the SVM.

Quantitative Metrics for Counterfactuals
Many of the above works are limited to qualitative assessment of counterfactuals.The quantitative assessment of counterfactuals is still an active research area, a summary of quantitative measures can be found in [67].
The most reliable evaluation of counterfactuals can be obtained by a large study that lets human agents evaluate the generated counterfactuals.As those are relatively costly to conduct and can introduce unexpected biases if not designed carefully, their application is often infeasible.Nevertheless, a few works [77,81,86,89] undertake small user studies (9 ≤ N ≤ 60) on a relatively limited set of generated counterfactuals.We aim to approximate an independent human evaluation by testing our counterfactuals on newly trained models that serve as oracles.
Some works [80,84,90] apply a metric commonly used for generative model evaluation, the Fréchet Inception Distance (FID) score [91], measuring the quality of the generated explanation compared to samples from the data set.As we find counterfactuals by moving directly in latent space the FID for our counterfactuals would be very similar to the FID of the generative model itself.We therefore do not consider the FID to be a meaningful metric for (approximate) diffeomorphic counterfactuals.
Van Looveren and Klaise [66] propose two metrics to test interpretability: IM1 (defined in (24) above) uses two autoencoders which were each trained on data from only one class and computes the relative reconstruction error of the counterfactual.The second metric IM2 (defined in (25) above) calculates the normalized difference between a reconstruction of an autoencoder trained on the target class and an autoencoder trained on all classes.Van Looveren and Klaise use these two metrics to compare how different loss functions effect the relative interpretability measured by IM1 and IM2 for the MNIST data set.We limit the quantitative evaluation of our counterfactuals to IM1, since IM2 has been subject to controversies.
Other works [80,83] check substitutability.They train classifiers on a training set consisting of generated counterfactuals and compare their performance on the original test data set to a classifier trained on the original training set.
As we can generate counterfactuals that are classified with very different confidence, this method may not be useful as results may be highly dependent on the choice of confidence.
Other methods aim to evaluate explanations by replacing pixel values or entire regions based on the importance of features in the explanation [27,83,[92][93][94] and testing the performance of a classifier on the modified images.Those methods may suffer from creating images that lie off the data manifold, so that a thorough comparison may require extensive retraining [95].

Conclusion
In this work, we proposed theoretically rigorous yet practical methods to generate counterfactuals for both classification as well as regression tasks, namely exact and approximate diffeomorphic counterfactuals.The exact diffeomorphic counterfactuals are obtained by following gradient ascent in the base space of a normalizing flow.While approximate diffeomorphism are obtained with the help of either generative adversarial networks or variational autoencoders.Our thorough theoretical analysis, using Riemannian differential geometry, shows that for well-trained models, our counterfactuals necessarily stay on the data manifold during the search process and consequently exhibit semantic features corresponding to the target class.Approximate diffeomorphic counterfactuals come with the risk of information loss but allow excellent scalability to higher dimensional data.Our theoretical findings are backed by experiments which both quantitatively and qualitatively demonstrate the performance of our method on different classification as well as regression tasks and for numerous data sets.
The application of our counterfactual explanation method is straightforward and requires no retraining, so that it can be readily applied to investigate common problems in deep learning like identifying biases for classifiers or training data or scrutinizing falsely classified examples -all common tasks for applications in computer vision and beyond.
For future work, we intend to investigate the benefit of our counterfactuals in the sciences, in particular for medical applications such as digital pathology [11] or brain computerinterfaces [96].
Furthermore, the method presented in this work is not restricted to the generation of counterfactuals for image data or in computer vision.In particular, one could imagine applications e.g. in chemistry and physics where the technique proposed here may be used to optimize desired properties of stable molecules which are restricted to minima of the associated potential energy surface [14,[97][98][99].
In practical applications, it is often beneficial to incorporate symmetries as inductive bias following a wellestablished paradigm in machine learning [100,101].It is straightforward to incorporate symmetries into our method by employing equivariant normalizing flows as constructed e.g. in [102].
In conclusion, our method is applicable to a broad range of computer vision problems and beyond as it provides a way to optimize the output of a predictive model on the data manifold given only indirectly by a trained generative model.
Since S has, by construction, (Euclidean) extension δ orthogonal to D in y µ coordinates, with δ 1, we perform a Taylor expansion of the metric around and obtain to first order Again, since S has range δ in y i ⊥ -direction, we have (x 1,⊥ i − x 0,⊥ i ) < δ and therefore We now change the x α -and y µ coordinates such that δ → 0, corresponding to a data distribution which is more and more concentrated on D. As we change coordinates, L(τ ) is constant as a geometric invariant4 and we obtain from (39) as desired.

B Details on Experiments B.1 Toy Example
The flow used for the toy example is composed of 12 RealNVP-type coupling layer blocks.Each of these blocks includes a three-layer fully-connected neural network with leaky ReLU activations for the scale and translation functions.
For training, we sample from the target distribution defined by We train for 5000 epochs using a batch of 500 samples per epoch.We use the Adam optimizer with standard parameters and learning rate λ = 1 × 10 −4 .This takes around 10 minutes on a standard CPU.After successful training we can map samples from a multivariate standard Normal distribution to the data distribution, see Figure 15.
In order to train a classifier we first define the ground truth: points with z-coordinate smaller than zero belong to the one class and points with z-coordinate bigger than zero belong to the other class.We train a neural network with 256 hidden neurons with ReLU activations and one output  Boxes extend from lower to upper quartile, red lines mark the medians, whiskers mark the 1.5×IQR and circles mark outliers.neuron with sigmoid activation to near perfect accuracy on this classification task.
We then run the gradient ascent optimization in image space X and in the base space of the flow Z.We start from samples from the true data distribution and set the target to 0.1 if the network predicted a value larger than 0.5 for the original data point, otherwise we set the target to 0.9.
Figure 16 shows that counterfactuals found in Z lie significantly closer to the data manifold than adversarials found in X For more details we refer to our github implementation 5 .
attribute is labeled as uncertain.Using this technique, we obtain 25717 training images labelled as healthy and 20603 training images labelled as cardiomegaly.We do not treat the imbalance but train on the data as is.We train for 9 epochs using a learning rate of 1 × 10 −4 .We test on the test set, that was produced in the same way as the training set.We get a balanced test accuracy of 86.07%by averaging over true positive rate (84.83%) and true negative rate (87.27%).

B.5 U-Net:
The U-Net [62] follows an hour glass structure.The first part consists of multiple convolutional, batch normalization, ReLU and pooling layers that gradually reduce the spatial dimensions while increasing the channel dimensions.The second block consists of upsampling, concatenation of feature maps from the first part, convolutional, batch normalization and ReLU activation layers.The last layer has the same spacial dimension as the input but only one channel corresponding to a probability map.For using the U-Net in order to count pedestrians, Ribera et.al. [63] add an additional fully connected layer with ReLU activations that combines the information from the last layer and the central encoding layer to estimate the number of objects of interest present8 .

B.6 Optimization of counterfactuals and adversarial examples
Counterfactuals and adversarial examples are found using the Adam optimizer with standard parameters.We vary only the learning rate λ.For our main experiments we use the base space of normalizing flows to find the counterfactuals.We set the threshold for the confidence of the target class high when searching for counterfactuals and adversarial examples.We therefore get more visually expressive results.Of course in practice one might whish to find counterfactuals with lower target confidence.We show an example optimization with different confidence thresholds in   CheXpert.We note that for distances in X the distributions for original images, adversarial examples and counterfactuals are very similar while for distances in Z the distribution of distances for adversarial examples is notably shifted to the right, meaning that adversarial examples are further away from random data samples when the distances are measured in Z, that is on the manifold.The effect is most notable for CelebA and CheXpert, for which the distances in Z of counterfactuals closely match the distribution of distances between images from the data set.
The original distribution of images from the Mall data set is strongly skewed towards few pedestrians.We can therefore not expect to achieve insights from comparing distributions of manipulated images.training set (as the test set is relatively small and the data is unevenly distributed (many images with few pedestrians and very few with many pedestrians)).We use the Euclidean norm as a distance measure.

C Examples for Counterfactuals
In this appendix, we present results on randomly selected images from the four data sets for which we produce counterfactuals via the flow.For the heatmaps, we visualize both the sum over the absolute values of color channels as well as the sum over the color channnels.

Figure 1 :
Figure 1: Example of a counterfactual from the CelebA dataset.The original is classified as not blond.The adversarial is classified with high confidence as blond, but the difference to the original resembles unstructured noise.The counterfactual is also classified with high confidence as blond but in contrast to the adversarial example it shows semantic differences to the original.

Figure 2 :
Figure2: (a) Image data usually lies on a lower dimensional data manifold, which is embedded in high dimensional space.We want to know what image features would have to change so that the classification flips.(b) If we follow the gradient of our target class with respect to the input ∂ft ∂x the prediction flips but the resulting image is an adversarial example that looks indistinguishable from the original for a human observer.The changes to the original image are not semantic, but are limited to specific noise.The Euclidean difference between adversarial and original is therefore very small when measured in X but large when measured in Z. (c) We use to the normalizing flow g to obtain the latent space representation z = g −1 (x) of our original image x.We then perform gradient ascent in the latent space Z.The prediction flips, but this time the resulting image is a counterfactual.The changes to the original image are semantic.The Euclidean difference between counterfactual and original is small when measured in X and Z.(d) Left: Quantitative evaluations show that counterfactuals generalize to simple classifiers in contrast to adversarial examples.Right: Counterfactuals are similar to images of the target class.We show results for the CelebA data set.(e) Histograms for original images, adversarial examples and counterfactuals are indistinguishable when measuring the distances in the input space X .When measuring the distances in the latent space Z we see that adversarial examples have larger distances.This confirms the hypothesis that adversarial examples lie off manifold.We show results for the CelebA data set.
Figure2: (a) Image data usually lies on a lower dimensional data manifold, which is embedded in high dimensional space.We want to know what image features would have to change so that the classification flips.(b) If we follow the gradient of our target class with respect to the input ∂ft ∂x the prediction flips but the resulting image is an adversarial example that looks indistinguishable from the original for a human observer.The changes to the original image are not semantic, but are limited to specific noise.The Euclidean difference between adversarial and original is therefore very small when measured in X but large when measured in Z. (c) We use to the normalizing flow g to obtain the latent space representation z = g −1 (x) of our original image x.We then perform gradient ascent in the latent space Z.The prediction flips, but this time the resulting image is a counterfactual.The changes to the original image are semantic.The Euclidean difference between counterfactual and original is small when measured in X and Z.(d) Left: Quantitative evaluations show that counterfactuals generalize to simple classifiers in contrast to adversarial examples.Right: Counterfactuals are similar to images of the target class.We show results for the CelebA data set.(e) Histograms for original images, adversarial examples and counterfactuals are indistinguishable when measuring the distances in the input space X .When measuring the distances in the latent space Z we see that adversarial examples have larger distances.This confirms the hypothesis that adversarial examples lie off manifold.We show results for the CelebA data set.

Figure 3 :
Figure3: When the gradient ascent optimization of the target class is performed in the input space of the classifier, one leaves the data manifold and obtains an adversarial example.If instead the gradient ascent is performed in the latent space of a generative model, one stays on the data manifold, resulting in a counterfactual example.

Figure 4 :
Figure 4: Construction of the y µ coordinates which are aligned with the data manifold D.

Figure 5 :
Figure 5: Construction of the curve τ used in Section 3.4.

Figure 7 :
Figure7: Left: As expected from the theoretical analysis, the parallelepiped spanned by all three eigenvectors of the inverse induced metric scaled by the corresponding eigenvalues is to good approximation one-dimensional, i.e. of the same dimension as the data manifold, and tangential to it.Right: The Jacobians of the trained flows have a low number of large and a large number of small eigenvalues, suggesting that the images lie approximately on a low-dimensional manifold.Both axes are scaled logarithmically.

Figure 8 :
Figure8: Counterfactuals for MNIST ('four' to 'nine'), CelebA ('not-blond' to 'blond'), CheXpert ('healthy' to 'cardiomegaly'), Mall ('few' to 'many') and Mall ('many' to 'few').Columns of each block show original image x, counterfactual x , and difference h for three selected datapoints.First row of each block is our diffeomorphic counterfactuals, i.e. obtained by gradient ascent in Z space.Second row of each block is standard gradient ascent in X space.Heatmaps h show the difference |x − x | summed over color channels.

Figure 9 :
Figure 9: Left: accuracy with respect to the target class k generalizes better to SVM for diffeomorphic counterfactuals.Right: regression values for oracle are closer to target values for Z-based counterfactuals (bars show means and errors denote one standard deviation).

Figure 10 :
Figure 10: Head locations for pedestrians in counterfactuals and adversarial examples when maximizing pedestrians (upper two rows) and minimizing pedestrians (lower two rows).The original U-Net is fooled by the adversarial examples, leading to false positives (second row) and false negatives (forth row) when detecting pedestrians.The oracle U-Net generalizes to the diffeomorphic counterfactuals found by gradient ascent in Z (odd rows) but not to the adversarial examples found by gradient ascent in X (even rows).

Figure 11 :
Figure 11: Left: ground truth class for the ten nearest neighbours (NNs) matches the target value ('9', 'blond' and 'cardiomegaly') more often for the counterfactuals found in Z. Right: ground truth pedestrian counts averaged over the three nearest neighbours are closer to target values for diffeomorphic counterfactuals.Bars show means and errors denote one standard deviation.

Figure 12 :
Figure 12: Euclidean distances in X and Z for adversarial examples (first row) and counterfactuals (second row) for the CelebA dataset.Counterfactuals lie closer to their respective source image than adversarial examples when measured in Z, i.e. along the data manifold.

Figure 13 :
Figure 13: Counterfactuals for cVAE on MNIST (left block), dcGAN on MNIST (middle block) and pGAN on CelebA (right block).Columns of each block show original image, decoded latent representation of original, counterfactual and absolute difference |x − x | summed over color channels.

Figure 14 :
Figure 14: Counterfactuals generated with HyperStyle and Celeba-HQ.Columns show original, decoded latent representation, counterfactual and absolute difference |x − x | summed over color channels.

Figure 15 :
Figure 15: From left to right: distribution in the base space of the flow, target distribution, learned distribution

Figure 16 :
Figure 16:Statistics for 1000 counterfactuals/adversarials.Boxes extend from lower to upper quartile, red lines mark the medians, whiskers mark the 1.5×IQR and circles mark outliers.
: MNIST: MNIST: MNIST: We use λ = 5×10 −4 for conventional adversarial examples and λ = 5×10 −2 for counterfactuals found via the flow.We do a maximum of 2000 steps stopping early when we reach the target confidence of 0.99.We perform attacks on 500 images of the true class 'four'.All conventional attacks and 498 of the attacks via the flow reached the target confidence of 0.99 for the target class 'nine'.CelebA: CelebA: CelebA: CelebA: CelebA: : CelebA: CelebA: CelebA: We use λ = 7 × 10 −4 for conventional adversarial examples and λ = 5×10 −3 for counterfactuals found via the flow.We do a maximum of 1000 steps stopping early when we reach the target confidence of 0.99.We perform attacks on 500 images of the true class 'not-blond'.492 conventional attacks and 496 of the attacks via the flow reached the target confidence of 0.99 for the target class 'blond'.

Figure 19 :
Figure 19: Euclidean distances in X and Z for adversarial examples (top row) and counterfactuals (bottom row) for the MNIST data set.

Figure 20 :
Figure 20: Euclidean distances in X and Z for adversarial examples (top row) and counterfactuals (bottom row) for the CelebA data set.

Figure 21 :Figure 22 :
Figure 21: Euclidean distances in X and Z for adversarial examples (top row) and counterfactuals (bottom row) for the CheXpert data set.

Figure 23 :
Figure 23: Euclidean distances in X and Z for adversarial examples (top row) and counterfactuals (bottom row) for the Mall data set.Source images have many pedestrians (r > 3).

Figure 24 :
Figure 24: Distributions of Euclidean distances in X and Z for test images, adversarial examples and counterfactuals for three data set.

Table 1 :
, we show mean and standard Interpretability metric IM1 values for MNIST and CelebA calculated for original images, adversarial examples and counterfactuals.Low values mean better interpretability.We show mean and standard deviation.deviation for the interpretability metric IM1 for two data sets; MNIST and CelebA.We calculate the values for the original images, the adversarial examples, produced by gradient ascent in X space, and the diffeomorphic counterfactuals, produced by gradient ascent in Z space.A low value for IM1 means the image is better represented by an autoencoder trained on only the target class.

Table 2 :
Euclidean norms L 2in X for adversarial examples found via gradient ascent in X and counterfactuals found via gradient ascent in Z.We show mean and standard deviation.

Table 3 :
Euclidean norms L 2in Z for adversarial examples found via gradient ascent in X and counterfactuals found via gradient ascent in Z.We show mean and standard deviation.