iVAE-GAN: Identifiable VAE-GAN Models for Latent Representation Learning

Remarkable progress has been made within nonlinear Independent Component Analysis (ICA) and identifiable deep latent variable models. Formally, the latest nonlinear ICA theory enables us to recover the true latent variables up to a linear transformation by leveraging unsupervised deep learning. This is of significant importance for unsupervised learning in general as the true latent variables are of principal interest for meaningful representations. These theoretical results stand in stark contrast to the mostly heuristic approaches used for representation learning which do not provide analytical relations to the true latent variables. We extend the family of identifiable models by proposing an identifiable Variational Autoencoder (VAE) based GAN model we name iVAE-GAN. The latent space of most GANs, including VAE-GAN, is generally unrelated to the true latent variables. With iVAE-GAN we show the first principal approach to a theoretically meaningful latent space by means of adversarial training. We implement the novel iVAE-GAN architecture and show its identifiability, which is confirmed by experiments. The GAN objective is believed to be an important addition to identifiable models as it is one of the most powerful deep generative models. Furthermore, no requirements are imposed on the adversarial training leading to a very general model.


I. INTRODUCTION
One of the biggest challenges facing machine learning and unsupervised learning in particular is meaningful representation learning. A very recent leap in representation learning is the understanding of identifiability in deep latent variable models by [7,8]. Identifiability has origins in early econometrics as shown by the "problem of confluent relations (or problem of arbitrary parameters)" [12]. The formulation states that if two or more parametrizations of the same model lead to the same joint distribution over observed random variables they are indistinguishable on the basis of observations and therefore unidentifiable. Up until recently the general consensus in literature agreed that arbitrary nonlinear functions, such as those modeled by neural networks, were almost surely unidentifiable under the assumption of independent latent variables [13,4]. It is now understood that identifiability results can be achieved if the model assumes conditionally independent latent variables. That is, given an additionally observed variable under which the latent variables are independent, they may be estimated up to a linear transformation and in certain cases reduced to a simple scaled permutation. The nonlinear maps from observable to latent variables need not preserve dimensionality, but if they do it is worth noting that the identifiability results become interpretable as nonlinear ICA [5].
Generative models are widely used and GANs are no exception. However, GANs do not learn a mapping from data space to latent space which, by definition, is necessary in order to construct an identifiable model. They do nonetheless rely on a latent space constructed from an explicit latent distribution from which samples are drawn as input to the generator (typically N (0, I)). In this work we leverage the new identifiability results to propose the first identifiable GAN by using variational inference (iVAE-GAN) as shown in FIGURE 1.
It should be highlighted that in the proposed model, the la-FIGURE 1: The proposed iVAE-GAN architecture. The latent variable model consists of four neural networks: an encoder E that learns the latent variables, a decoder/generator G that generates data in the original data space, a discriminator D that discriminates generated data from real data, and an auxiliary network A that learns the natural parameters, λ i (u), of an exponential family from the observed auxiliary variables. tent distribution is not chosen but learnt through the encoder and auxiliary network and more importantly it is identifiable. This results in a model architecture that is similar to VAE- GAN [11], but with key differences as elaborated in the following. Most importantly iVAE-GAN is identifiable, thus requiring the additional auxiliary network as opposed to VAE-GAN. The generator of iVAE-GAN is updated according to the standard GAN loss and does not require a weighting of a VAE reconstruction loss and GAN loss. Thus iVAE-GAN is susceptible to improvements in the GAN minimax game while being applicable to the same problems with the same exceptional generative capabilities known from GAN literature. Due to identifiability the latent space of the GAN becomes meaningful as it is related to the true latent variables of the data. Therefore iVAE-GAN presents a principled approach to disentanglement in GAN that can be used to understand the true factors of variation in data. Thus, our model not only allows us to disentangle data into the true features but we can also use it to understand how the generative model recreates data and insert new latent code in the latent space to generate novel data. Our contribution is three-fold: 1) First, identifiability proof in GAN, 2) Second, Identifiable VAE-GAN model and 3) Lastly, we implement the novel iVAE-GAN model to validate and compare the identifiability of the proposed model against existing identifiable models. The source code for the model implementation and the experiments will be made publicly available after publication.

II. RELATED WORK
This section reiterates the necessary theoretical results needed to propose iVAE-GAN. This is important, as it not only shows the theoretical justification for identifiability in iVAE-GAN, but also solidifies that the followed theoretical framework is general and could potentially be applied to a wider family of deep learning models.

a: Identifiability
A model is said to be identifiable if only one parametrization of the model can lead to the observed data distribution, i.e.
On the other hand if p θ1 (x) = p θ2 (x) but the parametrization is not unique, θ 1 ̸ = θ 2 , the model is said to be unidentifiable on the basis of observations. It is clear that identifiability is of interest in deep latent variable models as elaborated in the following. Latent variable models commonly model the joint distribution as: for x ∈ R d , and z ∈ R n (lower-dimensional, n ≤ d), but only provide training guarantees on the marginal distribution of the model (such as learning a lower bound on the observed marginal distribution): Unfortunately deep latent variable model as in (2) do not learn the true joint distribution and as a result can not recover the original latent variables. In contrast, it is sufficient for identifiable models to learn the marginal distribution because only one parametrization of the joint distribution produces the seen marginal distribution and therefore the latent variables can be recovered. [8,7] have derived a very general deep latent variable model that is identifiable up to linear equivalence relations. Their work highlights that it is pivotal that the prior distribution is conditioned on an additional observed variable, u. Therefore, the general form of the identifiable model becomes: where u ∈ R m is the auxiliary variable observed alongside the data and θ = (f , T, λ) are the parameters of the conditional generative model. The conditional latent distribution, p T,λ (z|u), is assumed to belong to an exponential family of independent variables: are the sufficient statistics and λ i,j (u) are the natural parameters of the family. The natural parameters of the distribution depending on u is learnt by the network we name A in our implementation. The assumption that the latent distribution must belong to an exponential family is not considered restrictive as it has been shown to have universal approximation capabilities by [16]. The decoder, p f (x|z), is defined as: allowing x to be decomposed into x = f (z) + ϵ, where the noise is distributed according to p ϵ (ϵ) and f : R n → R d is assumed injective. c: The auxiliary variable u is pivotal to the definition of the identifiable model. The auxiliary variable is an observed variable such that the collected dataset contains pairwise observations of both the data, x, and u such that D = {(x (1) , u (1) ), . . . , (x (N ) , u (N ) )}. From (5) it can be seen that it is critical that the latent variables are independent given u. It would be natural to wonder: How do we know the latent variables, which by definition are never observed, are independent given u? In short; we do not know. Often this will be application specific and rely on knowledge of the data at hand. For most labeled datasets, such as MNIST, u could simply be the label. states that, given the data we observe is the marginal distribution of some generating process with joint distribution conditioned on u with true generating parameters (f , T, λ): and a deep generative model of the same form learns to approximate the marginal distribution of observed data with parameters (f ,T,λ) such that: Then the parameters (f , T, λ) and (f ,T,λ) are said to be ∼ A −identif iable such that: for some nk × nk invertible matrix A and vector c. We provide a walk-through of the original proof in Appendix. Main points from the proof include that with a small assumption on the nature of the noise in (6) the underlying noiseless distributions of the models must be equal. By using said equality a system of equations can be constructed because the noiseless distributions are equal for all u. This system of equations has a matrix representation of the form The entries of the matrix L are a function of points of u. It is assumed that there exists at least nk + 1 points of u such that the matrix L is invertible. And lastly an assumption on the sufficient statistics, T , is made to prove the final equivalence relation.
This theoretical result is significant because it states that the trained deep generative model will have recovered the original latent variables, f −1 (x) = z, up to a linear transformation of the sufficient statistics.
The theory requires that estimation models must follow a deep latent variable model with a conditional prior as seen from (4) and it must be able to approximate the seen marginal distribution. The proposed iVAE-GAN model learns both a variational approximation, q ϕ (z|x, u), of the posterior, p θ (z|u), and a generative model and is therefore an appropriate deep latent variable model. In the next section we show that the iVAE-GAN model also fulfills the second condition and thereby make the first link between identifiability and GAN.

III. IVAE-GAN
iVAE-GAN has a hybrid loss function that consists of a divergence loss for the encoding of the latent space with respect to the prior (conditional) distribution and an adversarial loss for the generated samples such that: where we define: and VOLUME 4, 2016 Training is then performed according to: In the following we show that the loss is a lower bound on the difference between the log probability of the data and the expected log likelihood of the data generated by the decoder: (14) and that by maximizing L iV AE−GAN the expected log likelihood of the data generated by the decoder approaches the log probability of the data.
L prior of the iVAE-GAN loss is related to the ELBO loss ([10]) such that: (15) Now we show that the same inequality is also fulfilled by L prior + L GAN , but in contrast to (15) the data distribution may be learnt by maximizing L prior + L GAN . We assume an optimal discriminator, D * , and use the result of [2]. Therefore we can write the optimization of L GAN as: (See Appendix for proof).
To write our loss function only as a function that is to be maximized we pose the minimization over G as a maximization: Since we mean to maximize this function using a deep neural network the constant log(4) is inconsequential to the loss function. Therefore: Since the negated Jensen-Shannon divergence is nonpositive it can always be added to the lesser side of an inequality without altering the inequality. Therefore we recover our lower bound by adding −2·JSD(p(x)||p Φ (x|z)) to (15): The right-hand side of (19) can be recognized as the iVAE-GAN loss (updated according to (16) and (18)): Thus the iVAE-GAN loss is a lower bound on the difference between the log probability of observed data and expected log likelihood of the data generated by the decoder. To, hopefully, make (19) a little more interpretable we can make use of Jensen's inequality: Therefore we can write: By using the transitive property of inequalities we may write the lower bound as: The Jensen-Shannon divergence measures the distance between two distributions and is therefore closely related to the difference of log probabilities, so as the lower bound is maximized the difference between log probabilities is minimized. In fact, the only condition for which the Jensen-Shannon divergence vanishes is p(x) = p Φ (x), at which point the left-hand side becomes zero and the lower bound becomes: Which is of course the normal bound for the negative KL divergence. Therefore, by maximizing L iV AE−GAN we learn the data distribution while simultaneously maximizing the negative KL divergence between the encoded distribution, q ϕ (z|x, u), and the prior distribution, p θ (z|u) thus achieving an identifiable model. This is, to the best of authors knowledge, the first work to actually extend and apply the theoretical framework to a deep latent variable model not contained in the original works by [8,7]. The developed identifiability theory is claimed to be very general and extendable to a wide range of models and applications -a belief the authors of this paper share. Therefore it is not insignificant that we have shown how it extends to GAN. The training algorithm for iVAE-GAN is shown in Algorithm 1.
It is no coincidence that we consider GANs as a valuable framework to make identifiable. GANs are a very active area of research with state-of-the-art generative models and a wide range of applications. Providing proof of identifiability and initial experiments expand the toolkit researchers have at their disposal when meaningful latent spaces are desired in GANs. Importantly, the results we have shown do not impose or assume any restrictions on the adversarial training, therefore identifiability should be attainable in a large variety end for end while of GAN flavors that have other desirable properties such as stable training or alternative formulations of the minimax game.

IV. EXPERIMENTS
The core premise of the problem we aim to solve is that the latent variables are, by definition, never observed. Only the data which are a nonlinear function of the latent variables are observed. This premise is of great practical interest because it almost always reflects the true nature of data collection and the latent variables carry valuable information about the data. In our case where we also learn a generative model not only can we infer about the origin of the data but also generate unseen data. However, this also means that datasets with known latent variables are very limited, even for datasets where the latent variables would intuitively be very simple. Consider e.g. MNIST. It would be very intuitive to expect the true latent space to consist of ten independent distributions -one for each number. Yet, because it is very difficult to observe the latent space we cannot use such datasets to validate our model. Therefore we have strictly used a synthetic dataset such that there is no ambiguity with respect to the true latent variables. Of course the latent variables are of greatest interest in real data but the scope of this work has been to show that identifiability is possible in adversarial networks.

a: Dataset
We have created our dataset with the data generator graciously provided in [7]. There are two main reasons for this choice: As discussed above datasets with known latent variables are very limited and secondly the same data generator has been used with other identifiable models, so it is a suitable generator to compare models across the same data and parameters. The data are generated in segments such that they become a non-stationary Gaussian time series. All segments are generated with equally many samples. The latent variables are drawn from an exponential family distribution with λ i generated randomly and independent for each segment and passed through an uninitialized Multilayer Perceptron MLP to produce data that are a nonlinear function of the latent variables.

b: MCC metric
The Mean Correlation Coefficient (MCC) was used to quantify identifiability in [8,7] and we adopt the same metric to quantify identifiability in iVAE-GAN. Given two sets of observations of m random variables each, the MCC metric calculates the interclass correlation coefficients (either Pearson or Spearman's correlation coefficients) between the m random variables of each set. Since every recovered latent variable should correspond to exactly one true latent variable a linear sum assignment problem is solved such that each recovered latent variable is assigned to exactly one true latent variable and the sum of the assigned correlation coefficients is maximized. The MCC score is then the mean of the VOLUME 4, 2016 FIGURE 2: 2-Dimensional data and latent spaces. We have omitted axes to emphasize the linear indeterminacy as a rotation. a) The original generating latent variables. b) The latent variables recovered by iVAE-GAN. c) Input data. d) Data generated by iVAE-GAN.
correlation coefficients after assignment. A high MCC score thus reflects that the recovered latent variables are highly correlated with the true latent variables.
c: Network architecture The reason multiple values are stated for the discriminator is because different widths were required for 100, 200, 500, 1000 and 2000 number of observations per segment. We believe this to be a result of increasing observations per segment means the discriminator has to discriminate based on more points which in turn requires a larger network in order to perform well. We also think this is consistent with common findings where more data points result in more complex GAN networks, e.g. it is harder to get reasonable results with higher resolution images than with lower resolution images.

d: Optimizer
Two optimizers were used. One for the discriminator and one for the remaining three networks. The optimizers were both Adam optimizers with the same parameters: learning rate of 0.001 and β = (0.5, 0.999).

e: Training
All iVAE-GAN models were trained for 300000 iterations across 10 different seeds while iVAE and ICE-BeeM models were trained for 70000 iterations across 10 different seeds with the segment index as the auxiliary variable. iVAE used the default batch size of 256, ICE-BeeM also used the default batch size of 128. For this particular experiment of iVAE-GAN the batch size was simply set to the size of the dataset because the entire dataset could reside in the memory of the Tesla V100 32GB GPU made available by the university. For more details see Appendix.
During training we have noted the following remarks: • iVAE-GAN inherits common training instabilities associated with adversarial training.
• Training is not consistently seen to monotonically converge. See Figure 6. • Latent variables are only recovered well if the network learns to generate good data. • Discriminator size is vital for well-behaved training. In practice we have used a VAE model to find a decoder network sufficiently complex to express the data and then tuned discriminator hyperparameters. As it can be seen from FIGURE 2, iVAE-GAN generates similar but slightly different data, but most importantly it can be seen that the recovered latent variables are related to the true latent variables by a linear transformation, in this case a 90°clockwise rotation. To experimentally show identifiability we compare our model to iVAE and ICE-Beem [8,7] as shown in FIGURE 3.
Our experiments indicate that identifiability is achievable not only in the model we have presented here, but in adversarial training generally without sacrificing the desired properties that make adversarial training appealing. Since the training stability of our model greatly resembles the notoriously challenging training of most GANs, future work could benefit from various developed methods aimed at stabilizing GAN training as described by [14,3,1,17,6].

V. DISCUSSION
The main contribution of this paper is to show that GAN models can be made identifiable. This was achieved by showing that adversarial training fulfill the assumption that in the limit of infinite data the model learns the true data distribution while allowing the inference model to learn the prior conditional distribution, as shown in (24). In our comparison of iVAE-GAN (orange dotted) to the existing models iVAE (blue solid) and ICE-BeeM (green dashed) in FIGURE 3 we achieve comparable performance. In particular, iVAE-GAN performs better with few observations per segment than the other models but starts to struggle when more observations are available. We suspect this trend to be caused by our quite simple GAN implementation which tends to become unstable on large data. This could hopefully be remedied using various methods to stabilize training and regain performance when more observations per segment are used.
GAN has been associated with ethical concerns following the generation of images and speech that are virtually indistinguishable to the human eyes and ears. This work is not application specific and has therefore not produced any content that may be ethically concerning. Instead, we hypothesize that identifiability in GAN could be a remedying factor -not because it prohibits malicious applications, but because it forces the generation to be based on the true generating process. Therefore there should be nothing inherently evil about an identifiable model. Identifiable models could in fact be of interest when sensitive or safety critical data are used because the operator runs little to no risk of introducing unwanted bias in the model.

VI. CONCLUSION
We have proven that Generative Adversarial Networks (GANs) can be made identifiable and proposed the identifiable model iVAE-GAN. As validation we have implemented the model and compared it to state-of-the-art identifiable models on the same data. This is the first proof of identifiability in GAN and it does not impose constraints on the adversarial training. Therefore, the results will apply broadly to a variety of different GAN flavors. We found through experiments, that the training dynamics of iVAE-GAN is prone to the same difficulties found in most GANs and is therefore a topic of interest for further work.

VII. FUTURE WORK
In future works the model should be tested on real data. This work has proven identifiability theoretically for iVAE-GAN as well as verified and compared its identifiability on a synthetic dataset. We propose that Amazon's Dinner Party Corpus (DiPCo) dataset would be ideal for this and other identifiable models as the cocktail party problem is the epitome of applied identifiability and DiPCo has unambiguous and easily interpretable latent variables.

APPENDIX. DERIVATION OF GLOBAL OPTIMALITY IN GAN
This derivation follows directly from [2] only stated more explicitly. GAN is formulated as a two-player minimax game according to: (27) To simplify the derivations that are to follow, the second term is rewritten using the law of the unconscious statistician such that: To express this in a way where the behaviour of the generator can be examined an optimal discriminator is assumed, such that the generator will try to minimize the following function: (29) This equation can be recognized as the function f (y) = a log(y) + b log(1 − y) which attains a maximum at y = a a+b , which implies that the optimal discriminator is given by: Thus, if p g = p data the optimal discriminator will become D * G (x) = 1 2 and by (28) we can find the optimum value at which the generated data will be indistinguishable from the true data: Now we need to verify that − log 4 is indeed a minimum of V (G, D * ): ) dx (32) The first term can be recognized as a Kullback-Leibler divergence KL(p data ||p data +p g ) = x p data (x) log( p data (x) p data (x)+pg(x) ) dx and the second term can also be rewritten to a KL divergence since: Therefore (32) can be written as: The last step is achieved by rewriting the equation such that the two KL divergences can be expressed as a Jensen-Shannon divergence between p data and p g by multiplying the equation with log(2) log(2) and distributing terms: Since the Jensen-Shannon divergence is always nonnegative and attains a minimum only when p data = p g , it is concluded that min G V (G, D * ) = V * (D, G) = − log(4) only when p data = p g . In other words, the global optimum of adversarial training, under the assumption of an optimal discriminator, occurs only when the generated data follows the same distribution as that of the observed data.

APPENDIX. IDENTIFIABILITY PROOF
All the proofs stated herein comes from [8] only with a more explicit walk-through.
Given two sets of parameters (f , T, λ) and (f ,T,λ) such that p f ,T,λ (x|u) = pf ,T,λ (x|u) then the noise-free distributions,p as they will be shown below, are also equal. The marginal distribution in the generative model can be written as: (36) Substituting the decoder with the definition from (6) yields: (37) We now change the domain of the integral from Z to X , by introducingx = f (z). We also introduce the notion of matrix volume denoted by vol A, which acts as a replacement for the absolute determinant of the Jacobian introduced as a result of the change of variable:  (38) We now introduce the following shorthand: where 1 X is the indicator function, assuring that the expression has measure zero if x is not contained in the image of f : (40) We recognize this to be the convolution betweenp T,λ,f ,u and p ϵ as such: Transforming the functions to the Fourier domain allows us to simplify the expression further: Note here that we assume the characteristic function ϕ ϵ (x) to be non-zero for x ∈ X , which means it can be factored out yielding the final result, from which it is evident that: Therefore the noise-free distributions has to be the same. In the following we wish to examine the relationship between the true parameters (T, λ, f ) and the estimated parameters (T,λ,f ) given that our model learns to accurately approximate the true data distributionp T,λ,f ,u (x). First we use (39) to write the expression for the marginal distribution: Since f −1 (x) = z by definition. By inserting the expression for the prior distribution given an auxiliary variable, u: p T,λ (z|u) = n i Qi(zi) Zi(u) exp k j=1 T i,j (z i )λ i,j (u) from (5), we can write the marginal distribution over x as: Here we can safely drop the indicator function, 1 X (x), as the expression no longer contains integration and we will write z = f −1 (x) again to emphasize that latent variables are inferred from data. For simplicity we shall work with the log pdf since it greatly simplifies the exponential term: Thus we can rewrite (44) and investigate the relation between true and estimated parameters: Each side of the equation contain nk unknown parameters in T i,j andT i,j respectively, since they are summed over n latent variables and k sufficient parameters per latent variable. Each side of the equation also has nk unknown parameters in λ i,j andλ i,j . Therefore a system of equations is created for nk + 1 different points u (0) , ..., u (nk) . In time series data divided into segments this step may intuitively be thought of as calculating the probability of a seeing a given sample in each of nk segments. It can also be seen as a consequence of (44) where we have equality between the marginal distribution for all choices of u. Thus we get the following system of equations: This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
A neat trick is used to simplify this system of equations by using any of the nk + 1 equations as pivot, we shall simply use u 0 . By pivot it is understood that we consider a ratio of pdfs or in this case, since we are dealing with logarithms, a difference of log pdfs. This is the motivation for using nk + 1 points in our system of equations as we use one equation to pivot such that we end up with a system of nk equations. The consequence of this choice is that our equations no longer express how likely a sample is with a given u l but rather how likely it is compared to u 0 , but this is of little importance as we are interested in the relation between parameters of the models and not the exact likelihood of seen samples. Therefore the system of equations becomes: By eliminating terms we get rid of logQ i (f −1 i (x) and interestingly log vol Jf−1 (x) which is typically notoriously difficult to evaluate, therefore these equations can be reduced to: By factoring terms and distributing sums for an arbitrary point u (l) we get: T i,j and λ i,j are elements from the tall vectors T and λ therefore the first term can be recognized as the inner product between T(f −1 (x)) andλ(u l ) whereλ(u l ) is defined as: Therefore (58) can be written as: Across all nk equations T(f −1 (x)) will be the same, therefore we can collect all the equations in a single matrix product by defining the nk × nk matrix L = λ (u 1 ),λ(u 2 ) . . .λ(u nk ) and b = b 1 , b 2 , . . . , b nk such that: In the final step we assume that the true matrix of natural parameters, L, is invertible to obtain the following result: Where A = L T −1L T and c = L T −1 b. Thus, we see that the true latent variables are linear transformation of the recovered latent variables. The last step is to prove an equivalence relation such that the opposite is also true. That the recovered latents are also a linear transformation of the true latent variables. To do so it is assumed that the Jacobian of T exists has full rank n. Therefore: By using the following inequality for the rank of a matrix multiplication we may deduce that the rank of both A and JT •f −1 is at least n: Since JT •f −1 is a nk × n matrix we can conclude that it exists and has full rank. And if k = 1 then A will be a square n × n matrix with full rank and thus invertible, such that (62) can be shown to be true in both directions.
For k > 1 the matrix A must be invertible in order to establish the equivalence relation. In the following we show how A is invertible under the assumption that each latent variable follows a strongly exponential distribution. A strongly exponential distribution is one that almost certainly contains the exponent and thus can not be reduced to the base measure. Formally: Which means that the exponent of a strongly exponential distribution only reduces to a constant if θ = 0 which means the inner product becomes zero, ⟨T(x), 0⟩ = 0, or if the set X has Lebesgue measure 0. The following three Lemmas is used to derive useful properties for the derivate of the sufficient statistic, T ′ (x), from a strongly exponential distribution that is of relevance for the Jacobian matrix. The dimension, k, of all considered distributions is assumed minimal. That is, the distributions can not be rewritten with a k ′ < k. Consider an exponential family distribution with k ≥ 2 components. [ . . . ], the components of the sufficient statistic T are linearly independent.
If the components of T are not linearly independent then one of the components, T k (x), could be written as a combination of the remaining components for an a ̸ = 0.
If that was possible, we would have contradicted the assumption that the dimension of the distribution, k, is minimal.

b: Lemma 2
Consider a strongly exponential family distribution such that its sufficient statistic T is differentiable almost surely. Then T ′ i ̸ = 0 almost everywhere on R for all 1 ≤ i ≤ k We provide an alternate proof than the original, simply because we used this alternate proof to verify our understanding of the original proof. If we consider an exponential distribution that is not strongly exponential then we necessarily have: (67) The derivative would then become: Thus for an exponential distribution that is not strongly exponential the derivative of the exponent must be equal to zero. Which can be achieved in many different ways having either θ = 0, T(x) = 0 or their weighted sum equal to zero. For a strongly exponential distribution the exponent can only equal a constant if θ = 0 (see (65)) and therefore the derivative can also only be 0 if θ = 0. From Lemma 1 we have that the components of the sufficient statistic can not be written as a function of each other. Therefore it can be seen that T ′ (x) ̸ = 0 and even T ′ i (x) ̸ = 0. Because, if any T ′ i (x) was equal to zero the corresponding θ i could be an arbitrary number different from zero while the rest of θ is zero and thus θ ̸ = 0 but the derivative would equal zero. which violates the statement that the distribution is strongly exponential. Thus, we may conclude that for a strongly exponential distribution T ′ i (x) must be different from zero.

c: Lemma 3
Consider a strongly exponential distribution of size k ≥ 2 with sufficient statistic T(x) = (T 1 (x), ..., T k (x)). Further assume that T is differentiable almost everywhere. Then there exist k distinct values x 1 to x k such that (T ′ (x 1 ), ..., T ′ (x k )) are linearly independent in R k Recall that in order for the distribution to be strongly exponential then the only choice of parameter that can lead to the exponent being constant for all x is θ = 0. Since both T ′ (x) and θ is in R k this necessarily means that T ′ (x) must be able to span the full R k . That is, there exists at least k vectors of T ′ (x) in k points x 1 , . . . , x k such that the matrix B = [T ′ (x 1 ) T ′ (x 2 ) . . . T ′ (x k )] has full rank: If Rank(B) ̸ = k then the nullity of A will be greater than 1 and thus any vector from the orthogonal complement of the column space of B can be picked as θ * such that θ * ̸ = 0 and ⟨T(x) ′ , θ * ⟩ = 0 for all x. However, if that is the case then the distribution is not strongly exponential as seen from Lemma 2 that shows that only a distribution which is not strongly exponential will have ⟨T(x) ′ , θ⟩ = 0 for θ ̸ = 0. Therefore, in a strongly exponential distribution there must exist k points, x 1 , . . . , x k such that the column vectors of A are linearly independent.
These three Lemmas have been used to derive the important property that in univariate exponential distributions which are minimal in k and strongly exponential there exist at least k points, x 1 , . . . , x k such that the vectors T ′ (x 1 ), . . . , T ′ (x k ) are linearly independent. We can now use this to show that the nk×nk matrix A in (70) is invertible under the assumption that the nk × n Jacobian matrix of T(f −1 (x)), J T (f −1 (x)), exists and is of rank n: To make the proof easier to follow we examine the form the Jacobian matrix will have. First we write the expression for the Jacobian matrix of T(f −1 (x)) (remembering that f −1 is a function that maps x to R n , such that T is a function of n (latent) variables, f −1 1 (x), . . . , f −1 n (x): VOLUME 4, 2016 block diagonal matrix is invertible if all the diagonal matrices are invertible. That is: Since the points x 1 , . . . , x k are chosen as in Lemma 3 every diagonal matrix of Q is exactly identical to B in Lemma 3 (one for each of the n latent variables). Therefore every diagonal matrix of Q is invertible because B has full rank, as proven in Lemma 3, and thus Q is also invertible. Since Q is invertible we can write: Which means that both A andQ are invertible. Since A is invertible we have proven the equivalence relation for k ≥ 1.

APPENDIX. EXPERIMENT DETAILS
In this section we will provide the experimental setup used to achieve the reported results. To monitor training, we logged metrics of interest such as loss, MCC, percentage of how many generated images were classified as real etc... every n iterations. In addition, images of the generated output were saved along with plots of the magnitude of the gradients at each hidden dimension. To visualize and draw meaningful conclusion from all the logged data we created an interactive plot that simultaneously show interactive graphs of the logged data, meaning all graphs can be zoomed, dragged, scaled etc. and individual points can be inspected at cursor hover as well as the saved images. Since all datapoints are associated with an iteration and all images are also associated with an iteration whenever the mouse hovers a datapoint the plot is updated to show the images and gradients of that particular iteration as it can be seen in FIGURE 4 and 5.

a: Hyperparameter tuning
Adversarial training is known to require extensive hyperparameter tuning since there needs to be a balance between the generator and discriminator. If either is too complex or simple the training will collapse. Our approach was to use the same decoder size as the iVAE model because it would allow direct comparison of the two models and it was clear that the iVAE decoder was sufficiently complex to represent the data faithfully. In this way the hyperparameter tuning was simplified to finding an adequate discriminator. By sweeping over FIGURE 4: Interactive plots with updating images on hover FIGURE 5: Interactive plots with updating images on hover different discriminator sizes a discriminator size that would not collapse training and produce faithful reconstruction of the data could be found for each number of observations per segment. The found discriminator sizes are those seen in TABLE 1. Batch size and discriminator size are also closely related as the batch size determines how many samples is presented to the discriminator and thus larger batch sizes tend to require a larger discriminator. We found no trivial tendency such that e.g. when the batch size is doubled so should the size of the discriminator be. In practice a batch size that seemed appropriate for the used device was chosen and then the above-mentioned sweep over discriminator size was performed.

b: Data generation
The used data was generated using the same data generator as in [8,7]. The parameters of the data generator is summarized in TABLE 2: The same parameters are used for the generated data in all three models. VOLUME 4, 2016