Deep Generative Modelling: A Comparative Review of VAEs, GANs, Normalizing Flows, Energy-Based and Autoregressive Models

Deep generative models are a class of techniques that train deep neural networks to model the distribution of training samples. Research has fragmented into various interconnected approaches, each of which make trade-offs including run-time, diversity, and architectural restrictions. In particular, this compendium covers energy-based models, variational autoencoders, generative adversarial networks, autoregressive models, normalizing flows, in addition to numerous hybrid approaches. These techniques are compared and contrasted, explaining the premises behind each and how they are interrelated, while reviewing current state-of-the-art advances and implementations.


INTRODUCTION
G ENERATIVE modelling using neural networks has its origins in the 1980s with aims to learn about data with no supervision, potentially providing benefits for standard classification tasks; collecting training data for unsupervised learning is naturally much lower effort and cheaper than collecting labelled data but there is considerable information still available making it clear that generative models can be beneficial for a wide variety of applications.
Beyond this, generative modelling has numerous direct applications including image synthesis: super-resolution, text-to-image and image-to-image conversion, inpainting, attribute manipulation, pose estimation; video: synthesis and retargeting; audio: speech and music synthesis; text: summarisation and translation; reinforcement learning; computer graphics: rendering, texture generation, character movement, liquid simulation; medical: drug synthesis, modality conversion; and out-of-distribution detection.
The central idea of generative modelling stems around training a generative model whose samplesx ∼ p θ (x) come from the same distribution as the training data distribution, x ∼ p d (x). Early neural generative models, energy-based models achieved this by defining an energy function on data points proportional to likelihood, however, these struggled to scale to complex high dimensional data such as natural images, and require Markov Chain Monte Carlo (MCMC) sampling during both training and inference, a slow iterative process. In recent years there has been renewed interest in generative models driven by the advent of large freely available datasets as well as advances in both general deep learning architectures and generative models, breaking new ground in terms of visual fidelity and sampling speed. In many cases, this has been achieved using latent variables z which are easy to sample from and/or calculate the density of, instead learning p(x, z); this requires marginalisation over the unobserved latent variables, however in general, this is intractable. Generative models therefore typically make trade-offs in execution time, architecture, or optimise proxy functions. Choosing what to optimise for has implications for sample quality, with direct likelihood optimisation often leading to worse sample quality than alternatives. Interrelated with generative models is the field of selfsupervised learning where the focus is on learning good intermediate representations that can be used for downstream tasks without supervision [106]. As such, generative models can in general also be considered self-supervised, however, not all self-supervised models are generative models. Types of self-supervised objectives include auxiliary classification losses such as predicting the rotation of inputs, masked losses where the model must predict the true value of some inputs which have been masked out, and contrastive losses which learn an embedding space where similar data points are close and different points are far apart.
There exists a variety of survey papers focusing on particular generative models such as normalizing flows [126], [177], generative adversarial networks [71], [251], and energy-based models [204], however, naturally these dive into the intricacies of their respective method rather than comparing with other methods; additionally, some focus on applications rather than theory. While there exists a recent survey on generative models as a whole [174], it is less broad, diving deeply into a few specific implementations.
This survey provides a comprehensive overview of generative modelling trends, introducing new readers to the field, comparing and contrasting so as to explain the modelling decisions behind each respective technique. Additionally, advances old and new are discussed in order to bring the reader up to date with current research. A specific focus on image models is taken reflecting the predominance arXiv:2103.04922v4 [cs.LG] 28 Mar 2022 1: Comparison between deep generative models in terms of training and test speed, parameter efficiency, sample quality, sample diversity, and ability to scale to high resolution data. Quantitative evaluation is reported on the CIFAR-10 dataset [127] in terms of Fréchet Inception Distance (FID) and negative log-likelihood (NLL) in bits-per-dimension (BPD).

Method
Train Speed

Resolution Scaling
Free-form Jacobian Exact Density

Variational Autoencoders
Convolutional VAE [123] () 106.37 ≤ 4.54 Variational Lossy AE [29] () -≤ 2.95 VQ-VAE [184], [235] () -≤ 4.67 VD-VAE [31] () - ≤ 2.87 Autoregressive Models PixelRNN [234] -3.00 Gated PixelCNN [233] 65. 93 3.03 PixelIQN [173] 49.46 -Sparse Trans. + DistAug [32], [110] 14. 74   in literature, however, concepts are often relevant across modalities. In particular, this survey covers energy-based models, unnormalised density models, variational autoencoders, variational approximation of a latent-based model's posterior, generative adversarial networks, two models set in a mini-max game, autoregressive models, model data decomposed as a product of conditional probabilities, and normalizing flows, exact likelihood models using invertible transformations. This breakdown is defined to closely match the typical divisions within research, however, numerous hybrid approaches exist that blur these lines, these are discussed in the most relevant section or both where suitable. For a brief insight into the differences between architectures, we provide Table 1 which contrasts a diverse array of techniques. For the column "Exact Density", represents tractable densities, () approximate densities, and intractable densities. On a number properties assessed we use a star system to allow easy comparisons, with rules defined in Table 2 based on CIFAR-10. In particular, we acknowledge that ranking measures such as training speed in days can be considered anecdotal since it is dependent on the year and compute available. Nevertheless, this allows a comparison based on properties such as stability and convergence rates which cannot be easily judged, for instance, by simply looking at number of function evaluations per iteration.

ENERGY-BASED MODELS
Energy-based models (EBMs) [133] are based on the observation that any probability density function p(x) for x ∈ R D can be expressed in terms of an energy function E(x) : R D → R which associates realistic points with low values and unrealistic points with high values Modelling data in such a way offers a number of perks, namely the simplicity and stability associating with training a single model; utilising a shared set of features thereby minimising required parameters; and the lack of any prior assumptions eliminates related bottlenecks [46]. Despite these benefits, scaling to high dimensional data is difficult, however, recent advances have made substantial strides.
A key issue with EBMs is how to optimise them; since the denominator in Eqn. 1 is intractable for most models, a popular proxy objective is contrastive divergence where energy values of data samples are 'pushed' down, while samples from the energy distribution are 'pushed' up. Formally, the gradient of the negative log-likelihood loss L(θ) = E x∼p d [− ln p θ (x)] has been shown to approximately demonstrate the following property [23], [208], where x − ∼ p θ is a sample from the EBM found through a Markov Chain Monte Carlo (MCMC) generating procedure.

Early Energy-Based Models
Before moving to recent advances, we start with some of the earliest neural generative models.

Boltzmann Machines
A Boltzmann machine [83] is a fully connected undirected network of binary neurons (Fig. 1a) that are turned on with probability determined by a weighted sum of their inputs i.e. for some state s i , p(s i = 1) = σ( j w i,j s j where W , L, and J are symmetrical learned weight matrices. In order to train Boltzmann machines via contrastive divergence, equilibrium states are found via Gibbs sampling, however, this takes an exponential amount of time in the number of hidden units making scaling impractical.

Restricted Boltzmann Machines
Many of the issues associated with Boltzmann machines can be overcome by restricting their connectivity. One approach, known as the restricted Boltzmann machine (RBM) [84] is to remove connections between units in the same group ( Fig.  1b), allowing exact calculation of hidden units. Although obtaining negative samples still requires Gibbs sampling, it can be parallelised and in practice a single step is sufficient if v is initially sampled from the dataset [84]. By stacking RBMs, using features from lower down as inputs for the next layer, more powerful functions can be learned; these models are known as deep belief networks [85]. Training an entire model at once is intractable so instead they are trained greedily layer by layer, composing densities thus improving the approximation of p(v).

Deep EBMs via Contrastive Divergence
To train more powerful architectures through contrastive divergence, one must be able to efficiently sample from p θ . Specifically, we would like to model high dimensional data using an energy function with a deep neural network, taking advantage of recent advances in discriminative models [253]. MCMC methods such as random walk and Gibbs sampling [85], when applied to high dimensional data, have long mixing times, making them impractical. A number of recent approaches [46], [249] have advocated the use of stochastic gradient Langevin dynamics [188], [245] which permits sampling through the following iterative process, where ∼ N (0, αI), p 0 (x) is typically a uniform distribution over the input domain and α is the step size. As the number of updates N → ∞ and α → 0, the distribution of samples converges to p θ [245]; however, α and are often tweaked independently to speed up training. While Langevin MCMC is more practical than other approaches, sampling still requires a large number of steps. One solution is to use persistent contrastive divergence [46], [216] where a replay buffer stores previously generated samples that are randomly reset to noise; this allows samples to be continually refined with a relatively small number of steps while maintaining diversity. Short-run MCMC [166] which samples using as few as 100 update steps from noise has also been used to train deep EBMs, however, since the number of steps is so small, samples are not truly from the correct probability density. Nevertheless, there are other advantages such as allowing image interpolation and reconstruction (since short-run MCMC does not mix) [167]. Other approaches include initialising MCMC chains with data points [249] and samples from an implicit generative model [248], as well as adversarially training an implicit generative model, mitigating mode collapse somewhat by maximising its entropy [66], [121], [130]. Improved/augmented MCMC samplers with neural networks can also improve the efficiency of sampling [63], [89], [135], [201], [217].
One application of EBMs of this form comes by using standard classifier architectures, f θ : R D → R K , which map data points to logits used by a softmax function to compute p θ (y|x). By marginalising out y, these logits can be used to define an energy model that can be simultaneously trained as both a generative and classification model [65],

Score Matching and Denoising Diffusion
Although Langevin MCMC has allowed EBMs to scale to high dimensional data, training times are still slow due to the need to sample from the model distribution, additionally, the finite nature of the sampling process means that samples can be arbitrarily far away from the model's distribution [64]. An alternative approach is score matching [101] which is based on the idea of minimising the difference between the derivatives of the data and model's log-density functions; the score function is defined as s(x) = ∇ x ln p(x) which does not depend on the intractable denominator and can therefore be applied to build an energy model [209] by minimising the Fisher divergence between p d and p θ , however, the score function of data is usually not available. Various methods exist to estimate the score function including spectral approximation [196], sliced score matching [203], finite difference score matching [176], and notably denoising score matching [239] which allows the score to be approximated using corrupted data samples q(x|x). In particular, when q = N (x|x, σ 2 I), Eqn. 6 simplifies to That is, s θ learns to estimate the noise thereby allowing it to be used as a generative model [192], [202]. Since the Langevin update step uses ∇ x ln p(x) it is possible to sample from a score matching model using Langevin dynamics [226]. This is only possible, however, when trained over a large variety of noise levels so thatx covers the whole space.

Denoising Diffusion Probabilistic Models
Closely related are diffusion models [1], [11], [87], [199] which gradually destroy data x 0 by adding noise over a fixed number of steps T using a noise schedule β 1:T determined so that x T is approximately normally distributed. The forward process is defined by a discrete Markov chain, The parameterised reverse process is trained to gradually remove noise, i.e. approximate p θ (x t−1 |x t ), by optimising a re-weighted variant of the ELBO, similar to Eqn. 7. Diffusion models have also been applied to categorical data; multinomial diffusions [93] define a forward process where each discrete variable switches randomly to a different value and the reverse process is trained to approximate the noise. Self-supervised language models such as BERT [41] have similar training objectives: variables are randomly masked out and the model is trained to predict the original values; these models can be viewed as Markov random fields and sampled using Gibbs/Metropolis Hastings via iterative sampling of the masked distributions [61], [240].

Speeding up Sampling
Sampling from score-based models requires a large number of steps leading to various techniques being developed to reduce this. A simple approach is to skip steps at inference: cosine schedules [162] spend more time where larger visual changes are made reducing the impact of skipping; another approach is to use dynamic programming to find what steps should be taken to minimise ELBO based on a computation budget [243]. Taking the continuous time limit of a diffusion model results in a stochastic differential equation (SDE), numerical solvers can therefore be used, reducing the number of steps required [108], [206]. Another proposed approach is to model noisy data points as q(x t−1 |x t , x 0 ), allowing the generative process to skip some steps using its approximation of end samples x 0 [200].

Correcting Implicit Generative Models
While EBMs offer powerful representation ability due to unnormalized likelihoods, they can suffer from high variance training, long training and sampling times, and struggle to support the entire data space. In this section, a number of hybrid approaches are discussed which address these issues.

Exponential Tilting
To eliminate the need for an EBM to support the entire space, an EBM can instead be used to correct samples from an implicit generative network, simplifying the function to learn and allowing easier sampling. This procedure, referred to as exponentially tilting an implicit model, is defined as By parameterising q φ (x) as a latent variable model such as a normalizing flow [3], [165] or VAE generator [247], MCMC sampling can be performed in the latent space rather than the data space. Since the latent space is much simpler, and often uni-modal, MCMC mixes much more effectively. This limits the freedom of the model, however, leading some to jointly sample in latent and data space [3], [247].

Noise Contrastive Estimation
Noise contrastive estimation [52], [75] transforms EBM training into a classification problem using a noise distribution q φ (x) by optimising the loss function, where p θ (x) = e E θ (x)−c . This approach can be used to train a correction via exponential tilting [165], but can also be used to directly train an EBM and normalizing flow [55]. Eqn. 10 is equivalent to GAN Equation 18, however, training formulations differ, with noise contrastive estimation explicitly modelling likelihood ratios.

Alternative Training Objectives
As aforementioned, energy models trained with contrastive divergence approximately maximises the likelihood of the data; likelihood however does not correlate directly with sample quality [215]. Training EBMs with arbitrary fdivergences is possible, yielding improved FID scores [252]. Since score estimates have high variance, the Stein discrepancy has been proposed as an alternative objective, requiring no sampling and more closely correlating with likelihood [64]. A middle ground between denoising score matching and contrastive divergence is diffusion recovery likelihood [12] which can be optimised via a sequence of denoising EBMs conditioned on increasingly noisy samples of the data, the conditional distributions being much easier to MCMC sample from than typical EBMs [56].

VARIATIONAL AUTOENCODERS
One of the key problems associated with energy-based models is that sampling is not straightforward and mixing can require a significant amount of time. To circumvent this issue, it would be beneficial to explicitly sample from the data distribution with a single network pass.
To this end, suppose we have a latent based model p θ (x|z) with prior p θ (z) and posterior p θ (z|x); unfortunately optimising this model through maximum likelihood is intractable due to the integral in p θ (x) = introducing an approximation of the true intractable posterior q φ (z|x) = arg min q D KL (q φ (z|x)||p θ (z|x)) that allows a tractable bound on p θ (x) to be formed. In particular, variational autoencoders amortize the inference process, that is, approximate q φ (z|x) using a feedforward inference network allowing scaling to large datasets [123], [187]. From the definition of KL divergence we get which can be rearranged to find an alternative definition for p θ (x) that does not require the knowledge of p θ (z|x) where L is known as the evidence lower bound (ELBO) [109]. To optimise this bound with respect to θ and φ, gradients must be backpropagated through the stochastic sampling processz ∼ q φ (z|x). This is permitted by reparameterizingz using a differentiable function [187].
Monte Carlo gradient estimators can be used to approximate the expectations, however, this yields very high variance making it impractical. Alternatively, if D KL (q φ (z|x)||p θ (z)) can be integrated analytically then the variance is manageable. A prior with such a property needs to be simple enough to sample from but also sufficiently flexible to match the true posterior; a common choice is a normally distributed prior with diagonal covariance, z ∼ q φ (z|x) = N (z; µ, σ 2 I) withz = µ + σ and ∼ N (0, I). In this case, the loss simplifies tõ Despite success on small scale datasets, when applied to more complex datasets such as natural images, samples tend to be unrealistic and blurry [45]. This blurriness has been attributed to the maximum likelihood objective itself and MSE reconstruction loss, however, there is evidence that limited approximation of the true posterior is the root cause [260]; with MSE causing highly non-Gaussian posteriors. As such, the Gaussian posterior implies an overly simple model which, when unable to perfectly fit, maps multiple data points to the same encoding leading to averaging.
There are a number of other issues associated with limited posterior approximation, namely under-estimation of the variance of the posterior, resulting in poor predictions, and biases in the MAP estimates of model parameters [224]. Additionally, amortized inference leads to an amortization gap, the difference in ELBO for the amortized posterior and optimal approximate posterior [37]. Increasing the capacity of the encoder and decoder can reduce this gap by improving the posterior approximation and better fitting the choice of approximation respectively. Other proposed improvements include combining with adversarial training [98], [132], [150], improving the ELBO [21], as well as using different regularisation such as Wasserstein distance [218].
Reweighting the ELBO by multiplying D KL with an extra hyperparameter β allows the capacity of the latent representation to be altered. When β > 1 a more disentangled representation is learned where each latent unit is responsible for a single generative factor [82]. This approach has been generalised, allowing more precise states in the compression-representation trade-off to be targeted [2].

Beyond Simple Priors
One approach to improve variational bounds and increase sample quality is to improve the priors used for instance by careful selection to the task or by increasing its complexity [90]. Complex priors can be learned by warping simple distributions and inducing variational dependencies between the latent variables: variational Gaussian processes permit this by forming an infinite ensemble of mean-field distributions [220]; EBMs and score matching can be used to model flexible priors [175], [229]; normalizing flows (see Section 6) transform distributions through a series of invertible parameterised functions [14], [62], [97], [125], [186], [191].
By rewriting the VAE training objective to have two regularisation terms [150], the latter of which is the cross entropy between the aggregate posterior and the prior, the prior can be defined as the aggregate posterior, thus obtaining a rich multi-modal latent representation that combats inactive latent variables.
Since the true aggregate posterior is intractable, VampPrior [219] approximates it for a set of pseudo-inputs, tensors with the same shape as data points learned during training. Exemplar VAEs [169] scale this approach up, using the full training set to approximate the aggregate posterior, by approximating the prior using k-nearest-neighbours. Alternatively, the aggregate posterior can be approximated with a learned prior; this has been achieved with a learned rejection sampling procedure that transforms a base distribution [7]. In some instances, it can be helpful to compress data to discrete latent representations [18], [111], however, gradients through discrete sampling procedures are ill-defined.
The Gumbel-Softmax/Concrete distribution is a differentiable continuous approximation of a categorical distribution containing a temperature coefficient that converges to a discrete distribution in the limit [104], [148].
Alternatively, it has been argued that simple Gaussian priors are not a hindrance. When the data of dimension d lies on a sub-manifold of dimension r and r < d then global VAE optimum exist that do no recover the data distribution, however, when r = d, global optimums do recover the data distribution; as such, 2 stage VAEs that first map data to latents of dimension r then use a second VAE to correct the learned density can better capture the data [38].

Hierarchical VAEs
Hierarchical VAEs build complex priors with multiple levels of latent variables, each conditionally dependent on the last, forming dependencies depthwise though the network, Ladder VAEs [211] achieve this conditioning structure using a bidirectional inference network where a deterministic "bottom-up" pass generates features at various resolutions, then the latent variables are processed from top to bottom with the features shared ( Fig. 3). Specifically, they model latents as normal distributions conditioned on the last latent, By introducing skip connections around the stochastic sampling process, latents can be conditioned on all previously sampled latents [125], [147], [228]. Such an architecture generalises autoregressive models; inferring latents in parallel allows for significantly fewer steps compared to typical autoregressive models since many latents are statistically independent and allows different latent levels to correspond to global/local details depending on their depth. It has been argued that a single level of latents is sufficient since Gibbs sampling performed on that level can recover the data distribution [259]. Despite that, Gibbs sampling converges slowly, making hierarchical representations more efficient; in support of this, deeper hierarchical VAEs have been shown to improve likelihood, independent of capacity [31].

Regularised Autoencoders
Related to VAEs are regularised autoencoders (RAEs) which apply regularisation to the latent space of a deterministic autoencoder then subsequently train a density estimator on this space to obtain a complex prior [58]. Since the approximate posterior is a degenerate distribution, RAEs have little connection with variational inference. Vector Quantized-Variational Autoencoders (VQ-VAE) [183], [235] achieve this by training an autoencoder with a discrete latent space, then approximating encodings with an autoregressive model (see Section 5). The encoder's outputs are compared to a codebook of latent vectors and set to the code they are closest to; the gradient of this discretisation process is approximated using the straight through estimator [10]. Meanwhile, latent vectors in the codebook are moved closer to the encoder's outputs. To model larger images, hierarchy of codes have been applied [184], as well as adversarial learning to increase compression rate [53].

Data Modelling Distributions
Unlike energy-based models, VAEs must model an explicit density p(x|z). For efficient sampling, typically this distribution is decomposed as a product of independent simple distributions, allowing unrestricted architectures to be used to parameterise the chosen distributions. Common instances include modelling variables as Bernoulli [142], Gaussian [123], multinomial distributions, or as mixtures [190].

Autoregressive Decoders
To introduce dependencies between the output variables, numerous works have used powerful autoregressive networks [73]. While these approaches allow complex distributions to be learned, they increase the runtime and often suffer from posterior collapse since early in training the approximate posterior contains little knowledge about x meaning that it is easy to minimise D KL which in turn reduces the gradient between the encoder and decoder making it difficult to escape this minima [18]; in fact, for a sufficiently powerful generative distribution, this can occur even at optimum solutions [29]. Various methods to prevent posterior collapse have been proposed: by restricting the autoregressive network's receptive field to a small window, it is forced to use latents to capture global structure [29]; a mutual information term can be added to the loss to encourage high correlation between x and z [261]; encouraging the posterior to be diverse by controlling its geometry to evenly covering the data space, redundancy is reduced and latents are encouraged to learn global structure [146].

Bridging Amortized and Stochastic Inference
While variational approaches offer substantial speedup over MCMC sampling, there is an inherent discrepancy between the true posterior and approximate posterior despite improvements in this field. To this end, a number of approaches have been proposed to find a middle ground, yielding improvements over amortized methods with lower costs than MCMC. Semi-amortised VAEs [122] use an encoder network followed by stochastic gradient descent on latents to improve the ELBO, however, this still relies on an inference network. The inference network can be removed by assigning latent vectors to data points, then optimising them with Langevin dynamics or gradient descent, during training; although this allows fast training, convergence for unseen samples is not guaranteed and there is still a large discrepancy between the true posterior and latent approximations due to lag in optimisation [16], [78]. Shortrun MCMC has also been applied however it has poor mixing properties [168]. Gradient Origin Networks [17] replace the encoder with an empirical Bayes approximation of the posterior that only requires a single gradient step. VAEBMs offer a different perspective, rather than performing latent MCMC sampling based on the ELBO, they use an auxiliary energy-based model to correct blurry VAE samples, with MCMC sampling performed in both the data space and latent space. This setup is defined by

GENERATIVE ADVERSARIAL NETWORKS
Another approach at eliminating the Markov chains used in energy models is the generative adversarial network (GAN) [59]. GANs consist of two networks, a discriminator D : R n → [0, 1] which estimates the probability that a sample comes from the data distribution x ∼ p d (x), and a generator G : R m → R n which given a latent variable z ∼ p z (z), captures p d by tricking the discriminator into thinking its samples are real. This is achieved through adversarial training of the networks: D is trained to correctly label training samples as real and samples from G as fake, while G is trained to minimise the probability that D classifies its samples as fake. This can be interpreted as D and G playing a mini-max game, as with prior work [194], [195], optimising the value function V (G, D), For a fixed G, the objective for D can be reformulated as = D KL (p d || 1 2 (p d + p g )) + D KL (p g || 1 2 (p d + p g )) + C. Therefore the loss is equivalent to the Jensen-Shannon divergence between the generative distribution p g and the data distribution p d and thus with sufficient capacity, the generator can recover the data distribution. The use of symmetric JS-divergence is well behaved when both distributions are small unlike the asymmetric KL-divergence used in maximum likelihood models. Additionally, it has been suggested that reverse KL-divergence, D KL (p g ||p d ), is a better measure for training generative models than normal KL-divergence, D KL (p d ||p g ), since it minimises E x∼pg [ln p d (x)] [100]; while reverse KL-divergence is not a viable objective function, JS-divergence is and behaves more like reverse KL-divergence than KL-divergence alone. With that said, JS-divergence is not perfect; if 0 mass is associated with a data sample in a maximum likelihood model, KLdivergence is driven to infinity, whereas this can happen with no consequence in a GAN.

Stabilising Training
The adversarial nature of GANs makes them notoriously difficult to train [4]; Nash equilibrium is hard to achieve [189] since non-cooperation cannot guarantee convergence, thus training often results in oscillations of increasing amplitude. As the discriminator improves, gradients passed to the generator vanish, accelerating this problem; on the other hand, if the discriminator remains poor, the generator does not receive useful gradients. Another problem is mode collapse, where one network gets stuck in a bad local minima and only a small subset of the data distribution is learned. The discriminator can also jump between modes resulting in catastrophic forgetting, where previously learned knowledge is forgotten when learning something new [213]. This section explores proposed solutions to these problems.

Loss Functions
Since the cause of many of these issues can be linked with the use of JS-divergence, other loss functions have been proposed that minimise other statistical distances; in general, any f -divergence can be used to train GANs [170]. One notable example is the Wasserstein distance which intuitively indicates how much "mass" must be moved to transform one distribution into another. Wasserstein distance is defined formally in Eqn. 19a, which by the Kantorovich-Rubinstein duality is equivalent to Eqn. 19b [238]: where the supremum is taken over all 1-Lipschitz functions, that is, f such that for all x 1 and Optimising Wasserstein distance, as described in Table 5a, offers linear gradients thus eliminating the vanishing gradients problem (see Fig. 5b). Moreover, Wasserstein distance is also equivalent to minimising reverse KL-divergence [157], offers improved stability, and allows training to optimality. Numerous approaches to enforce 1-Lipschitz continuity have been proposed: weight clipping [5] invalidates gradients making optimisation difficult; applying a gradient penalty within the loss is heavily dependent on the support of the generative distribution and computation with finite samples makes application to the entire space intractable [72]; spectral normalisation (discussed below) applies global regularisation by estimating the singular Name Discriminator Loss Generator Loss   Table 5a). The catastrophic forgetting problem can be mitigated by conditioning the GAN on class information, encouraging more stable representations [19], [156], [255]. Nevertheless, labelled data, if available, only covers limited abstractions. Self-supervision achieves the same goal by training the discriminator on an auxiliary classification task based solely on the unsupervised data. Proposed approaches are based on randomly rotating inputs to the discriminator, which learns to identify the angle rotated separately to the standard real/fake classification [28]. Extensions include training the discriminator to jointly determine rotation and real/fake to provide better feedback [223], and training the generator to trick the discriminator at both the real/fake and classification tasks [223]. A more explicit approach is to model the generator with a normalizing flow, avoiding collapse by jointly optimising the GAN and likelihood objectives [70].

Spectral Normalisation
Spectral normalisation [157] is a technique to make a function globally 1-Lipschitz utilising the observation that the Lipschitz constant of a linear function is its largest singular value (spectral norm). The spectral norm of a matrix A is thus a weight matrix W is normalised to be 1-Lipschitz by replacing the weights with W SN := W SN (W ) . Rather than using singular value decomposition to compute the norm, the power iteration method is used; for randomly initialised vectors v ∈ R n and u ∈ R m , the procedure is Since weights change only marginally with each optimisation step, a single power iteration step per global optimisation step is sufficient to keep v and u close to their targets.
As aforementioned, enforcing the discriminator to be 1-Lipschitz is essential for WGANs, however, spectral normalisation has been found to dramatically improve sample quality and allow scaling to datasets with thousands of classes across a variety of loss functions [19], [157]. Spectral collapse, has been linked to discriminator overfitting when spectral norms of layers explode [19] as well as mode collapse when spectral norms fall in value significantly [139].
Additionally, regularising the discriminator in this manner helps balance the two networks, reducing the number of discriminator update steps required [19], [255].

Data Augmentation
Augmenting training data to increase the quantity of training data is often common practice; when training GANs the types of augmentations permitted are limited to more simple augmentations such as cropping and flipping to prevent the generator from creating undesired artefacts. Several approaches independently proposed applying augmentations to all discriminator inputs, allowing more substantial augmentations to be used [115], [222], [262], [263]; the training procedure for a WGAN with augmentations is where T is a random augmentation. These approaches have been shown to improve sample quality on equivalent architectures and stabilise training. Each work offers a different perspective on why augmentation is so effective: the increased quantity of training data in conjunction with the more difficult discrimination task prevents overfitting and in turn collapse [19], notably this applies even on very small datasets (100 samples); the nature of GAN training leads to the generated and data distributions having nonoverlapping supports, complicating training [210], strong augmentations may cause these distributions to overlap further. If an augmentation is differentiable and represents an invertible transformation of the data space's distribution, then the JS-divergence is invariant, and the generator is guaranteed to not create augmented samples [115], [222].

Discriminator Driven Sampling
In order to improve sample quality and address overpowered discriminators, numerous works have taken inspiration from the connection between GANs and energy models [258]. Interpreting the discriminator of a Wasserstein GAN [5] as an energy-based model means samples from the generator can be used to initialise an MCMC sampling chain which converges to the density learned by the discriminator, correcting errors learned by the generator [160], [225]. This is similar to pure EBM approaches, however, training the two networks adversarially changes the dynamics. The slow convergence rates of high dimensional MCMC sampling has led others to instead sample in the latent space [24], [207].

GANs without Competition
Originally proposed as a proxy to measure GAN convergence [69], the duality gap is an upper bound on the JSdivergence that can be directly optimised [68], defined as Cooperative training simplifies the optimisation procedure, avoiding oscillations. Each training step, however, requires optimising for D and G which slows down training and could suffer from vanishing gradients.

Architectures
Careful network design is a key component for stable GAN training. Scaling any deep neural network to high-resolution data is non-trivial due to vanishing gradients and high memory usage, but since the discriminator can classify highresolution data more easily, GANs notably struggle [171]. Early approaches designed hierarchical architectures, dividing the learning procedure into more easily learnable chunks. LapGAN [40] builds a Laplacian pyramid such that at each layer, a GAN conditioned on the previous image resolution predicts a residual adding detail. Stacked GANs [99], [256] use two GANs trained successively: the first generates low-resolution samples, then the second upsamples and corrects the first, thus fewer GANs need to be trained. A related approach, progressive growing [114], [117], iteratively trains a single GAN at higher resolutions by adding layers to both the generator and discriminator upscaling the previous output, after the previous resolution converges. Training in this manner, however, not only takes a long time but leads to high frequency components being learned in the lower layers, resulting in shift artefacts [118].
Accordingly, a number of works have targeted a single GAN that can be trained end-to-end. DCGAN [182] introduced a fully convolutional architecture with batch normalisation [102] and ReLU/LeakyReLU activations. BigGAN [19] employ a number of tricks to scale to high resolutions including using very large mini-batches to reduce variation, spectral normalisation to discourage spectral collapse, and using large datasets to prevent overfitting. Despite this, training collapse still occurs thus requiring early stopping. Another approach is to include skip connections between the generator and discriminator at each resolution, allowing gradients to flow through shorter paths to each layer, providing extra information to the generator [113], [118], [230]. By treating subsets of the generator's parameters as smaller generators, Anycost GANs extend this approach, allowing samples to be generated at multiple resolutions and speeds [137]. To learn long-range dependencies, GANs can be built with self-attention components [105], [236], [255], however, full quadratic attention does not scale well to high dimensional data.

Training Speed
The mini-max nature of GAN training leads to slow convergence, if achieved at all. This problem has been exacerbated by numerous works as a byproduct of improving stability or sample quality. One such example is that by using very large mini-batches, reducing variance and covering more modes, sample quality can be improved significantly, however, this comes at the cost of slower training [19]. Small-GAN [197] combats this by replacing large batches with small batches that approximate the shape of the larger batch using core set sampling [197], significantly improving the mode coverage and sample quality of GANs trained with small batches.
While strong discriminator regularisation stabilises training, it allows the generator to make small changes and trick the discriminator, making convergence very slow. Rob-GAN [141], include an adversarial attack step [149] that perturbs real images to trick the discriminator without altering the content inordinately, adapting the GAN objective into a min-max-min problem. This provides a weaker regularisation, enforcing small Lipschitz values locally rather than globally. This approach has been connected with the followthe-ridge algorithm [242], [264], an optimisation approach for solving mini-max problems that reduces the optimisation path and converges to local mini-max points.
Another approach to improve training speed is to design more efficient architectures. Depthwise convolutions [33] apply separate convolutions to each channel of a tensor reducing the number of operations and hence also the runtime, have been found to have comparable quality to standard convolutions [161]. Lightweight GANs [138] achieve fast training using a number of tricks including small batch sizes, skip-layer excitation modules which provide efficient shortcut gradient flow, as well as using a self-supervised discriminator forcing good features to be learned.

AUTOREGRESSIVE LIKELIHOOD MODELS
Autoregressive generative models [9] are based on the chain rule of probability, where the probability of a variable that can be decomposed as x = x 1 , . . . , x n is expressed as As such, unlike GANs and energy models, it is possible to directly maximise the likelihood of the data by training a recurrent neural network to model p(x i |x 1:i−1 ) by minimising the negative log-likelihood, While autoregressive models are extremely powerful density estimators, sampling is inherently a sequential process and can be exceedingly slow on high dimensional data. Additionally, data must be decomposed into a fixed ordering; while the choice of ordering can be clear for some modalities (e.g. text and audio), it is not obvious for others such as images and can affect performance depending on the network architecture used.

Architectures
The majority of research is focused on improving network architectures to increase their receptive fields and memory, ensuring the network has access to all parts of the input to encourage consistency, as well as increasing the network capacity, allowing more complex distributions to be modelled.

Masked Multilayer Perceptrons
One approach to build autoregressive models is to mask the weights of simple multilayer perceptron (MLP) autoencoders so as to satisfy the autoregressive property. The neural autoregressive density estimator (NADE) [131], which can be viewed as a mean-field approximation of a restricted Boltzmann machine, achieves this for binary data by placing time-dependent masks on an MLP with one hidden layer. Specifically, at time step i, weights are masked so that the entire hidden state h i and output p(x i |x <i ) are dependent only on x <i ; formally this can be defined as where W ·,<d is the first d − 1 columns of a shared weight matrix W , and b i and c are biases. The RNADE [227] generalises NADE to real valued data by instead modelling p(x i |x <i ) with mixture distributions parameterised by the network. An alternative masking procedure known as MADE [57] allows for parallel density estimation by placing a mask fixed over time on an MLP so that no connections exist between p(x i |x <i ) and x ≥i . Additionally, MADE is more readily vectorisable and does not suffer from neuron saturation since the number of inputs to all neurons is constant with respect to time.

Recurrent Neural Networks
A natural architecture to apply is that of standard recurrent neural networks (RNNs) such as LSTMs [88], [214], [234] and GRUs [35], [152] which model sequential data by tracking information in a hidden state. However, RNNs are known to forget information, limiting their receptive field thus preventing modelling of long range relationships. This can be improved by stacking RNNs that run at different frequencies allowing long data such as multiple seconds of audio to be modelled [35]. Nevertheless, their sequential nature means that training can be too slow for many tasks.

Causal Convolutions
An alternative approach is that of causal convolutions, which apply masked or shifted convolutions over a sequence [30], [190], [233]. When stacked, this only provides a receptive field linear with depth, however, by dilating the convolutions to skip values with some step the receptive field can be orders of magnitude higher.

Self-Attention
Neural attention is an approach which at each successive time step is able to select where it wishes to 'look' at previous time steps. This concept has been used to autoregressively 'draw' images onto a blank 'canvas' [67] in a manner similar to human drawing. More recently self-attention (known as Transformers when used in an encoder-decoder setup) [236] has made significant strides improving not only autoregressive models, but also other generative models due to its parallel nature, stable training, and ability to effectively learn long-distance dependencies. This is achieved using an attention scheme that can reference any previous input where an entirely independent process is used per time step so that there are no dependencies. Specifically, Fig. 6: Autoregressive models decompose data points using the chain rule and learn conditional probabilities.
inputs are encoded as key-value pairs, where the values V represent the inputs, and the keys K act as an indexing method. At each time step a query q is made; taking the dot product of the queries and keys, a similarity vector is formed that describes which value vectors to access. This process can be expressed as where d k is the key/query dimension and is used to normalise gradient magnitudes. Since the self-attention process contains no recurrence, positional information must be passed into the function. A simple effective method to achieve this is to add sinusoidal positional encodings which combine sine and cosine functions of different frequencies to encode positional information [236]; alternatively others use trainable positional embeddings [32]. The infinite receptive fields of attention provides a powerful tool for representing data, however, the attention matrix QK T grows quadratically with data dimension, making scaling difficult. Approaches include scaling across large quantities of GPUs [20], interleaving attention between causal convolutions [30], attending over local regions [179], and using sparse attention patterns that provide global attention when multiple layers are stacked [32]. More recently, a number of linear transformers have been proposed whose memory and time footprints grow linearly with data dimension [34], [119], [241]. By approximating the softmax operation with a kernel function with feature representation φ(x), the order of multiplications can be rearranged to allowing φ(K) T V to be cached and used for each query.

Multiscale Architectures
Even with a linear autoregressive model, O(N ) for N pixels, scaling to high-resolution images grows quadratically with resolution. One multi-scale approach reduces this complexity to O(ln N ) by successively upscaling images, making the assumption that when upscaling, each pixel is dependent only on its adjacent area and the previous resolution image, allowing scaling to high resolutions [185].
To avoid making independence assumptions, [155] partition images in an interleaving pattern so that sub-images are the same size and capture global structure. Sub-images are generated autoregressively pixel-wise and are conditioned on previously generated sub-images; while this reduces the memory required, sampling times are still slow.

Data Modelling Decisions
When generating text, output variables are often modelled using a multinomial distribution since tokens are discrete and are in general unrelated. However, this modelling assumption can cause complications or be infeasible in other cases such as 16-bit audio modelling, in which magnitude would not be intrinsically modelled and 65,536 output neurons would be required. Solutions proposed include: • Applying µ-law, a logarithmic companding algorithm which takes advantage of human perception of sound, then quantizing to 8-bit values [231].
• First predicting the first 8-bits, then predicting the second 8-bits conditioned on the first.

•
Modelling output probabilities using a mixture of logistic distributions (MoL) has the benefits of providing more useful gradients and allowing intensities never seen to still be sampled [190].
Nevertheless, these assumptions restrict the expressiveness of the network, for instance, MoLs struggle to model high frequency signals as found in raw image data; a simple solution in this case is to add Gaussian noise, reducing the Lipschitz constant of the data distribution [153]. This restriction can be removed at the expense of less efficient sampling by learning an autoregressive energy model, for instance, by approximating normalising constants [159] or through score matching [154]. Alternatively, quantile regression, which minimises Wasserstein distance, can be used to learn an approximation of the inverse cumulative distribution [173]. When modelling images, many works use "raster scan" ordering [190], [233], [234] where pixels are estimated row by row. Alternatives have been proposed such as "zig-zag" ordering [30] which allows pixels to depend on previously sampled pixels to the left and above, providing more relevant context. Another factor when modelling images is how to factorise sub-pixels. While it is possible to treat them as independent variables, this adds additional complexity. Alternatively, it is possible to instead condition on whole pixels, and output joint distributions in a single step [190].

NORMALIZING FLOWS
While training autoregressive models through maximum likelihood offers plenty of benefits including stable training, density estimation, and a useful validation metric, the slow sampling speed and poor scaling properties handicaps them significantly. Normalizing flows are a technique that also allows exact likelihood calculation while being efficiently parallelisable as well as offering a useful latent space for downstream tasks. Consider an invertible, smooth function f : R d → R d ; by applying this transformation to a random variable x ∼ p(x), then the distribution of the resulting random variable y = f (x) can be determined through the change of variables rule (and application of the chain rule), Consequently, arbitrarily complex densities can be constructed by composing simple maps and applying Eqn. 29 [237]. This chain is known as a normalizing flow [186] (see Fig. 7). The density p K (x K ) obtained by successively transforming a random variable x 0 with distribution p 0 through a chain of K transformations f k can be defined as Each transformation therefore must be sufficiently expressive while being easily invertible and have an efficient to compute Jacobian determinant. While restrictive, there have been a number of works which have introduced more powerful invertible functions (see Table 3). Nevertheless, normalizing flow models are typically less parameter efficient than other generative models. One disadvantage of requiring transformations to be invertible is that the input dimension must be equal to the output dimension which makes deep models inefficient and difficult to train. A popular solution to this is to use a multiscale architecture [43], [124] (see Fig. 8) which divides the process into a number of stages, at the end of each half of the remaining units are factored out and treated immediately as outputs. This allows latent variables to sequentially represent course to fine features and permits deeper architectures.

Coupling and Autoregressive Layers
A simple way of building an expressive invertible function is the coupling flow [42], which divide inputs into two and applies a bijection h on one half parameterised by the other, here f can be arbitrarily complex i.e. a neural network. h tends to be selected as an elementwise function making the Jacobian triangular allowing efficient computation of the determinant, i.e. the product of elements on the diagonal.

Affine Coupling
A simple example of this is the affine coupling layer [43], which has a simple Jacobian determinant and can be trivially rearranged to obtain a definition of x (d+1:D) in terms of y, provided that the scaling coefficients are not 0. This simplicity, however, comes at the cost of expressivity; while stacking numerous such flows increases their expressivity, allowing them to learn representations of complex high dimensional data such as images [124], it is unknown whether multiple affine flows are universal approximators [177].

Monotone Functions
Another method of creating invertible functions that can be applied element-wise is to enforce monotonicity. One possibility to achieve this is to define h as an integral over a positive but otherwise unconstrained function g [244], however, this integration requires numerical approximation. Alternatively, by choosing g to be a function with a known integral solution, h can be efficiently evaluated. This has been accomplished using positive polynomials [103] and the CDF of a mixture of logits [86]. Both cases, however, don't have analytical inverses and have to be approximated iteratively with bisection search. Another option is to represent g as a monotonic spline: a piecewise function where each piece is easy to invert. As such, the inverse is as fast to evaluate as the forward pass. Linear and quadratic splines [158], cubic splines [50], and rational-quadratic splines [51] have been applied so far.

Autoregressive Flows
For a single coupling layer, a significant proportional of inputs remain unchanged. A more flexible generalisation of coupling layers is the autoregressive flow, or MAF [178], Here f θ can be arbitrarily complex, allowing the use of advances in autoregressive modelling (Section. 5), and h is a bijection as used for coupling layers. Some monotonic bijectors have been created specifically for autoregressive flows, namely Neural Autoregressive Flows (NAF) [96] and Block NAF [39]. Unlike coupling layers, a single autoregressive flow is a universal approximator. Alternatively, an autoregressive flow can be conditioned on y (1:t−1) rather than x (1:t−1) , this is known as an Inverse Autoregressive Flow, or IAF [125]. While coupling layers can be evaluated efficiently in both directions, MAF permits parallel density estimation but sequential sampling, and IAF permits parallel sampling but sequential density estimation.

Probability Density Distillation
Inverse autoregressive flows [125] offer the ability to sample from an autoregressive model in parallel, however, training via maximum likelihood is inherently sequential making this infeasible for high dimensional data. Probability density distillation [232] has been proposed as a solution to this where a second pre-trained autoregressive network is used as a 'teacher' network while an IAF network is used as a 'student' and mimics the teacher's distribution by minimising the KL divergence between the two distributions: where p S and p T are the student's and teacher's distributions respectively, H(p S , p T ) is the cross-entropy between p S and p T , and H(p S ) is the entropy of p S . Crucially, this never requires the student's inverse function to be used allowing it to be computed entirely in parallel.

Convolutional
A considerable problem with coupling and autoregressive flows is the restricted triangular Jacobian, meaning that all inputs cannot interact with each other. Simple solutions involve fixed permutations on the output space such as reversing the order [42], [43]. A more general approach is to use a 1×1 convolution which is equivalent to a linear transformation applied across channels [124]. Numerous works have been proposed to generalise these to larger kernel sizes. A number of these apply variations on causal convolutions [231], including emerging convolutions [91] whose inverse is sequential, MaCow [145] which uses smaller conditional fields allowing more efficient sampling, and MintNet [205] which approximates the inverse using fixed-point iteration.
Alternative approaches to causal masking involve imposing repeated (periodic) structure [112], however in general this is not a good assumption for image modelling, as well as representing convolutions as exponential matrix-vector products, exp(M )x, approximated implicitly with a power series, allowing otherwise unconstrained kernels [92].

Residual Flows
Residual networks [80] are a popular technique to build deep neural networks that alleviate the vanishing gradients problem. By restricting f θ , invertible residual networks can be built by stacking blocks of the form

Matrix Determinant Lemma
If a function has a certain residual form, then its Jacobian determinant can be computed with the matrix determinant lemma [186]. A simple example is planar flow [186] which is equivalent to a 3 layer MLP with a single neuron bottleneck: where u, w ∈ R d , b ∈ R, and h is a differentiable nonlinearity function. Planar flows are invertible provided some simple conditions are satisfied, however its inverse is difficult to compute making it only practical for density estimation tasks. A higher rank generalisation of the matrix determinant lemma has been applied to planar flows, known as Sylvester flows, removing the severe bottleneck thus allowing greater representation ability [14], [79].

Lipschitz Constrained
By restricting the Lipschitz constant of f θ , f θ L < 1, then this block is invertible [8]. The inverse, however, has no closed form definition but can be found through fixedpoint iteration which by the Banach fixed-point theorem converges to a fixed unique solution at an exponential rate dependant on f θ L . The authors originally proposed a biased approximation of the log determinant of the Jacobian as a power series where the Jacobian trace is approximated using Hutchkinson's trace estimator (see Table 3), but an unbiased approximator known as a Russian roulette estimator has also been proposed [26]. Unlike coupling layers, residual flows have dense Jacobians, allowing interaction. Enforcing Lipschitz constraints has been achieved with convolutional networks [60], [139], [157] as well as self-attention [120].
Making strong Lipschitz assumptions severely restricts the class of functions learned; an N layer residual flow network is at most 2 N -Lipshitz. Implicit flows [143] bypass this by solving implicit equations of the form where both f θ and f φ both have Lipschitz constants less than 1. Both the forwards (solve for y given x) and backwards (solve for x given y) directions require solving a root finding problem similar to the inverse process of residual flows; indeed, an implicit flow is equivalent to the composition of a residual flow and the inverse of a residual flow. This allows them to model arbitrary Lipschitz transformations.

Surjective and Stochastic Layers
Restricting the class of functions available to those that are invertible introduces a number of practical problems related to the topology-preserving property of diffeomorphisms. For example, mapping a uni-modal distribution to a multimodal distribution is extremely challenging, requiring a highly varying Jacobian [44]. By composing bijections with surjective or stochastic layers these topological constraints can be bypassed [164]. While the log-likelihood of stochastic layers can only be bounded by their ELBO, functions surjective in the inference direction permit exact likelihood evaluation even with altered dimensionality. Surjective transformations have the following likelihood contributions: where p(x|y) is deterministic for generative surjections, and q(y|x) is deterministic for inference surjections. One approach to build a surjective layer is to augment the input space with additional dimensions allowing smoother transformation to be learned [25], [48], [95]; the inverse process, where some dimensions are factored out, is equivalent to a multi-scale architecture [43]. Another approach known as RAD [44] learns a partitioning of the data space into disjoint subsets {Y i } K i=1 , and applies piecewise bijections to each region g i : X → Y i , ∀i ∈ {1, . . . , K}. The generative direction learns a classifier on X , i ∼ p(i|x), allowing the inverse to be calculated as y = g i (x). Similar to both of these approaches are CIFs [36] which consider a continuous partitioning of the data space via augmentation equivalent to an infinite mixture of normalizing flows. Other approaches include modelling finite mixtures of flows [47].
Some powerful stochastic layers have already been discussed in this survey, namely VAEs [123] and DDPMs [87]. Stochastic layers have been incorporated into normalizing flows by interleaving small energy models, sampled with MCMC, between bijectors [246].

Discrete Flows
The normalizing flow framework can be extended to discrete distributions, by restricting transformation functions to be discrete e.g. f : X d → X d . Integer discrete flows (IDF) achieve this using additive coupling layers, rounding translation values to the nearest integer and approximating gradients with the straight-through estimator [94]; discrete flows [221] apply affine coupling layers in modulo space while also restricting the translation and scaling coefficients to a finite number of possible values. In this case the change of variables rule (Eqn. 29) simplifies to [94], [221] p(x) = p(f (x)).
Unlike the continuous case, there is no Jacobian determinant term; intuitively this term adjusts for volume changes, however, in a discrete space there is no volume. As such, there is no requirement for f to have an efficiently computable Jacobian determinant [221]. The absence of this term is restricting, however, discrete flows can only permute the values of p(x), not change them i.e. a uniform base distribution can only be mapped to another uniform distribution [177]. Nevertheless, this can be avoided by embedding the data into a space with more values than the data, making IDFs more flexible than discrete flows [13].

Continuous Time Flows
It is possible to consider a normalizing flow with an infinite number of steps that is defined instead by an ordinary differential equation specified by a Lipschitz continuous neural network f with parameters θ, that describes the transformation of a hidden state x(t) ∈ R D [27], Starting from input noise x(t 0 ), an ODE solver can solve an initial value problem for some time t 1 , at which data is defined, x(t 1 ). Modelling a transformation in this form has a number of advantages such as inherent invertibility by running the ODE solver backwards, parameter efficiency, and adaptive computation. However, it is not immediately clear how to train such a model through backpropagation. While it is possible to backpropagate directly through an ODE solver, this limits the choice of solvers to differentiable ones as well as requiring large amounts of memory. Instead, the authors apply the adjoint sensitivity method which instead solves a second, augmented ODE backwards in time and allows the use of a black box ODE solver. That is, to optimise a loss dependent on an ODE solver: the adjoint a(t) = ∂L ∂x(t) can be used to calculate the derivative of loss with respect to the parameters in the form of another initial value problem [180], which can be efficiently evaluated by automatic differentiation at a time cost similar to evaluating f itself.
Despite the complexity of this transformation, the continuous change of variables rule is remarkably simple: and can be computed using an ODE solver as well. The resulting continuous-time flow is known as FFJORD [62].
Since the length of the flow tends to infinity (an infinitesimal flow), the true posterior distribution can be recovered [186]. As previously mentioned, invertible functions suffer from topological problems; this is especially true for Neural ODEs since their continuous nature prevents trajectories from crossing. Similar to augmented normalizing flows [95], this can be solved by providing additional dimensions for the flow to traverse [48]. Specifically, a p-dimensional Euclidean space can be approximated by a Neural ODE in a (2p + 1)-dimensional space [254].

Regularising Trajectories
ODE solvers can require large numbers of network evaluations, notably when the ODE is stiff or the dynamics change quickly in time. By introducing regularisation, a simpler ODE can be learned, reducing the number of evaluations required. Specifically, all works here are inspired by optimal transport theory to encourage straight trajectories. Monge-Ampère Flow [257] and Potential Flow Generators [250] parameterise a potential function satisfying the Monge-Ampère equation [22], [237] with a neural network. RNODE [54] applies transport costs to FFJORD as well as regularising the Frobenius norm of the Jacobian, encouraging straight trajectories. By combining these approaches, OT-Flow [172] utilises the optimal transport derivation to derive an exact trace definition with cost similar to stochastic estimators.

EVALUATION METRICS
A huge problem when developing generative models is how to effectively evaluate and compare them. Qualitative comparison of random samples plays a large role in the majority of state-of-the-art works, however, it is subjective and time-consuming to compare many works. Calculating the log-likelihood on a separate validation set is popular for tractable likelihood models but comparison with implicit likelihood models is difficult and while it is a good measure of diversity, it does not correlate well with quality [215].
One approach to quantify sample quality is Inception Score (IS) [189] which takes a trained classifier and determines whether a sample has low label entropy, indicating that a meaningful class is likely, and whether the distribution of classes over a large number of samples has high entropy, indicating that a diverse range of images can be sampled. A perfect IS can be scored by a model that creates only one image per class [144] leading to the creation of Fréchet Inception Distance (FID) [81] which models the activations of a particular layer of a classifier as multivariate Gaussians for real and generated data, measuring the Fréchet distance between the two.
These approaches are trivially solved by memorising the dataset and are less applicable to non-natural image-related data. Kernel Inception Distance (KID) [15] instead calculates the squared maximum mean discrepancy in feature space, however, pretrained features may not be sufficient to detect overfitting. Another approach is to train a neural network to distinguish between real and generated samples similar to the discriminator from a GAN; while this detects overfitting, it increases the complexity and time required to evaluate a model and is biased towards adversarial models [74].

APPLICATIONS
In general, the definition of a generative model means that any technique can be used on any modality/task, however, some models are more suited for certain tasks. Standard autoregressive networks are popular for text/audio generation [20], [32], [231]; VAEs have been applied but posterior collapse is difficult to mitigate [6], [18]; GANs are more parameter efficient but struggle to model discrete data [163] and suffer from mode collapse [128]; some normalizing flows offer parallel synthesis, providing substantial speedup [181], [221], [266]. Video synthesis is more challenging due its exceptionally high dimensionality, typically approaches combine a latent-based implicit generative model to generate individual frames, with an autoregressive network used to predict future latents [6], [129], [134] similar to how world models are constructed in reinforcement learning [76], [77]. Modality conversion has been achieved using GANs [265], VAE-GANs [140], and DDPMs [193].

Implicit Representation
Typically deep architectures discussed in this survey are built with data represented as discrete arrays thus using discrete components such as convolutions and self-attention. Implicit representation on the other hand treats data as continuous signals, mapping coordinates to data values [198], [212]. Implicit Gradient Origin Networks (GONs; Fig.  9a) [17] form a latent variable model by concatenating latent vectors with coordinates which are passed through an implicit network; here latent vectors are calculated as the gradient of a reconstruction loss with respect to the origin. By sampling using a finer grid of coordinates, super-resolution beyond resolutions seen during training is possible. Other approaches to learn an implicit generative model as a GAN include directly feeding latents through an implicit network 0 cx F (a) Implicit GON [17].x F z x c 0,1 D H (b) Implicit GAN [49]. Fig. 9: Implicit networks model data continuously permitting arbitrarily high resolutions. Dashed lines represent gradients, F is an implicit network, and H is a hypernetwork.
with upsampling [116] and mapping latents to the weights of an implicit function using a hyper-network [49] (Fig. 9b).

CONCLUSION
While GANs have led the way in terms of sample quality for some time now, the gap between other approaches is shrinking; the diminished mode collapse and simpler training objectives make these models more enticing than ever, however, the number of parameters required in addition to slow run-times pose a substantial handicap. Despite this, recent work in hybrid models offers a balance between extremes at the expense of extra model complexity that hinders broader adoption. The varied connections between these systems mean that advances in one field inevitably benefit others, for instance, improved variational bounds are beneficial for VAEs, diffusion models, and surjective flows, and the application of innovative data augmentation strategies has been found to offer benefits across numerous model classes without necessitating more powerful architectures. When it comes to scaling models to high-dimensional data, attention is a common theme, allowing long-range dependencies to be learned; recent advances in linear attention will aid scaling to even higher resolutions. Implicit networks are another promising direction, allowing efficient synthesis of arbitrarily high resolution and irregular data. Similar unified generative models capable of modelling continuous, irregular, and arbitrary length data, over different scales and domains will be key for the future of generalisation.