Enhancing variational generation through self-decomposition

In this article we introduce the notion of Split Variational Autoencoder (SVAE), whose output $\hat{x}$ is obtained as a weighted sum $\sigma \odot \hat{x_1} + (1-\sigma) \odot \hat{x_2}$ of two generated images $\hat{x_1},\hat{x_2}$, and $\sigma$ is a {\em learned} compositional map. The composing images $\hat{x_1},\hat{x_2}$, as well as the $\sigma$-map are automatically synthesized by the model. The network is trained as a usual Variational Autoencoder with a negative loglikelihood loss between training and reconstructed images. No additional loss is required for $\hat{x_1},\hat{x_2}$ or $\sigma$, neither any form of human tuning. The decomposition is nondeterministic, but follows two main schemes, that we may roughly categorize as either \say{syntactic} or \say{semantic}. In the first case, the map tends to exploit the strong correlation between adjacent pixels, splitting the image in two complementary high frequency sub-images. In the second case, the map typically focuses on the contours of objects, splitting the image in interesting variations of its content, with more marked and distinctive features. In this case, according to empirical observations, the Fr\'echet Inception Distance (FID) of $\hat{x_1}$ and $\hat{x_2}$ is usually lower (hence better) than that of $\hat{x}$, that clearly suffers from being the average of the former. In a sense, a SVAE forces the Variational Autoencoder to make choices, in contrast with its intrinsic tendency to {\em average} between alternatives with the aim to minimize the reconstruction loss towards a specific sample. According to the FID metric, our technique, tested on typical datasets such as Mnist, Cifar10 and CelebA, allows us to outperform all previous purely variational architectures (not relying on normalization flows).


I. INTRODUCTION
Generative modeling (see e.g. [27] for an introduction) is one of the most fascinating problems in Artificial Intelligence, with many relevant applications in different areas comprising computer vision, natural language processing, medicine or reinforcement learning. The goal is not only to be able to sample new realistic examples starting from a given set of data, but to gain insight in the data manifold, and the way a neural network is able to extract and exploit the characteristic features of data. In case of high-dimensional data, the generative problem can only be addressed by means of Deep Neural Networks, and this topic led to tremendous research in many different directions, particularly nourishing the recent field of unsupervised representation learning.
Among the different kind of generative models which have been investigated, Variational Autoencoders (VAEs) [18], [26] have always exerted a particular fascination [29], [34], [35], mostly due to their strong theoretical foundations, that will be briefly recalled in Section II. Unfortunately, results remained below expectations, and the generative quality of VAEs is systematically outperformed by different generative techniques like e.g. Generative Adversarial Networks.
A particularly annoying problem is that VAEs produce images with a characteristic blurriness, very hard to be removed with traditional techniques [20], [28]. The source of the problem is not easy to identify, but it is likely due to averaging, implicitly underlying the VAE frameworks and, more generally, any autoencoder approach. As observed in [11]. In presence of multimodal output, a loglikelihood objective typically results in averaging and hence blurriness. A GAN does not have this problem, since its goal is to fool the Discriminator, not to reconstruct a given input. Since Variational Autoencoders are intrinsically multimodal, both due to dimensionality reduction, and to the sampling process during training, a certain amount of blurriness is unfortunately expected.
Starting from the averaging assumption [2], it is natural to try to address blurriness by offering to the Variational  x 2 (similarly for the other analogous pictures). In this case, the image is decomposed in complementary subimages at high-frequency. This usually helps to decorrelate adjacent pixels in the latent encoding. FIGURE 2: Example of "semantic" decomposition for Mnist. Digits are usually decomposed in a "fat" and a "thin" version following the contours of objects. The compound imagex = σ x 1 + (1 − σ) x 2 is particularly neat and mushy.
Autoencoder the possibility to create multiple images, and then synthesize a result as a (learned) weighted combination of them. This is precisely what our Split Variational Autoencoder (SVAE) is supposed to do: the generator returns two imagesx 1 ,x 2 and a probability map σ with the same spatial dimension of the images, and synthesize a resulting imagex = σ where is point-wise multiplication (broadcasted over channels). The Autoencoder is trained by minimizing the reconstruction loss between x andx, together with the traditional regularization component over latent variables. No additional loss is imposed overx 1 , The resulting decomposition is non deterministic, mostly depending on the network architecture and the dimension of the latent space. However, it seems to follow two main schemes, that we call "syntactic" (see Figures 1, 3), and "semantic" (see Figures 2,4). In this Figures, the top line is the σ map, the second line isx 1 , the third line isx 2 , and in the last line we havex = σ x 1 + (1 − σ) x 2 . Let us also remark that all images in the pictures have been generated, not reconstructed. In the first case, the map takes advantage of the strong correlation between adjacent pixels splitting the image in two complementary high frequency sub-images. Each image has more freedom in filling the ignored parts, easing the generative task. FIGURE 3: Example of "syntactic" decomposition for CelebA. In this case, the FID score forx 1 andx 2 is usually bad. Still, the decomposition helps to get a stable and robust training, typically resulting in good generative results for the compound imagex = σ FIGURE 4: Example of "semantic" decomposition for CelebA. This is the most interesting case. The map focuses on contours of objects, emphasizing them in opposite directions. This is frequently rewarding in terms of FID score forx 1 andx 2 , that are usually better than that ofx.
In the second, even more interesting case, the map focuses on the contours of objects, splitting the image in interesting variations around them, typically resulting in more marked and distinctive features. In this case, the Fréchet Inception Distance (FID) ofx 1 andx 2 may also be lower (hence better) than that ofx, that apparently suffers from being the average of the former.
An interesting aspect of SVAEs, and possibly one of reasons behind their effectiveness, is that they allow to work with a number of latent variables sensibly higher than usual, hence implicitly addressing the variable collapse phenomenon [1], [6], [25], [31], [36]. This seems to be an indication that self-splitting is indeed a convenient way to induce the model to synthesize a large number of uncorrelated latent VOLUME 4, 2016 features.
We tested SVAE on typical datasets such as Mnist, Cifar10 and CelebA, and in all cases we observed substantial improvements w.r.t. the "vanilla" approach. Excluding models that make use of sophisticated techniques like normalizing flows [16], [23], [32] (typically requiring thousands of latent variables, and practically hindering a fruitful exploration of the latent space), SVAE outperforms all previous variational architectures.
The code relative to this work can be accessed on Github in the following public repository: https://github.com/asperti/ Split-VAE Pretrained weights for the models discussed in the article are available at the following page: https://www.cs. unibo.it/~asperti/SVAE.html.

A. STRUCTURE OF THE ARTICLE
In Section II we briefly recall the theory behind Variational Autoencoder, show their encoder-decoder architecture (Section II-A), and discuss some aspects related to the dimension of the latent space and the so called variablecollapse phenomenon (Section II-B). Section III introduces the notion of Split Variational Autoencoder and provides a detailed description of the ResNet-like architecture used in our experiments. In Section IV we outline our experimental setting, discussing the metrics and datasets used for the benchmarks. Quantitative results are given in Section V along with a critical discussion. Ablation investigations are debated in Section VI. In the Conclusions VII, we summarize the content of the article and outline research directions for future developments.

II. BACKGROUND
There exist in the literature several good introductions to Variational Autoencoders (VAEs) [4], [9], [19], so in this section we provide a quite short introduction to the topic, mostly with the purpose to fix notation and terminology.
In a latent variable approach, the probability distribution p(x) of a data point x is expressed through marginalization over a vector z of latent variables: where z is the latent encoding of x distributed with a known distribution p(z) named prior distribution. If we can learn a good approximation of p(x|z) from the data, we can use it to generate new samples via ancestral sampling: • sample z ∼ p(z).
Supposing to have a parametric family of probability distributions {p θ (x|z)} (e.g. modelled by a neural network), the goal is to find θ * that optimize the loglikelihood over all x ∈ D (MLE): Addressing directly the previous optimization problem is usually computationally infeasible. For this reason, VAEs exploit another probability distribution q φ (z|x) named inference (or encoder) distribution, expressing the relation between a data point x and its associated latent representation z. Hopefully, q φ (z|x) should approximate p θ (z|x), so that their Kullback-Leibler divegence should be small. Further expanding the previous equation, we get: Hence, Recalling that D KL is always positive, we get stating that the left hand side is a lower bound for the loglikelihood of p θ (x), known as Evidence Lower Bound (ELBO). Since ELBO is more tractable than MLE, it is used as the cost function for training of neural networks. Optimizing ELBO we are jointly improving the loglikelihood of p θ (x), and implicitly minimizing the distance between q φ (z|x) and p θ (z|x).
The ELBO has a form similar to an autoencoder: the inference distribution q φ (z|x) encodes the input x to its latent representation z, and p θ (x|z) decodes z back to x.
For generative sampling, we just exploit the decoder, sampling latent variables according to the prior distribution p(z) (that must be known).

A. VANILLA VAE AND ITS TRAINING
In the vanilla VAE, we assume q φ (z|x) to be a Gaussian , so that learning q φ (z|x) amounts to learning its two first moments. It is important to know the variance σ 2 φ (x) since during training we need to sample according to q φ (z|x).
Supposing that the model approximating the decoder function µ θ (z) is sufficiently expressive, the shape of the prior distribution p(z) does not really matter, and it is traditionally assumed to be a normal distribution p(z) = G(0, I).
Under these assumptions, the term D KL (q φ (z|x)||p(z)) is the KL-divergence between two Gaussian distributions G(µ φ (x), σ 2 φ (x)) and G(0, I) that has the following closed form expression: where k is the dimension of the latent space.
Coming to the reconstruction loss E q φ (z|x) [log p θ (x|z)], under the Gaussian assumption, the logarithm of p θ (x|z) is proportional to the quadratic distance between x and its reconstruction µ θ (z); the variance of this Gaussian distribution can be understood as a parameter balancing the relative importance between reconstruction error and KL-divergence [5], [9].
The problem of integrating sampling with backpropagation during training is addressed by the so called reparametrization trick [18], [26]: sampling is performed using a standard distribution outside of the backpropagation flow and this value is rescaled with µ φ (x) and σ φ (x).
It is important to stress that sampling at training time has no relation with ancestral sampling for generation (we do not have x at generation time!). The purpose of sampling at training time is to provide estimates for the main moment of q(z|x), that are then subject to KL-regularization. In turn, the final goal of this regularization is to bring the marginal inference distribution q(z) = E x∈D q(z|x) close to the prior p(z).

B. THE DIMENSION OF THE LATENT SPACE
A critical aspect of VAEs is the dimension of the latent space. Typically, having many latent variables reduces the compression loss and improves reconstruction. However, this may not result in an improvement of the generative model: more variables we have and the harder is to ensure their independence and force them to assume the desired prior distribution. We may try to tame them by strengthening the KLregularization component in the loss function, in the spirit of a β-VAE [7], [13], but this typically results in the collapse of the less informative variables, that get completely ignored by the decoder [1], [6], [25], [31], [36]. A collapsed variable z has a very characteristic behaviour: since it is ignored by the decoder, it is free to minimize KL-regularization, with a mean value µ z (x) = 0 and a variance σ 2 z (x) = 1 for any x. A more expressive architecture may typically result in a better exploitation of latent variables, allowing to work with a larger number of them. The dimension of the latent space also reflects the complexity of the data manifold: for instance, with non-hierarchical architectures, it is customary to work with 16 variables for Mnist, 128 variables for Cifar10 and 64 variables for CelebA. As we shall see, with a SVAE we can sensibly enlarge these numbers.

III. SPLIT-VAE
The general notion of VAE does not impose any specification on the architecture of the encoder and the decoder, and many different variants have been investigated in the literature: dense, convolutional, with residuality, with autoregressive flows, hierarchical, and so on (see e.g. [4], [34] for a discussion).
A Split-VAE (SVAE) is just another architectural variant: we do not touch the theory or the loss function. In a SVAE the outputx is computed as a weighted sum of two generated imagesx 1 ,x 2 , and σ is a learned compositional map. Typically, to turn a vanilla VAE into a Split-VAE it is enough to change the number of channels of the last layer of the network: in case of a grayscale image, passing from 1 to 3, and in case of a color image passing from 3 to 1+3+3=7 (the σ map and two color imagesx 1 ,x 2 ). The compound imagex can be computed internally of externally to the network, as part of the loss function. Since the increased number of channels only concerns the very last layer, the total number of parameters remains comparable to the vanilla version.
The philosophy underlying a SVAE has been already discussed in the introduction: we merely create an opportunity to be exploited by the model. From a more practical perspective, it can be be understood as a way to induce diversification in the features learned by the network, via a simple but highly effective self-attention mechanism [33]. From this point of view, it is not too far from techniques like squeeze and excitation [14] or feature-wise linear modulation [24]; the main difference is that we operate on the visible level, and along spatial dimensions. This allows us, among other things, to provide intelligible visualizations of the splitting learned by the network.
One of main characteristic of Split-VAEs is that they allow to work with a sensibly larger number of latent variables: 32 for Mnist, 200 for Cifar10 and 150 for CelebA. This testifies the diversification of latent features, and partially explains the improved generative quality.

A. ENCODER-DECODER ARCHITECTURE
For the implementation of the encoder and the decoder we adopted a ResNet-like architecture derived from [8] that we already used in previous works [4], [5]; this allows us to do a fair comparison of the split-technique with previous approaches, without additional biases. The network architecture is schematically described in Figure 6. The encoder is a fully convolutional model where the input is progressively downsampled for a configurable number of times, jointly doubling the number of channels. Before downsampling, the input is processed by a so called Scale Block, that is just a sequence of Residual Blocks. A Residual Block is an alternated sequence of BatchNormalization and spatial preserving Convolutional layers, intertwined with residual connections. The number of Scale Blocks at each scale of the image pyramid, the number of Residual Blocks inside each Scale Block, and the number of convolutions inside each Residual Block are user configurable hyperparameters.
In the encoder, after the last Scale Block, a global average level extracts spatial agnostic features. These are first passed through a so called Dense Block (similar to a Residual Block but with dense layers instead of convolutions), and finally used to synthesize mean and variance for latent variables. The decoder first maps the internal encoding z to a small map of dimension 4 × 4 × base_dim via a dense layer suitably reshaped. This is then up-sampled to the final expected dimension, inserting a configurable number of Scale Blocks at each scale.

IV. EXPERIMENTAL SETTING
We compare the performance of SVAE with state-of-theart variational autoencoders comprising Two-stage models [4], [5], [8], and Regularized Autoencoders [10]. We are not considering models relying on normalizing flows, such as [16], [23], [32]: these models typically require thousands of latent variables making them of relatively little interest from the point of view of representation learning.
For the comparison, we used traditional datasets, such as MNIST, CIFAR-10 [21] and CelebA [22]; the metrics adopted is the usual Frechèt Inception Distance [12] (FID), shortly discussed in section IV-A. All comparative results reported Section V are borrowed from the original publications.
In addition to the FID-score for generated images (GEN field, in Tables), we also provide an ex-post estimation of the probability distribution of the latent-space. This is done through a second VAE in [5], [8], and by fitting a Gaussian Mixture Model (GMM) in [10] (a normalizing autoregressive flow can be used with a similar purpose [17], [23]). Although possibly less expressive, the GMM technique is simple and effective, so we use it in our experiments (this aspect is somewhat orthogonal to the content of this article). The FIDscore after resampling in the latent space is reported in the GMM entry, in the following Tables.
For each architecture, we also provide the number of parameters as an indicative measure of its complexity and energetic footprint (according to recent investigations [3], the number of parameters seem to provide a more reliable measure of efficiency than the number of Floating Point Operations).

A. FRECHÈT INCEPTION DISTANCE
The Frechèt Inception Distance [12] (FID) does not try to assess the "quality" of a single generated sample but merely compares the overall probability distribution of generated vs. real images. The dimension of the visible space is typically too large to allow a direct comparison; the main idea behind FID is to use, instead of raw data, their internal representations generated by some third party, agnostic network. In the case of FID, the Inception v3 network [30] trained on Imagenet is used to this purpose. The activations that are traditionally used are those relative to the last pooling layer, resulting in a vector of 2048 features.
Let a 1 and a 2 be the activations relative to real and generated images, and µ i , i = 1, 2, C i , i = 1, 2 their empirical mean and covariance matrix, respectively. Then the Frèchet Distance between a 1 and a 2 is the just the squared Wasserstein distance, namely: where T r is the trace of the matrix.

V. NUMERICAL RESULTS
In this section we give numerical results relative to Mnist, CIFAR-10 and CelebA, discussing the training process and the relevant hyperparameters.

A. MNIST
In the case of Mnist we worked with a latent space of dimension 32, in contrast with the traditional dimension of 16. We tested several versions, with different balancing β-factor between reconstruction and KL-regularization: we report results for β = 8 and β = 3. Even when starting with the relatively high balancing factor β = 8, the number of inactive variables at the end of training is low: between 2 and 5. In all our experiments, the β-factor is progressively reduced along training in order to preserve the initial balance between the two components, as described in [5].
In the case of Mnist, both syntactic and semantic decomposition (see Figures 1 and 2) usually give good results on the compound image, slightly better in the latter case.
Training lasted 300 epochs, using Adam optimizer with a learning rate of 1.0e − 3. Numerical results in terms of FIDscores are given in Table1  The GMM value refers to ex-post estimation of the latent space distribution via a Gaussian Mixture Model in the spirit of [10]. For MNIST, we use a GMM with 20 components.
In [10], they observed that enlarging the number of components beyond 10 was not beneficial. However, in our case, presumably due to the larger number of latent variables, we found convenient to use a larger mix. In Table1 we compare the cases of 20 and 100 components. To have an acceptable generative score before GMM-resampling we need to work with a high β factor, like e.g. β = 8; however, resampling compensates the need of regularization, and we obtain the best results with β = 3 and 100 components. See Figure 7 for examples of generated Mnist-like digits. CIFAR-10 confirms its somehow pathological nature. The complexity of the dataset can be readily appreciated by looking at the pictures in Figure 9 where we compare the mean images for CIFAR-10 and CelebA: the former is completely gray, while the latter is a relatively well defined "average" face (also observe, by the way, the strong bias of the CelebA dataset towards feminine, frontal, young, smiling faces). Another interesting indicator of the complexity of CIFAR-10 is given by the FID score between each category and the full dataset (we derive 10000 image per category by flipping images in the training set, and compare them with the test-set). The score is extremely high, in spite of the fact that the "texture" is apparently quite similar. The FID score "magically" drops to 5.3 for a random mix.
Coming to SVAE, in the case of CIFAR-10, splitting produces results of the kind described in Figure 8. The FID scores forx 1 andx 2 are usually lower to that ofx testifying that the network is attempting a "semantic" decomposition; however, our networks failed, apart exceptions, to generate recognizable contours. In spite of this problem, numerical results, reported in Table 3 are quite good.    These results have been obtained exploiting a latent space of dimension 200 (in contrast with the traditional dimension of 128) and a balancing factor β = 3 between reconstruction and KL-regularization. Training lasted 110 epochs (fast!), using Adam optimizer with an initial learning rate of 1.0e−3. For ex-post re-estimation of the distribution of the latent space we used a GMM with 100 components. Examples of generated CIFAR-10-like images are given in Figure 10.   The splitting technique for CelebA works particularly well, automatically producing remarkable "semantical" maps similar to drawings (see Figures 4 and 13). The quality and precision in the design of details is impressive and largely unexpected.
Comparative values are reported in Table 4. In this case we provide fid scores forx,x 1 ,x 2 . We worked with a latent space of dimension 150, in contrast with more traditional dimensions like 64 or 128. The initial β-factor was 3.  In the case of SVAE the best result is obtained by taking a random mix of images fromx 1 andx 2 .
Training lasted 160 epochs, using Adam optimizer with an initial learning rate of 1.0e−3, but already after 40/50 epochs we get excellent FID scores forx 1 andx 2 . A typical evolution of FID scores during training is shown in Figure 11; a more thorough investigation is given in the next section. Additional examples of generated CelebA-like images and splitting masks are given in Figures 12 and 13, respectively.

VI. ABLATION
The splitting technique is simple and non-invasive. There are however a few additional modifications suggested and induced by splitting -most notably the increased number of latent variables -and one could naturally wonder if this is not the actual source of the observed improvements in FID scores.
To clarify the point we compared the behaviours of precisely the same architectures just changing the final layer of the decoder. FIGURE 12: Examples of generated images. In the case of CelebA, the quality of samples generated with a variational approach should be judged on those details with the highest variability: hairs, background, accessories. Note also the wide differentiation in pose, illumination, colors, age and expressions.
FIGURE 13: Example of generated boolean maps. The quality and precision of contours is both unexpected and remarkable. VOLUME 4, 2016 We focused the attention on the most interesting cases of CIFAR10 and CelebA, comparing the evolution of the fid score forx,x 1 ,x 2 for 5 different trainings for each dataset. All scores have been computed after resampling in the latent space according to a GMM with 100 components. Results are given in Figures 14 and 15. The improvement forx in the case of the split-network is marginal, but the Fid score forx 1 andx 2 is significantly smaller, and frequently even smaller than the reconstruction FID, that is traditionally supposed to be a lower bound for this metric on generated samples.  Our understanding of the phenomenon is that splitting allows the network to decompose each image towards higherdensity regions in the given neighbourhood, hence creating more realistic variants of the usual "average" result.
On the other side, it is also interesting to remark the high sensibility of the FID score to apparently minor modifications of generated images (a phenomenon already pointed out in [2]).

VII. CONCLUSIONS
In this article, we introduced the notion of Split Variational AutoEncoder (SVAE). In a SVAE the outputx is computed as a weighted sum σ x 1 +(1−σ) x 2 wherex 1 ,x 2 are two distinct generated images, and σ is a learned compositional map. A Split VAE is trained as a normal VAE: no additional loss is added over the split imagesx 1 andx 2 . Splitting is meant to offer to the network a way to generate variants of the expected result, in the attempt of overcoming the averaging problem inherent to the adoption of a loglikelihood loss function. At the same time, the network may specialize its generative capabilities towards more oriented and specific subsets of the data manifold, possibly learning additional and differentiated features. As a side result, even with a relatively high balancing factor for KL-regularization, the variable collapse phenomenon is less constraining, and the possibility of exploiting a larger number of latent variables improve the quality and diversity of generated samples. This has been experimentally confirmed on traditional benchmarks such as Mnist, Cifar10 and CelebA. The SVAE architecture systematically improves over its vanilla counterpart, and outperforms state-of-the-art loglikelihood-based generative models such as Two-Stage architectures or Regularized autoencoders. We intentionally avoided to test the architecture on high-resolution datasets such as CelebA-HQ [15], mostly for ethical and ecological reasons: they are too demanding in terms of computational resources. We think that there are a lot of interesting problems to be investigated and solved even on relatively cheap datasets, so there is no actual need to move to high-resolution domains.
As for future developments of this work, a particularly interesting research direction seems to be the possibility to add control over the splitting operation, possibly segmenting the input image in other interesting and meaningful components, and specializing subnets for their respective processing.
Code The code relative to this work is available on Github in the following repository: https://github.com/ asperti/Split-VAE Pretrained weights for the models discussed in the article are available at the following page: https://www.cs.unibo.it/~asperti/SVAE.html.

Conflict of Interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
LAURA BUGO Laura was born in Bologna, Italy, in 1995. She received the B.S. degree in computer science from the University of Bologna in 2019. She is currently pursuing the master's degree at the University of Bologna. Her research interests include Artificial Intelligence applied to improvement of well-being for most vulnerable people. She is also an athlete of judo kata, she has won a bronze medal at European Kata Championships in 2021, 3 silver medals at European Kata Championships U24 and a silver medal at World Judo Kata Grand Slam U35.
DANIELE FILIPPINI Daniele Filippini was born in Ostiglia , Italy, in 1996 . He received the B.S. degree in information science for management from the University of Bologna in 2019. He is currently pursuing the master's degree in computer science at University of Bologna. His research interests include artificial intelligence and deep learning.