Lessons Learned from the Training of GANs on Artificial Datasets

Generative Adversarial Networks (GANs) have made great progress in synthesizing realistic images in recent years. However, they are often trained on image datasets with either too few samples or too many classes belonging to different data distributions. Consequently, GANs are prone to underfitting or overfitting, making the analysis of them difficult and constrained. Therefore, in order to conduct a thorough study on GANs while obviating unnecessary interferences introduced by the datasets, we train them on artificial datasets where there are infinitely many samples and the real data distributions are simple, high-dimensional and have structured manifolds. Moreover, the generators are designed such that optimal sets of parameters exist. Empirically, we find that under various distance measures, the generator fails to learn such parameters with the GAN training procedure. We also find that training mixtures of GANs leads to more performance gain compared to increasing the network depth or width when the model complexity is high enough. Our experimental results demonstrate that a mixture of generators can discover different modes or different classes automatically in an unsupervised setting, which we attribute to the distribution of the generation and discrimination tasks across multiple generators and discriminators. As an example of the generalizability of our conclusions to realistic datasets, we train a mixture of GANs on the CIFAR-10 dataset and our method significantly outperforms the state-of-the-art in terms of popular metrics, i.e., Inception Score (IS) and Fr\'echet Inception Distance (FID).


I. Introduction
The past few years have witnessed the arising popularity of generative models. As can be seen, image processing (e.g., image super-resolution and editing) and machine learning (e.g., reinforcement learning and semi-supervised learning) tasks are infused strong energy by generative models [1]. Typically, a generative model learns a distribution P g to approximate the true distribution P r , given a set of observed samples.
Generative Adversarial Network [2], with no doubt, is the most prevailing generative model. It is composed of a generator G that maps random noise to synthesized data points, and a discriminator D which aims to tell whether its input comes from the real data distribution P r or generative distribution P g . During training, D and G are updated simultaneously or alternatingly. In a vanilla GAN, D gives an estimate of the Jensen-Shannon divergence between P r and P g while G tries to minimize it [2].
Unfortunately, the objective of G can get saturated when P g and P r do not have an non-negligible overlapping manifold, causing vanishing gradients to the generator [3]. Let Z and X be the domain and codomain of G respectively. G(Z) is contained in a countable union of manifolds of dimension at most dim Z. Then, according to [3], if the dimension of Z is less than that of X , G(Z) will be a set of measure 0 in X , P r and P g can be distinguished with accuracy 1 by D and thus no gradient is provided to G. Besides, GANs suffer from mode collapse. Mode collapse refers to the phenomenon that the samples of the generator lacks the diversity exhibited in P r . [4] prove that the generator can fool the discriminator by generating a limited number of images from the training set. In other cases of mode collapse, the generated samples are even meaningless as G needs only to fool D in the current iteration. When mode collapse happens, the model fails to generate diverse and realistic data.
To cope with these challenges, variants of GAN were proposed (e.g., [5]- [10]). Limited by the fact that these methods are applied to high-dimensional realistic datasets with inadequate samples from each class, the behavior of GANs remains not completely understood. Another problem with realistic datasets is that the performance of GANs can degrade simply due to data scarcity or insufficient model complexity [4], [11].
Considering that we aim to study the behavior of GANs, conventional image datasets might not be good choices. Hence we train GANs on artificially constructed datasets (e.g., mixtures of Gaussians in high dimensional space), applying neural networks with sufficiently high capacity. In this way, we can avoid the influence of the aforementioned factors and focus on the inherent problems of GAN training.
The contributions of this work can be summarized as follows: • We propose a set of metrics for evaluating GANs trained on the artificial datasets. • We designed controlled experiments where we can adjust the network width/depth, the mixture of networks, and the training set size, and then relate them to the performance of GANs.
• Our empirical study suggests that GANs may fail to learn the real data distribution, even if at least one set of optimal parameters exists for the generator by design. • In terms of model complexity, our experimental result demonstrates that when the networks are already reasonably large, training a mixture of GANs is more beneficial than increasing the complexity of standalone networks, as the generation task can be divided by multiple generators and the variance of the discrimination model is reduced when using an ensemble of discriminators. We further validate this conclusion on the CIFAR-10 dataset and achieve the state-of-theart Inception Score and Fréchet Inception Distance.
II. Related Work There are attempts to make a GAN converge to an equilibrium [4], [10], [12]. However, even if a GAN reaches an equilibrium, it might fail to learn the desired real data distribution. To support this conjecture, [13] adopt the Birthday Paradox to measure the diversity of the generative distribution. They present empirical evidence that P g has lower support than P r . However, it should be noted that this problem also might be due to the dimension of the manifold of the latent distribution being lower than the dimension of the manifold of P r [3]. In order to rule out this possibility, we set the dimension of z to be no lower than the dimension of x in our experiments on the artificial datasets. Basically, we share a similar goal with [13], but we conduct experiments on artificial constructed datasets with infinite data samples.
Consistent with [13], our experiments reveal that even when a GAN converges to a diverse distribution, it still differs from the true distribution. Considering that the birthday paradox test in [13] is rather restrictive on continuous data, we propose to use some other measures for validating whether GANs can learn the real data distribution.
Recently, large scale GAN training(e.g., [11], [14]- [16]) has proven effective on the ImageNet [17] dataset. Their superiority over previous models is mainly due to high model complexity and large batch sizes. While current state-of-the-art GAN models on ImageNet are still subject to model complexity and batch size, our work focus on synthetic datasets that allows the batch size and model complexity to be sufficiently high, which enables us to explore the properties of GANs in ideal cases.
Some previous work has studied the feasibility of using multiple discrimiantors [18], multiple generators [19], [20], or both [4] to improve the performance of GANs. Our experiments on artificial datasets is based on MIX+GAN [4] and we find it beneficial to use multiple generators and discriminators. Further, our experimental results unveil the relations between the number of generators and discriminators and the performance of GANs. As the computation of MIX+GAN is expensive or even infeasible, we modify it to allow larger mixtures and achieve stat-ofthe-art results on CIFAR-10. In our work, we also explore how factors such as network depth, network width and training set size affect the performance of GANs.

III. Models
WGAN-GP [8] has been gaining popularity (e.g., [21]- [23]) for its stability, while MIX+GAN [4] guarantees the existence of approximate equilibrium using a mixture of generators and discriminators. MIX+GAN is also effective in modeling multi-modal data which is common in realistic datasets. Therefore, we combine WGAN-GP and MIX+GAN for our experiments on the artificial datasets.

A. WGAN-GP
In a vanilla GAN, the generator tries to minimize the approximate Jensen-Shannon divergence defined by the discriminator. Different from vanilla GAN, the discriminator in WGAN calculates an approximate Wasserstein distance between the real and fake data distributions. The discriminator in WGAN is also referred to as the "critic". We will use both terms interchangably in this paper.
The minimax game for WGAN is formulated as where D is in the set of all 1-Lipschitz functions and P g is the model distribution implicitly defined by z ∼ p(z), x = G(z). Note that Eq. 1 can be reformulated as where D is in the set of all k-Lipschitz functions. WGAN [7] adopts a weight clipping approach to enforce the Lipschitz constraint. However, it can lead to optimization problems and pathological behaviors. To overcome these problems, An improved version of WGAN was proposed in [8], introducing a new objective for the critic: wherex comes from the distribution Px whose samples are interpolated between samples from P g and P data . This choice is based upon the fact that the L2-norm of the gradient of the optimal D is 1 between the manifolds of P g and P data [8].
The last term can be interpreted as a regularizer that forces the gradient between the real and fake datasets to be at a moderate scale, so that P g is moved smoothly to the real data distribution.

B. MIX+GAN
A group of datasets that the GANs in this paper are tasked with is mixtures of Gaussians. A generator can learn any n-dimensional Gaussian distribution with n-dimensional isotropic Gaussian input noise by simply learning an affine transformation. However, this problem becomes less straightforward if P r is a mixture of Gaussians. Another problem is that different modes in the dataset can be discontinuous (which is common in realistic datasets) and thus cannot be learned by a continuous generator network, posing another challenge to GAN training. Therefore, we use MIX+GAN [4] to model mixtures of Gaussians. In MIX+GAN, there are n G generators and n D discriminators. Each G i and each D j has a weight w i and v j respectively to indicate their relative importance. The weights are produced by the softmax function on the learnable log-probabilities, therefore In a MIX+GAN, both players play mixedstrategies: The generators' weighted probability density at point x is and the discriminators' weighted output at point x is To encourage the weights to get close to the discrete uniform distribution, entropy regularization terms, , are added to the loss of the generators and the loss of discriminators respectively. Therefore, the overall loss for the discriminators is and the overall loss for the generators is where L D,real , L D,f ake and L G are functions of the outputs of the discriminators. For example, in a vanilla GAN,

IV. Experiments on the artificial datasets A. Datasets
Popular GANs usually focus on learning highdimensional and complicated datasets (e.g., realistic images and natural languages). As a result, the GAN models are sensitive to almost every hyperparameter. Furthermore, these datasets contain either too many categories or too few samples in each category, thus GAN models can easily underfit or overfit [24], i.e., generating samples of low visual quality or encountering mode collapse. In this paper, we conduct experiments on the following simple artificial datasets with infinite samples: 1) Mixture of Gaussians. In our experiments, a dataset of a mixture of Gaussians consists of samples from independent high-dimensional Gaussian components with equal prior probabilities. The Gaussian components have their centers lying on axes of a Cartesian coordinate system in 1024-dimensional space. Specifically, the coordinate of the i'th center is e i = (0, ..., 0, 1, 0, ...0) whose i'th entry is 1 and the other entries are 0; the covariance matrices are all 0.09I. 2) Output of a randomly initialized network. This dataset is from the output of a network R that has the same input noise as a generator. In the case of a single generator G, if R and G have the same arhitecture and G has learned the parameters of R, then no classifier can distinguish P r and P g .

B. Design of Model Architectures
In order to compare across different experimental settings, we design our model architectures following the rules below to reduce unnecessary interference and maintain simplicity: • All neural networks consist of affine layers and LeakyReLU non-linearities only. • Each hidden affine layer is followed by a LeakyReLU activation layer. • The input dimension and output dimension of each generator is 1024. In addition to having LeakyReLU activations, each generator has no less than 1024 neurons in each of its hidden layers, so that it can be learned to be injective. • If a network has hidden layers, then all of its hidden layers contain the same number of neurons. • Throughout this session, the number of layers refers to the number of hidden affine layers plus one input layer and one output layer, excluding LeakyReLU layers, e.g., a 2-layer network is an affine transformation from the input space to the output space; a 5-layer network has 3 hidden layers. • Unless stated otherwise, each network has 5 layers and has 1024 neurons in each hidden layer.

C. Evaluation metrics
Evaluating different generative models accurately and objectively remains challenging. Currently, there are some reasonable and widely accepted metrics, i.e., Turing test, Inception Score [25], [26], Fréchet Inception Distance [12] and approximate Wasserstein distance [24]. The evaluation metrics we use are detailed as follows: 1) Visualization and Turing test: This is perhaps the simplest way to evaluate a generative model. It is done by using human inspectors to check the quality of (the projection of) the generated data. If a human inspector cannot distinguish whether the generated data are real or fake, then one can conclude that the generative model is very successful. If the inspectors say that samples generated by one model are significantly better that those generated by another model, then it can also be concluded that one model is better than another. On the other hand, if a human inspector cannot tell a significant difference, then one may want to resort to more objective and more accurate metrics. To inspect the generated synthetic high dimensional samples manually, we project the generated samples onto a plane determined by a = (1, 0, 0, 0, ..., 0), b = (0, 1, 0, 0, ..., 0) and c = (0, 0, 1, 0, ..., 0). In the projection plane, the origin is a, The x-axis is in the same direction as − → ab, and the y-axis is in a direction that is perpendicular to the x-axis, as is shown in Figure 1. The projections of sample data can be seen in Figure 2  2) Fréchet Distance: [12] proposed to use the Fréchet Inception Distance (FID) as a metric for evaluating generative models. The Fréchet Distance (FD, also known as the Wasserstein-2 distance) for two Gaussian distributions [27]. During the computation of FID, images from the real and fake distributions are fed into the Inception model [28] to get their activations in the last pooling layer. The distributions of the activations are approximately treated as Gaussian so that their means and covariances can be used to compute the FID. In this paper, since we are dealing with artificial data and the Inception model was intended for realistic data, we use the means and covariances of the artificial data directly without passing them through the Inception Network. 50,000 data points are sampled from P r and P g respectively for computing the Fréchet Distance.
3) Critic output: As is noted in [7], the loss of the critic provides a meaningful estimate of the Wasserstein distance between P r and P g . We can log the average value of D(x r ) − D(x g ) during each iteration with almost no additional computation cost. If it is positive, then it tells us that P r is different from P g . Moreover, it is an indicator of the training dynamics of WGAN.
4) Wasserstein distance: The above estimate of the Wasserstein distance can be inaccurate due to adversarial training. Alternatively, one can train an independent critic to approximate the Wasserstein distance after training a GAN [24]. Note that the gradient penalty term might be large than 0 and the critic may be a k-Lipschitz function, thus we normalized the estimated Wasserstein distance using Eq. 2.
For fair comparison, we train an independent critic with the same architecture across different experiments. Specifically, it has 5 layers and 1024 neurons in each hidden layer. In our experiments, we estimate the approximate Wasserstein distance W (P r , P g ) with 25,600 sample points from P r and P g respectively. 5) "Judge" accuracy: In all our experiments, an independent classifier called "Judge" is trained to distinguish samples from P g and P r . The accuracy of the Judge is an objective metric for evaluating all of our GANs. After the Judge is fully trained, its classification accuracy is expected to range between 0.5 and 1. If the generator(s) has learned the distribution, then the Judge should have an accuracy of around 0.5. Conversely, if the generator(s) produces a distribution different from P r , the Judge is expected to have an accuracy higher than 0.5. Following Theorem 2.2 of [3], given two distributions P g and P r that have support contained in two closed manifolds M and P that don't perfectly align and don't have full dimension, and assume that P g and P r are continuous in their respective manifolds, then there exists a perfect classifier that has accuracy 1.
One can show that the expected Judge accuracy is related to the total variation distance: Proposition 1. Let J be a deterministic classifier for samples from two distributions P r and P g with equal prior probabilities. Let δ(P r , P g ) be the total variation distance between P r and P g , then The proof of Proposition 1 is provided in Appendix A. Proposition 1 is intuitive: If the total variation distance between two distributions is very low, then it is hard for any classifier to tell them apart and the accuracy of a classifier can hardly get above 0.5; if a classifier has an accuracy of 1, then the total variation distance between them is high. One can in turn show that the total variation distance is related to the Kullback-Leibler Divergence [29].
For fair comparison, we train an independent Judge with the same architecture across different experiments. Specifically, it has 5 layers and 1024 neurons in each layer. In our experiments, we estimate J acc with 25,600 sample points from P r and P g respectively.

D. Training
We follow some experimental setups of [8] for toy data: The batch size is 256; there are 100,000 GAN iterations, each of which includes 1 generator update and 5 discriminator updates; After the training of GAN, we train the Judge and the independent critic for another 100,000 iterations respectively. Adam optimizers [30] are used for optimizing all models.
Our experiments differ from [8] in the following ways: In order to imitate the training of GANs on high-dimensional realistic data, the data points lie in 1024-dimensional space; motivated by [3], in order to allow the manifold of P g to have the same dimension as that of P r , the input noise z follows a 1024-dimensional Gaussian distribution and the activation layers are chosen to be LeakyReLU layers that are injective; in all the experiments, λ in WGAN-GP is set to 10 to improve stability; we adopt a "two time-scale update rule" (TTUR) [12]: the learning rate of the Discriminators(s) is set to 1e − 4 and the learning rate of the Generator(s) is set to 1e−5 after some hyperparameter searching.
During each iteration of the training of GAN, the generator(s) and the discriminator(s) are updated in the following order: 1) Draw a batch of real data from the real data distribution. 2) Each generator generates n D batch(es) of fake data, which are distributed to the n D discriminator(s). 3) Compute the loss for the generator(s) as described in Eq. 8 and take an optimization step on the generator(s). 4) Compute the loss for the discriminators(s) as described in Eq. 6-7 and take an optimization step on the discriminator(s). 5) Repeat step 1), 2) and 4) for another 4 times. To stabilize GAN training, some tricks were proposed in [31]. However, most of them are not necessary under the WGAN-GP's setting. Besides, we do not incorporate into our models other potentially beneficial techniques such as normalization techniques [9], [32]- [34] as they may limit the capacity of the models.

E. Results
In this part, we will report the experimental results on the artificial datasets.
1) Generation of Mixtures of Gaussians: In Figure 2, we present the qualitative results on the 3-Gaussians dataset. The projections of real data points and generated data points are indicated by red and blue dots, respectively. The contour plot shows that the output of the critic of WGAN-GP is quite smooth. In Figure 3, we compare MIX+GANs with different combinations of mixtures quantitatively using the aforementioned metrics.
In each experiment, MIX+GAN successfully learns a 3modal mixture, but it differs from the real distribution. Nevertheless, the generative distribution P g is closer to P r with larger mixtures.  Wasserstein distance There are at least two ways for the generator(s) to win the game. For one thing, Corollary 3.2 in [4] states that low-capacity discriminators are unable detect lack of diversity, thus the generator(s) can memorize a large quantity of training data to win the game. For another, since the generator(s) can be learned to be injective with all the hidden dimensions being 1024, which is the same as the input dimension and the output dimension, a mixture of 3 generators can learn 3 individual Gaussian components perfectly. But in GAN training, the generator(s) does not win, as Figure 3b shows that the discriminator(s) can distinguish the real and generative data distributions.
An intriguing phenomenon is observed when the number of generators equals the number of Gaussian components. In Figure 4, 5 and 6, we show different fractions of samples generated by different generators in a MIX+GAN. The results in Figure 4 and Figure 5 show that when the number of generators equals the number of Gaussian components, MIX+GAN can roughly make each generator capture one Gaussian component. When the number of generators exceeds the number of Gaussian components, as is shown in Figure 7, we can see that each generator generates a small portion of data.  The above empirical results indicate that increasing the mixture size can improve the generative distribution, partly by means of dividing the generation and discrimination tasks across multiple generators and discriminators.
In Figure 8 and 9, we show the quantitative results of varying the depth or width of the networks. In these experiments, we use 1 generator and 1 discriminator for generating 3 Gaussians. We do not see any significant improvement when increasing the complexity of the networks compared to increasing the number of generators and discriminators. Increasing the depth does not help might be because the dataset is too simple and more layers do not give better modeling power. Also, training deep MLPs can be unstable. Increasing the width does not help might be because hidden dimensions of 1024 can preserve       For all the metrics, lower is better.

2) Generation of datasets defined by neural networks:
In this part, we define the real data distribution as the distribution of the output of a neural network R that has the same input and architecture as the generator(s). The parameters of R is randomly initialized with the Glorot uniform initializer [35] and fixed thereafter. There are also at least two ways in which the generator(s) can win the game: either memorize a large sample of the training data according to [4], or learn to have the same parameters as R (of course, there are other sets of parameters that enables the generator(s) to generate P r due to the symmetry and complexity of neural networks). We consider the simplest situation where R has only two layers, that is, it defines an affine transformation from R 1024 to R 1024 . Therefore, this dataset is in fact a 1024-dimension Gaussian distribution with randomly initialized mean and covariance. We plot the quantitative results in Figure 10. The results show that GAN training can have difficult in learning an affine transformation.  3) Varying the training set size: Now that we have access to infinite training data, we are able to study the influence the training set size has on the quantitative metrics and show the results in Figure 11. In this set of experiments, we have a MIX+GAN consisting of 3 generators and 3 discriminators, each of which has 5 layers and 1024 neurons in every hidden layer. There are infinite samples in the test set. The only factor of variation is the training set size. The results show that GANs perform worse with smaller training sets. On the contrary, the GAN trained on the largest training set preforms among the best in terms of all the metrics. We can see that the distances to the training set is larger with smaller training set size. This phenomenon is not straightforward as some would believe that it is easier for the generator(s) to overfit smaller training sets. A possible explanation is that with fewer training data, the discriminator(s) can memorized the training set and reject fake samples more easily, providing less informative feedbacks to the generator. This explanation is consistent with the one in Session 4.2 of [11]. Judge accuracy (training set) 25  Note that the dimension of our data is 1024 = 32 × 32, which is the same as the spatial dimension of the CIFAR-10 dataset [36]. However, the CIFAR-10 dataset is more complex and consists of only 50,000 training images. Therefore, one can expect a performance boost when there are more training data for the training of GANs on CIFAR-10 or other small-scale image datasets.

V. extended experiments on CIFAR-10
In this section, we will show that the lessons we learned from artificial datasets apply to realistic datasets. Inspired by our empirical finding on the artificial datasets that increasing the mixture size can improve the performance of GANs, we modify MIX+GAN and train mixtures of GANs on the CIFAR-10 [36] dataset. Fig. 12: Illustration of the generation and discrimination of fake samples when there are 5 generators and 5 discriminators distributed across 5 devices. The input noise to each generator is omitted.
Since the time and space complexity of MIX+GAN is O(n G n D ), it is computationally infeasible to train very large mixtures of GANs. Thus, we propose to use a modified version of MIX+GAN. We assume that w i is uniformly distributed (which is true for the data distribution of CIFAR-10 and many other datasets). The batch generated by each G i is split into n D parts uniformly and fed to different discriminators. Therefore, the actual batch size for each generator and each discriminator remains unchanged, but each discriminator can receive samples from different generators. Inspired by the finding that different generators can capture different modes in a distribution, we do not make each generator generate samples for 10 classes, but max{10/n G , 1} classes, which can ease the difficulty of generation for each generator. In this way, the generators can be viewed as a mixture of experts [37] and the discriminators can be viewed as an ensemble of discriminative models. We use a model-parallelism setting where generators and discriminators are distributed across different devices. If we have n GPU/TPU devices, then G i and D j are allocated to device (i−1 mod n)+1 and device (j −1 mod n)+1 respectively. In this way, there is no need to synchronize parameters across different devices and load balance can be achieved if both n G and n D are divisible by n. Figure 12 illustrates the flow of the generation and discrimination of fake samples when there are 5 generators and 5 discriminators distributed across 5 devices.
For CIFAR-10, We use MHingeGAN [38] as the base model. MHingeGAN is based on BigGAN [11] but uses multi-class hinge losses. In a MHingeGAN, D is a (K + 1)class classifier where class 0 represents fake data and class 1 to class K represent the K classes in the dataset. The intuition behind the multi-class hinge loss is to make the affinity of D for the target class to be by a margin of at least 1 over the other classes. The loss for D is where D y (x) is the y-th element of the output vector D(x) and represents D's affinity for class y given input x, D ¬y (x) is D's highest affinity for any class that is not y, i.e., D ¬y (x) = max D k =y (x), k = 0, 1, 2, ..., K.
The loss for G is a combination of the multi-class hinge loss (13) and a feature matching loss (14) where D f eat (x) is the feature of x after the last pooling layer of D. Different from [38], the loss we use for G is where λ = 0.05. Our network architectures are the same as [38]. We use shared embedding, hierarchical input noise, and moving average of the weights for G as in BigGAN [11]. The dimension of the input noise z is 80. The batch size for each generator and each discriminator is 50. We use the Adam optimizer [30] with β 1 = 0, β 2 = 0.9 and a learning rate of 0.0002 for all Gs and Ds. The proposed models are trained for 100, 000 iterations. There are 4 discriminator updates and 1 generator update per iteration. The training of a mixture of 10 generators and 10 discriminators takes 1.5 days with 5 Nvidia GTX 1080Ti GPUs and 1 day with a TPU-V3. We show the supervised and unsupervised Inception Score and FID in Table I and Table II respectively. We refer to our method as "MIX-GAN", to distinguish from MIX+GAN. Note that the "GAN" can be substituted by the name of a specific GAN model. We evaluate the Inception Score and the FID with 50,000 samples from each distribution. Since the test set of CIFAR-10 has only 10,000 samples, it is repeated 5 times (which does not change the moments used for calculating FID). Using 10 generators and 10 discriminators, we improve the state-ofthe-art IS and FID on CIFAR-10 significantly.
A random sample of a supervised MIX-MHingeGAN with 10 generators and 10 discriminators is shown in Figure 13.
Since the multi-class hinge loss is only applicable for conditional image generation, we use BigGAN [11] as the Method IS FID(train) FID(test) ACGAN [39] 8.25±0.07 --SGAN [40] 8.59±0.12 --Splitting GAN [41] 8.87±0.09 --WGAN-GP [8] 8.42±0.10 --cGANs with Projection D [42] 8.62 17.5 -CT-GAN [43] 8.81±0.13 --BigGAN [11] 9    Figure 14. To some extent, the 10 generators can learn different concepts automatically without label supervision, although they do not correspond to the 10 classes perfectly. VI. Conclusions and Future Work In this work, we explore different distance measures to investigate whether GAN training succeeds in learning the distribution. Our empirical results show that even when the distances between P g and P r are short, there exists a simple classifier with a model complexity similar to the discriminator that can distinguish P g and P r accurately. It suggests that P g and P r have little non-negligible overlapping manifold [3]. Empirically, we also find that even when an optimal set of generator parameters exists, GAN training fails to find it. Therefore, it remains an open question whether GANs should be replaced by nonadversarial generative models (e.g., [45]- [49]).
In our experiments on the synthetic datasets, increasing the size of the training set can improve the performance of GANs, even when it is already very large. On the other hand, a small training set can negatively affect GANs. Therefore, current datasets might not be large enough to make GANs learn the real data distribution or even result in overfitting.
Our experimental results show that training a mixture of GANs is more beneficial than simply increasing the complexity of standalone networks (that are sufficiently complex) for modeling multi-modal data. It is an interesting topic to devise different ways to combine models in the mixtures. It is also promising to measure and promote the diversity of the ensemble [50], [51] of discriminators.
Finally, while current state-of-the-art GAN models such as BigGAN [11], CR-BigGAN [16] and LOGAN [15] use a number of TPU cores that is the same as the height or width of the images, we are not able to conduct such large-scale experiments. But we believe that with more computing power, a large mixture of GANs can be trained on datasets such as ImageNet 128 × 128 and improve current state-of-the-arts.
Appendix A Proof of Proposition 1 Proposition 1. Let J be a deterministic classifier for samples from two distributions P r and P g with equal prior probabilities. Let δ(P r , P g ) be the total variation distance between P r and P g , then Proof. Without loss of generality, assume that the label y of a sample point x equals 1 if x is from P r and 0 otherwise. Let J opt be an optimal classifier with the highest expected accuracy. For all x such that p r (x) + p g (x) > 0, we have P (y = 1|x) = pr(x) pr(x)+pg(x) and P (y = 0|x) = pg(x) pr(x)+pg(x) . Then there exists a J opt that predicts J(x) = 1 if p r (x) ≥ p g (x) and J(x) = 0 if p r (x) < p g (x). Therefore, It follows that δ(P r , P g ) ≥ 2E[J acc ] − 1.