JGAN: A Joint Formulation of GAN for Synthesizing Images and Labels

Image generation with explicit condition or label generally works better than unconditional methods. In modern GAN frameworks, both generator and discriminator are formulated to model the conditional distribution of images given with labels. In this article, we provide an alternative formulation of GAN which models the joint distribution of images and labels. There are two advantages in this joint formulation over conditional approaches. The first advantage is that the joint formulation is more robust to label noises if it’s properly modeled. This alleviates the burden of making noise-free labels and allows the use of weakly-supervised labels in image generation. The second is that we can use any kinds of weak labels or image features that have correlations with the original image data to enhance unconditional image generation. We will show the effectiveness of our joint formulation on CIFAR10, CIFAR100, and STL dataset with the state-of-the-art GAN architecture.


Introduction
Due to the success of Generative Adversarial Network (GAN) for modeling distributions of real world data, it becomes widely used for image generation.After the first introduction from Goodfellow and his colleagues [1], many researchers have improved its stability and accuracy by new designs of loss functions [2,3], new network architectures [4,5], improving training process and regularization [5,6], imposing conditions [7,8,9,10,11,12,13], and progressive methods [14].Among them imposing explicit conditions is one of the easiest way of improving the quality of image generation if there exist well-defined labels.In modern GAN frameworks, both generator and discriminator are formulated to model the conditional distribution of images given with labels.
In this paper, we propose an alternative formulation of GAN which models the joint distribution of images and labels.We will show that there are two advantages of this joint formulation over conditional approaches.The first advantage is that the joint formulation is more robust to label noises.Typical labels used in image synthesis is annotated by human workers or generated by other machine learning methods.It is generally difficult to guarantee the completeness or correctness of labels.Since conditional image generation regards labels as a given condition, noises in labels may degenerate the quality of image generation.Since our joint formulation regards labels as an additional information to model the joint distribution, it's more robust to the noises in labels.We will show the joint formulation provides the same level of image generation quality with defect-free labels, and more robust to noises in labels.Second and more importantly, we can use any kind of weak labels (or additional information which has dependence on the original image data) to enhance unconditional image generation since our joint GAN formulation doesn't require those labels in image generation but actually generates them.In a conventional conditional formulation, it's impossible to feed these additional data into the generator since we don't know what kind of data should be added to the generator.Our experiment shows better image generation is possible without explicitly feeding images labels.Our contribution are summarized as follows: • We propose a novel GAN formulation which models the joint distribution of images and labels, and show that this joint formulation increases the robustness for noisy or weak labels.
• We demonstrate that this joint formulation can be used to increase the quality of unconditional image generation by incorporating weak labels or any kind of additional information which have dependence on the image data into training process.Since the labels are used only for training and our GAN generates both images and labels, we don't need to feed labels when generating images.
2 A Joint formulation of GAN for modeling p(I, L) The standard adversarial loss for the discriminator for modeling p(I|L), in which I and L are images and labels respectively, is given by: , where z is input noise, and q and p are the true distribution and the generator distribution, respectively.The generator loss is defined as: In our joint formulation, we can rewrite discriminator and generator losses with a new generator G I,L (z), which can generate both I and L, as follows: As you can see, no modification is made on the discriminator since the discriminator has already a joint formulation which takes p(L) and p(I|L), and G I,L generates I and L, simultaneously.Figure 1 illustrates the basic difference between the conditional and our joint formulation of exploiting a label.Benefits of joint formulation over conditional formulation are limited when there exist well-defined labels, which are made carefully by human workers or an external oracle.It's well-known that modeling joint distribution is generally a more difficult task than modeling conditional distribution due to its increased dimension in probability distribution.
The discriminator represents the joint distribution by the lower dimension probability distributions p(L) and p(I|L).
The only difference here is how we can incorporate the label in generators.Common choices of imposing condition on generators are input or hidden concatenation [7, 8, 9, 10] and conditional batch normalization [11,12].Our joint formulation doesn't require labels as a condition but actually generates labels with the given input noise along with images.To do this we add an additional function approximator as a part of the generator (refer to the Experiment section for the choices of the label function approximators).Since this joint formulation doesn't use labels as a prior for lowering the dimension of the probability distribution of the data, it can be more robust to the noises in labels if we can properly model the joint distribution in the original dimension of probability distribution.

Boosting Unsupervised Image Generation
With our joint formulation we can add any kind of additional information, which has dependence on the original data, as a weak label for the generator.Figure 2 illustrates how we can add features from other classification network ϕ for boosting the quality of unsupervised image generation.Typical choices for ϕ are feature extractors from other tasks like ImageNet classification and object detection.You can also use some unsupervised learning algorithms like k-means clustering or autoencoders [15].This is an unique advantage of JGAN over conditional GANs since the additional information is modeled simultaneously by the generator, and the discriminator uses this fake information as a condition for the decision.As you can see in Equation 3, the discriminator actually models the joint distribution with the prior equals to the training label distribution.This additional information can boost the quality of synthesized images since it can acts like a weak label for the discriminator.
Figure 2: Enhancing unsupervised image generation by using an additional feature extractor ϕ.The label generator part of JGAN models the distribution of ϕ, and the generated label is fed into the discriminator in a conventional way.

Experiment
We used CIFAR-10, CIFAR-100, and STL for our experiment, and resized STL images to 48x48 from its original size of 96x96.For all experiments, we fixed the discriminator architecture for assessing the effect of our joint formulation.
We followed the design used by Miyato et al. [6] as a baseline framework for the entire experiment.Table 1 and 2 show the architecture of our generator and discriminator respectively.We removed batch normalization and applied spectral normalization to all layers of the discriminator.We used one discriminator update for each generator update, and all results are evaluated at 100K generator updates except STL case, in which we used 200K generator updates for better convergence.We used 0.0004 for the learning rate for the discriminator, and 0.0001 for the generator with Adam optimizer with β 1 = 0.5 and β 2 = 0.999.We reported the average inception score [4] of the last five iterations of several runs.We first show our joint formulation is as good as the conditional formulation when modeling the conditional distribution p(I, L) for clean labels, and more robust to label noises.We used input concatenation [9,10] and conditional batch normalization [11,13] for the generator for comparison, and the projection discriminator proposed by Miyato et al. [13], which shows the state-of-the-art result for conditional image generation.To generate labels, we added a function approximator consisting of few neural network layers right after the last ReLU layer of the generator in Table 1.Table 3 describes the network architecture for the label generation part of the generator.Dropout [16] is applied to all dense layers of the label generator with the rate of 0.5 to avoid overfitting.We added label noises by randomly selecting a subset of the entire dataset and then applied a random offset for each selected label.Table 4 summarizes the results of inception score changes according to the amount of label noises.As you can see, our joint formulation shows a competitive result on clean labels, and remains robust to high label noise ratios.Our next experiment is focused on improving unconditional image generation by incorporating an additional information which has dependence on the image data.We used "pool3" layer output f pool3 (which is the right before the logit layer) of inception network as a starting point of additional information.We used the same inception network version used in [4].Since f pool3 has 1000 dimension, which is relatively high, and it's difficult to find the optimal network architecture to capture its distribution, we applied truncated singular value decomposition (SVD) to f pool3 to reduce its dimension to lower ones (32-128) to simplify the problem.Table 6 summarizes the comparison result between unsupervised and and joint image generation.We used the same network architecture for both unsupervised and joint settings except additional label function approximation.We used a label generator slightly different from the ones used in Table 3. Table 5 describes the network for weak label generation.As you can see, JGAN consistently generates images with higher inception scores compared to unsupervised ones.Note that we didn't feed those labels in image generation stage.

Conclusion
In this paper, we propose a novel GAN framework which models the joint probabilistic distribution of images and labels.We showed that this joint formulation can generate as good image quality as the conventional conditional image generation with clean labels, and remains robust when there exist noises in labels.We also applied our method to improve the image quality of unconditional image generation by incorporating additional information which has dependence on the image data.We think this joint formulation can provide an easy way to feed any kind of relevant information into the GAN framework with a simple modification of generators.There are several interesting future work like finding optimal network architectures for label generator and testing with other methods for generating additional information we can use with our joint formulation.

Figure 1 :
Figure 1: Three different GAN formulations: (left) Unsupervised GAN modeling p(I); (middle) Conditional GAN modeling p(I|L); (right, ours) A joint formulation of GAN modeling p(I, L), and thus generates fake images and labels simultaneously.simultaneously.

Table 3 :
Label generation part of the generator, D r = 32 for CIFAR and D r = 48 for STL, C l = 128 for CIFAR-10 and STL and C l = 256 for CIFAR-100, D o = 10 for CIFAR-10 and STL and D o = 100 for CIFAR-100.We used one-hot vector representation for labels.Output of the last ReLU of the generator ∈ R Dr×Dr×256 7 × 7 conv, stride=4, 256 BN, ReLU, dense, C l BN, ReLU, dense, C l BN, ReLU, dense, D o

Table 4 :
Inception scores on CIFAR-10 and CIFAR-100 with different label noise ratios.Note that the joint formulation is more robust than the conditional one at high noise ratios.The conditional formulation have almost no benefit from 50% label noises but the joint formulation has small improvement.

Table 6 :
Inception scores on CIFAR-10, CIFAR-100, and STL by adding weak labels (Inception pool3 with reduced dimension by truncated SVD).Note that our Joint GAN doesn't require any label information when generating images (actually it generates labels) compared to conditional GANs.