Conditional Activation GAN: Improved Auxiliary Classifier GAN

A conditional generative adversarial network (cGAN) is a generative adversarial network (GAN) that generates data with a desired condition from a latent vector. Among the different types of cGAN, the auxiliary classifier GAN (ACGAN) is the most frequently used. In this study, we describe the problems of an AC-GAN and propose replacing it with a conditional activation GAN (CAGAN) to reduce the number of hyperparameters and improve the training speed. The loss function of a CAGAN is defined as the sum of the loss of each GAN created for each condition. The proposed CAGAN is an integration of multiple GANs, where each GAN shares all hidden layers, and their integration can be considered as a single GAN. Therefore, the structure of the integrated GANs does not significantly increase the number of computations. Additionally, to prevent the conditions given in the discriminator of a cGAN from being ignored with batch normalization, we propose mixed batch training, in which every batch for the discriminator keeps the ratio of the real and generated data consistent.


I. INTRODUCTION
A conditional generative adversarial network (cGAN) [1] is a generative adversarial network (GAN) [2] that can generate data with a desired condition from a latent vector. Among the various cGANs [3], [4], the most frequently used is the auxiliary classifier GAN (ACGAN) [5]- [11]. Some studies have used variations of ACGAN [10], [11], without giving any details on the rationalization of the variations made. In this study, we describe the reasons for modifying an ACGAN and its disadvantages.
In an ACGAN, when the real and generated data distributions are the same, the auxiliary classifier of the discriminator and generator can be considered as a group of GANs, wherein each GAN trains condition using cross-entropy adversarial loss and shares all hidden layers. Considering an ACGAN as a set of GANs, the generated data classification loss of the ACGAN discriminator loss interferes with the training of each GAN and hence is removed in the modified ACGAN.
The associate editor coordinating the review of this manuscript and approving it for publication was Muhammad Sharif .
Each GAN can only be trained when the real and generated data distributions are the same; hence, a problem might arise wherein each individual GAN may not be trained at the beginning of the ACGAN training.
Additionally, to use the advanced adversarial loss, as applied in the least squares GAN (LSGAN) [12] or the Wasserstein GAN-gradient penalty (WGAN-GP) [13] in ACGAN, it is necessary to determine a hyperparameter that adjusts the ratio of adversarial loss to classification loss.
We propose a conditional activation GAN (CAGAN) that can replace an ACGAN to reduce the number of hyperparameters and improve the training speed (i.e., performance increase per epoch) to overcome the ACGAN problems mentioned above. The loss of a CAGAN is the sum of the losses of each GAN when it is created for each condition. Each GAN shares all hidden layers, and thus, a CAGAN created through the conceptual aggregation of an individual GAN can be considered a single GAN.
Unlike an ACGAN, which uses two losses (i.e., adversarial and classification losses), a CAGAN uses a conditional activation loss, and thus, there is no need to find the proper adversarial to classification loss ratio. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ In ACGAN, it begins to train each condition when the real data distribution is the same as the generated data distribution; whereas a CAGAN always trains all conditions simultaneously, which indicates that it always produces meaningful gradients, even during the early training stage. Therefore, the CAGAN's performance is better than that of ACGAN.
A cGAN is trained by applying batch normalization [14] to the discriminator that causes the generator to distort the distribution of the input conditions. When batch normalization is applied to the discriminator, and when the distribution of the real and generated data differs, the discriminator may use the distribution conditions of the batch to discriminate between the real and generated data. Further, the distribution conditions of the generated data follows that of the real data, rather than the distribution of the input target conditions.
To prevent the generator from ignoring the distribution of the input target conditions, we suggest applying a mixed batch training technique, which is used to configure each batch for the discriminator under the same ratio of real to generated data, such that each batch always has a mixture of real and generated data, in the same ratio However, if mixed batch training technique is applied prior training, it is essential to classify the generated and real data in a single batch; consequently, training of the discriminator is not required. Therefore, to apply mixed batch training, each batch's ratio of real data and generated data is gradually changed to the target ratio as training progresses. After the ratio of each batch reaches the target ratio, training proceeds without changing the ratio.

II. RELATED WORKS
cGAN [1] was proposed to generate data with the desired conditions. The generator of cGAN generates data with the desired condition by receiving a pair consisting of a latent vector and a condition vector. The discriminator discriminates a pair consisting of the real data and real data condition vector(labels) as real, and a pair consisting of generated data and a corresponding condition vector as fake. Subsequently, ACGAN, which is an extended version of cGAN, was proposed.
There is an auxiliary classifier in the ACGAN discriminator. This classifier is trained to classify the conditions of real data and generated data correctly. Additionally, generators are trained to classify the generated data correctly from the auxiliary classifiers. It is possible to change the image attribute using an image instead of a latent vector as input to the ACGAN generator and adding an additional reconstruction loss to the generator [10], [11]. Vanilla GAN [2] uses crossentropy as an adversarial loss, which reduces the Jensen-Shannon divergence between real data and generated data. However, because the Jensen-Shannon divergence always has almost the same value when the distance between the real data distribution and the generated data distribution is similar to the values prior training; consequently, this indicates that there is almost no gradient [25]. Therefore, results obtained from the GAN training is insufficient. Different divergences or distances between the two distributions or adversarial losses using different methods [22]- [25] were proposed to solve this problem. Among these, LSGAN and WGAN-GP are widely used. Mario et al. [20] compared the performance of several adversarial losses. Sebastian et al. [22] proposed a method to improve the performance of GAN, wherein each batch was composed of randomly selected real data and data generated by a Bernoulli trial. Subsequently, proposed a GAN with unique adversarial loss that trains the discriminator to infer the ratio of real data in each batch.

III. ANALYSIS OF AUXILIARY CLASSIFIER GAN
The loss of an ACGAN is defined as follows: (1) In (1) and (2), L d is the loss of the discriminator, and L g is the loss of the generator. L d adv and L g adv are the adversarial loss of the discriminator and generator, respectively. Similarly, L r cls and L g cls are the classification loss of real data and generated data, respectively.
In (3), E is the expectation of the given variable; x is the real data, and cnd is a binary vector that expresses the conditions of real data x. D cls (k) is the probability distribution of input data k within the auxiliary classifier of the discriminator, and − log D cls (b | k) is the cross-entropy loss between binary vectors b and D cls (k). Minimizing − log D cls (b | k) implies that D cls is being trained to estimate binary vector b when the input data is k.
In (4), cnd is the target condition binary vector, and z is the latent vector. G cnd , z is the generated data distribution by generator G with the target condition binary vector cnd and latent vector z.
In (5) and (6), D adv is the probability distribution function of the data in the adversarial module, and D adv (k) is the probability distribution of k, which is given as the input to the adversarial module.
Note that L r cls in L g does not have any role, because the generator does not affect the calculation of L r cls . In ACGAN, when the real and generated data distributions are the same, the auxiliary classifier of the discriminator and the generator can be considered as a group of GANs, wherein each GAN trains each condition using cross-entropy adversarial loss and shares all hidden layers, as depicted in Fig. 1.
Assume that an ACGAN training three independent conditions (A, B, and C) does so using only the adversarial loss, and that the real and generated data distributions are the same. Node A of the discriminator is trained by L r cls [A] in L d to output 1 to represent real data when it receives real data under condition A, and 0 to represent generated data under a not-A condition.
When the generator receives 1 as its node A input, it attempts to generate data using L g cls [A] L g under condition A and trains node A of the discriminator to output 1.
If the generator attempts to generate data under condition A but fails, the generated data distribution will be close to the real data distribution with a not-A condition; consequently, it is assumed that the real and generated data distributions are the same.
Thus, the hidden layers of the discriminator and node A, hidden layers of the generator and the latent vector input, and node A itself can be considered as a single GAN A that generates data with condition A, trained by L r cls [A] in L d and L g cls [A] in L g . However, L g cls [A] in L d trains node A of the discriminator to output 1, which represents a real valuewhen the discriminator receives the generated data. Therefore, L g cls [A] in L d interferes with the training of GAN A. Additionally, when the generator receives 0 as its node A input, it can be considered as a GAN that generates data under a not-A condition.
An ACGAN uses cross-entropy loss as an adversarial loss. However, to use an advanced adversarial loss, such as that of LSGAN or WGAN-GP, a hyperparameter is needed to adjust the ratio of adversarial loss to classification loss.
To solve these problems, the loss of the modified ACGANs used in StarGAN [10] or AttGAN [11] and is modified as follows: L g = L g adv + λ cls L g cls (8) In (7) and (8), L d is the loss of the discriminator, and L g is the loss of the generator. L d adv is the advanced adversarial loss of the discriminator, and L g adv is the advanced adversarial loss of the generator. L r cls and L g cls are the same as in (3)  and (4), respectively. λ cls is a hyperparameter representing the classification loss weight. As described above, a modified ACGAN can also be considered as a group of GANs. However, each GAN can only be trained as a GAN for each condition if the real and generated data distributions for the corresponding condition are the same.
In other words, if the real and generated data distributions differ at early stages of training, the training does not proceed with the classification loss but only with the adversarial loss, as shown in Fig. 2.
By training using adversarial loss, the real and generated data distributions become closer. As these distributions move closer to each other, the classification loss gradually acts as the cross-entropy adversarial loss of each GAN, producing meaningful gradients, and training is performed to generate data under each condition.
An ACGAN has the disadvantage of requiring one additional hyperparameter to adjust the ratio of adversarial to classification loss in both the discriminator and generator and does not produce meaningful gradients during the early training stage.

IV. CONDITIONAL ACTIVATION GAN (CAGAN)
To solve these ACGAN problems, we propose a CAGAN that is similar to having multiple GANs, each of which is defined to train the corresponding condition.
The loss of a CAGAN is the sum of each loss of a GAN, wherein each GAN trains only one condition, as expressed by the following: In (9) and (10), L d and L g represent the discriminator and generator losses of the CAGAN, respectively. In ∀c∈cnd (), c is a specific binary condition scalar in the binary condition vector cnd, and GAN c is an individual GAN trained for only condition c.
In addition, G c and D c are the generator and discriminator of GAN c, respectively, where G c receives a binary activation value with a latent vector. If G c receives 1 as an activation value, G c tries to trick D c , and consequently D c tries to discriminate the generated data from G c as fake.
If G c receives 0 as the activation value, the generated data does not depend on G c and D c . D c is only concerned with discriminating real data, which have conditions c, and does not pay attention to other real data, including real data with a not-condition c.
In (11), c is a binary scalar that expresses the condition of real data x, and G c (1, z) is the generated data distribution by G c when it receives latent vector z with 1 as the activation value.
Here, f d r is a function used to calculate the adversarial loss of the discriminator corresponding to the real data, and f d g is a function that calculates the adversarial loss of the discriminator corresponding to the generated data. In (14), f g is a function that calculates the adversarial loss of the generator.
The following equation is an example of the adversarial loss of GAN c, which uses the adversarial loss given in LSGAN [12].
In CAGAN, because each GAN shares all hidden layers, a conditional activation loss can be changed through the following equations: In (15) and (16), ''·'' is an inner product (element-wise sum of products).
The following equation expresses the loss of CAGAN when using the adversarial loss of LSGAN.
Likewise, the loss of CAGAN when using the adversarial loss of WGAN-GP can be defined by the following equation: In (19), λ gp and gp_loss represent the gradient penalty loss weight and gradient penalty loss, respectively. gp_loss is the average of each GAN's gradient penalty loss. In (20),x is data uniformly sampled from a straight line between x and x . average is a function that calculates the average of the input vector.
In ACGAN, GAN A, which trains condition A, generates data with the not-A condition as well as those with condition A.
However, in a CAGAN, because GAN A disregards the not-A condition when training with condition A, a new GAN must be added to train the not-A condition.
In CAGAN, meaningful gradients are generated even at the early stages of training because each GAN can be trained through an advanced adversarial loss that generates meaningful gradients, even if the real and generated data distributions differ. For this reason, the performance of CAGAN is better than that of ACGAN.
Additionally, unlike ACGAN, which uses two losses (adversarial and classification losses), CAGAN uses only one loss (conditional activation loss), and thus, there is no need to find the proper adversarial to classification loss ratio. This indicates that it takes less time to search for an important hyperparameter, that is, the ratio of adversarial to classification loss.

V. MIXED BATCH TRAINING
A cGAN is trained by applying batch normalization to the discriminator that may cause the generator to distort the distribution of input conditions. When batch normalization is applied to the discriminator and the distribution of target conditions used for training is different from that of the real data, the discriminator may use the distribution conditions on the batch to discriminate between real and generated data. Further, the distribution conditions in the generated data to follow that of real data. To prevent the generator from ignoring the distribution of conditions in the input target, we suggest using mixed batch training.
Mixed batch training technique is used to configure each batch, allowing the discriminator to maintain the same ratio of real to generated data continuously. Each training batch is configured to maintain the same ratio of real to generated data; thus, the discriminator will not discriminate between the real and generated data through a distribution of conditions, and the generator will not attempt to follow the distribution of conditions in real data.
Additionally, if the ratio of real data or generated data in each batch is not constant, for example, when real data size batch size ∼ U (0, 1) because the generator cannot determine the ratio of real data or generated data of each batch, it is advantageous for the generator to follow the distribution of conditions in real data. This causes the generator to ignore the input condition as if mixed batch training is not applied. Therefore, to apply mixed batch training, each batch should always consist of the same ratio of real data and generated data.
However, if mixed batch training is applied prior training, it is easy for the discriminator to discriminate the generated data from the real data in the batch, and the training hardly proceeds.
Therefore, the ratio of real data and generated data of each batch gradually changes to the target ratio as training progresses. For example, when the target ratio of ''real data: generated data'' is 50:50, at the beginning of training, the ratio of each batch is ''100:0 and 0:100.'' As training progresses, the ratio changes to ''20:80 and 80:20,'' ''40:60 and 60:40,'' and finally becomes ''50:50 and 50:50.'' After the ratio of each batch reaches the target ratio, training proceeds without changing the ratio. The ratio of ''real data: generated data'' that changes for each epoch or iteration is an additional hyperparameter used for mixed batch training. This hyperparameter exists to prevent training from failing early, so it has little impact on the model's final performance and is very easy to find. Additionally, when the training generator and discriminator are unbalanced, the target ratio may not be 50:50, but in general, 50:50 is used.

VI. MATERIAL AND METHODS
We conducted experiments on the MNIST handwriting number dataset [15] and Celeb A dataset [26]. Tensorflow2 was used [18]. VOLUME 8, 2020 To evaluate the proposed network, the average Fréchet inception distance (FID) [19] in overall conditions were used. The better the GAN's performance, the lower the FID. In all experiments, Adam optimizer [17] with beta1 = 0.9, beta2 = 0.999, and batch size = 32 were used. The size of the generated dataset was the same as the size of each test dataset during the evaluation.
All experiments were conducted three times, and the results were averaged.
The same row and column of the generated images have the same latent vector and condition vector, respectively.

A. MNIST EXPERIMENT
In this experiment, we used the MNIST handwriting number dataset, which has 60,000 training images and 10,000 test images, with a pixel resolution of 28 × 28 and a channel size of 1. The conditions to train are the type of number (0-9). The average of the FIDs for each number was used for the evaluation.
The basic design of a Deep Convolutional GAN(DCGAN) [16], applying instance normalization to both the generator and discriminator, was used for the model architecture. The adversarial loss of LSGAN was used.
For both ACGAN and CAGAN, the generator received a 10-dimensional condition vector and a 256-dimensional latent vector following a normal distribution.
The output dimension of the ACGAN discriminator is 1 + 10 = 11. The output dimension of the CAGAN discriminator was 10. We used learning rate = 10 −5 for both generator and discriminator training.
In the original MNIST handwriting number training dataset, the number of images for each number is approximately the same. For the mixed batch training experiment, we intentionally used a dataset consisting of 5,500 zeros and 500 other numbers from 1 to 9 (each from the MNIST handwriting number training dataset) to create an unbalanced dataset. Zeros in the dataset occupy 55% of the 10,000 images, and each of the remaining numbers (1-9) accounts for 5%. Because the size of the training data was reduced to 1/6, the epoch was doubled (epoch = 100), and the learning rate was tripled (learning rate = 3 × 10 −5 ) for this experiment. Since the epoch doubled, the FID was measured every two epochs. We replaced all instance normalization layers in the discriminator with batch normalization layers. For mixed batch training, we used batch ratio change per epoch = 1%p. This means that the batch ratio gradually changes to the target ratio by 1%p for epochs 1-50. Additionally, epochs 51-100 were trained with real data : generated data = 50 : 50.

B. CELEB A DATASET
In this experiment, we used the Celeb A dataset, which has 162,770 training images, 19,867 validation images, and 19,962 test images. We used validation images for hyperparameter tuning and test images for evaluation. We cropped the 128 × 128 pixels at the center of the image and then   The basic design of a DCGAN, with instance normalization on the discriminator and batch normalization on the  generator, was used for the model architecture. An adversarial loss of WGAN-GP λ gp = 0.1 was used.
For ACGAN, the generator received a 3-D condition vector and a 512-D latent vector following a normal distribution. The output dimension of the discriminator of ACGAN was 1 + 3 = 4.
In the mixed batch training experiment, we trained the unbalanced conditions: ''Black hair,'' ''Bangs,'' and ''Young.'' These conditions occupy 23.90%, 15.17%, and 77.89% of the training data, respectively. We replaced all instance normalization layers in the discriminator with batch normalization layers. Because WGAN-GP cannot apply  batch normalization to the discriminator [13], we changed adversarial loss to LSGAN adversarial loss. Because the adversarial loss and model architecture changed WGAN-GP to LSGAN and instance normalization to batch normalization; subsequently, we also changed the λ cls and learning rate. We used λ cls = 0.1 and learning rate = 3 × 10 −5 to generator optimizer, and learningrate = 3 × 10 −6 to discriminator  optimizer for mixed batch training experiments. We used batch ratio change per epoch = 2.5%p for mixed batch training. Therefore, the batch ratio gradually changed to the target ratio by 2.5% p for epochs 1-20. Epochs 21-50 were trained with real data : generated data = 50 : 50.

A. ACGAN AND CAGAN
First, we compared the performance of a modified ACGAN, with and without L g cls discriminator loss, when an adversarial loss occurred.
In Figs    changes are clearly shown in this graph. Additionally, both figures show that the performance of CAGAN is similar to or better than that of the modified ACGAN when we use a good hyperparameter (λ cls ). Fig. 13 shows the data generated by CAGAN in the Celeb A experiment after 50 epochs. Columns1-4 have the condition ''Not black or brown hair,'' and 5-8 have the condition ''Black or brown hair.'' Columns 1, 2, 5, and 6 have the  Figs. 17 and 18 show the data generated by the CAGAN after 100 epochs, without and with mixed batch training, respectively. Fig. 19 compares these two cases based on FID.   The generated data clearly show that both the modified ACGAN and CAGAN without mixed batch training ignore the conditional vectors and generate many zeros, following the distribution of conditions in the training data. The results shown in Figs. 14-19 clearly demonstrate that the performance with mixed batch training is better than that without such training in both the modified ACGAN and CAGAN.
In particular, Figs. 15 and 18 show that mixed batch training in a conditional GAN can prevent the conditional vector from being ignored.
The same experiment was repeated on the Celeb A dataset.

VIII. CONCLUSION
In this study, we interpreted an ACGAN as a set of GANs, described why the generated data classification loss of the discriminator loss in an ACGAN interferes with the training, and confirmed this theory experimentally. Based on this interpretation, we proposed a novel CAGAN, which can be interpreted as an integration of GANs in which each individual GAN trains only one condition. Unlike the modified ACGAN, CAGAN generates a meaningful gradient even at the early stages of training; consequently, the training is significantly faster, as demonstrated by our experiments.
CAGAN is expected to be used as a replacement for the modified ACGAN in many GAN applications because the former has fewer hyperparameters and trains faster while remaining compatible with ACGAN.
We also predicted that the discriminator, with batch normalization, might use a distribution of conditions in each batch to discriminate between real and generated data in a cGAN, which degrades the performance.
To prevent this degradation, we propose the use of mixed batch training, which is configured for each batch for a discriminator with the same ratio of real to generated data such that each batch always has the same distribution of conditions. Based on the experiments, the performance improvement of cGANs (i.e., modified ACGANs and CAGANs) were confirmed through the results of mixed batch training.
Mixed batch training is expected to help train cGANs using batch normalization for discriminators.
In conclusion, the proposed CAGAN provides better performance than ACGAN in terms of the training speed (performance increase per epoch) and hyperparameter search. Mixed batch training also improves the performance of a somewhat trained conditional GAN by inducing healthy competition between the generator and discriminator.