Stochastically Flipping Labels of Discriminator’s Outputs for Training Generative Adversarial Networks

Generative Adversarial Networks (GANs) play the adversarial game between two neural networks: the generator and the discriminator. Many studies treat the discriminator’s outputs as an implicit posterior distribution prior to the input image distribution. Thus, increasing the discriminator’s output dimensions can represent richer information than a single output dimension of the discriminator. However, increasing the output dimensions will lead to a very strong discriminator, which can easily surpass the generator and break the balance of adversarial learning. Solving such conflict and elevating the generation quality of GANs remains challenging. Hence, we propose a simple yet effective method to solve this conflict problem based on a stochastic selecting method by extending the flipped and non-flipped non-saturating losses in BipGAN. We organized our experiments based on the famous BigGAN and StyleGAN models for comparison. Our experiments successfully validated our approach to strengthening the generation quality within limited output dimensions via several standard evaluation metrics and real-world datasets and achieved competitive results in the Human face generation task.


I. INTRODUCTION
Generative Adversarial Networks (GANs) [1] and its variants are one of the most successful generative models that maintain a dynamic balance between two neural networks: the generator and the discriminator. GANs can generate imitative images and defeat many other generative models, such as VAE [2], Flow [3], etc., owing to the discriminator and the adversarial training.
The discriminator in GANs could be treated as a binary classifier, which can recognize input images from the real image source or the generated fake image source. Meanwhile, some GANs' studies (such as f-GAN, IPM-GANs, The associate editor coordinating the review of this manuscript and approving it for publication was Mehul S. Raval . etc. [4], [5], [6], [7], [8], [9]) treated the discriminator's outputs as an implicit posterior distribution, conditioning on parameters in the discriminator.
Thus, using multi-dimensional discriminator's outputs rather than a single dimension is a helpful way to represent richer information. However, the generation quality cannot improve if the discriminator's output dimensions are too high, as one ablation study in repulsive MMD-GAN [9]. Before reaching such high dimensions, richer information from the discriminator output can help to train a better generator. Then, if further increasing the discriminator's output dimensions, the discriminator gradually surpasses the generator and thus breaks the balance of the adversarial learning and harms the generation quality. Besides, some famous GAN models [9], [10], [11] have to use multi-dimensional discriminator's outputs for specific purposes. For instance, PatchGAN quadratically increases the discriminator's output dimensions, which may easily lead to an imbalance between the generator and the discriminator. Further research is necessary to avoid the conflict between higher dimensions and to keep the balance of adversarial learning.
BipGAN [12] explored another way to benefit GAN's training based on multi-dimensional discriminator's outputs. The authors set two outputs for the discriminator and then led outputs to two different optimization directions. Here, two types of outputs functionally work as two different kinds of discriminators: one for authenticity discrimination and the other for falsity discrimination. Such two types of discriminators in logic can work together to increase the posterior reliability [13] of the whole discriminator. Therefore, the adversarial training process benefits and improves the generation quality. BipGAN inspired us to solve the conflict problem mentioned before in the discriminator. We can use limited output dimensions to represent two distributions learned from the two types of outputs in BipGAN [12].
First, we set the output dimensions of the discriminator as N pairs of outputs (2N in all). One output is the authenticity loss in each pair, and the other is the falsity loss. Then, we randomly select part of the pairs only to back-propagate authenticity loss and the rest in the pair only for falsity loss. On this basis, our discriminator keeps double dimensions while holding the same volume of parameters as traditional GANs in the back-propagation to avoid the conflict problem.
All in all, we highlight our contributions to this work: • We extend BipGAN to larger dimensional outputs without causing the conflict problem via our random selecting method; • We tested the empirically best ratio and the selection polity between authenticity and falsity outputs; • Our experiments validate the effectiveness of our approach on conditional GANs with multi-dimensional discriminator's outputs based on several real-world datasets and achieve competitive generation quality in the Human face generation task.

II. BACKGROUND
The discriminator in GAN models was developed in many directions based on past research [9], [14], [15]. Increasing the output dimensions of the discriminator is a feasible and useful way to improve GAN's training while harming its performance after reaching the summit at specific dimensions.
In this section, we will introduce several past works based on multi-dimensional outputs in discriminators.

A. MULTI-DIMENSIONAL OUTPUTS IN THE DISCRIMINATOR
As validated in repulsive MMD-GAN [9], using multidimensional discriminator's outputs can significantly improve the generation quality due to an information-rich representation of the input data. In contrast, the model generation ability faces the conflict problem when adding more output dimensions. Maximum Mean Discrepancy (MMD) [16] loss function requires multi-dimensional discriminator's outputs to accommodate enough information in Hilbert's space. Repulsive MMD-GAN improved one successful past work: MMD-GAN [17], which utilized MMD as the loss metrics [11], [18] to minimize the distance between training data distribution and generated image distribution. Here is the original MMD loss [19], [20] in the discriminator step: Moreover, Eq. 2 is the repulsive MMD loss for the discriminator step: where k(D(a), D(b)) is the kernel embedding function (e.g., RBF kernel) and (a, b) could be the multi-dimensional discriminator outputs x or y as defined in [9]. Such function can kernelize x or y from the N dimensional vector to a N 2 matrix. P data and P G are the real image distribution and the generated image distribution, respectively (the same below). The authors showed that the generation quality is increasing from 1 to 16 dimensions in terms of Inception Score (IS) [21], and Fréchet Inception Distance (FID) [22] can engage a better performance until 64 dimensions. Both FID and IS are decreased when the discriminator's output dimensions achieve 256.

B. BipGAN
BipGAN [12] creates a novel way to improve the generator with two-dimensional outputs in the discriminator. Bip-GAN's discriminator guides one output to one learning direction as traditional GANs and guides the other one to an opposite direction. The authors name two outputs and losses as authenticity or falsity. Hence one can evaluate the possibility of the input source, which belongs to the real image distribution and vice versa to the fake image distribution. Therefore, leading the output distribution in one direction can ensure the discriminative confidence of either real distribution or generated distribution, yet hard to guide both distributions simultaneously. Hence, one natural idea is to separate the discriminative task into two discriminator outputs: one for real distribution (authenticity loss L A ) and the other one for generated distribution (falsity loss L F ). Such two outputs can offer two types of discrimination and work together to promote the posterior confidence of the whole discriminator.  where D A refers to the discriminator with authenticity loss output and D F for falsity loss output. By optimizing the sum of L A and L F as the discriminator loss, the discriminator can receive information from different outputs, thus enhancing adversarial learning.

C. NON-SATURATING LOSS AND LABEL FLIPPING
Here we give an extended analysis of authenticity loss from D A and falsity loss from D F . As explained in [1] and [23], two types of generator losses refer to two kinds of adversarial games: the mini-max game and the heuristic non-saturating game.
In Eq. 5, saturating generator loss indicates the minimax game, and non-saturating generator loss indicates the heuristic non-saturating game [23].
One past work [21], the author's talk 1 , 2 and the related practical project [24] mentioned that: flipping target labels (set real=0 and fake=1 in the discriminator loss and fake=0 in the generator loss) can obtain empirically better results than results from the non-flipped version of the nonsaturating game.
Here, we explain that the saturating generator loss can work as a non-saturating loss under the flipped targets. As shown in Fig. 1, the authenticity loss in BipGAN [12] is exactly the same as in this non-saturating game, and this state (before label-flipping) used the D A (authenticity discriminator). First we show two curves for J (G) = −log(D(G(z))) (green curve) and J (G) = log(1 − D(G(z))) (blue curve) as showed in [23]. The author in [23] answered why the blue curve is saturating: since the discriminator can easily recognize real and fake images at initial training steps, D A (G(z)) is almost 0 at the beginning. Hence, ) is almost 0 and hard to supply effective gradients for learning (blue arrow). As one solution, using J (G) = −log(D A (G(z))) can improve the learning efficiency at initial steps (green arrow). Then, if we flip the training target (which is the same as the falsity output's target in [12]), D F (G(z)) is almost 1 at the beginning. Thus, we can ascend gradients from J (G) = log(1 − D F (G(z))) to avoid being saturating (red arrow), which is originally saturated in the non-flipped version. As shown in different arrows, the initial value of D(G(z)) matters the criterion value and thus affects gradients and the training effectiveness for the back-propagation steps.
Besides, the label flipping trick could be unified in a single value function (the negative version of V(G,D) in [1]) under Jensen-Shannon divergence [1], [23]. Therefore, the label flipping trick can inherit theoretical conclusions from the mini-max game in [1], yet has not been proved in the heuristic non-saturating game as introduced in [8] and [25].
All related literature point to the fact that flipping the discriminator's target is a useful way to enhance the GANs' training, and how to fully use such a mechanism remains to be studied.

D. VECTOR-BASED DISCRIMINATOR
Here we define the discriminator with multi-dimensional outputs (use a linear layer to map the last feature map to a multidimensional vector) as the vector-based discriminator. Some past works [14], [15] and our preliminary experiments in Tab. 1 validated the empirical advantages (before reaching its summit performance) of using the vector-based discriminator, yet not determined its reason. Reference [26] proved that vector-based discriminator could acquire richer information under Wasserstein distance. In this work, we show the corollary from [26] to prove a similar conclusion while under the f-divergence [4] for our work.
Corollary 1: Define f-divergence between two distributions P and Q as Div (m) (P||Q) based on the m-dimensional outputs discriminator. If two positive integer n < n , then for any distributions P, Q ∈ P(R), we have Div (n) (P||Q) Div (n ) (P||Q).
Proof: Div(P||Q) can be transferred to the problem that maximizes the discrepancy between two expectations based on Fenchel conjugate [27]: where f is a convex function which is used to define the f-divergence as in [4] and f * is its conjugate function as defined in [27] and [4]. is the arbitrary class of functions T (m) . Hence larger m can offer more candidate functions in when finding the optimal T (m) for achieving the maximal discrepancy, thus obtaining the proof.

E. PATCH-BASED DISCRIMINATOR
Patch-based discriminator worked as a vital part in pix2pix [10]. GANs can generate vivid and high-quality images after remarkable improvements in training stability and network architectures. At the same time, high-resolution images can bring in extra instability and learning hardness in GANs' training. Some works tried to solve this problem based on progressive training policy or create swollen networks, while one effective way is the Patch-based discriminator. Dividing a large image into several small patches in GANs' training can significantly increase each patch's generation quality and diversity.
Based on past works' conclusions, patch discriminator can hardly obtain performance improvement via increasing output dimensions because the number of its output dimensions is already higher than others. Either quadratically increasing the number of patches or duplicating a curtein number of patches in a third dimension may easily cause the conflict problem. Therefore, one feasible way to acquire both advantages from the multi-dimensional discriminator's outputs and the patch discriminator is to apply our stochastic discriminator to the patch discriminator.

III. STOCHASTIC OUTPUTS IN THE DISCRIMINATOR A. MOTIVATION: CONFLICT PROBLEM WHEN INCREASING OUTPUT DIMENSIONS
As validated in repulsive MMD-GAN [9], increasing output dimensions of the discriminator can obtain significant performance improvement at the beginning while facing damage when dimensions become higher. After reaching such empirically best dimensions, the conflict problem happens when keeping on increasing the output dimensions. This problem happens in the MMD metric, as well as GANs with the non-saturating loss. Hence, how to break through such a problem restrict the model performance.

B. METHOD: RANDOMLY SELECTING OUTPUTS FOR BACK-PROPAGATION
To overcome the conflict problem brought by the higher dimensions in the discriminator outputs, we propose to modify BipGAN's bipolar discriminator to a randomly selected discriminator. In BipGAN, the authors combined non-saturating loss and flipped non-saturating loss to obtain better posterior confidence in the discriminator. Such a posterior discriminator gains advantages from two types of losses. Naturally, we can increase discriminator's two outputs to higher dimensions: a half for authenticity loss and a half for falsity loss. Suppose we randomly abandon some authenticity and rest for falsity outputs. In that case, the practical 2N dimensions become N. Based on conclusions from repulsive MMD-GAN and our preliminary experiments, the reduction can help the discriminator overcome the conflict problem. Thus, we generalize our method to patch-based discriminators. Duplicating all patches' outputs in the discriminator or quadratically increasing the number of patches can easily trigger the conflict problem. In a word, our random selection method can help multiple-output-based discriminators eliminate restrictions.
In Fig. 2, the discriminator outputs are shaped as (2N 2 ) while patch-based outputs in red are for authenticity loss and blue for falsity loss. The significant difference between this work and BipGAN is that the BipGAN method will use all red and blue patch-based outputs for the pack-propagation. At the same time, ours randomly selects part of the red outputs from authenticity outputs. The remaining parts (in shallow red) will be neglected, and we enable outputs from falsity outputs (in deep blue). Finally, patch-based outputs from two losses (deep red and blue) are concatenated together as the stochastic output loss metrics with the shape of (N 2 ) for the backward step. We randomly select outputs for backward. Thus, the mixing method and ratio of authenticity and falsity outputs are significant, and we will explore these issues in our experimental ablations.

C. ADVERSARIAL LOSS FUNCTION
Our method extends BipGAN's generator loss function [12] to n-dimensional cases: Our discriminator loss function is based on Eq. 3 and Eq. 4, with the random vector mask k to control the back-propagation and extend them to n-dimensions: In Algorithm 1, we show how our method works in a patchbased discriminator. VOLUME 10, 2022 FIGURE 2. A general view to illustrate our Stochastic Discriminator. Each input image from the input mini-batch was separated into small patches corresponding to the matrix-based outputs with randomly selected labels. Details are in Section III-B.

Algorithm 1 Our Method via the Randomly Selecting Training Strategy in the Discriminator With Multi-Dimensional Outputs
Require: Discriminator output dimensions 2N , learning rates (α g , α d ), batch size B, discriminator training iterations m per generator step, training data distribution P data , random selecting probability parameter p. Initialize G parameter θ and D parameter φ (D sync owns the same parameter as D); while θ has not converged do for j = 1, . . . , m do Sample real samples

IV. EXPERIMENTS A. SETTINGS
We set original non-saturating loss in vanilla-GAN [1], flipped non-saturating loss in [21] and acquired both losses in BipGAN [12] as baseline methods to compare to our work. Besides, as we used the multiple outputs discriminator like repulsive MMD-GAN, we also set it as another baseline (mixed RBF kernel w/ {1, √ 2, 2, 2 √ 2, 4}). Firstly, we validate the empirically best dimensional number based on a set of ablation studies. Next, we subdivide stochastic methods as 'Hard-mix' and 'Soft-mix' and test their performance. Afterward, we visualize the trend chart of such ablations. Finally, we test our methods and compare them to three past works based on several real-world datasets.

1) DATASETS
We tested and compared our methods in our experiments with past works based on four widely used real-world datasets.
Tiny-ImageNet [29] (100k training images, 200 classes, 64 × 64 resolution) is a replacement dataset for the standard 1k classes ImageNet dataset. Training standard is time and electricity expensive and normally costs more than two weeks for a single model. Therefore we used Tiny-ImageNet to test and compare our methods with past works.
CelebA [30] (203k training images, 64 × 64 resolution) is a famous human face dataset. After alignment and cropping, each image has a 64 × 64 resolution. Compared with CIFAR and Tiny-ImageNet datasets, CelebA only contains human face images and no other categories.

2) NETWORK ARCHITECTURE
We organized all experiments based on one common and powerful architecture, namely BigGAN [31] for the experimental comparisons. BigGAN is concise in losses and easy to train compared with other state-of-the-art models. In order to avoid uncertainty via other factors (such as the penalty, other loss metrics, etc.), we chose BigGAN and edited the discriminator output layer of BigGAN merely. We removed the final pooling layer in patch-based experiments to create receptive fields for input patches. We used the default hyper-parameters as in BigGAN experiments, and all models were trained by Adam optimizer [32]. The exponential moving average (EMA) was started after the first 1k training iterations. We trained each experiment 10k iterations and reported its best FID and corresponding IS. The batch size was 64 for Tiny-ImageNet and CelebA experiments and 256 for CIFAR experiments. The dimension number of the input noise is 128 for all experiments. We used Pytorch [33] version 1.10.0 for all experiments. We also test the unconditional generation tasks based on StyleGAN2 [34], [35], one famous architecture in StyleGAN series [36], which can reflect the performance of each loss metric.

3) EVALUATION METRICS
We used Fréchet Inception Distance [22] (FID, lower is better) and Inception Score [21] (IS, higher is better) as the evaluation methods. We calculated and recorded the IS and FID score every 1k training steps and sampled 50k random samples to get the FID and IS. IS compares the feature-level similarity between sampled images and pretrained Ima-geNet [37]. Therefore, for non-imagenet-like datasets, higher IS can reflect a higher similarity to the ImageNet dataset. FID can usually reflect the generation quality better when using both generated and training samples [21], [38].

B. PRELIMINARY EXPERIMENT: DIMENSIONS
As validated in repulsive MMD-GAN [9], the best generation quality is obtained in 16 dimensions for IS and 64 dimensions for FID. On the contrary, FID and IS worsen when output dimensions achieve 256. In our experiments, we turn to use the powerful BigGAN model, and thus the training situation may change in our work. Therefore, we use the CIFAR-10 dataset to train several ablation studies to test our experiments' empirically best output dimensions.
We test two cases that indicate two types of outputs. The trivial vector-based outputs are the duplicated version of a traditional discriminator with a single output. We duplicate the output from one dimension to {2, 8, 16, 64, 256} times. From Tab. 1 we find that the generation quality is increased until eight dimensions and causes the conflict problem after 16 dimensions in terms of FID and IS. Regarding the trend, we plot FID scores within output dimensions in Fig. 3(a). The other one is the patch-based outputs in the discriminator. In our ablations, we separate the input image to the 8 × 8 patches. Therefore, the de facto number of output dimensions is 64. We also duplicated such 64 patches four times and obtained 256 dimensions. On the contrary, further duplication cannot improve the generation quality in FID and IS, as shown in Tab. 1.

C. PRELIMINARY EXPERIMENT: MIXING RATIO
Our stochastic methods mix authenticity outputs and falsity outputs to acquire both advantages without a double size of parameters in the last layer at back-propagation. Hence, how to mix such two types of outputs worth the discussion. We control the ratio between authenticity outputs and falsity outputs every ten percent, namely 'Hard-mix'. For instance, if we set the ratio between authenticity outputs and falsity outputs as 10%/90%, it means that about 10% of authenticity outputs will be selected under Binomial distribution. Besides, to determine whether reducing half of the parameters in a backward step can avoid the conflict problem, we also set a comparison group, namely 'Soft-mix'. 'Soft-mix' group experiments applied the Uniform distribution U(0, p ) onto each specific output, similar to k ∼ B(1, p) in Eq. 8. The p control the maximum ratio of authenticity outputs as in Eq. 9.
In this case, both authenticity outputs and falsity outputs can back-propagate gradients under the specific ratios k . As shown in Tab. 2, generation quality performances in 'Soft-mix' group are generally poorer than 'Hard-mix' group.
Such results indicate that more outputs can cause the conflict problem and make it hard to lead to ideal performance. Besides, we plot the FID trend curve accomplished with the ratio variety in Fig. 3(b,c). Both hard and soft versions demonstrate a similar phenomenon. Once the ratio becomes 20% for either authenticity outputs or falsity outputs, the generation quality gains its advantages. Thus, both Fig. 3(b) and (c) demonstrate a W-shaped trend. Based on such ablations, we tend to test three combinations in our main experiments: {20%/80%, 50%/50%, 80%/20%}.

D. RESULTS ANALYSIS
As shown in Tab. 3,4, we compare our methods to three past methods, which are Vanilla GAN with non-saturating loss, flipped non-saturating loss, and BipGAN. We set four different settings for our methods. As shown in Tab. 2, softmix methods cannot obtain their advantages over hard-mix methods. Thus, we show the ''50%/50% soft-mix'' setting as the lower bound of our method. Similarly, ''50%/50% hardmix'' setting is selected as another comparison. Due to the outstanding performances of ''20%/80%'' and its reverse case in hard-mix settings, we use them to demonstrate the upper bound performances of our methods. Besides, we repeat all settings from the vector-based discriminator outputs to the patch-based discriminator outputs. The vector-based outputs setting is the unidirectional expansion of the traditional single output discriminator in the output layer. The discriminator can use multiple dimensions to represent real or fake image distributions in this case. The patch-based outputs setting separates input images into small patches.
In CIFAR-10 and CIFAR-100 experiments, 1) Our methods can achieve superior performances than past works under 20% and 80% combinations in multi-categories datasets both in Inception Score and FID evaluations; 2) Changing output ratios between authenticity outputs and falsity outputs can obviously affect generation qualities; 3) Generally, the ''50%/50%'' combination in both soft and hard mixing methods performed worse results.
In CelebA experiments, our methods perform different results on vector-based and patch-based results.
In vector-based experiments, the discriminator's output dimensions are lower than the patch-based ones. Thus, patchbased discriminators are easier to surpass the generators. Our ''20%/80%'' method demonstrates its advantage against LSGAN losses and WassersteinGAN losses in such cases. In Table 2, both non-saturating and flipped non-saturating losses cannot defeat our hard-mixing methods in some ratios for CIFAR-10. Here, flipped non-saturating losses' performances are better in FID. All in all, other losses can obtain their advantages in specific datasets or settings, while ours are generally better choices in most cases.
Based on the analysis of real-world datasets, we found that our method supplied the possibility to change the ratio parameter for achieving superior model performance. The more complex datasets (more categories) can obtain a more significant improvement. Therefore, we tested a more complex real-world dataset to validate our hypothesis.
In Tiny-ImageNet experiments, the number of categories increased to 200. With the increased learning hardness, our methods demonstrated their advantages against past methods. In Tab. 5, the mixing ratio showed similar phenomenons to other experiments: ''50%/50%'' case showed even worse results than past methods. At the same time, 20% of authenticity loss with 80% of falsity loss obtained the best performances both in Inception Score and FID score. Furthermore, the repulsive-MMD method performed quite unstable among different datasets, which are generally better in complex cases. We guess that MMD-based losses are sensitive to the kernel parameters, while we used the default settings.
Finally, we doubt the model effect because of the different performances in conditional (CIFAR, Tiny-ImageNet) and unimodal (CelebA) training tasks. Therefore, we tested another network architecture with the unconditional training task, namely StyleGAN2 [34], [35]. In Tab. 6, the unconditional CIFAR-10 generation results were similar to the cases in vector-based BigGAN experiments. Repulsive-MMD losses and BipGAN losses are outstanding in IS, while our ''80%/20%'' hard mixing method can always gain its advantage. Thus, we can conclude that different loss metrics are more sensitive to output dimensionality and data resolution than class numbers and conditional labels.
Besides, we also show some samples in Fig. 4 and Fig. 5 from our patch-based experiments. All methods can avoid mode collapse when generating images. Generally, our methods can generate more details in some cases or samples and thus reflect higher evaluation scores.

V. RELATED WORKS A. D2GAN
D2GAN [39] proposed a two-discriminator-based method to stabilize GANs' training. The significant difference between our work and D2GAN includes two aspects. First, D2GAN has two discriminators, and ours only have one physical discriminator. In BipGAN and this work, though different output dimensions indicate the authenticity of the falsity, the TABLE 3. Vector-based discriminator experimental results. Here we show the best FID (lower is better) and their corresponding IS (higher is better) during training processes. As IS is meaningless for the human face dataset, we only list FID for CelebA experiments.  feature extracting function can be solved based on other layers. In contrast, D2GAN needs two discriminators to keep the training stability accomplished with their dual loss functions. Second, our work is based on a stochastic selecting policy, while D2GAN is similar to our soft-mix experiments with a fixed ratio. Our experiments showed that hard-mix methods obtained better results due to fewer dimensions by half.

B. DROPOUT
Dropout [40] is a well-known technique to improve neural networks. Randomly masking some fully connected neural in each training iteration can solve or improve some issues. Here in our work, we randomly select the discriminator outputs, similar to operating Dropout onto the last layer. There are two significant differences: one is that Dropout neglects the selected ones while ours will use the other loss to replace the selected ones. Second, many past works use Dropout in all layers; hence the gradient vanishing problems happen in different layers. In contrast, we only change the last layer because the two losses can share the feature extractor as one physical discriminator to reduce the computational cost.

C. MIXUP
Mixup [41] is a useful way to augment data by linearly mixing input images and output labels. In our 'soft-mix' experiments, we only mixed the discriminator outputs. These outputs are extracted from the posterior distribution, which cannot refer to the linear interpolation in the original space.

VI. CONCLUSION AND DISCUSSION
This work extended an improved way to fully use the advantages of the flipped and non-flipped losses to supply two types of output distributions, which can offer richer information without a double size of output dimensions during the backpropagation. Both flipped and non-flipped losses have the equivalent topology strength [42] and thus can work together. Other types of loss metrics may not be as convenient as the binary cross entropy loss. For instance, reversely using the MMD loss may cause the attractive problem as in [9]. If we find more useful loss metrics, the discriminator's outputs could be used to represent more information within limited output dimensions.
In the future, we would like to investigate a non-parametric way to determine the optimal ratio between different loss metrics. Besides, developing f-divergence [4] families or IPM [5], [6], [7], [8], [9] families may offer more candidate metrics for our stochastic discriminator.