Pseudo Conditional Regularization for Inverse Mapping of GANs

Inverse mapping of the Generative Adversarial Networks (GANs) which projects data to latent space have been recently introduced, and it is shown that the inverse mapping models trained by the bidirectional adversarial learning can enable novel and practical operations including interpolation between real data. However, existing techniques still do not ensure the consistent mapping between the data and their latent representation so that the models are hardly converged throughout training steps. Our discussion begins with empirical investigations on the inconsistency issue of the prior techniques, and we further propose a novel adversarial learning method, Pseudo Conditional Bidirectional GAN (PC-BiGAN), for training the inverse mapping of GANs with a high degree of consistency and similarity-awareness. Our models are speciﬁcally guided by the pseudo conditions deﬁned by the proximity relationship among data in unsupervised learned feature space. We demonstrate that our novel bidirectional adversarial learning frameworks improve the performance in sample reconstruction, generation, and interpolation.


I. INTRODUCTION
Generative Adversarial Networks (GANs) [1] have been recognized as one of the most powerful frameworks to learn the data generating distributions. In GAN frameworks, a generative model, generator, is trained to imitate real data by learning a mapping from a latent distribution to the data space. On the other hand, an adversarial model, discriminator, is concurrently trained to distinguish between generated and real samples. Training two adversarial models is formulated as a two-player minimax game and leads the generator to nicely approximate the real data distribution when the discriminator becomes difficult to distinguish between true and fake samples.
The ability of GANs is sufficiently validated in both theoretical and practical aspects. For example, GANs trained on image datasets have provided magnificent achievements [2]- [6]. In particular, interpolations and arithmetic among latent variables [7]- [9] produce smooth semantic variations and attribute changes via the generator, respectively. Since The associate editor coordinating the review of this manuscript and approving it for publication was Yan-Jun Liu. such impressive results confirm that the learned latent space reflects the semantic information of the data [10]- [13].
Recently an interesting question has been arisen and investigated: can inverse mapping of GANs to project data back to the latent space be learned and utilized? The Bidirectional GANs (BiGAN) [14] and Adversarially Learned Inference (ALI) [15] have been proposed to learn an additional model, encoder, that maps samples in the real data distribution to the latent space. The discriminator is then trained to distinguish between two joint samples; 1) a real data and its latent representation predicted by the encoder and 2) a fake data and its latent variable fed to the generator. On the other hand, the goal of both the generator and the encoder is fooling the discriminator. These bidirectional approaches are well formulated to match two joint distributions of the generator and encoder.
Nonetheless, Li et al. [16] have demonstrated a scenario where the ALI and BiGAN produce inconsistent inverse mappings across the training steps, and present a solution, ALICE (ALI with Conditional Entropy), based on the conditional entropy expressed by the cycle-consistency [17]- [19]. The ALICE relieved the non-identifiability problem of the ALI and improved the sample reconstruction performance, however, there is lack of consideration on the similaritypreserving mapping. For instance, if two samples are considered as similar (or close) in the data space, then their corresponding latent vectors are also preferred to be located in close proximity in the latent space. However, this similaritybased criteria is not explicitly considered in BiGAN, ALI, and ALICE.
In this paper, we propose a novel bidirectional learning method, Pseudo Conditional BiGAN (PC-BiGAN), that aims to produce latent representations preserving the similarity among data by guiding the encoder to form a similarity-aware latent space. We first compute initial features of data based on an unsupervised model. Although our method does not have any restriction on choice of a feature extraction model, in this paper we consistently utilize a trained encoder of the BiGAN to clearly validate the benefits of our approach over the BiGAN. Once we have the initial representations of the data, we perform a clustering to define pseudo classes and assign a pseudo label to each data. Those pseudo labels coarsely guide the encoder to map samples within the same pseudo class to latent vectors in close proximity. As a learning strategy, we simultaneously train two models, PC-BiGAN and its inference model PC-BiGAN-Inf, where the generator and encoder of a PC-BiGAN are explicitly regularized by the pseudo labels whereas ones of PC-BiGAN-Inf are not. However, the discriminator of PC-BiGAN-Inf is frequently synchronized with the PC-BiGAN. Since the generator and encoder are trained under the judgment of discriminator, we can learn the generator and encoder of PC-BiGAN-Inf to have similar joint distributions with ones of the PC-BiGAN without explicit guidance by pseudo conditions through the synchronization strategy.
We extensively evaluated our proposed method, especially PC-BiGAN-Inf, in sample reconstruction, generation, and interpolation tasks. The experimental results validate that our method achieves higher performance compared to the prior techniques specialized for the inverse mapping of GANs. This paper is organized as follows. In Sec. II, we first summarize the literature related to inverse mapping of GANs and conditional GANs. In Sec. III, we further explain details on the bidirectional GANs, and discuss about their inconsistent mapping across training steps. We then present our methods alleviating the aforementioned issue; similarity-preserving mapping in bidirectinal GANs frameworks. Experimental comparisons against prior bidirectional GANs frameworks are reported in Sec. V. We finally conclude our paper in Sec. VI.

A. INVERSE MAPPING OF GANs
Inverse mapping from data space to latent space of GANs have been explored in recent years [14]- [16], [20]- [23]. Notable examples are BiGAN [14] and ALI [15]. They added an additional parametric model to GAN frameworks that infers latent representations of data samples. Li et al. [16] proposed to regularize the ALI by cycle-consistency [17] to improve the sample reconstruction performance. Our proposed method also aims to regularize bidirectional GANs, however, we guide models to be aware of a global structure of latent space induced by the data similarity. The merits of our global guidance scheme over sample-by-sample approach [16] are clearly validated in Sec. V.
One alternative approach is to compute latent representation based on the back-propagation [20]- [23]. Those methods first generate a fake sample from a random latent code and then update the latent code by the back-propagation according to predefined loss functions. Creswell and Bharath [20] utilized a pixel-level reconstruction loss for the backpropagation. On the other hand, Yeh et al. [22] used both weighted pixel-wise difference and adversarial losses. Although those methods have certain strength that they can be applied to any pretrained GANs, the predicted latent representations are sensitive to low-level features since their objectives mostly concentrate on the pixel-wise reconstruction. Furthermore, they require a huge computational time due to iterative updates based on the back-propagation which is inevitable to produce a latent code for a single data.

B. CONDITIONAL GANs
Conditional GANs (cGANs) [4], [24]- [26] have shown successful results in class-specific data generation. Mirza and Osindero [26] first proposed cGANs framework and validated that the learned generator can produce samples according to given conditional labels. The cGANs framework is further applied to various applications [27]- [31] such as image generation from text [32]- [34] and image-to-image translation [35]. Note that, all the aforementioned methods are based on supervised labels or supervised feature representations. At a high-level, our proposed method can be seen as a conditional BiGAN framework since we regularize the BiGAN by conditional labels. However, the main difference between our technique and the ordinary cGANs is that we define and utilize pseudo labels defined by totally unsupervised manner based on the data similarity.

A. BACKGROUND
Let us define notations that we will use throughout this paper. We use a distribution of data x ∈ X as P X (x). The goal of GAN [1] is to approximate P X (x). The GAN framework is consist of two parametric models: a generator G and a discriminator D. The generator maps a latent distribution P Z (z) for z ∈ Z into P X (x). The discriminator distinguishes whether a particular x is drawn from P X (x) or P Z (z).
The BiGAN [14] has an additional parametric model, encoder E, for the inverse mapping from the data x to the latent representation E(x). The discriminator D then predicts whether a given sample x is drawn from real joint distribution P XE (x, z) or fake joint distribution P GZ (x, z). The objective of BiGAN is described as follows: where It is theoretically verified that the perfect inverse relationship between the generator and the encoder (i.e. G −1 = E) can be reached with the optimal discriminator D * . The BiGAN, however, hardly provides the sample reconstruction G(E(x)) = x in practice.
Note that, the Adversarially Learned Inference (ALI) [15] is an identical model with BiGAN. In this paper we explain our models and experimental results on top of BiGAN for the sake of succinct explanations.

B. MOTIVATION
The BiGAN is hardly converged to a perfect inverse mapping mainly because the latent representation E(x) is significantly changing during training steps. Since those latent representation E(x) are fed to the discriminator D (Eq. 2), such inconsistency disturbs training of D that distinguishes between real and fake samples. It sequentially makes the learning the encoder E difficult because the encoder is trained based on the judgement of D. As a result, the inconsistent mapping degrades the inverse mapping accuracy of BiGAN.
We experimentally validate the inconsistency mappings during training of the BiGAN. In the experiment, we investigate the latent space by measuring the change of the neighborhood codes. Given a set of real data samples X = {x 1 , . . . , x n }, we compute their latent representations L t = {E t (x 1 ), . . . , E t (x n )} predicted by the trained encoder E t (·) at the t th training epoch. For each E t (x i ), we then investigate k-nearest neighbors among the latent codes by the following measure: where R(y i , y j , Y , d(·, ·)) is a function that computes a ranking of y j within a set Y according to the distance measure d(·, ·) in non-decreasing order. In this experiment, L 2 distance y i − y j is used. Thus, N (i, k, t) is a set of k-nearest neighbor indices of E t (x i ) among all the latent codes predicted by the encoder at the t th training epoch. We specifically measure how many local neighboring latent codes are preserved in two consecutive training epochs t − 1 and t to investigate the consistency of learned latent representations during the training steps as follows: where | · | is the cardinality operator (i.e. the number of elements in a set). For instance, I (3, 10) = 0.6 indicates that 60% of 10 neighboring samples in the learned latent space at the second training epoch still remain as 10-nearest samples at the third epoch on average. One expected trend of the I (t, k) across training epochs may be that it starts with a low value while setting up the global structure of latent space and is getting increased as the models converge. However, the experimental results show that the neighborhood in the learned latent space is hardly preserved across the training epochs ( Fig. 1). Specifically, the intersection rates of BiGAN, ALI, and ALICE were fluctuating and the values were below 0.4. These results imply that the latent distribution induced by the encoder E(x) is changed drastically and is hardly converged even for the local neighborhood. Note that we conducted the experiment on the CIFAR-10 benchmark.
In this paper, we specifically aim to alleviate the aforementioned inconsistency problem of the prior techniques for the inverse mapping of GANs. As shown on Fig. 1, our proposed method predicts more consistent latent representations during adversarial learning, and we further experimentally validate that it leads to improve the inverse mapping performance over the baseline methods in Sec. V.

IV. OUR APPROACHES
We propose to regularize our encoder to locate latent vectors of similar data in close proximity in learned latent space. In order to train an encoder under the regularization, we propose a joint regularization scheme for both the generator and encoder. Note that, all of our learning strategies utilize fullyunsupervised pseudo labels to guide a global structure of the latent space.
Specifically, the similarity among data is defined by initial feature representations of data. Even though our method does not have any restriction on the initial feature representations, we explain our method when it utilizes the latent vectors predicted by the encoder of an ordinary BiGAN. BiGAN and ALI are learned only based on the adversarial loss so that they can lead inconsistent and unstable mapping between data space and latent space during training steps. (b) ALICE explicitly incorporates the cycle consistency to improve the sample reconstruction. However, the consistency is limited on a sample-wise one-to-one inverse relationship but not preserving local neighborhoods or global structures of the latent space. (c) Our method regularizes bidirectional GAN framework based on pseudo labels defined by the data similarity. The pseudo labels describe a global structure of the latent space and guide the encoder to learn to map data belong to the same pseudo class in close proximity in the latent space.
Let us denote a latent vector (or arbitrary feature representations) of a sample x extracted by a trained encoder of the BiGAN (or any unsupervised feature extractor) as f (x). We construct pseudo classes according to the similarity defined with the features f (x). The k-means clustering [36] is, especially, performed to define the pseudo classes.

A. PSEUDO CONDITIONAL BiGAN (PC-BiGAN)
One essential condition for the optimal inverse mapping E = G −1 is that the encoder and generator share an unified latent space. To achieve this, we consider a way to jointly regularize the latent space of generator as well as encoder in an explicit manner. To provide a guidance with explicit and unified ways, we introduce a regularization method based on conditional GAN architecture. We specifically define a fixed global structure of the latent space based on the pseudo classes and guide both encoder and generator to learn under the predefined space (see Fig. 2).
We compute a single pseudo label y i of each data x i , and k similarity groups S 1 , . . . , S k based on pseudo labels as follows: where c j is the centroid of j-th cluster computed by the kmeans clustering on the initial features f (x), and d(A, B) denotes the Euclidean distance between A and B. For each similarity group S i , we compute the mean µ i and covariance matrix i . Those mean and covariance explicitly describe the regions corresponding to similarity groups in the latent space. Specifically, we guide the encoder to map a real data x i to a latent vector near µ y i and the generator map a latent variable z drawn from a multivariate Gaussian distribution N (µ j , j ) to a sample similar to data in S j . Note that, the k-means clustering is performed only once as a pre-computation to define pseudo labels of training samples in our training stage. We also define two functions h x (x) and h z (z) that compute the pseudo labels of a data x and latent representation z,

1) PSEUDO CONDITIONAL BiGAN
Once we determine a fixed global structure of the latent space based on (µ i , i ), we train our encoder to map a similarity group S i to latent variables belong to a single fixed region bounded by (µ i , i ). To do this, we modify the BiGAN framework by incorporating the conditional GAN [26] so that all models to get a pseudo label as an additional input (see Fig. 3). Specifically, the generator, encoder, and discriminator receive joint inputs (z, h z (z)), (x, h x (x)), and (x, z, h), where h is pseudo label index. Note that, the z fed to the generator is a random variable drawn from N (µ i , i ) for all 1 ≤ i ≤ k, and the input fed to the discriminator is either of a real tuple (x, E(x), h x (x)) or a fake tuple (G(z), z, h z (z)). The adversarial learning with those joint inputs trains the encoder and generator to produce their outputs strongly related to the predefined global structure based on the pseudo labels likewise conditional GANs in supervised settings. Consequently, we formulate our adversarial loss as follows: where )]] and models D P , G P and E P denote discriminator, generator and encoder of our pseudo conditional BiGAN (PC-BiGAN), respectively. Note that z denote a randomly sampled vector from a multivariate Gaussian distribution N (µ x , x ).
In order to further regularize the encoder to map a sample x ∈ S h x (x) to a latent vector nearby µ h x (x) , we add the Euclidean distance between E(x) and µ h x (x) as a distancebased loss L dist as follows: Our PC-BiGAN explicitly receives pseudo labels to train its generator, encoder, and discriminator to be aware of the global structure of latent space defined by the pseudo classes. On the other hand, the generator and encoder of the inference model (PC-BiGAN-Inf) does not require the pseudo labels as input so that they can be utilized in the inference stage without explicit computation of the pseudo labels. In order to guide the encoder and generator of the inference model, we frequently synchronize the learned parameters of two discriminators. More details are described in Sec. IV. Note that, the green circles in the illustration represent the pseudo class regions in the latent space.
As a result, our objective function to train the pseudo conditional BiGAN is given as follows: where λ dist is a small constant to balance the distance-based loss with adversarial loss. In all the experiments, we consistently use λ dist = 0.01.

2) BINARY PSEUDO LABEL
One straightforward way to provide the pseudo labels to models as their input is encoding the labels as one-hot vector [24], [37], [38]. However, it significantly increases the number of parameters as we define more similarity groups. In practice, we set the number of similarity groups k as 500 for all experiments. In order to reduce the model complexity, we convert each pseudo label index into a binary string with a fixed length and feed to models. For instance, an one-hot vector corresponding to a data within the similarity group index 3, , 1} 500 , is converted to 9-dimensional binary vector [00 . . . 011] T ∈ {0, 1} 9 . We found that converting pseudo labels to binary representations and feeding them to the models significantly reduce the training time without any loss in accuracy.

B. INFERENCE MODEL WITHOUT PSEUDO LABELS (PC-BiGAN-INF)
Although our pseudo conditional BiGAN can learn inverse mapping in more consistent manner compared to the BiGAN, it has certain limitation that it requires explicit pseudo labels h z (z) and h x (x) when generating samples (i.e. G P (z, h z (z))) or mapping a real data back to the latent space (i.e. E P (x, h x (x))). For instance, when we encode x to a latent vector we need to compute a pseudo label corresponding to x by extracting the initial feature f (x) and specify the similarity group where x belongs to. The pseudo label assignment in the inference stages has non-negligible computational costs since it basically requires to compute distances from f (x) to all the cluster centers {c j } N j=1 (Eq. 5). In order to remove such overhead in the inference stages, we propose an alternative learning strategy to train inference models which does not require any explicit pseudo labels.
The generator G I (z) and encoder E I (x) of our inference models, PC-BiGAN-Inf, have almost identical architectures with G P (z, h z (z)) and E P (x, h x (x)) except the inference models do not receive the pseudo labels as their input. On the other hands, the discriminator of PC-BiGAN-Inf, D I (x, z, h), has the same structure with D P (x, z, h). Since only G I and E I are utilized in the inference time, we allow D I to have explicit pseudo labels as one of its input.
Specifically, we simultaneously train the PC-BiGAN and PC-BiGAN-Inf. If the adversarial learning based on Eq. 8 enables to form regularized latent space with G P , E P , and D P , then it seems to be also possible to train G I and E I by an adversarial learning with D P to have a similar latent space with G P and E P . In other words, D P becomes in charge of discriminating tuples (x, z, h) predicted by four models G P , E P , G I , and E I . However, this strategy has a problem. From a D P 's point of view, D P (x, z, h) should classify tuples predicted by E I without pseudo conditions, (x, E I (x), h x (x)), to a real joint sample. Since E I does not receive any conditional labels, it hardly produces latent representations E I (x) ∼ N (µ h x (x) , h x (x) ) due to lack of sufficient guidance that E P (x, h x (x)) relies on. Such poorly predicted samples of E I having drastically different distribution from the output of E P obliviously disturb D P to be converged. Similarly, fake tuples estimated by G I confuse D P to learn the decision boundaries between real and fake samples.
Instead of playing an adversarial game with 5 models G P , E P , G I , E I and the judge D P , we allow D I to join the game. The role of D I is to discriminate samples produced by G I and E I , whereas D P is a referee for G P and E P . However, learning D I only based on the inference models has the aforementioned problem. In order to reduce the problem, we frequently synchronize D I with D P by copying the learned parameters of D P to D I . This way prevents to disturb D P to learn a regularized latent space by separating it from G I and E I . Moreover, D I which is a copy of D P provides appropriate guidance to G I and E I . In practice, we synchronize two discriminators in every 10 training epochs.
As a result, our minimax game to train PC-BiGAN-Inf is governed by the following objective: where V (D I , E I , We also apply the same regularization loss L dist with equal amount λ dist to the E I . Then, our final objective for PC-BiGAN-Inf is given as follows, Balanced Mini-Batch: If we feed the real data x ∼ P X and latent variables z ∼ P Z in totally random manner, the number of training samples with respect to the pseudo class indices become imbalanced. For instance, it is highly likely to have no pair of x and z where h x (x) = h z (z) in a mini-batch. We found that such totally random sampling approach leads the training often failed. To overcome this problem, we first prepare real data {x 1 , . . . , x m } randomly when the mini-batch size is 2m. We then draw latent variables {z 1 , . . . , z m } so that the mini-batch is pseudo class-wise  balanced based on the following criteria: Recall the experimental study reported Fig. 1, our proposed PC-BiGAN-Inf better preserves the local neighborhood compared to the prior techniques across training epochs, thanks to our method that regularizes based on predefined and fixed latent space defined by the data similarity.

V. EVALUATION
In this section, we intensively evaluate our proposed method against the state-of-the-art bidirectional adversarial learning techniques including BiGAN, ALI, and ALICE.
Note that, the reported quantities with a † are taken from the literature. In order to conduct experiments with baselines whose results are unavailable in the literature yet, we have utilized the source codes released by the authors and the hyperparameters suggested in each paper for fair and reproducible experiments.

A. DATASETS AND IMPLEMENTATION DETAILS 1) DATASETS
We conduct experiments on CIFAR-10 and CIFAR-100 benchmarks [39] 1 to validate the benefits of our proposed method. The CIFAR-10 dataset contains images of 10 object classes, while the CIFAR-100 dataset has instances of 20 coarse and 100 fine-grained classes. Both benchmarks contain 60, 000 images at a resolution of 32 × 32, and the images are separated into 50, 000 and 10, 000 images for training and testing sets, respectively. We utilize the training set without supervised labels when training methods and measure all the quantitative and qualitative performances on the testing set.

2) IMPLEMENTATION DETAILS
In all experiments, we set the dimensionality of the latent space to 200. We set the number of similarity groups k as 500 for all experiments. For every 10 epochs, we synchronize the learning parameters of discriminators. We use the ADAM optimizer [40] with learning rate of 0.0002 and momentum β 1 = 0.5. We set the mini-batch size to 256 and trained our models during 500 epochs. As data augmentation. we scale data to [−1, 1] and perform random horizontal flipping. Although there is no restriction on choice of neural network architecture, we use DCGAN [2] structure for fair comparisons. We also feed latent vectors or binary pseudo labels to all input and intermediate layers of generators, discriminators, and encoders as did in the BiGAN.

B. RECONSTRUCTION AND GENERATION
We first evaluate the performance of our encoder and generator by the data reconstruction and generation tasks. For those experiments, we use the CIFAR-10 dataset.

1) SAMPLE RECONSTRUCTION
The sample reconstruction based on the sequence, x → E(x) → G(E(x)), is an appropriate way to evaluate the performances of both the encoder and generator together. We report the pixel-level reconstruction error measured by Mean Squared Error (MSE) between the original data x and its reconstruction G(E(x)). As reported in Table. 1, our method achieves significantly lower reconstruction error over the baseline techniques. We also report the Structural Similarity (SSIM) scores [41], and the results show that our method improves the second best models ALI and BiGAN in a large margin. Those improvements over the prior techniques confirm the merits of our technique that provides a guidance of a global structure of the latent space. Moreover, our method  even outperformed ALICE which is optimized for the sample reconstruction by incorporating the cycle-consistency loss to their objective. These results validate that having structural consistency of the latent space is more beneficial than samples-wise one-to-one constraints.
We also report qualitative comparison of reconstructed images in Fig. 4. We observe that our method provides more accurate sample reconstruction over the state-of-the-art techniques in terms of colors and global shapes.

2) SEMANTIC RECONSTRUCTION
We also have investigated the semantic consistency of the reconstructed samples. In order to quantitatively measure the performance of semantic reconstruction, we specifically utilize a mis-classification rate of reconstructed samples, denoted by Semantic Reconstruction Error (SRE): where f c is an external classification model trained with supervision [42] which is only utilized for the evaluation, and Y C (x) is the true label of the sample x. As reported in Table. 1, our method significantly outperforms over the tested baselines. Those results confirm that our similarity based guidance boosts the latent space to better reflect the semantics, and strongly imply that our methods can capture the semantic classes by the latent codes even in the unsupervised settings. While our guidance globally regularizes the latent space to locate similar data in close proximity, the adversarial objective allows local structures of the latent space to exploit the semantic juice. On the other hand, the ALICE performs even worse than its base model ALI since the cycle-consistency expressed by the pixel-wise L 2 distances tends to more focus on learning low-level features rather than high-level ones. Note that, the classification accuracy of the external classifier f c utilized in this experiment on the real samples is 90%.
86978 VOLUME 8, 2020  G(z a ), G(z b ), and G(z c ) generated by tested methods, respectively. We further test with z d computed by a vector arithmetic, z d = z a − z b C z c , and the generated images G(z d ) are reported in the fourth row.

3) SAMPLE GENERATION
We further have studied the sample generation performance by measuring the Inception Score (ICP) [43] and Frechet Inception Distance (FID) [44] to evaluate both quality and diversity of the generated samples. We specifically generate 50K and compute ICP and FID on those generated samples. As shown in Table. 1, our method provides the highest ICP and lowest FID among the tested methods.
We further claim that our conditional generator G P produces higher quality of samples, since we more explicitly regularize our PC-BiGAN to generate samples belonging to their corresponding similarity groups compared to its inference model PC-BiGAN-Inf. In order to validate this claim, we report ICP and FID scores of PC-BiGAN which requires explicit computation of the pseudo labels and PC-BiGAN-Inf without any condition in Table. 2. The result shows that our PC-BiGAN further improves image generation quality. It confirms that our regularization based on the data similarity in a more explicit way provides higher performance. Note that, the pseudo labels to measure image quality are randomly selected for a fair comparison between our PC-BiGAN and PC-BiGAN-Inf.

C. INTERPOLATION AND VECTOR ARITHMETIC
We also validate that the latent space produced by our method captures semantic variations. We specifically report quantitative and qualitative results of the sample interpolation and vector arithmetic operations among latent vectors.

1) SAMPLE INTERPOLATION
One may suspect that the guidance for global structures of latent space disturbs to exploit semantic juice of GAN.
To show that our similarity based regularized latent space preserves the attributes of semantic juice, we estimate mean softmax values across interpolated samples. Specifically, we first select two real samples x 1 and x 2 from different categorical classes. Those samples are encoded to the latent representation E(x 1 ) and E(x 2 ), then decoded to the sample G(z t ) with a interpolated latent vector parameterized by t for t ∈ [0, 1]. We measure mean softmax values of their own categorical classes across interpolation steps t on the top of external classifier. As shown in Fig. 5, the mean softmax values of our reconstructed samples (t = 0 for x 1 and t = 1 for x 2 ) are higher than BiGAN. Although those values are output of softmax activation, it shows that the classifier predicts category of those reconstructed samples with higher confidence than the baseline method. On the other hands, we also found that the derivatives of our softmax values are dramatically changed from t = 0.3 to t = 0.5. It validates that our regularized latent space is semantically well-clustered since the semantic changes are focused on the middle of interpolation (i.e. boundary between different clusters). Note that, the mean softmax values of real samples are 0.87. Fig. 6 shows the qualitative results of the sample interpolation. Specifically, we interpolate between two real samples which have different semantic attributes. Although we predefine the global structure of the latent space, those results confirm that our model captures underlying semantic attributes such as viewpoint and shape. Furthermore, the semantic attributes change smoothly without any collapsing on the one side.

2) ARITHMETIC OPERATIONS AMONG LATENT VECTORS
One interesting property of the inverse mappings of GANs is that it enables to perform attribute-level arithmetic operations on real data as did in natural language processing [45]. We conduct experiments with car and truck classes of CIFAR-10 benchmark to validate that our inverse mapping and latent space have such property, and their qualitative results are reported in Fig. 7. In each experiment, 9 real samples {x k i } are utilized, where k ∈ {a, b, c} and i ∈ {1, 2, 3}. We first examine the generated data G(z k ), expected to have the average attribute and class label of three real samples x k 1 , x k 2 , and x k 3 , from the mean latent vector z k : for each k ∈ {a, b, c}. Note that, E(·) is the latent representation of the real data predicted by the tested techniques including ours PC-BiGAN-Inf. While all tested methods generated quite reasonable sample in terms of the attributes and classes, our PC-BiGAN-Inf produced the clearest image. Note that, the term neutral represents that many attributes are mixed together so it is hard to pick a single description [2].
We further study on the arithmetic operations among latent vectors. We specifically compute a latent vector z d as following equation: where z a , z b , and z c are computed by Eq. 15. The last rows (4th rows) of Fig. 7(a) and 7(b) show the generated samples from z d by ALI, ALICE, BiGAN, and our PC-BiGAN-Inf methods. As reported, our method produces the most accurate samples in terms of attribute, class, and appearance. Especially in Fig. 7(c), ALI and ALICE produced trucks, and BiGAN generated left-heading car, while the result of our PC-BiGAN-Inf correctly reflects the desired attribute and class.
Those experimental results validate that our inverse mapping and latent space support the vector arithmetic operators among latent vectors well thanks to our novel the bidirectional adversarial learning framework regularized by the structure by structure guidance.

D. RECONSTRUCTION ON CIFAR-100
In order to test the scalability of our method in terms of the complexity of data distribution, we evaluate MSE and SRE of our method against BiGAN on CIFAR-100 benchmark. As reported in Table. 3, our method significantly outperforms BiGAN in both pixel and semantic-level reconstructions. The performance improvement over BiGAN is consistent with respect to the complexity of datasets. Note that, we use 20 coarse labels to measure SRE. For the external classifier, we utilized ResNet-18 model [46] which achieves classification accuracy of 0.83 on real samples.
The qualitative results of the sample reconstruction on CIFAR-100 are provided in Fig. 8. We found that our method robustly reconstructs the samples with various shapes and classes.

E. ABLATION STUDIES
In this section, we discuss about our method with various ablation studies. We specifically analyze the impacts of initial features, frequency of weight synchronization between D P and D I , regularization methods, and incorporation of cycleconsistency.

1) INITIAL FEATURE EXTRACTOR
We first perform analysis on the impact of initial feature extractor utilized to compute pseudo labels. We test our method with the bottleneck representations of Auto Encoder (AE) as the initial features instead of latent vectors of BiGAN, since AE is one of the most popular unsupervised feature extractors [47]- [50]. As reported in Table. 4, our method w/ AE clearly outperforms BiGAN across all the tested reconstruction tasks. Although AE generally more focuses on the local features such as colors or textures, our method w/ AE more accurately reconstructs semantic features and provides a lower SRE compared to BiGAN. This ablation study strongly implies that our learning strategy has a low sensitivity on choice of initial feature extractors unless they can capture the data similarities. Note that, the experiments are conducted on CIFAR-10 test data. Table. 5 shows SRE scores of our method with various frequencies for weight synchronization between two discriminators D P and D I . We observed that our method with the   frequencies from 10 to 25 epochs consistently reached at 0.50 of SRE score. Furthermore, the frequencies shorter or longer than the best setting such as 5 or 50 epochs did not cause noticeable performance degradation, and our method achieved higher accuracy than BiGAN regardless of the frequency (0.52 at 5 epochs vs 0.61 of BiGAN). Those results confirm that our method has a low sensitivity to the weight synchronization frequency. Note that, we train the models up to 500 epochs, and the frequency of 10 epochs is consistently utilized in all the other experiments.

3) REGULARIZATION METHODS
Our proposed method regularizes the BiGAN by providing pseudo labels to both generator and encoder (see Sec. IV). One may find that an alternative approach could be modifying only the encoder to produce a structured latent space. Specifically, we can stack a linear classification layer on the top of encoder to predict the pseudo label. This way can implicitly form a clustered latent space according to the pseudo labels. We now name this alternative approach as PC-BiGAN-Cls.
We conduct experiments to compare our original method and the alternative one PC-BiGAN-Cls, and the results are reported in Table. 6. Interestingly, PC-BiGAN-Cls outperforms the BiGAN in both SRE and MSE scores. This result is notable since classifying the pseudo labels by the encoder does not provide direct guidance to the generator to have a matched latent distribution with the encoder but marginally alleviates the inconsistent mappings of latent codes across training steps. As a result, guiding the BiGAN framework to have consistent mappings during training helps to improve the performance of the inverse mappings. Mean- while, our proposed method outperforms the PC-BiGAN-Cls and BiGAN in both measures MSE and SRE by a large margin. Those results confirm the unique merits of our method that guides both the encoder and generator to produce a shared global structure of the latent space. Note that, all tested models have exactly same neural network architecture in the inference phase.

4) CYCLE-CONSISTENCY
Our method regularizes the BiGAN by the guidance of the global structure of the latent space based on the pseudo conditions defined by the data similarities, while the ALICE utilizes the cycle-consistency expressed by the sample-wise reconstruction. Although our method outperforms the ALICE even in the sample reconstruction task (Table. 1), we investigate our method when incorporating the cycle-consistency in addition to our objectives. To do this, we add the following cycle-consistency loss L cycle PC-BiGAN (G P , E P ) scaled by a small constant λ cycle : E x∼P X G P (E P (x, h x (x)), h x (x)) − x 1 + E z∼P Z E P (G P (z, h z (z)), h z (z)) − z 1 . (17) Similarly, the following cycle-consistency loss is added to the objective of PC-BiGAN-Inf: Table. 7 shows the MSE and SRE scores when we add the cycle-consistency loss to our objective with different λ cycle . As reported, the cycle-consistency is not beneficial to our method. Moreover, our models are even hardly converged or generate blurry or semantically meaningless images if we increase the λ cycle larger than 10 −3 . This is mainly because our regularization scheme based on the structurewise guidance conflicts with the sample-wise way. More importantly, we would like to highlight that our original method without any explicit objective on the sample reconstruction significantly outperforms the ALICE as shown in Table. 1.

VI. CONCLUSION
In this paper, we have proposed a novel pseudo conditional bidirectional adversarial learning framework to train inverse VOLUME 8, 2020 mapping of GANs so that the resulting latent space can reflect the similarity among data and exploit the semantic juice. While the prior techniques hardly provide consistent mapping between the data and their latent representations, we guide the models to have predefined the global structure of the latent space formed by the proximity relationship among data in unsupervised learned feature space. We specifically define pseudo labels based on clustering on such feature space of an unsupervised model. We train our pseudo conditional bidirectional GAN (PC-BiGAN) with the pseudo labels so that the encoder maps data to certain region corresponding to the pseudo label in the latent space. Since PC-BiGAN requires pseudo labels as its input, we further propose to learn its inference model that does not require any explicit feed of pseudo labels. We extensively evaluate our method against the state-of-the-art bidirectional learning approaches, and the experimental results validate that our method consistently outperforms all the tested techniques in various tasks including sample reconstruction, generation, and interpolation. The experimental results confirm that the unique merits of our work to have a fixed guidance based on the sample similarity for the latent space in the bidirectional adversarial learning.