Collaborative Training of GANs in Continuous and Discrete Spaces for Text Generation

Applying generative adversarial networks (GANs) to text-related tasks is challenging due to the discrete nature of language. One line of research resolves this issue by employing reinforcement learning (RL) and optimizing the next-word sampling policy directly in a discrete action space. Such methods compute the rewards from complete sentences and avoid error accumulation due to exposure bias. Other approaches employ approximation techniques that map the text to continuous representation in order to circumvent the non-differentiable discrete process. Particularly, autoencoder-based methods effectively produce robust representations that can model complex discrete structures. In this paper, we propose a novel text GAN architecture that promotes the collaborative training of the continuous-space and discrete-space methods. Our method employs an autoencoder to learn an implicit data manifold, providing a learning objective for adversarial training in a continuous space. Furthermore, the complete textual output is directly evaluated and updated via RL in a discrete space. The collaborative interplay between the two adversarial trainings effectively regularize the text representations in different spaces. The experimental results on three standard benchmark datasets show that our model substantially outperforms state-of-the-art text GANs with respect to quality, diversity, and global consistency.


I. INTRODUCTION
Generating realistic text is an important task with a wide range of real-world applications, such as machine translation [1], dialogue generation [2], image captioning [3], and summarization [4]. A language model is the most common approach for text generation, and it is typically trained via maximum likelihood estimation (MLE), specifically in an autoregressive fashion.
Although MLE-based methods have achieved great success in text generation, there are two fundamental issues that call for further research. The first problem is that MLE suffers from exposure bias [5]: during training, the model sequentially generates the next word depending on the ground-truth words; however, the model relies on its previously generated words at inference time. Therefore, the cumulative effect of incorrect predictions in the text sequence results in lowquality samples. The second problem lies in that the objective functions of MLE-based methods are rigorous [6]; the models are forced to learn every word in the target sentence.
Under this strict guidance, the ability of language models to generate diverse samples can be severely limited.
In recent years, Generative Adversarial Networks (GANs) [7] have drawn attention as a remedy to the above problems. However, applying GANs to text-related tasks is challenging due to the discrete nature of text. In the inference phase of the text generation, the model iteratively samples the next word from the distribution of vocabulary. As this step includes the sampling process that hinders the backpropagation of the gradients from the discriminator, several approximation methods have been proposed to avoid the non-differentiability issue [8]- [11].
Depending on the data space in which the GAN's discriminator operates, text GANs can be categorized into two groups: continuous-space methods and discrete-space methods. For discrete spaces, one prominent research line adopts the reinforcement learning (RL) technique to address the non-differentiability issue directly [8], [9], [12]. In the RL setting, GANs treat the generator as a stochastic policy to VOLUME 4, 2016 1 arXiv:2010.08213v2 [cs.CL] 4 Nov 2020 synthesize realistic samples. The generator is optimized via policy gradient methods by incorporating the reward signals from the discriminator. These signals are computed from a complete sequence rather than individual words in text. This RL approach can consider the final form of the text, thus it resolves the discrepancy between the training and inference stages in the MLE method. However, it has significant limitations, such as an excessive dependency on MLE pretraining, and severe mode collapse [13].
Other methods employ approximation techniques to transform discrete text into continuous representation. Such approaches include substituting next-word sampling in the generation phase with continuous relaxation [14], [15] or adopting an autoencoder architecture to learn an implicit data manifold in a continuous space instead of directly modeling the discrete text [10], [16]. In these approaches, the discriminator distinguishes between the synthetic and real text representations in the continuous space. As the discriminator only learns to distinguish an approximated representation of text, these approaches cannot provide direct feedback concerning the entire text's correctness.
In this work, we propose a novel text GAN architecture, called ConcreteGAN, which promotes the collaborative training of the continuous-space and discrete-space methods. Specifically, in the continuous space, a latent code representation of the synthetic text is learned jointly with an autoencoder. Then, the textual output generated from the latent code is further updated via RL training. In this way, ConcreteGAN simultaneously regularizes the text generation process within the continuous and discrete data spaces. The interplay between adversarial trainings in the two data spaces takes the following advantages; 1) it reduces RL training variance through the regularization of the latent code representation; and 2) it alleviates exposure bias in continuousspace methods. To the best of our knowledge, our proposed method is the first work to train a text GAN combining both continuous-space and discrete-space methods. We evaluate our model on three benchmark datasets: the COCO Image Caption corpus, the Stanford Natural Language Inference corpus, and the EMNLP 2017 WMT News corpus. Extensive experiments show that our model surpasses the existing text GAN models and achieves a substantial improvement with respect to quality, diversity, and global consistency. In addition, we provide comprehensive analyses of the latent code space. Compared to the GANs that work only in a continuous space, the synthetic code space generated by our model is more similar to the latent code space of real text. This behavior demonstrates that the proposed approach effectively regularizes the latent code space, which helps to reduce the variance of RL training.

II. BACKGROUND
In this section, we first give a brief description of GANs. Then we introduce two lines of research for text generation, including continuous-space methods and discrete-space methods.

A. GENERATIVE ADVERSARIAL NETWORKS
GANs are one of the implicit generative models that do not require a tractable likelihood function. Thus, they can be applied to practical situations such as imitating the distribution of high-dimensional complex data.
In general, GANs have a generator and a discriminator as their basic components. The generator tries to imitate the real data distribution, and the discriminator tries to distinguish generated samples from real data. The iterative interplay between these two components improves their strength against each other and provides significant performance enhancement in each of them. One can formulate the objective function of the GAN as a minimax game: where G and D are the functions for the generator and discriminator, respectively.
GANs have shown significant achievements in various deep learning applications, especially in computer vision research [17]- [19]. However, when applied to text generation, GANs suffer from the non-differentiability issue due to the discrete nature of text. Recently, various methods have been proposed to circumvent this issue, which can be broadly classified into two categories: continuous-space methods and discrete-space methods.

B. CONTINUOUS-SPACE METHODS
Several studies on GANs sidestep the non-differentiability by reformulating the learning objective in a continuous space. The adversarially regularized autoencoder (ARAE) [10], as a representative model, employs an autoencoder to learn an implicit data manifold, mapping discrete text into a continuous latent representation. In this model, the encoder and the generator are trained adversarially. Based on the ARAE, LATEXT-GAN [16] uses an additional approximated representation of text called soft-text, which is the reconstructed output of the autoencoder. Then, they employs two discriminators for each approximated representation in the continuous spaces.

C. DISCRETE-SPACE METHODS
SeqGAN [8] is the first work addressing the nondifferentiablity issue within a discrete space by introducing an RL technique into GAN training. Specifically, this approach considers the generated words as the current state and the generation of the next word as an action. In this scenario, the generator is optimized via a policy gradient, where the reward is computed by the discriminator through a Monte Carlo search. MaliGAN [21] proposes a normalized maximum likelihood objective. Combined with several The overall architecture of our model, which is composed of an autoencoder, a code-generator, a code-discriminator and a text-discriminator. Each training iteration of our model has three steps: (a) autoencoder reconstruction, (b) adversarial training of code-generator and code-discriminator in a continuous space, and (c) adversarial training of decoder and text-discriminator in a discrete space. The red dotted lines specify the modules to be updated in each step. The collaborative interplay between two adversarial training in continuous (latent code) and discrete (natural text) space regularizes the text representations in different spaces.
reduction techniques, it reduces the variance of the reinforcement learning rewards and the instability of the GAN training dynamics. LeakGAN [22] devises a hierarchical architecture for the generator to address the sparsity issue in the long text generation. The generator is guided by the latent feature leaked from the discriminator at all generation steps. RankGAN [9] relaxes the binary restriction of the discriminator by exploiting relative ranking information between the real sentences and generated ones. This increases the diversity and richness of the sentences. All of the above methods use the maximum likelihood pretraining, followed by small amounts of adversarial fine-tuning. ScratchGAN [12] first achieved a performance comparable to that of MLE methods without any pretraining.

III. CONCRETEGAN
In this work, we propose a novel text GAN architecture that promotes the collaboration of two adversarial trainings in a continuous space and a discrete space, respectively. We adopt the alternating training of the two methods in each iteration, rather than MLE pretraining commonly used for discretespace methods. The architecture of our model is shown in Fig. 1.
Our model consists of the following four components: (1) RNN-based autoencoder is composed of an encoder and a decoder. The encoder provides a latent code representation of real text in a continuous space. The decoder, as a textgenerator, yields textual outputs by interpreting the latent code from encoder or code-generator. (2) Code-generator maps a random noise to a latent code representation with the goal to imitate the distribution of the encoder. (3) Codediscriminator is the code-generator's opponent, and is adversarially trained to distinguish between latent codes from encoder and code-generator. (4) Text-discriminator evaluates complete sequences from real data distribution or decoder(or text-generator) distribution. The computed scores are used as the rewards to train the decoder via the policy gradient algorithm. The interplay between two adversarial trainings has a complementary effect, improving both the quality and the diversity of generated text.

A. AUTOENCODER RECONSTRUCTION
Let x ∈ X be the input sequence and z ∈ Z be the latent code of an autoencoder. We use a conventional RNN autoencoder that consists of two parts: an encoder network and a decoder network. The encoder network f φ : X → Z (parameterized by φ) maps the input sequence x to the latent code z, which is represented as the last hidden state. The decoder function f ψ : Z → X (parameterized by ψ) reconstructs the original input x conditioned on the encoded latent code z. Here, we use a gated recurrent unit (GRU) for both the encoder and decoder networks, whose parameters are trained using the cross-entropy loss function:

B. ADVERSARIAL TRAINING IN THE LATENT CODE SPACE
The next step of autoencoder reconstruction is the adversarial training of code-generator G θ (µ) and code-discriminator D ω (z). The code-generator aims to imitate the distribution of real text in the continuous latent code space that is represented as the last hidden state of the encoder. Given a random noise vector µ from a fixed distribution, such as a standard Gaussian distribution, the code-generator G θ (µ) outputs a vectorz that has the same shape as the last hidden state of the encoder. On the other hand, the code-discriminator D ω (z) learns to distinguish code-generator's output from the latent representations of real text. We use multilayer perceptrons (MLPs) with residual connections for both G θ (µ) and D ω (z) VOLUME 4, 2016 and adopt WGAN with gradient penalty (WGAN-GP) for optimization.z

C. ADVERSARIAL TRAINING WITH TEXTUAL OUTPUTS
Along with the adversarial training in the latent code space, we build another adversarial training loop that operates on the discrete textual outputs.
Given a code-generator with fixed weights, we model the decoder, which yields the textual outputs, as a policy and apply policy gradient method to optimize it. The text-discriminator D ρ is utilized to evaluate the generated sequence and provide the reward R t . Following previous works, we use REINFORCE [23], a Monte Carlo (MC) variant of the policy gradient algorithm, for gradient estimation of the decoder training.
Since the reward signal can be calculated only when the entire sequence is completely generated, several approximation methods are proposed to obtain an intermediate reward for each generated token. While an MC search with a rollout policy [8] is the method adopted in most research, it is computationally expensive even with a feed-forward discriminator. From our preliminary experiments, we find that GRU-based sequential discriminator shows better performance than a CNN discriminator with MC search in terms of computation time and evaluation results. With this empirical

Algorithm 1 ConcreteGAN Training
Require: text encoder f φ ; shared decoder f ψ ; codegenerator G θ ; code-discriminator D ω ; text-discriminator D ρ ; real text data x ∈ X; random noise vector µ; for each training iteration do (1) Train the autoencoder for reconstruction Train f ψ via policy gradient ∇ ψ end for intuition, we use GRU-based discriminator as follows: where γ is a discount factor such that 0 < γ < 1 and N is the size of the mini-batch. The overall learning procedure is shown in Algorithm 1. As a result of adversarial training in a continuous space, code-generator can provide an regularized latent representation of text sequence. This leads to the effective restriction on the search space of the RL-policy decoder, acting as a guideline for generating a sentence within bounded space. In adversarial training in a discrete space, the decoder learns to better capture the structure of text, such as a phrase, rather than the choice of words. This process contributes to mitigate the exposure bias of autoencoder, which further affects the training process of continuous space.

IV. EXPERIMENTS
To demonstrate the efficacy of our proposed method, we evaluate our model on various real-world datasets. In what follows, we give a detailed description of the whole evaluation process, from the experimental settings to the experimental results. We provide a performance comparison with state-ofthe-art models as well as several analyses on the code space.

A. DATASET
We carry out experiments on three standard benchmark datasets for evaluating text GANs: COCO Image Caption (COCO) dataset [24], Stanford Natural Language Inference (SNLI) corpus [25] and EMNLP 2017 WMT News (EMNLP) dataset. The statistics of each dataset are presented in Table 1.
For the SNLI dataset, considering the data distribution, we set a maximum sentence length of 15 and a vocabulary size of 11k. Each dataset represents different experimental environments, which have a critical impact on the unsupervised training of the text generation model: COCO for small-sized data with short text, SNLI for big-sized data with short text, and EMNLP for mid-sized data with long text.

B. EXPERIMENTAL SETTINGS
We implement our model using TensorFlow 1.15 and train the model with up to 200,000 iterations. For all experiments, we use the same model, loss function, and hyperparameters across the set of datasets, but different vocabulary sizes.

1) Autoencoder
The autoencoder is made up of an encoder GRU and a decoder GRU with 300 hidden units. We use 300-dimensional GloVe word embeddings trained on 840 billion tokens to initialize both the encoder and the decoder, and they are fine-tuned separately during training. The encoder output is normalized with l2-normalization. The input to the decoder is augmented by the output of the previous time step with a residual connection at every decoding time step. Additive Gaussian noise is injected into the encoder output and decays with a factor of 0.995 every 100 iterations. We use ADAM [26] optimizer with an initial learning rate of 1e −03 . Gradient clipping is applied if the norm of gradients exceed 5.

2) Generator & Discriminators
The code-generator and the code-discriminator are 2-layer 300-dimensional MLPs with residual connections between each layer. We use a Leaky ReLU for the activation function. The text-discriminator is a 1-layer GRU with 300 hidden units that has the same structure as the decoder. We use ADAM [26] optimizers and set the initial learning rate of the code-generator and two discriminators as 5e −06 and 5e −03 respectively. Gradient clipping is applied to the textdiscriminator if the norm of gradients exceed 5.

C. EVALUATION METRICS
The evaluation of natural language generation models is difficult since there is no single metric to measure the quality of various features of the language. In general, there are two aspects of natural language to be evaluated: quality and diversity.

1) BLEU & Backward BLEU
Following previous works, we use BLEU score [27] as the metric of quality. For each dataset, we first sample the same amount of generated text as the held-out test data. Then, for each generated text, the corpus-level BLEU score is calculated with the entire test data as a reference data [28].

2) Fréchet distance
Reference [30] proposed an automatic evaluation metric called the Fréchet InferSent Distance (FD), which evaluates the outputs of text generation models. The FD calculates the Fréchet distance between real text and generated text in the pretrained embedding space. This metric can capture both quality and diversity along with the global consistency of the text. Since the metric is known to be robust to the embedding model, as suggested in [12], we use Universal Sentence Encoder [31] to compute the sentence embedding of texts for our experiments.

D. EXPERIMENTAL RESULTS FOR QUALITY & DIVERSITY
We compare our model with an MLE baseline along with other state-of-the-art text GANs, such as SeqGan [8], RankGan [9], MaliGan [21], and ScratchGan [12] based on Texygen [28], which is an evaluation platform for text GANs. The MLE baseline is an RNN with MLE objective which has the same structure as the decoder of the proposed model. In addition, we detach the text-discriminator from our model and train the remaining part with the same training strategy as for the ARAE [10], which is one of the most representative continuous-space text GANs. Our RL-detached model shows superior performance over the original ARAE model (detailed information is provided in Appendix A). We call the model ARAE* in the following sections. Every score is averaged over five runs, and they have a standard deviation smaller than 0.005. Table 2 reports BLEU and B-BLEU scores of text GANs trained on the EMNLP dataset. While recently proposed ScratchGan surpasses the previous state-of-the-art text GANs by a significant margin, ConcreteGAN shows superior per- formance over ScratchGan. Interestingly, our implementation of an ARAE-like model (see "ARAE*" in the table) performs better than most of the discrete-space methods. The effect of our model stands out in larger n-grams, which means that commonly-used combinations of words, such as phrases, can be generated with better quality and diversity. Then, we compare the performance on another commonly used corpus, which is a part of the original COCO image caption dataset, and has a very small amount of training data. As shown in Table 3, ConcreteGAN performs better than most of the discrete-space methods in generating longer combinations of words. However, we find that all of the BLEU and B-BLEU scores of ARAE* are higher than those of discrete-space methods, including the proposed model. We conjecture that the lack of training data (i.e., 10k samples) cannot provide enough guidance for the RNN-based RL discriminator.
To see the effect on the dataset size, we conduct an additional experiment on the SNLI dataset, which is composed of a large amount of data with short sentences(i.e., 701k samples). In addition to the MLE baseline, We choose ScratchGAN, ARAE*, and ConcreteGAN, which represent the discrete-space methods, continuous-space methods, and combined approaches respectively, for comparison. Table 4 shows the BLEU and B-BLEU scores of these three models. With a large data for training the models, our proposed method surpasses ARAE* and achieves the best performance compared to other text GANs and the MLE baseline.

E. EXPERIMENTAL RESULTS FOR FD SCORE
We compare FD score between the real text distribution and the generated text distribution in the Universal Sentence Embedding space. Table 5 shows the FD score of each state-ofthe-art model with different learning paradigm. Analogous to the results in the Experimental results for Quality & Diversity Evaluation, the FD scores of all three models on the COCO corpus are fairly high. We explain this result as a natural outcome of the lack of training data. In other corpora with large training data, our model shows the best performance, which means that it can generate text that has the most similar distribution to the real text.

F. HUMAN EVALUATION
We further conduct a human evaluation for textual sample quality of ConcreteGAN and other methods. Following previous work [20], we randomly sample 100 sentences from each model and ask ten different people to score each sample on Amazon Mechanical Turk. We provide detailed criteria of human evaluation in Appendix B. As shown in Table 6, the samples from ConcreteGAN are rated with the highest score compared to the state-of-the-art models of continuous-space and discrete-space models. Along with the experimental results in the previous sections, the human evaluation further demonstrate that the proposed method can generates humanlike samples better than other methods.

G. ANALYSES OF CODE SPACE
In the previous section, the textual outputs of various text GANs are compared with diversified measurements. To demonstrate the effectiveness of the collaborative adversarial training in both the continuous and discrete spaces, we further analyze the behavior of code-generators. As the goal of code-generator is to imitate the real distribution of text in the latent code space, we compare the outputs of the encoder and code-generator to examine the performance of codegenerator. We independently gather the encoder's outputs from the real text inputs (i.e., the test dataset) and the codegenerator's outputs from the random noise inputs.

1) Analysis on t-SNE Space
We first visualize the code distribution with t-SNE [32] for indepth analysis. Fig. 2 shows the t-SNE plots of four different latent code distribution; each of them represents the codegenerator outputs of ConcreteGAN (Ours/G), the encoder outputs of ConcreteGAN (Ours/R), the code-generator outputs of ARAE* (ARAE*/G), and the encoder outputs of ARAE* (ARAE*/R). We see the encoders of both models map the real text to more compact latent spaces than the codegenerators. Furthermore, the latent code space generated by our code-generator is more compact than that of ARAE* in the same embedding space. Considering that the two models (i.e., ConcreteGAN and ARAE*) employ the same encoder architecture, this result demonstrates that ConcreteGAN pro- duces a compact and dense latent code distribution, which is more similar to the latent code space of real text.

2) Fréchet Distance between Latent Code Distribution
We further compare the Fréchet distance between the latent code distribution of real and synthetic text. Latent codes are obtained from the encoder and code-generator, respectively. As the codes are represented as embedding vectors, no external model for computing the sentence-embedding is required. As shown in Table 7, ConcreteGAN shows reduced Fréchet distance compared to ARAE* in both datasets: 24.7 to 15.15 in SNLI and 18.9 to 16.2 in EMNLP. These results demonstrate our model's superiority in generating latent codes compared to the previous baseline with a significant margin. While the proposed model shows better performance in imitating the encoder than ARAE*, we see that the gap of Fréchet distance is smaller in the SNLI dataset than in the EMNLP dataset. The average length of a sentence in EMNLP is approximately three times larger than that of SNLI, and it is more difficult for the model to encode lengthy text to a fixedsize code representation. This observation calls for future research investigating the use of a different architecture (i.e., a CNN or Transformer) for the encoder part.

V. CONCLUSION
In this paper, we propose ConcreteGAN, a novel GAN architecture for text generation. Unlike previous approaches, ConcreteGAN promotes the collaborative training of the continuous-space and discrete-space methods. The interplay between two adversarial trainings has a complementary effect on text generation. From a continuous-space method, our model effectively reduces the search space of RL-policy decoder. Meanwhile, discrete-space training enables the model to capture the structure of text and thereby alleviate the exposure bias, which is caused by continuous-space methods. The experimental results on three standard benchmark datasets VOLUME 4, 2016 show that ConcreteGANs outperforms state-of-the-art text GANs in terms of quality, diversity, and global consistency. .

APPENDIX A COMPARISON OF ARAE AND ARAE*
We compare our implementation of ARAE* and the original ARAE model with SNLI dataset, since the author of the ARAE published the pretrained model. In terms of FD score, ARAE achieves 0.011 which is the same as the score of ARAE*. We further compare the BLEU and the B-BLEU score of them. Table 8 shows that our implementation of ARAE* outperforms the original ARAE with respect to sentence quality and diversity.

APPENDIX B GENERATED SAMPLES
We present samples generated by our proposed Concrete-GAN trained on EMNLP, COCO and SNLI dataset in Table 9, Table 10 and Table 11 respectively.

APPENDIX C HUMAN EVALUATION CRITERIA
The Human evaluation is based on grammatical correctness and meaningfulness and any text formatting problems (e.g., capitalization, punctuation, spelling errors, extra spaces between words and punctuations) are ignored. Workers are asked to score each sample based on the criteria shown in Table 12.  Score Criterion

-Excellent
It's Grammatically correct and makes sense. For example: "if England wins the World Cup next year, it will be the most significant result the sport has seen in more than a decade."

-Good
It has some small grammatical errors and mostly make sense. For example: "it is useful to have had a doctor who forced her to release him a couple of days before she was cleared."

-Fair
It has major grammatical errors but the whole still conveys some meanings. For example: "even then once again there's a sign of that stuff is going on the way to work on Christmas eve."

-Poor
It has severe grammatical errors and the whole doesn't make sense, but some parts are still locally meaningful. For example: "we go to work for the moment in life their eyes and, i have been a different race on to go."

-Unacceptable
It is basically a random collection of words  From 2009 to 2013, he was an Assistant Professor in the department of Computer Science at KAIST. Since 2016, he has been an Associate Professor in the department of Electrical and Computer Engineering at Seoul National University (SNU). He is an adjunct professor in the department of Mathematical Sciences, SNU. His research interests include natural language processing, deep learning and applications, data analysis and web services.