TiVGAN: Text to Image to Video Generation With Step-by-Step Evolutionary Generator

Advances in technology have led to the development of methods that can create desired visual multimedia. In particular, image generation using deep learning has been extensively studied across diverse fields. In comparison, video generation, especially on conditional inputs, remains a challenging and less explored area. To narrow this gap, we aim to train our model to produce a video corresponding to a given text description. We propose a novel training framework, Text-to-Image-to-Video Generative Adversarial Network (TiVGAN), which evolves frame-by-frame and finally produces a full-length video. In the first phase, we focus on creating a high-quality single video frame while learning the relationship between the text and an image. As the steps proceed, our model is trained gradually on more number of consecutive frames. This step-by-step learning process helps stabilize the training and enables the creation of high-resolution video based on conditional text descriptions. Qualitative and quantitative experimental results on various datasets demonstrate the effectiveness of the proposed method.


I. INTRODUCTION
I N the last few years, there has been intensive research on generative models.In particular, the recent developments of variational auto encoders(VAEs) [1] and generative adversarial networks(GANs) [2] represent the forefront of rapid, abundant, and high-quality progress.Further, since the deep convolutional GAN [3], which employs a convolutional neural network(CNN), succeeded in generating a realistic output using a GAN framework, many studies have reported impressive results using deep networks.It is now possible to produce photo-realistic images that are difficult to discriminate from real images, even for humans [4].
However, the number of studies on video generation is significantly lower than those on image generation, because the video generation is a considerably more challenging task than image generation.Image creation is only concerned with the completeness of a single frame, whereas videos also need to consider the connectivity between frames.Even if each image has good quality, well-crafted videos cannot be generated if the continuity between adjacent frames is not guaranteed.
In addition, nearly all public video datasets are extremely diverse and unaligned, thereby further complicating the video generation process.
Conditional video generation has received little research attention, whereas the generation of conditional output for a wide variety of inputs has been widely studied in the image generation field [5]- [7].For example, a simple onehot vector can be used as a control code to manipulate the attributes of a resulting image [5], and there is also a network that creates photo-realistic images corresponding to a given text [8].However, studies regarding text-to-video generation are lacking and are generally performed on a low resolution compared to text-to-image generation.Therefore, to broaden the field of video generation, we focused on generating a conditional video that has not yet been investigated thoroughly in this domain.This study introduces a new scheme for text-tovideo generation tasks with GANs.
We propose a novel network that generates a video corresponding to a given description.The learning framework of our network is established on the basic concept that connected frames of a video have substantial continuity.If we can create one high-quality video frame, it will be easier to create a linked frame because they are related.Rather than first finding a mapping function between the text and all video frames, we train our network with respect to one image and gradually extend it to longer frames (Figure 1).Our model call this scheme as TiVGAN, which stands for Text-to-Image-to-Video Generative Adversarial Network framework.
In the process of progressively learning to generate a large number of adjacent frames, TiVGAN can learn to create long consecutive scenes.Our extensive experimental results show that our model not only produces an accurate video for a given text but also produces qualitatively and quantitatively sharper and better results than those presented in other comparable works.

II. RELATED WORKS
Generative image models have been studied actively in the past few years.Kingma et al. [1] suggested a reparametrization trick to derive a variational lower limit of data likelihood.GAN shows promising results with the use of adversarial training between discriminators and generators.The discriminator is trained to distinguish between fake and real distribution, and the generator attempts to create realistic data to deceive the discriminator.Since then, there have been creation tasks for various datasets, such as human faces, furniture, animals, and others [3], [9], [10].
Following the research on the generation of images based on GANs, studies regarding conditional GANs have also been published with various kinds of conditional inputs.Info-GAN [5] uses a one-dimensional vector as a condition to control output image by concatenating the code into input noise.Maximizing the mutual information between the code and the generated image enables the network to learn interpretable representations.Furthermore, Reed et al. [11] demonstrated networks that can create text-based images by learning text feature representations and using them to synthesize images.StackGAN [8] extends this structure to stage-1 and stage-2, which enables the generation of 256 × 256 photo-realistic images after the generation of low-resolution images.Moreover, there have been many studies on the whole-image translations, such as image domain transfer [12]- [14] and image manipulation [15]- [17].
In contrast, very few experiments have been conducted on video synthesis.Vondrick et al. [18] untangled the background and foreground of the scenes with two streams using 2D spatial convolution and 3D spatio-temporal convolutions for each scene.TGAN [19] exploits two different generators for temporal vector sampling and multiple frame creation based on the acquired vectors.The developers of MoCo-GAN [20] suggested the decomposition of motion and content space for effective video generation.They used a recurrent neural network for sampling from a motion subspace and concatenated the sampled features with the content vector to generate continuous frames.Our goal of text-to-video generation, on the other hand, has rarely been attempted.Li et al. [21] used a conditional VAE to generate a 'gist', which refers to the video background color and object layer; the video content and motion are created based on the gist and text.Pan et al. [22] proposed a new architecture for the textto-video task by using 3D convolutions on their network and different types of losses.Balaji et al. [23] suggested a multiscale text conditioning scheme with GAN to generate desired video frames with the given text.However, despite these examples, the number of studies on text-to-video remains small.Therefore, we present a new architecture suitable for video generation conditioned on the text description.

III. METHOD
As we have emphasized in the above sections, GANs have been proven for their ability to create sharp images.From this point of view, starting our text-to-video network training Step with a text-to-image stage may result in more effective video generation.Based on these ideas, we decomposed the training process into two stages: Text-to-Image Generation and Evolutionary Generation.The overall flow of our proposed architecture is described in Figure 2. We begin with the learning of text-to-single image generation, gradually increase the number of produced images, and repeat this training process until the desired video length is achieved.This is our key paradigm.These two stages will be described in detail in subsections III-A and III-B, followed by explanations of the techniques we used to stabilize the learning.

A. TEXT-TO-IMAGE GENERATION
We aim to generate a realistic fake video ) that matches with the given text description t using a recurrent unit R and a generator G where n ∈ N and I i is each frame of video V F .At this stage, we only focus on the text-to-image generation task without considering the image sequence.Then, our goal is simplified to the generation of the single realistic image I 1 from t.
To train the model with text, we must first transform the text string into an encoded feature vector.We adopted the pre-trained skip-thoughts vector network [24] to encode text t into a 4, 800-dimensional vector.Since the encoded vector is high-dimensional, we used principal component analysis (PCA) to derive meaningful features and reduce its dimensionality.We defined this embedded vector as φ t .
We start with a single GRU cell R and a generator G to create one frame.Given the embedded text vector φ t , the recurrent unit R outputs the vector v 1 = R(z 0 , (φ t , z 1 )), where z 0 and z 1 are random noises from N (0, 1).The noise z 0 is a initial hidden state, and (φ t , z 1 ) is an input of R.This v 1 from R is the source input vector for G, and it creates the resulting image I 1 = G(v 1 ) with the same size as the real frame.To ensure that I 1 matches with the provided text description t and follows the distribution of real data, we trained the generator G and the image discriminator D I adversarially using a GAN framework with slightly modified losses consisting of real, wrong, and fake pair as similar to those used by Reed et al. [11].Overall losses will be explained in Section III-E.
When training the image discriminator D I , the real image X i is randomly selected from 2 n frames of the real video V R .Therefore, G and R aim not only to create the corresponding image for the given text but also to model the image distribution of the frames in the given video dataset.After the training of the text-to-image generation, G can generate various images if some appropriate input vector is received.Therefore, if R provides a meaningful sequence of vectors, G can easily generate consecutive frames, which leads to realistic video generation.R could act as an instructor that teaches G to synthesize consecutive frames.This initial stage is the basis of the whole training process and plays a significant role in generating a series of frames.

B. EVOLUTIONARY GENERATION
The evolutionary generation stage begins after the completion of the initial stage training described in the above section.We now create a series of consecutive frames based on the model trained in the previous stage.Since G has the ability to generate various frames and R is a recurrent unit that can output a series of meaningful vectors, the extension of the text-to-image generation can lead to successful consecutive frames generation.
This stage consists of the process from step 1 to n, and the goal of each step m ∈ {1, 2, • • • n} is to generate 2 m consecutive frames stably.As the step proceeds, the number of created frames increases, and we can finally reach the video-level generation we desired.In each step m, 2 m images can be obtained by iterative operations of R and G learned in the previous stage.Let's look at an example of the generation in step 1.After the text-to-image generation, we forward R once more with v 1 as a hidden state and the (φ t , z 2 ) as an input where z 2 is randomly sampled noise from N (0, 1).Then the next latent vector v 2 = R(v 1 , (φ t , z 2 )) is obtained from R. This vector v 2 is again delivered through the same G to create another image, I 2 .
Unlike the text-to-image generation stage, the temporal consistency of the generated frames should be managed in this stage.At each step m, m th step-discriminator D S m is newly added to discriminate the sequence of 2 m frames.D S m receives the fake input ) are 2 m randomly selected connected frames from the real video V R .Since the real input has temporal information (images are concatenated with original order), the fake input should be generated to have temporal information correctly.After this training step converges, we move on to the further step m + 1.Then, D S m is removed, and training proceeds with a new step-discriminator D S m+1 .
To summarize, we use different step-discriminator the adversarial learning of D I and D S can even disrupt the learning of G. To prevent this, we present an advanced scheme which initializes the D S to a better state rather than noise.
For all m ∈ {1, 2, • • • , n}, the weight of D S m is initialized with the previous step-discriminator D S m−1 .Image discriminator D I is used for initialization of the first stepdiscriminator D S 1 .We designed all the step-discriminators to have the same architecture except for the number of input channels in the first layer.The only difference is that D S m receives 2 m images as input, while D S m−1 receives 2 m−1 images.Let F l,k m be the weights connected to the k th input channel of the l th layer of step-discriminator D S m .Then, our step-discriminator initialization can be defined as: ( All layers except the first layer are initialized with same weight of the step-discriminator D S m−1 .However, there is a slight variation only in the first layer, which is that F 1,2i−1 m and F 1,2i  m are initialized to F 1,i m−1 /2 for all i = 1, • • • , 2 m−1 .This is illustrated in Figure 3.After the evolution in step m − 1 to step m, the appearance of the generated images within each step should be similar, but the number of the resulting images are twice as long.To retain the output of the discriminator when the step changes, we divide the weight value by 2. We believe that this initialization technique can assist the training of our framework to maintain stability even with sudden step changes.The effect of this method will be discussed further in the results section.

D. INDEPENDENT SAMPLES PAIRING
One of the main failures in training GAN is mode collapse.To mitigate this phenomenon, independent samples pairing (ISP) is applied in our training process.Our model produces two output images I k a and I k b with one fixed text description t and two different randomly sampled input noises z a k and z b k ∈ N (0, 1) when generating the k th frame (k = 1, 2, • • • , 2 n ).These two independently generated images are paired by concatenation in the channel dimension and form the fake pair.To make the generator create various examples corresponding to the same text t, we train the discriminator to distinguish between (I k a , I k b , φ t ) and (X a , X b , φ t ), where X a and X b are two real dissimilar images associated with the same text description.Since X a and X b are dissimilar, if a mode collapsed generator generates very similar I a k and I b k , it will be easily detected as fake by the discriminator.Thus, the generator attempts to create different images with the same t to deceive the discriminator.This independent samples pairing technique is only used for image discriminator D I .

E. TRAINING PROCEDURES
We retain the adversarial training framework of the generator and discriminator.Unlike in a conventional GAN, however, we include a slight perturbation by having two branches on the discriminator similar to [8].At first, the discriminator passes several convolution layers to acquire a high-level feature map.Then, one branch calculates the text-image match loss by concatenation with φ t , and the other branch performs patch discrimination without text concatenation.The operations of these two independent branches ensure that the image matches well with the text while improving the image quality.We use three kinds of text-image losses to train the model for text matching, the same as those used by Reed et al. [11].The losses are obtained by a real pair (X, φ t ), fake pair (I, φ t ), and wrong pair (X, φt ), where φt is one of the embedded text vector that is not identical to φ t .
The overall loss for G, R, D I , and D S are as follows: where and Update the D I ← maximizes loss via Eq. 3 with independent samples pairing Update the D S m ← maximizes loss via Eq. 4 end end X i and I i are randomly selected frames from real and fake videos, respectively.D I (I i ) is an image loss without text conditioning, and D I (I i , φ t ) is the pair loss with text conditioning.At evolutionary step m, D S represents D S m .Here, X i,S is a 2 m consecutive images set randomly selected from the real video, and I i,S is generated fake images set with 2 m images.They are concatenated in channel dimensions.L S is only used at the evolutionary generation stage, and L I is used through all processes.Our overall training algorithm is described in Algorithm 1.

IV. EXPERIMENTS
To prove the effectiveness of our proposed method, we conduct several experiments on three diverse datasets: KTH Action, MUG, and Kinetics.To compare with the previous text-to-video method, we reproduced [22] and make experiments on three datasets.Since other methods [21], [23] have experimented on the Kinetics dataset, we directly compare results for the Kinetics.Due to the lack of published papers in the text-to-video area, we also employ several other existing video generation methods (TGAN [19], MoCoGAN [20]), and trained on each dataset using the same settings.For a fair comparison, we attempted to balance our model and video Man is making a sadness face Woman is making a happiness face Man is making a disgust face generation model by adding text conditioning to each method to yield TGAN++ and MoCoGAN++.
For all experiments, we used n = 4 steps, which implies that we generated a 16-frame video.The detailed structure of the network is given in the supplementary material.In the training of the text-to-image generation stage, 30k iterations are performed, and in the evolutionary generation stage, we perform 15k iterations for each step.For PCA, we reduce the dimension of the vector to 60, i.e., φ t ∈ R 60 .

A. KTH ACTION
The KTH Action dataset [25] contains six types of human actions: walking, jogging, running, boxing, hand waving, and hand clapping.Of these, we use the jogging class.In each clip, a man is jogging from left to right or right to left on two backgrounds.We extracted 16 frames from each video sequence and reshaped it into 128 × 128.For training, a total of 200 videos (3,200 frames) are used.Our qualitative results are shown in Figure 4. We can see that the person in the generated video is moving exactly as described in the text description while maintaining the image quality with high resolution.
Quantitative results: For a quantitative evaluation, we use frechet inception distance(FID) [26] to measure the quality of the generated images.FID measures the similarity between two image sets.We collect the 200 video frames generated from each method, and then calculate the FID between each image set with the same number of video frames in the training dataset.Our TiVGAN shows the best performance as shown in Table 1.

B. MUG FACIAL EXPRESSION
MUG is a human facial expression database [27].There are tens of people in the dataset, and each person shows seven types of facial expressions: 'happy', 'disgust', 'sad', 'neutral', 'surprise', 'fear', and 'anger'.We reshaped and used 128 × 128 resolution frames, and the models are trained on a total of 1, 030 videos.Result images on the MUG dataset and their given captions are shown in Figure 5.We observe that our model generates sharp images corresponding to the given text.
Quantitative results: The inception score is used for the   quantitative evaluation of the MUG dataset results.It is a measurement proposed by Salimans et al. [28] that evaluates the quality of a GAN by observing the diversity and classification confidence of its generated images.In this experiment, a simple 5-layers 3D convolutional neural network is used instead of an inception network owing to the limitations of the number of data and classes; each video is classified into seven classes representing human facial expressions.Table 2 shows a comparison of the results using the MUG dataset.
Our results showed the highest inception score among the studied methods.

C. KINETICS
Kinetics is a large-scale, human-focused video dataset from YouTube [29].The dataset comprises thousands of video URLs covering 600 human action classes.We used six classes from this dataset: 'snow bike', 'swimming', 'sailing', 'golf', 'kite surfing', and 'water ski', which are similar to those used in previous works [21].We reshaped and used a 128 × 128 frame size, and every model is trained on a total of 3, 032 videos.For Kinetics, the recent text-to-video generation method proposed by Li et al.(T2V) [21] is also compared qualitatively.
The generated qualitative results are shown in Figure 6.From the results, we can easily see that our method produces a much higher-quality video.Our generated images are much clearer because of their higher resolution, and they can also capture more distinctive features of given text, apart from the resolution differences.Quantitative results: First, the inception score is again used for the evaluation of the Kinetics dataset.Each video is trained on the 6-class classification by using a 5-layers 3D convolutional neural network.Table 3 shows a comparison of the results on the Kinetics dataset.Our result again shows the highest inception score among the compared methods.Next, the video classification accuracy of the generated results is recorded in Table 4 following the settings of previous textto-video works [21], [23].Our TiVGAN achieves the highest performance which is very close to the in-set accuracy.

V. ABLATION STUDIES
We conduct ablation studies to analyze the effectiveness of the proposed architecture.All experiments are tested on the Kinetics dataset.

A. NEAREST NEIGHBORS
To address that our model does not simply memorize the dataset, we present the nearest neighbor image in Figure 7.We can observe that our generated results are different from the nearest neighbors in the training set.

B. STEP-BY-STEP GENERATION
When we aim to create 2 n frames of video, our network starts with generating a single frame (n = 0) and gradually double the number of images to create (n = 1, 2, 3, 4).To show the advantage of our step-by-step evolutionary generation framework, we perform an ablation study with various cases.Several steps are omitted in the comparison experiment, but total training time in all cases is same for fair comparison.As shown in Table 5, TiVGAN yields the highest inception score than other experiments.The results of the second row and fourth row demonstrate that initial text-to-image generation plays a significant role in the final step.Also, the result of third row indicates that skipping a few steps decrease the performance of the model.From the experimental results, it can be seen that our step-by-step generation is critical to producing high-quality video.

C. STEP-DISCRIMINATOR INITIALIZATION
In Sec III-C, we described our step discriminator initialization as our training strategy to enhance the network.In this experiment, we record the training loss of D S for the cases with and without our initialization.The results are shown in Figure 8.With random noisy initialization, D S shows an unstable loss graph at the beginning of every step.Since random initialization does not utilize the previous learning, the loss rises rapidly when the discriminator is newly added.The step-discriminator initialization indicates that D S is not affected by step change.This means that the model can reliably handle the generation of a larger number of images owing to the suggested initialization for the stepdiscriminator.

D. INDEPENDENT SAMPLES PAIRING
We employ Independent Samples Pairing to prevent the mode-collapse of the generator.The effects of ISP can be visualized in Figure 9. Without ISP, the generator often  produces identical outputs when the same input text with different noise are given.However, we verify that our network generates various videos when the same text description and different random noise are given.

VI. CONCLUSION
In this paper, we proposed a new effective learning paradigm for text-to-video generation.Beginning with the creation of a single image, our network evolves progressively to synthesize a video clip of a desired length.Additionally, several techniques were used for stabilizing the training.Experimental results on the KTH, MUG, and Kinetics datasets support that our model can accomplish the given task under various situations.Conditional video generation is still a less explored field, but we believe it will be actively researched in the near future.We hope that our work will invite more interest in this field.

'FIGURE 1 .
FIGURE 1. Simple overview of our TiVGAN framework and generated videos.(a) The generator starts with producing a single frame and gradually evolves to create longer frames with the given text.(b) Generated sample videos using our framework TiVGAN.

2 ⋯FIGURE 3 .
FIGURE 3. Illustration of the first convolution layer of step-discriminators when step changes.At the beginning of step m of evolutionary generation stage, step-discriminator D S m is initialized using D S m−1 .
(I) Text-to-Image generation stage: Generate a single image and train G, R, D I while not converged do Get z 0 , z 1 ∈ N (0, 1), generate I 1 = G(R(z 0 , (φ t , z 1 )) Randomly choose one real image X from V R Update the G, R ← minimizes loss via Eq. 3 Update the D I ← maximizes loss via Eq. 3 with independent samples pairing end (II) Evolutionary Generation stage.At each step-m, generate 2 m images and train G, R, D I , D S m for m → 1 to n do while not converged do Get z 1 , • • • , z 2 m ∈ N (0, 1), generate I 1 , • • • , I 2 m by repeating R and G Randomly choose

FIGURE 4 .
FIGURE 4. Qualitative results of the models trained on KTH dataset.(a) Our generation results.(b) Comparative results with previous works.

FIGURE 5 .
FIGURE 5. Qualitative results for MUG dataset.The images above and below are parts of generated video frames created from the same text description.Generated video frames are well matched with given text description.

FIGURE 6 .
FIGURE 6. Example results of text-to-video generation trained on Kinetics dataset.

TABLE 5 .FIGURE 8 .
FIGURE 8. Ablation study on step discriminator initialization.Training loss is used as a measure to demonstrate the effectiveness of step-discriminator initialization compared to random initialization.The points where the blue line rises abruptly indicates the time when the step changes.

FIGURE 9 .
FIGURE 9. Ablation study on Independent Samples Pairing.Two different video clips are independently generated from one text input 'swimming in the swimming pool' with different noises.The left (with ISP) generated a completely different but appropriate video following the text description.However, without ISP, two samples are identically generated even with the different input noise.

FIGURE 2. Full architecture of our proposed network, TiVGAN, and the training stages. We
first start with training for generating a single image at the text-to-image generation stage, and we make consecutive frames in an evolutionary way through further stages.Although the text-to-image generation stage only uses an image discriminator D I , the evolutionary generation stage uses both an image discriminator D I and a step-discriminator D S .

TABLE 1 . FID score for models trained on the KTH dataset
. A lower FID means that the generated images are more similar to the training data.

Ablation study on nearest neighbors.
Left images are generated samples, and right images are corresponding nearest neighbors in training dataset.