Image Creation Based on Transformer and Generative Adversarial Networks

To address the problem of low authenticity of generated images in existing generative models, the transformer super-resolution generative adversarial network(TransSRGAN) model based on the generative adversarial network is proposed. The generator of the model uses the transformer encoder sub-module as the basic module. The features of the input vector are extracted. low-definition images are generated through the transformer encoder submodule, and the low-definition image is up-sampled by the convolutional neural network to complete the image generation. The discriminator of this model uses the convolutional neural network as the basic module. To discriminate the real samples from the generated fake samples, the discriminator extracts the image features by the convolutional neural network. The experimental results show that the TransSRGAN model brings the distribution of the generated samples closer to the training samples, effectively raises the quality of the generated samples, improves the authenticity of the generated samples, and enriches the diversity of the generated samples. During the training process, there was no mode collapse or instability.


I. INTRODUCTION
The generative model is a sort of algorithm that can learn reusable features in large unlabeled datasets and generate data that do not exist in the dataset. Generative models have been focus of research in recent years. The earliest generative model is the variational autoencoder [1] based on variational inference and Bayesian theory. Variational autoencoder can generate not only pictures, but also text [2] and audio [3]. Although the variational autoencoder is simple and effective, it tends to generate noisy data irrelevant to the trainset owing to the assumption of a simple normal distribution as the original sample distribution.
With the development of deep learning, its application to generative models has begun. Deep learning not only has a better learning ability for the characteristics of the data set but also has a better ability to fit the real data distribution of the samples. The generative adversarial network (GAN) [4] based on deep learning has an excellent generation effect in the generative model. GANs can utilize a large number of The associate editor coordinating the review of this manuscript and approving it for publication was Yang Li . unlabeled samples to learn good intermediate feature representations of samples. Because GAN has excellent adaptability to samples, it can be suitable for various generative tasks, such as the generation of videos [5], images [6], and audio [7] with different styles. In recent years, generative adversarial networks have been widely used in the energy field, such as energy scheduling [8], [9].
With the development of computer vision, increasing number of generative models have become accustomed to generating images. Because the network structure of GAN determines its ability to extract the features of the sample, a reasonable network structure can make GAN generate more real and high-definition pictures. The exploration of a reasonable GAN structure has become a research hotspot today [10].

II. RELATED WORK
The variational autoencoder [1] proposed by King Ma et al. first learns reusable features in large unlabeled images. It generates images that do not exist in the dataset, but the generated images are relatively blurry. Owing to the development of deep learning, GANs can generate clearer images than variational autoencoders. The GAN [4] was proposed by Ian Goodfellow of the University of Montreal in 2014, which is a machine learning architecture. It is a neural network based on the game theory minimax algorithm. The GAN consists of a discriminator and a generator. The purpose of the generator is to maximize the realism of fake samples. The purpose of the discriminator is to attempt to discriminate between real and the fake samples. During the training process, the discriminator and generator continue to converge to the optimal state.
Because GAN is an unsupervised learning model, it cannot extract the features of labeled data. For labeled data, researchers have proposed many GANs with labeled classifications, such as CGAN [11] and ACGAN [12]. ACGAN is a generative model proposed by Odena et al. in 2016 which can generate images based on tags. ACGAN performs supervised learning on labeled data by fusing the classification tags into a loss function.
Although a GAN can effectively generate images, it is difficult to train [13]. When the generating adversarial network is trained, two situations occur. First, if the distributions of the real and generated samples do not overlap, the gradient of the generator is always 0. This causes the generator not to update. Second, the generator tends to generating repeated and safe samples, leadings to mode collapse. One solution is to use methods based on integral probability metrics, such as the Wasserstein distance (WGAN) [14], kernel MMD [15], and Cramer distance [16]. Another methold is to add a gradient penalty term to maintain stabilization during the training of GANs [16], [17], [18]. Among them, DRAGAN [18] is a method that adds a gradient penalty term to maintain stabilization in the training of GANs. DRAGAN was proposed by Kodali et al.The DRAGAN stabilizes the GAN training by adding a gradient penalty to the discriminator. It can be observed that restricting the discriminator D(x) to a Lipschitz continuity for K. Compared with WGAN, the training of DRAGAN is more stable, and the update direction is the same as the gradient direction when the momentum-based optimization algorithm is used for training.
The generator and discriminator of GAN can be constructed using by various networks. The DCGAN [19], proposed by Alec Radford, uses a multilayer convolutional neural network (CNN) to build a GAN. The DCGAN dramatically improves the quality and style richness of the generated adversarial network results compared to the GAN containing the multilayer perceptron. However, there are still some problems with DCGAN. First, the generated samples deviated from the real sample distribution. Second, the generator is prone to collapse during training. To solve the problem of generated samples deviating from the distribution of real samples, StyleGan [20] based on style is proposed by T Karras et al. StyleGan introduced a style network to generate images through feature fusion after learning the features of style images. StyleGan makes the distribution of generated samples and real samples closer, but there are still some defects, such as a single generation style and complex network structure. Because the input of StyleGan is not a random vector, but a styled vector, StyleGan does not generate images from scratch. It is closer to a style transfer. However, This method is not suitable for images with changing styles.
SRResNet [21], based on DCGAN, was proposed by Y Jin et al. In contrast of the vanilla DCGAN, the generator and discriminator of this network introduced the residual neural network ResNet [22] proposed by K He. Moreover, the CNNs of the network generator and discriminator wear deepened. The SRResNet generator can generate images with a higher resolution, and the discriminator has a stronger discriminative ability. Compared with the vanilla DCGAN, the model can generate more realistic images. However, the model still has the problem of the low authenticity of the generated samples.
At present, generative models based on GAN have the problem that the generated samples deviate from the original samples. In order to make the generated samples closer to the original samples, a GAN consisting of Transformer Encoder and CNN is proposed. This network can effectively minimize the distance between the generated sample distribution and the original sample distribution without using a style transfer. This method can generate more realistic images. Compared with the existing models, this model adopts super-resolution technology after the transformer generator generates the images. Compared with DCGAN, the model in this study has more realistic generated samples. Compared to a GAN composed of a full transformer, the proposed model requires less computation.

A. CONVOLUTIONAL NEURAL NETWORK
The convolutional neural network (CNN) refers to a neural network that uses convolution operations for feature extraction. A general CNN consists of a pooling layer, a normalization layer, a convolutional layer, and a fully connected layer. Each layer is usually followed by an activation function.
The convolutional layer refers to a neural network composed of convolution operations. The convolution operations are obtained by sliding the convolution kernel and multiplying and adding the corresponding image pixels. The convolution operation can be explained as follows: where b is the bias parameter, w is the weight matrix of h×k dimension, X [i:i+h-1] represents the i-th row to the i+h-1th row of the matrix X, and the size is h×k the convolution kernel. Generally, a CNN consists of multiple pooling layers and convolutional layers. Which can be explained as: where f(x) is the normalization function or activation function or pooling function, n is the number of iterations. VOLUME 10, 2022 Normalization can reduce the influence of distribution changes caused by parameter updates in CNN, and map the data distribution to a specific interval. Normalization can be explained as: where Var is the standard deviation, E is the expectation, ε is the value prevent the denominator from being 0, and β and γ are learnable parameters. BatchNorm calculates the expectation and standard deviation on the input batch dimension. LayerNorm calculates the expectation and standard deviation on the input channel dimension.
The activation function can de-linearize the linearly transformed feature matrix, and the commonly used activation functions are ReLU, tanh, sigmoid, etc. These activation functions can be explained as: where max is the maximum value.

B. GENERATIVE ADVERSARIAL NETWORK
Generative adversarial network (GAN) can be explained as: where G refers to the generator, D refers to the discriminator, z refers to the noise vector, x refers to the real data, E refers to the expectation, and P refers to the distribution function. Y Jin proposed a new network SRResNet based on DCGAN.
The network can be represented as: where H(x) is the residual neural network. ACGAN is a generative model that can generate images based on tags. For the sample set, using this model needs not only the sample image but also the class tag corresponding to each sample, and the loss function can be explained as: where L s is the true and false discrimination loss, L cls is the classification loss, P is the distribution function, and E is the expectation. DRAGAN is a stable training method. The loss function of DRAGAN can be explained as: where ∇ represents the gradient, N is the normal distribution,cl and λ is a parameter, and K is the gradient penalty parameter.

C. TRANSFORMER
Transformer [23] neural network was proposed by Ashish Vaswani et al. and originally was used for natural language processing. The Transformer is composed of Encoder and Decoder. The Encoder is responsible for encoding the data into hidden vectors.The Decoder is responsible for decoding the hidden vector into data. The Transformer Encoder consists of two blocks. The first block is a multi-head selfattention block. The second block is a feed-forward fully connected network with Relu activation function. Normalization is used before the two blocks. Both blocks have residual network links. The Transformer Decoder has three blocks. The first block is a multi-head self-attention block. The second block is a multi-head attention block that extracts the relationship between latent vectors and input attention.
The third block is a feed-forward fully connected network with Relu activation function. Normalization was used before the two blocks. All three blocks have residual network links.

D. SELF-ATTENTION
The self-attention was first proposed by Ashish Vaswani et al. and applied in Transformer. The difference between the self-attention and the convolution operation is a range of receptive field. The receptive field of the convolution operation is the local receptive field. In contrast, the receptive field of the self-attention is the global receptive field. The self-attention describes the dependency between any two data. It can be regarded as a particular case of embedding Gaussian [24]. The input data is linearly transformed into three matrices Q, K, and V. After that, Q and K are dot-multiplied and then divided by the square root of the scaling factor to calculate Softmax, and finally dot-multiplied with the matrix V. The self-attention can be explained as: where W q , W k , W v are trainable parameters, d k is the scaling parameter, and D is the input data. Using a single self-attention can only make the model extract the feature information of one space. Using multi-head attention can make the model extract the feature information of multiple subspaces. The multi-head self-attention can be explained as: where W 0 , W Q , W K , W V are trainable parameters, and Concat is matrix splicing. The multi-head attention is the result of linear transformation after splicing multiple attention.

IV. TRANSFORMER-BASED GENERATIVE MODEL TransSRGAN
In order to minimize the distance between the generated sample distribution and the original sample distribution, the TransSRGAN model based on GAN is proposed. Different from vanilla GAN, the generator of TransSRGAN uses Transformer Encoder to build sub-modules and uses CNN sub-modules for upsampling operations. The discriminator of TransSRGAN is constructed using a CNN as a sub-module. The CNN submodule used by TransSRGAN uses only normalization and convolutional layers and uses an activation function for delinearization. TransSRGAN can be explained as: where Transformer refers to Transformer Encoder ( Figure (2)), MultiConv refers to the calculation of multiple convolutions by formula (2), and tanh refers to the calculation of hyperbolic tangent by formula (6). The network structure of the generator is shown in Figure (1), and the structure of the Transformer Encoder is shown in Figure (2). In Figure (1), s is the convolution stride, k is the convolution kernel dimension, and n is the number of output channels. The generator contains 16 layers of Transformer Encoder. Each Transformer Encoder includes a multihead self-attention layer composed of four self-attentions. Constructing generative adversarial networks entirely with Transformer Encode will result in extremely high memory consumption. Therefore, three-layer CNNs are used to upsample the 16*16 image generated by the Transformer Encoder. First, the generator inputs a 162-dimensional noise vector consisting of 128-dimensional random floating-point numbers and 34-dimensional classification tags. The random floating-point numbers are randomly generated from a normal distribution with an expectation of 0 and a standard deviation of 1. 34-dimensional one-hot encoding was used for the classification tags. Then, the noise vector passes through a fully connected network to output a feature vector with a dimensions of 64*16*16. After the normalized feature vector, a 16*16*64 image is generated through a 16-layer Transformer Encoder and normalization, then upsampled through a three-layer CNNs, and the final output is 128*128*3 image. A residual neural network was used to abstract the shallow features. It can not only solve the problem of vanishing gradients or exploding gradients but also improve the extraction of shallow network features.
Since the calculation of the large-dimensional attention matrix will consume a lot of computing resources, the network will divide the data into 64 sequences for calculation, and the length of each sequence is 256. In this manner, only a 64*64 attention matrix will be generated. The sequence can output a 16*16*64 image after going through multiple layers of the Transformer Encoder. Since Transformer Encoder outputs larger images, the memory footprint increases, so CNN is used for upsampling. Under normal circumstances, the convolution operation, which is a downsampling operation, reduces the height and width of the feature matrix. However, the Pixel Shuffle algorithm [25] can obtain the feature matrix of r2 channels through the convolution operation and then perform upsampling through the method of periodic screening. It can increase the resolution of the output image by a factor of r compared to the input image. After using the above methods, the memory occupation and calculation amount of the algorithm can be effectively reduced, and the calculation speed can be accelerated.
The generator is represented as Algorithm (1). Where W 1 , W 2 , W 3 , W 4 , W q ,W k , W v are the parameters of the generator. Before training, the initial values of the parameters of the generator are randomly selected from a normal distribution with the standard deviation of 0.02 and the expectation of 0. After training, the parameters of the generator will be fixed to the local optimal solution of the generator. The parameters were used after training when using the generator to generate images. The feature vector z is a 162-dimensional noise vector, of which the first 128 bits are randomly generated by a normal distribution with the expectation of 0 and the standard deviation of 1, and the last 34 bits are randomly generated by One-Hot encoding.
The discriminator is shown in Figure (3). Since the addition of the sigmoid activation function will cause the mode to collapse, the final sigmoid activation function in the discriminator based on ACGAN is canceled. The main structure of the discriminator is composed of a multi-layer CNNs. The input of the network is 3*128*128 images, and the output is a 35-dimensional vector composed of 34-dimensional tags and 1-dimensional true and false samples. If the output of the true and false sample flag is 0, it is discriminated as the generated sample; if the output is 1, it is discriminated as the original sample. The initial values of all parameters of the discriminator network obey the normal distribution with the standard deviation of 0.02 and the expectation of 0.
Since the vanilla generative adversarial network cannot perform feature extraction on the data with classification tags, the auxiliary classifier is included in the vanilla GAN loss function in the form of a multiplier. To solve the problem of mode collapse in the GAN, the gradient penalty term is included in the vanilla GAN loss function in the form of a multiplier. Through these two schemes, the GAN can extract the features of image classification tags while stably training. VOLUME (17) {Store the result of one iteration to O 6 and go to the next iteration} 15: end for 16: The loss function of the network is shown in formula (19): where, L D is the discriminator loss, L D item 1 is the realfake loss, L D item 2 is the classification loss, L G is the discriminator loss, L G item 1 and 2 are the real-fake loss, the third is the gradient penalty loss, and the fourth is the classification loss, λ 1 is the parameter of the gradient penalty formula (12,13), λ 2 is the parameter of the gradient penalty formula (10,11), z is the noise vector, x is the real data, σ is a random variable, E is the expectation, and P is the Distribution function. The training model adopts the stochastic gradient descent method. Stochastic gradient descent is an optimization algorithm of batch, which can make the network parameters converge to the minimum value of the loss function through continuous iteration. Due to the limited memory size, it is impossible to process all the data in one iteration, so it is necessary to split the training samples into batches for training.
The training process of the network is shown in Algorithm (2). Where N is normal distribution, U is uniform distribution, BCE is binary cross-entropy, CE is cross-entropy, GradientDescent is gradient descent algorithm, grad is gradient, and GradClip is gradient clip.

Algorithm 2 Model Training
Input: real data r, real data tag r l , real data quantity r n , feature vector z, D(x) is the discriminator, G(x) is the generator, learning rate lr, number of epoch e, batch size b, gradient Cut parameter C, gradient penalty parameter K. Output: new discriminator parameters D θ2 , new generator parameters G θ 2 . 1: D θ , G θ ∼ N (0, 0.0004) {Initialize generator and discriminator parameters G θ , D θ } 2: for i from 1 to e do 3: for i from 1 to r n /b do {Optimize the discriminator:} 4: x, x l ← D(r) {Input real samples to the discriminator to get a 34-dimensional tag x l and a 1-dimensional true and false sample mark x } 5: {After inputting the real sample, the loss function L r d of the discriminator is obtained } 6: x, x l ← D(G(z)) {After inputting the generated samples to the discriminator, the 34-dimensional tag x l and the 1-dimensional true and false sample mark x} 7: L fd ← BCE(x, 0) + 0.02 * CE(z l , x l ) {After inputting the generated samples, the loss function L fd of the discriminator is obtained, where z l is the last 34 dimensions of z} 8: x r ∼ U (0, 1) {Generate random matrix x r from uniform distribu-tion} 9: L d ← L rd + L fd + 0.5 * ( grad(D(x r ), D θ ) − K ) {Use formula (19.2) to calculate the loss function L d of the discriminator} 10: {Find the gradient of the discriminator through the loss function, and truncate the gradient of the discriminator to C } 11: D θ2 ← GradientDescent(L dgrad , lr) {Get a new discriminator parameter D θ 2 through gradient descent } 12: D θ ← D θ 2 {Update discriminator parameters D θ } {Optimization generator: } 13: x, x l ← D(G(z)) {Input generated samples through the discriminator to get 34-dimensional tags and 1-dimensional true and false sample flags} 14: L g ← BCE(x, 1) + 0.02 * CE(z l , x l ) {Calculate the loss function L g of the generator using formula (19.1), where z l is the last 34 dimensions of z} 15: L ggrad ← grad(L g , G θ ) {Find the gradient L ggrad of the generator through the loss function} 16: G θ 2 ← GradientDescent(L ggrad , l r ) {Get the new generator parameter G θ 2 by gradient descent} 17: G θ ← G θ 2 {Update generator parameters g θ } 18: lr ← lr * 0.1 ((rn * j+i * b)/50000) {Update the learning rate lr } 19: end for 20: end for

V. EXPERIMENT AND RESULT ANALYSIS
In this chapter, the model SRResNet [21] proposed by Y Jin et al. is used as the baseline model, and experiments show that TransSRGAN has better generation ability.

A. DATASET
The training set of the model comes from crawling 43,740 pictures of anime characters from the Internet through crawler. These images come from different characters and have different resolutions. To intercept the avatar of the character, the Lbpcascade Animeface based on Adaboost is used to locate and crop the face of the character in the big picture. These face images need to be uniformly scaled to a size of 128 × 128.
Illustration2Vec is used to classify the samples in the training set. This is a CNN-based classifier that can add categorical tags to images (eg ''green hair'', ''red eyes'', etc.). To compare TransSRGAN with the baseline model, 34 tags consistent with the baseline model are selected for image classification. The classification tags are shown in Table (1).

B. INCEPTION SCORE
The Inception Score (IS) is introduced to evaluate the definition and diversity of generative adversarial networks. In 2016, IS [26] was proposed by T Salimans et al., which calculates the KL divergence between the edge distribution of the classification vector and the classification vector through the pre-trained InceptionV3 neural network as a measure of the quality of the generated model. The representation is: where x represents the generated sample, Y represents the classification vector, and Dkl is the KL divergence. The larger the value of IS, the clearer the generation effect IS and the more diverse the generated samples are.

C. FRÉCHET INCEPTION DISTANCE
To evaluate the performance of the samples generated by the GAN, the Fréchet Inception distance (FID) [27] is used to evaluate the realism of the samples generated by the GAN. FID(x, g) = Tr( x + g − 2 x g ) + µ x − µ g 2 2 (21) where x and g represent the generated sample and the real sample, Tr is the trace of the matrix, µ x and µ g represent the feature mean of the middle layer extracted by the generated sample and the real sample through the same InceptionV3, and x and g are the covariance matrix of the eigenvalues of the middle layer. The smaller value of FID is, the generated sample is closer to the real sample and better the generation effect.
FID is more sensitive to model collapse and also has better robustness to noise. If there is only one photo, the FID score will be very high. Therefore, FID can evaluate not only the similarity between the generated sample set and the real sample set, but also the diversity of the generated sample set.

D. HYPERPARAMETERS
Several groups of different hyperparameters are tested in the experiment by binary search. It is found that different VOLUME 10, 2022     This model uses 43740 data for training. Both the discriminator and the generator are optimized by the momentum-based stochastic gradient descent optimization algorithm Adam [28]. For all pictures, all are scaled to 128*128 size. To prevent the loss function from jumping out of the minimum value due to the high learning rate at the end of the training, which leads to training failure, the dynamic learning rate is used to stabilize the training process. The learning rate drops to 0.1 times the learning rate of the previous round after 50,000 iterations. (4) shows the image generated by the generator of this model. The 128-bit random noise and the 34-bit tag vector are input to the generator. The 128-bit random noise is randomly generated from a normal distribution with a standard deviation of 1 and an expectation of 0. The 34-bit tag vector is a randomly generated One-Hot code.

Figure
To evaluate the IS score of the model, we generated 10000 images through the generator. Generator input vector and label vector are random vectors. This is to ensure that the prior distribution of the sampled labels is the same as that of the training dataset labels. Then input all the images into InceptionV3. Each image outputs a 1000-dimensional classification vector. Then, the feature vectors were substituted into formula (20) to calculate the average IS of 10000 images. Table 3 shows the average value of IS between the proposed model and other models. It can be seen that compared with  the baseline model, the model in this study has higher IS and can generate clearer and more diverse images.
To evaluate the FID score of the model, 10,000 images are sampled from the real dataset. Then generate fake data with tags as same as the tags of real data. The generator input vector is a random vector. The tag vector is the same as the real dataset. It is to ensure that the sample tags have the same prior distribution as the training dataset tags. Then all images are input into InceptionV3. Each image outputs a 2048-dimensional feature vector. After that, the feature vectors are substituted into formula (21) to calculate FID. Five groups of images are generated, and five FIDs are calculated. Then calculate the average of five FIDs. Table (3) shows the average value of the FID of the TransSRGAN model and the other models. The TransSRGAN model has a lower FID than the baseline model. While generating images that are closer to real samples, the model also avoids mode collapse and ensures the diversity of generated samples.

VI. CONCLUSION
In order to make the generation model produce more realistic images, TransSRGAN based on the GAN is proposed, which generates clear images of anime characters. The main improvement of this study is that Transformer Encoder is used as a sub-module to generate adversarial network generator. After a series of self-attention calculation and feature up-sampling, the generated samples are closer to the distribution of original samples. How to further reduce the memory usage of the model while making the image more realistic is still a problem worthy of further research.