Instance Mask Embedding and Attribute-Adaptive Generative Adversarial Network for Text-to-Image Synthesis

Existing image generation models have achieved the synthesis of reasonable individuals and complex but low-resolution images. Directly from complicated text to high-resolution image generation still remains a challenge. To this end, we propose the instance mask embedding and attribute-adaptive generative adversarial network (IMEAA-GAN). Firstly, we use the box regression network to compute a global layout containing the class labels and locations for each instance. Then the global generator encodes the layout, combines the whole text embedding and noise to preliminarily generate a low-resolution image; the instance embedding mechanism is used firstly to guide local refinement generators obtain fine-grained local features and generate a more realistic image. Finally, in order to synthesize the exact visual attributes, we introduce the multi-scale attribute-adaptive discriminator, which provides local refinement generators with the specific training signals to explicitly generate instance-level features. Extensive experiments based on the MS-COCO dataset and the Caltech-UCSD Birds-200-2011 dataset show that our model can obtain globally consistent attributes and generate complex images with local texture details.


I. INTRODUCTION
Conditional deep generative models have realized quite exciting progress in text-to-image generation. The widely used Generative Adversarial Networks (GANs) [1], which jointly learn generators and discriminators, have generated promising individual images on simple datasets. However, once there are heterogeneous objects and scenes in the text, the quality of the generated image becomes drastically worse [2]. This is mainly because most existing approaches only focus on global sentence embedding without considering that each word has a different level of information related to the image. Besides, the ambiguity of text and the unknown shapes of instances make the generation process more difficult to constrain [3]. As a result, those images generated by current models usually have lower resolution and blurred texture. Moreover, instance attributes represent important The associate editor coordinating the review of this manuscript and approving it for publication was Guitao Cao . image feature information [4], but existing methods use the sentence-conditional discriminator which only provides the coarse-grained training feedback, making it hard for generators to disentangle different regions and learn fine-grained attributes.
To address these three limitations, our proposed IMEAA-GAN harnesses a pre-trained box regression network [5] to obtain a global layout which contains class labels and bounding boxes, then generates complex images from this layout through a coarse-to-fine process, where the global generator initially generates a low-resolution image and two local refinement generators hierarchically synthesize highresolution images by combining the instance-wise attention and the instance mask embedding. Additionally, our model adopts the word-level and attribute-adaptive discriminators to provide fine-grained feedback, thus, the local refinement generators can be instructed to synthesize specific visual attributes.
The contributions of this paper can be listed as follows: VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ 1) To overcome the complexity and ambiguity of a whole sentence, we explicitly utilize the word-level embedding as input and use box regression network to obtain the global layout that contains spatial positions, object sizes, and class labels.
2) In order to make local refinement generators learn instance-level and fine-grained features, we propose the instance mask embedding mechanism to add pixel-level mask constrains. Therefore, our generators can get more details and semantic information for high-resolution image generation.
3) Two word-level and attribute-adaptive discriminators instead of commonly used sentence-conditional discriminator are employed to classify each attribute independently and generate exact signals for generators to synthesize certain visual attributes.

II. RELATED WORK
As one of the most commonly used image generation models, GANs include generators and discriminators. The generator is mainly used to learn pixel distributions and generate realistic images, while the discriminator should distinguish the received images as real or fake. They continually update in order to achieve dynamic equilibrium [6].
Many methods based on GANs have been proposed to improve image quality, and there are many input types. Zhu et al. [7] showed using sketches to modify images. Based on this, Lu et al. [8] adopted contextual GAN to synthesize images from sketch constraints. Similarly, Huang et al. [9] proposed an image-to-image translation model. In order to synthesize images from category labels, Brocket et al. [10] introduced a class-conditional model. Sharma et al. [11] improved the text-to-image generation by using dialogue. However, due to the complexity of the input text, Johnson et al. [12] proposed the sg2im method to convert the input text into scene graphs for image generation.
Among these various inputs, the text is the easiest and the most convenient type to perform manipulation. An increasing number of researchers have shown interest in text-toimage generation, and there are mainly two manifolds in the research community.

A. SINGLE-STAGE TEXT-TO-IMAGE GENERATION
Many approaches directly generate images from text without intermediate representations. For example, Reed et al. [13] have achieved simple image synthesis directly from captions without reasoning any semantic layouts. By contrast, Dong et al. [14] input both the image and text into conditional GAN (CGAN) to generate manipulated contents. Based on CGAN, Li et al. [15] proposed the Triple-GAN, which contains an extra classifier to label the generated image with its matching text for data augmentation, the labeled image-text pairs then can be used as the training data. Similarly, Dash et al. [16] proposed the TAC-GAN to generate diverse images by distinguishing real images from generated images and classifying real images into true classes. Nguyen et al. [17] introduced the PPGN, which is similar to TAC-GAN and contains a conditional network, to generate images from captions. Furthermore, based on conditional GANs, Cha et al. [18] improved the adversarial training process by forming positive-negative label pairs and employing an auxiliary classifier to predict the semantic consistency of a given image-caption pair.
All of these models produce diverse images directly from descriptions and their main focus isn't in synthesizing highresolution images, so they only use single-stage generation.

B. MULTI-STAGE TEXT-TO-IMAGE GENERATION
It's difficult to directly generate high-quality images from complex text, Denton et al. [19] adopted the LapGAN to generate images by constructing a Laplacian pyramid framework. However, this model still has limitations, the most obvious one is that its deep networks increase the training difficulty, resulting in model collapse. To solve this problem, Zhang et al. [20] employed StackGAN which contains two generators to synthesize images within two stages. Afterward, they improved the previous architecture by proposing the StackGAN++ [21] which is designed as a tree structure. But these two models only encode text into a single sentence vector for image generation. Similar to scene graphs, Hong et al. [22] introduced the text2img method, they utilized the inferred layouts to generate images, Li et al. [23] also obtained graphic layouts with wireframe discriminators. Given a coarse layout, Zhao et al. [24] generated images by disentangling each instance into a certain label part and uncertain appearance part. Hinz et al. [25] evaluated the detecting frequency of objects and synthesized multiple instances at various spatial locations based on an object pathway. Likewise, Li et al. [26] improved the grid-based attention mechanism by coupling attention with the layout. In order to minimize the differences between real and fake images, Yuan and Peng [27] showed symmetrical distillation networks. Then Sun and Wu [28] put forward a new feature normalization approach to synthesize visually different images from given layouts. Xu et al. [29] introduced the AttnGAN which aggregates the attention mechanism [30] and the DAMSM loss into text-to-image generation.
However, AttnGAN only leverages a global sentence vector and takes all instances equally, thus it may miss the detailed instance-level information. Our local refinement generators are able to uncover such difference by applying the instance mask embedding. Moreover, the proposed wordlevel attribute-adaptive discriminators have the capacity to disentangle each attribute independently in order to instruct two local refinement generators to synthesize certain visual attributes.

A. BOX REGRESSION NETWORK
Box regression network can effectively reason scene layouts from descriptions or scene graphs [31]. This network takes a sentence embedding or final object embedding as input and outputs the predicted bounding boxes B 1:T = {B 1 , B 2 , · · · , B T }. The t-th bounding box is parameterized indicates the location (x, y) and size (w × h) of the related object and l t ∈ {0, 1} L+1 represents the one-hot class label of the t-th box. We define L as the number of real object categories and the (L + 1)-th label as an end-of-text indicator. The joint probability is calculated as: is the box coordinate probability and p(l t ) represents the label distribution. It is hard to directly model the joint probability since it contains various parameters. Therefore, the coordinate probability of the t-th box is decomposed as: where the probability p(b x t , b y t |l t ) and p(b w t , b h t |b x t , b y t , l t ) are implemented by two bivariate Gaussian mixtures: where k indicates the number of mixture components, the label of the t-th object l t and π xy t,k , π wh t,k ∈ R, µ xy t,k , µ wh t,k ∈ R 4 , xy t,k , wh t,k ∈ R 4×4 are parameters of the Gaussian Mixture Model (GMM) [32], [33]. These parameters are calculated by the outputs of LSTM at each step: where h t is the hidden state, c t is the t-th cell state. Similarly, π xy t,k , π wh t,k , µ xy t,k , µ wh t,k and xy t,k , wh t,k are computes as: Inspired by the recent progress of the box regression network, we explicitly use it to predict locations for various instances. Different from sg2im [12] and text2img [22], we use word embedding instead of final vectors computed by graph convolutional network [34] or a sentence vector as input to obtain bounding boxes. Each box in our model not only predicts the location but also indicates the size and class label of each instance, which greatly differs from sg2im [12], the global layout is thus synthesized for the further multistage generation.

B. MASK REGRESSION NETWORK
Mask regression network [35] has been used for mask segmentation in many computer vision tasks. And Hong et al. [22] have constructed shape masks from captions for image generation. As shown in Fig. 1, the mask regression network encodes the bounding box tensor B t into a binary one B t ∈ {0, 1} h×w×l where h × w represents the instance size and l is the category label. After a down-sampling block, the encoded features are fed into Bi-LSTM and concatenated with noise z. If and only if the bounding box contains the related class label, the binary tensor B t is set to 1, other parts outside the box are all set to 0. After applying this mask operation, these masked features are then fed into a residual unit which allows the network to possess a deeper encoding ability by applying the skip connection [36]. Afterward, the predicted segmentation mask p t ∈ R h×w with all elements in the range (0,1) is obtained through several up-sampling layers for image generation.
Contrary to previous methods that use segmentation mask annotations for both low-resolution and high-resolution image synthesis, our approach employs the predicted pixellevel instance masks only as constraints to two identical local refinement generators so that its up-sampling path can preserve the capacity to refine local texture details. Hence, the synthesized instances are coherent with inferred masks while discarding ambiguous features and containing pixellevel details.

IV. IMEAA-GAN
The proposed IMEAA-GAN performs text-to-image synthesis in three steps: the box regression network infers global layouts to obtain categories, sizes, and locations of objects. Then the global generator generates relatively low-resolution global images from these layouts. Two local refinement generators finally synthesize high-resolution and photographic images.

A. GLOBAL LAYOUT GENERATION
We employ the box regression network to initially infer a global layout L from word-level embedding vectors. The global layout, as an intermediate representation, contains the corresponding bounding boxes for the related instances. The generation process of a global layout is illustrated in Fig. 2.
The box regression network is designed as an encoderdecoder architecture. For each instance, the network infers Firstly, our IMEAA-GAN takes the text as input, with a pretrained Bi-LSTM which is used as a text encoder, the whole text is encoded into word embedding vectors and also a global text embedding ϕ. Every word is related to two hidden states, we concatenate the two states to indicate the semantic information of a word. Thus a feature matrix of all the words is obtained, each column of this matrix represents a word feature vector. At the same time, we concatenate the last hidden states of two directions to get the global text embedding ϕ. Then we take LSTM [37] as the decoder to approximate the class label l t and the coordinates b t , these GMM parameters are mentioned in function (1). To achieve this, we decompose the conditional joint probability as: where T is the number of instances. We firstly predict the category l t for the t-th object, then compute the b t based on l t : here, the class label l l is calculated by softmax and the coordinates b t are modeled by GMM: where e t is the softmax logit calculated by the t-th step outputs of each LSTM unit. Similarly, these parameters π t,k ∈ R, µ t,k ∈ R 4 , and t,k ∈ R 4×4 that have been mentioned in function (3) and (4) are also computed in this way, k indicates the number of mixture elements. Finally, a global layout L that includes box coordinates and class labels for all entities is generated.

B. IMAGE GENERATION
Our IMEAA-GAN takes advantage of the multi-stage textto-image generation strategy [38]. Despite there are many methods, such as Obj-GAN [26], and our IMEAA-GAN both use the multi-stage generation, other methods are not robust to complex and ambiguous descriptions and the pixellevel features are not sufficiently used for image synthesis.
Obj-GAN has achieved image-level semantical consistency. However, during the generation process, Obj-GAN implements segmentation mask annotations for both low-resolution and high-resolution image synthesis, it is labor-intensive to collect these annotations. In addition, applying them in lowresolution image generation cannot efficiently improve the image quality, since these images are not finely synthesized and the image features are more tend to be random vectors. By contrast, our approach calculates the pixel-level instance mask embedding instead of collecting mask annotations. More importantly, we adopt the instance mask embedding only in two local refinement generators. In this way, our IMEAA-GAN can obtain the capability of capturing visual features and the flexibility of generating fine-grained instances. Given a coarse layout L 0 , the global generator G img 0 initially generates an image I 0 with 64 × 64 resolution. Then the local refinement generator G img 1 employs the instance-wise attention and instance mask embedding to refine different regions of the first generated image in order to synthesize a high-quality image. Here, two local refinement generators that have the same architecture are utilized for generating higher resolution images. For the sake of brevity, we will not show the generation process of the 256 × 256 image because it is the same as the 128 × 128 image.

1) GLOBAL GENERATOR
The global layout provides the semantic structure of the corresponding text. Fig. 3 shows that given a pre-generated  layout L 0 , the global generator G img 0 is designed to produce an image that conforms to both the layout and text.
We first compute the global layout embedding vector µ 0 ∈ R h×w×d by down-sampling the global layout L 0 and add noise z by spatial replication and depth concatenation. The text embedding ϕ calculated by the pre-trained LSTM in the box regression network, the layout encoding µ 0 , and noise z are concatenated and fed into a residual unit implemented by several residual layers. Our model jointly aggregates the bounding box and text information into a latent feature representation, and we further apply one up-sampling layer to generate the global hidden feature vector y 0 from the latent representations. After the final 3 × 3 convolution layers, the global image with 64 × 64 resolution is initially generated. Specifically: where F 0 is modeled as neural networks, y 0 is the global hidden layer feature vector. Conditioned on y 0 , the global generator G img 0 then generates the low-resolution image I 0 .

2) LOCAL REFINEMENT GENERATOR
In the first stage of generation, local details are not explicitly utilized for instance-level image generation, most of the synthesized images lack fine-grained features, resulting in overly smooth textures. To generate high-resolution images, we further employ the local refinement generator, and the overall architecture is illustrated in Fig. 4. During the refinement process, we only repeat two times due to the memory limitation of GPU. With two identical local refinement generators G img 1 and G img 2 , we first generate the 128 × 128 images then synthesize 256 × 256 images.

a: INSTANCE-WISE ATTENTION
Our local refinement generator is designed as the encoderdecoder structure. It first encodes the global layout L 1 by several down-sampling layers to obtain the layout encoding vector µ 1 ∈ R h×w×d (d indicts the layout feature dimension). Considering that traditional grid attentionhas been successfully used for image captioning [39], image-to-image translation [40], and visual questioning and answer-ing [41]; the attention-based generative adversarial network AttnGAN uses attention mechanism for image generation; our two local refinement generators need to encode various context information of L 1 along the channel dimension. Hence, as shown in the bottom of Fig. 5, we employ the instance-wise attention to select the context relevant features. Specifically, with the sub-region vectors V region of the pregenerated image I 0 , our local refinement generator retrieves the relevant instance vectors from the layout L 1 . Afterward, it assigns instance-wise attention weights to each instance VOLUME 8, 2020 vector V t and then calculates the weighted sum of the input information. The instance-wise context vector of the t-th object is calculated as: where t ∈ (1, 2, · · · , T ) denotes the number of objects, V t and w t represent the embedding vector and the attention weight of the t-th instance, respectively.

b: INSTANCE MASK EMBEDDING MECHANISM
Different parts of bounding boxes may overlap during the refinement process, multiple pixels may cover the same pixel, and the output shapes do not always align with the ground truth. These problems can be solved as a space sampling issue where the proposed instance mask embedding can pose spatial and morphological constraints on instance feature projection.
In general, many methods use mask annotations, which are not flexible to obtain, to separately add the shape of each instance. As a result, the generated images as a whole may present poor scene layouts though each instance is correctly rendered. Differently, we employ the predicted pixel-level instance mask embedding for image synthesis, in this way we can avoid consuming too much model capacity and unstable training.
As shown in the top of Fig. 5, given a global layout L 1 , we use the mask regression network to obtain the aggregated mask P global ∈ R h×w . Our down-sampling block is made up of a 3 × 3 convolution (stride-2) followed by batch normalization and ReLU activation, the residual unit is implemented with three 3 × 3 convolution layers and a skip connection, and the up-sampling block consists of a 4 × 4 deconvolution (stride-2) followed by the batch normalization and ReLU activation. Then the aggregated mask P global is cropped to get the t-th instance mask embedding P t To clearly represent the overlapping parts and make the generated features comply with the instance mask embedding, the most relevant context vector should be selected by the local refinement generator.
Thus, for the t-th instance, we copy the instance-wise context vector V t context to the instance mask embedding P t , the pixellevel feature vector V which contains latent pixel details is calculated by: (16) where ⊗ is the vector outer-product, t is the number of instances in the image. When there are several pixels covering a single pixel, we perform max-pooling to select the corresponding pixel that associated with the most related instancewise context vector V t context , then employ pixel representation at this position.
Meanwhile, in order to integrate the global information from G img 0 to G img 1 , we inject the global hidden layer feature vector y 0 into the refinement stage (see Fig. 4). y 0 , µ 1 , and V as input are aggregated by concatenation along the channel dimension and subsequently fed into a residual unit. We further apply one up-sampling layer as the decoder to calculate the local hidden feature vector y 1 . As the input of the final 3×3 convolution layers, the hidden layer vector y 1 is subsequently mapped to an image with resolution 128 × 128. Specifically: where F 1 is modeled as neural networks, y 0 is the global hidden feature vector, and µ 1 represents the high-resolution layout encoding. The pixel-level feature vector V and the instance-wise context vector V context are calculated and aggregated into the concatenation of µ 1 and y 0 to get the local hidden feature vector y 1 . Then the local refinement generator outputs a high-resolution image I 1 conditioned on the hidden feature vector. Additionally, we also apply another local refinement generator G img 2 and finally have synthesized images with the resolution 256 × 256.

3) ATTRIBUTE-ADAPTIVE DISCRIMINATOR
The discriminator should have a large receptive field to differentiate synthesized and ground truth [42], this requires either bigger convolution kernels or a considerably deeper network, resulting in an increased model capacity and repeated patterning images. To this end, we employ multi-scale discriminators D img 0 , D img 1 , and D img 2 to separately train different resolution images. The sentence-level discriminator is adopted for D img 0 , the identical D img 1 and D img 2 are designed as wordlevel attribute-adaptive discriminators. Generative models tend to synthesize the ''average'' pattern instead of the related attribute features, this is mainly because the global sentence-wise discriminator cannot be attached to a specific type of visual attribute and only provides the coarse training feedback. Therefore, our attribute-adaptive discriminators D img 1 and D img 2 are trained to recognize each attribute and discriminate whether it exists in the synthesized image. Each attribute-adaptive discriminator is made up of word-level discriminators to disentangle different attributes with fine-grained training signals. the overall structure of the image discriminator and the proposed wordlevel attribute-adaptive discriminator is shown in Fig. 6.
The attribute-adaptive discriminator consists of a set of word-level discriminators {D 1 , D 2, , · · · , D N }. Given an image, the image encoder outputs image features, see Fig. 6 (b), we implement the global average pooling to all feature layers to compute the one-dimensional image feature vector e. Meanwhile, we use the text encoder to get word vectors {w 1 , w 2 , · · · , w T }, then respectively feed them into wordlevel discriminators. Take the t-th word vector w t as an example, the one-dimensional sigmoid word-level discriminator F wt is used to decide whether the synthesized image contains a visual attribute that related to w t . Specifically, the wordlevel discriminator F w t is represented as: where σ is the sigmoid function, e n represents the onedimensional image vector of the n-th image feature layer, W (w t ) denotes the weight matrix and b(w t ) is the bias. We also reduce the influence of less significant words in the discrimination process. For this, we apply the wordlevel instance-wise attention to indicate the correlation degree between the word and the visual attribute. The attention mechanism mainly has two aspects: calculating attention distributions; computing the average of the weighted sum based on attention distributions. Note that the discriminator should have a multi-scale receptive field to detect multi-scale image features, the attention distribution α t,n is calculated as: , S t,n = (v) T w t (20) where α t,n is the attention weight assigned to the t-th word of n-th image feature layer. S t,n is the attention scoring function calculated by the dot product model. v denotes the average of word vector w t . With the attention distribution, the final score of the wordlevel discriminator is multiplicatively aggregated as: α t,n (21) where I represents the generated image, x denotes the text, T is the total number of input words, α t,n represents the attention distribution. γ tn is the weight of softmax function, and this parameter is used to determine the importance of each word for the layer n.
Hence, compared with the sentence-level discriminator that operates at coarse-level and only determines whether the VOLUME 8, 2020 synthesized image roughly matches the text, our attributeadaptive discriminators can provide feedbacks at different stages and identify the existence of related visual attributes.

C. OBJECTIVE FUNCTION
Our final objective function consists of a GAN adversarial loss [1] and a DAMSM loss [29]. The GAN cross-entropy loss function L GAN is determined by the adversarial training of image generators and attribute-adaptive discriminators. Both generators and discriminators all consist of an uncon-ditional loss and a conditional loss. The generator objective is defined as: where the first item represents the unconditional loss, the second item is the conditional loss, I and x denote the synthesized image and the related text, respectively. The adversarial loss for each discriminator also consists of an unconditional and a conditional item: where P data represents the distribution of the ground truth. Additionally, we adopt the DAMSM loss introduced in AttnGAN to calculate the fine-grained image-text matching loss. Hence, our final objective loss is obtained by: where λ 1 is a hyper-parameter. L DAMSM is the loss of the Deep Attentional Multimodal Similarity Model (DAMSM) pre-trained on ground truth images and related descriptions. We set the learning rates of generators and discriminators all to 0.0002, the hyper-parameter of DAMSM loss is set λ 1 = 50 on MS-COCO and λ 1 = 5 on CUB. We use the Adam algorithm [43] to optimize the adversarial training. The exponential decay rates β 1 , β 2 ∈ [0, 1) for the first and second moment estimates are set to 0.5, 0.999, respectively.

1) DATASETS
We perform experiments on MS-COCO and CUB datasets. The MS-COCO dataset [44] has pixel-level annotations and contains 82,783 training images, 40504 validation, and 40,775 testing images. There are 80 object categories in this dataset, each image has 5 text descriptions and corresponding instance labels. Derived from the CUB-200 dataset, the CUB dataset [45] includes a total of 11,788 images that provide class labels, bounding boxes, and bird attributes information. It has 200 different bird categories, each image has 10 descriptions describing the bird attributes. We employ 150 bird categories (including 8,855 images) as our training set while those other 50 categories (including 2,933 images) as the testing set.
A pre-trained Inception v3 network [48] is adopted to compute the IS and FID. The IS evaluates the image quality and diversity, namely: this metric measures the uniqueness of synthesized images and the number of object categories [49], while the FID calculates the Wasserstein-2 distance [50] between the ground truth and synthesized images according to final layer activations. A lower FID indicates a shorter distance between the generated image distribution and ground truth image distribution. Therefore, the larger the IS value while the smaller the FID value, the better the model performance. Same as AttnGAN and MirrorGAN [51], we also apply R-precision to measure the matching degree between the image and text. Specifically, we randomly select 99 descriptions from the dataset, then compute cosine distance to indicate the similarity (in feature space) between the generated image and the related text. We sort these 100 (including a ground truth text) descriptions and select the top k most similar descriptions to calculate the R-precision. In practice, we set k = 1, meaning that the R-precision indicates whether the ground truth text more closely matches the synthesized image than those 99 randomly sampled text descriptions.

B. QUALITATIVE RESULTS
Our model has produced high-fidelity 256 × 256 images containing complex scenes and multiple instances, Fig. 7 shows the synthesized results on MS-COCO. Conditioned on the instance mask embedding, IMEAA-GAN is able to separate instances from the background, reduce overlapping pixels. Given the similar input, due to the use of attribute-adaptive discriminators, IMEAA-GAN can also synthesize various detailed attributes. For example, the sheep in the third column of Fig. 7 show that our approach can well distinguish the word-level information and generate diverse images corresponding to various features.  To prove the generalization ability of our IMEAA-GAN, we also perform experiments on the CUB dataset. As shown in Fig. 8, the generate high-quality 256 × 256 images vividly display the color and texture of different birds, there are almost no indistinguishable instances and overlapping parts by using instance mask embedding mechanism, Moreover, with the guidance of attribute-adaptive discriminators, our images present correct and fine-grained attributes.
We adopt the multi-stage generation strategy to synthesize high-resolution images. During the refinement stage, we have attempted to stage up the generator to 4. However, the training process becomes unstable and difficult to control due to the complexity of deep neural networks and the memory limitation of the GPU. Therefore, we only apply one global generator and two local refinement generators for the optimal generation. The intermediate results of different stages on CUB and MS-COCO are illustrated in Fig. 9. Figure 9 shows that IMEAA-GAN is capable of refining images to match the text. The global generator initially generates coarse-grained 64 × 64 images (e.g. Fig. 9(a)), but these synthesized images lack fine-grained textures. Then two local refinement generators generate fine-grained images (e.g. Fig. 9(b), Fig. 9(c)). The context-wise instance vectors can be obtained by our generators, so the synthesized images VOLUME 8, 2020  are further well-improved, contain more accurate texture features and clear backgrounds. For example, in the second right row of Fig. 9, there is no short beak in the initial 64 × 64 image, our local refinement generators are able to encode the ''short beak'' information and synthesize the missing features.
Further, as illustrated in Fig. 10, we compare the IMEAA-GAN with other methods conditioned on the same text. The sg2im method converts the input text into scene graphs to infer semantic layouts, and this approach has achieved the synthesis of 128 × 128 images. But scene graphs lack core object attributes and spatial information (e.g. positions and sizes), it is difficult to generate details that consistent with semantic layouts. In addition, the information conveyed by scene graphs is very limited, the features of an instance are not only determined by its position and class labels but also interactions with others, so it fails to solve the overlapping pixels and separate different object appearances.
As shown in the second row of Fig. 10, AttnGAN has synthesized 256 × 256 images. Conditioned on a sentence vector, the effect of each word is not fully considered, it assigns all instances with the same weight. Thus, lacking word-level embedding and ignoring interactions between different instances are difficult for it to generate high-quality images. Besides, it uses sentence-level discriminators that only provide coarse-grained feedback, so its generators tend to generate texture associated with the wrong word. This can explain why the synthesized results appear realistic features but lack meaningful layouts and correct attributes.
The recent MirrorGAN [51] has made great progress on complex image generation, the example results are shown in the third row of Fig. 10. This method outperforms the first two models, it guarantees the semantical consistency in multiple object generation, the synthesized images match the text at the image level. Yet, MirrorGAN lacks investigations on uneven instance distribution and feature occlusion, the visual appearance and instance interactions are not finely regulated. For example, the ''cattle'' in the first image of the third row contain reasonable appearance, but the ''green hillside'' is inappropriately shown as the ''dry field''.
Different from these aforementioned methods, IMEAA-GAN adopts word-level attribute-adaptive discriminators. As presented in the last row of Fig. 10, the synthesized instances have correct attributes. Besides, due to the use of instance mask embedding and instance-wise attention mechanism, as well as the maximum pooling of multiple pixels, overlapping pixels between different instances have been solved. So these generated instances, which contain clear shapes and texture features, are more recognizable and semantically meaningful.
We also perform comparative experiments on the CUB dataset as shown in Fig. 11. Since sg2im mainly aims at the positional relationship between different instances, every image in CUB only contains a single object, so we just compare the IMEAA-GAN with AttnGAN and MirrorGAN.
Observing the second and third columns of Fig. 11, though these two methods both accurately capture attribute features, IMEAA-GAN can better display the main attributes and differentiate birds from their backgrounds. In general, our approach has the capacity to synthesize individuals with more vivid details as well as more clear shapes. Figure 12 demonstrates that IMEAA-GAN can generate diverse images using the same input. The results contain various shapes and complex scenes, this is mainly owing to word-level attribute-adaptive discriminators which provide specific signals. Therefore, only changing a few words, under the guidance of discriminators, the generators can synthesize images with detailed attributes, and these samples look similar but unique to each other.  Table 1 and Table 2, we measure the performance of different methods in terms of IS, FID, and R-precision, the best results are in bold. Based on the MS-COCO and CUB datasets, compared with MirrorGAN, we have almost increased IS by 15.19% and 4.17%, R-precision by 2.96% and 2.63%. Compared with the officially pretrained AttnGAN, our model decreases the FID by 8.03% on MS-COCO and 32.90% on CUB, which confirms that IMEAA-GAN is able to generate images with more diverse objects and higher quality than other methods. VOLUME 8, 2020  Our model can obtain the most relevant instance at the position where has overlapping pixels, so these synthesized results are closely consistent with global layouts and ground truth images. Hence, as demonstrated in Table 1 and  Table 2, feeding generated results into the pre-trained Inception v3 network, we get better performance of the IS and FID. In addition, we also obtain the highest R-precision, which indicates that the images and attributes generated by our generators are most relevant to descriptions. However, other methods occur lots of overlapping pixels and blurred objects, and the Wasserstein-2 distances between ground truth and generated samples are quite large. So it is hard to adaptively disentangle corresponding visual features under linguistic expression variants. By comparison, IMEAA-GAN greatly improves the quality and diversity of generated images, as well as the text-image matching degree.

As shown in
Besides, observing that the IS values based on CUB differ significantly from the MS-COCO, this is because all images in CUB are birds and the feature distributions are similar, while the MS-COCO contains different instance categories and complex scenes, the feature distributions among various objects are greatly different. Therefore, the IS values on MS-COCO are generally larger than that of the CUB.

D. ABLATION STUDY
To verify the effectiveness of the proposed discriminators, as shown in Fig. 13 (a), we visualize our word-level attributeadaptive discriminators. Meanwhile, to make a comparison, we adopt two commonly used sentence-level discriminators which have the same structure as our baseline model, the visualization maps of sentence-level discriminators are presented in Fig. 13 (b). The highlighted regions indicate the feedback information provided by discriminators. With feedbacks, generators are instructed to synthesize related attributes and instances. The discriminators in our baseline model are conditioned on a whole sentence, so it is hard to highlight word-level regions, thus, resulting in an excessively large range of highlighted areas. What's worse, the baseline method even omits highlighting when synthesizing certain attributes, see images in the third row of Fig. 13(b).
All these illustrate that sentence-level discriminators can only provide the coarse-grained information and fail to offer effective feedback signals. In contrast, our attribute-adaptive discriminators are word-level that can provide generators with detailed attribute feedbacks and highlight the related regions. Therefore, our generators can focus on the most relevant regions to perform pixel-level attribute generation.
Further, we also demonstrate the necessity of instance mask embedding for the local refinement generators. The image quality of the ablated version and our full model are shown in Fig. 14 (a) and Fig. 14 (b), respectively. The ablated model has the same settings as our full IMEAA-GAN except that it does not use the instance mask embedding (Fig. 14 (a) w/o IME). Images in Fig. 14 (a) lack detailed and complete features, for example, the zebras and giraffes in the first row are only synthesized with scattered features. It is difficult for the ablated model to synthesize corresponding instances in the correct locations, so the image accuracy and fidelity are quite low. With the instance mask embedding ( Fig. 14 (b) w/ IME), the synthesized images can meet the shape and location constraints. Even for complex scenes, for example, the giraffes in Fig. 14 (b), there are almost no overlapping pixels and indistinguishable instances.

VI. CONCLUSION
In this paper, we present a novel Instance Mask Embedding and Attribute-Adaptive Generative Adversarial Network (IMEAA-GAN) for text-to-image generation. With the instance mask embedding, which provides shape constraints and solves the overlapping problem between different pixels, our two local refinement generators are able to refine the initial image synthesized by a global generator. We also proposed the word-level attribute-adaptive discriminators, which focus on individual attributes and provide effective feedback to discriminate whether the generated instances match the attribute descriptions, so as to guide generators synthesize accurate features. Experimental results illustrate that our model is capable of generating complex images with highfidelity attributes on different datasets. However, once the text contains various scene settings and instances, the image quality drops drastically. Our future work will focus on using knowledge graphs to infer corresponding semantic layouts and generating multiple high-resolution images from a single semantic layout.