Text-Guided Sketch-to-Photo Image Synthesis

We propose a text-guided sketch-to-image synthesis model that semantically mixes style and content features from the latent space of an inverted Generative Adversarial Network (GAN). Our goal is to synthesize plausible images from human facial sketches and their respective text descriptions. In our approach, we adapted a generative model termed Contextual GAN (CT-GAN) that efficiently encodes visual-linguistic semantic features pre-trained on over 400 million text-image pairs at different resolutions along the model. Also, we introduced an intermediate mapping network called c-Map that combines textual and visual-based features to a disentangled latent space $\mathcal {W{+}}$ for better feature matching. Furthermore to maximise the computational performance of our model, we implemented a linear-based attention scheme along the pipeline of our model to eliminate the drawbacks of inefficient attention modules that are quadratic in complexity. Finally, the hierarchical setting of our model ensures that textual, style and content features are synthesised based on their unique fine grained details, which result in visually appealing images.

seen as images that contain minimal information bounded by 23 pixels that could be translated into photo-realistic images by a 24 suitable generative model. Sketches might contain key struc-25 tural information that could aid in providing visual mean- 26 ing, which is crucial in classifying images as valuable or 27 not. However, sketches do not portray any style information 28 regardless of the mode from which they are crafted, and as a 29 result, it becomes pretty difficult to translate sketches to per-30 ceptually appealing images. While significant work on image 31 synthesis is still ongoing due to its numerous applications, 32 It is still challenging to synthesize natural-looking images 33 from sketches or labels. Currently, recent methods of image 34 synthesis could be fashioned as a form of text-to-image, 35 The associate editor coordinating the review of this manuscript and approving it for publication was Jiachen Yang . image-to-image or sketch-to-image synthesis, or a combina-36 tion of all three techniques. 37 The ability to synthesize high quality images is the core 38 goal for most generative adversarial models, and these 39 and text description, respectively. Our approach is similar to 96 models that adapt some form of perceptual similarity for stan-97 dard image-to-image generative models aimed at boosting 98 perceptual quality. Secondly, we introduced the BERT model 99 to properly identify and update the contextual associations 100 between newly derived words defined by the user and the 101 pre-trained text descriptions from the CLIP model. Finally, 102 we deviate from previous techniques that apply a less efficient 103 attention scheme and we rather adapt a novel linear-based 104 attention model that computationally ensures better feature 105 interactions at the semantic level. To elaborate on our method 106 further, we trained an encoder capable of obtaining latent 107 codes that align with the hierarchically semantic arrangement 108 of a pre-trained StyleGAN model, which is inspired origi-109 nally by [8]. We break down our methods into three core 110 contributions.

111
• We present a visual-linguistic inversion module struc-112 tured in a hierarchical fashion, where the inverted code 113 of a given sketch image and learned text descriptions can 114 be found in the W latent space of a StyleGAN generator. 115 • Secondly, to aid application-specific implementations, 116 we incorporated a text-based encoder module to update 117 the contextual associations between newly derived 118 text descriptions and the pretrained features of the 119 CLIP model.

120
• Finally, we eliminated the quadratic nature of pre-121 vious attention models by adapting a more efficient 122 linear-based attention scheme which effectively permits 123 long sequence interactions on large inputs where con-124 textual information is the key to achieve better feature 125 disentanglement.

127
Our work is closely related to the literature of text-guided 128 image-to-image translation, with emphasis on sketch-to-129 image synthesis which is still a challenging task. Our model 130 is achieved by implementing an improved version of a 131 visual-linguistic GAN inversion model with the additional 132 benefits of a linear-based attention scheme.

133
A. In our work, we propose a hierarchy of encoded 250 visual-linguistic features that computes the contextual sim-251 ilarity between images and text pairs extracted at different 252 down-sampled resolutions. The contextual similarity between 253 each pair of text and image is computed for different layers of 254 the model, as shown in Figure 2. Since our design direction 255 is focused on a multi-modal problem where text is used to 256 guide sketch-to-image synthesis, we incorporated a set of 257 linear-based attention models that ensure the local contextual 258 similarity between image regions and specific words are 259 coarse, medium and fine layers [7], [9]. Secondly, the Style-  where φ(x), h and w denotes the input image embeddings, 304 features from the CLIP model, and the average of StyleGAN 305 latent embeddings, respectively. We use '' * '' to represent the 306 concatenation operation. The training process is given as: 312 We use D(·), V (·) and φ(·) to represent discriminator, 313 perceptual and style loss, while λ represents adjustable 314 hyper-parameters, more detail on losses is described in 315 VOLUME 10, 2022   a masked language modeling where some parts of the 369 input tokens are randomly replaced with a special token 370 (i.e., [MASK]), and the model needs to predict the identity 371 of those tokens. Secondly, a sentence prediction where the 372 model is given a sentence pair and trained to classify whether 373 they are two consecutive sentences from a document. Finally, 374 an output layer and objective are introduced and fine-tuned on 375 the task data from pre-trained parameters [14], [32]. We com-376 pute the similarity between these features; text w t and image 377 w v embeddings from the latent space. Hence, the multi-modal 378 similarity is learned by the given expression: where w v , w t ∈ R L×C , are the features obtained from the 381 image and text embeddings; w v = E x (x, h) and w t = E t (·). 382 All features are of the same shape for L layers, each with a 383 c-dimensional latent embedding.

385
A mapping scheme is necessary to transfer the embeddings 386 from both image and text pairs. In our case, we imple-387 mented a contextual mapper (c-Map) that translates con-388 textual features into a latent embedding; f : Z −→ W, 389 of semantic vectors derived from abstracted data. We also re-390 scaled the images and then extracted the local feature matrix 391 from the last layer of the image encoder E x (·), which we 392 feed to the c-Map network. The mapping network mimics 393 a small convolutional network, which gradually reduces the 394 spatial size using a set of 2-strided convolutions followed 395 by LeakyReLU activations [9], [43] and a series of fully 396 connected layers to ensure the model is aware of the inher-397 ent information between text and images, which is crucial 398 for feature disentanglement as shown in Figure 3. A fully 399 connected layer learns features from all the combinations 400 of the features of the previous layer, while a convolutional 401 layer relies on local spatial coherence with a small recep-402 tive field (3 × 3 kernel, in most cases). We then added 403 the fully connected layers to address the problem of shared 404 weights in conventional architectures, which prevents the 405 convolutional layers from generating subtle variations in dif-406 ferent spatial zones which are needed to produce realistic 407 images.

408
Furthermore, our approach stands out from previous tech-409 niques because in the design of our mapping model, we payed 410 attention to the shortcomings of current contextual encoders 411 that operate at a pixel-to-pixel correspondence [9], [10]. 412 An encoder that operates at a pixel-to-pixel correspon-413 dence will be subject to locality bias, which is a major 414 limitation when handling non-local transformations [44]. 415 In our approach, we implemented a model that operates 416 at a global level where multi-modal synthesis is easier to 417 achieve. Since StyleGAN provides a layer wise represen-418 tation, our mapping framework can sample style vectors 419 defined by W ∈ R 512 which makes hierarchical style mixing 420 efficient.

421
FIGURE 3. Linear-based attention module: A new sequence P ∈ R l * d with fixed (constant) length is introduced as an input which is referred to as the pack attention. A second scheme is implemented on the input sequence E x ∈ R l * d called the unpack attention. The image to the right depicts contextual matching mechanism of our model. The concatenated features from the vision and text-based encoder, result in a visual-linguistic embedding that is fed to the generator model. (6) 443 We note that since the length of P is a constant l, the complex- input sequence X , given as: 449 We also incorporated a position-wise feed-forward network

452
To match the attention conventions to the rest of our model, 453 we represent X := E x ∈ R l * d and P := P ∈ R l * d as the 454 outputs of the feed forward FFN (·). The attention model is 455 illustrated in Figure 2 and  perceptual loss L p for our proposed network is defined as:

502
The style transfer loss comprises of the content and style [1].

503
The content in our case is derived from the encoder model 504 E(·). Basically, the expectation over the entire spatial space 505 of the feature maps are compared with a loss function, so as 506 to ensure similarity. These are obtained by taking the gram 507 matrix G φ (·) of the outputs of X andX given by:  image, given as:

525
We sum up all the loss functions defined above to compute 526 the overall objective given as: where variables λ 1 , λ 2 , λ 3 are the hyper-parameters used as a 530 weight factor of the different loss terms.

532
We use the Multi-Modal CelebA-HQ [40] dataset for the text-533 guided multi-modal image synthesis. It's a large-scale dataset 534 which has a high-quality semantic segmentation map, sketch, 535 descriptive texts, and images with transparent background. 536 The text structure comprises of ten unique single sentence 537 descriptions for each image in CelebA-HQ [6]. For training, 538 we divided the dataset into 80% training and 20% test sam-539 ples, respectively.

541
To train the StyleGAN inversion module [8], we combined 542 features from an image encoder and the features from the 543 CLIP model [13], which was trained on over 400 million 544 image and text pairs, as defined in Equation (1). We adapt 545 this technique in order to achieve better semantic mean-546 ing between text and images. In retrospect, we built a 547 visual-linguistic encoder combined with a BERT [14] text 548 encoder to produce embeddings that match the latent space 549 W + of StyleGAN. In our design, we adapted the BERT 550 encoder specifically to fine tune the model newly added text 551 descriptions that are of interest for our model. In line with the 552 approach implemented in [10], we trained only the encoder 553 and discriminator while the generator weights are frozen.  In our experiments, we evaluate the FID score and also con-

571
The quantitative results are shown in Table 1 descriptions.

611
In this section, we evaluate the key components that define the 612 performance of the our model. We evaluate the Linear-based 613 attention model for the encoder as well as the visual-linguistic 614 impact of the BERT [14] and CLIP [13] models, respectively. 615 Our findings throw more light on the potential of our model 616 in general.   Our findings indicate that the model is sensitive to 649 the visual-linguistic relationship between image-word pairs. 650 We used the CLIP model to confirm that the semantic      As the sketches are slightly degraded, the model still main-690 tains its ability to reproduce plausible images at high 691 quality as reflected in Figure 12. We observe that the 692 model still tries to synthesize similar identities, which 693 show the attentiveness of the model to the perceptual identity 694 of the subject. Lastly, we pay key attention to the ability of the 695 model to synthesize images and still maintain the contextual 696 meaning of the sentence for each case study. We can clearly 697 identify the subjects regardless of the poor sketch provided. 698 Our approach confirms that regardless of the sketch quality, 699 visual-linguistic property of the model is still maintained.   The ability to reproduce similar images of the same identity is 715 crucial for sketch base synthesis. In Figure 10, we showcase 716 our model's ability to reproduce identity-consistent images.

717
Our results reflect the synthesised images when we change 718 the subject's attributes such as ''gender'', ''age'' and ''hair   Our results confirm consistency in perceptual quality, iden-736 tity and sketch diversity. Sketch diversity in this case high-737 lights the fact that the model is able to synthesize an image 738 and still represent the attributes within the specified sentence. 739 In Figure 9 we show the perceptual quality of images gener-740 ated using some compound sentences.

742
In our work, we showed that a text-guided sketch-to-image 743 GAN model can be visually appealing and still portray all 744 the facial attribute within an associated text description. Our 745 model leveraged on the hierarchical structure of the state-746 of-the-art StyleGAN model to combine visual-linguistic fea-747 tures from a properly disentangled latent space. From our 748 findings, we observe that introducing the CLIP features to 749 our framework encourage better contextual meaning to our 750 results without comprising the identity of the facial results 751 across board. We also confirm that adapting a linear-based 752 attention module aids in generating plausible images. 753 and J. Gao, ''StoryGAN: A sequential conditional GAN for story visualiza-