Cross Modal Facial Image Synthesis Using a Collaborative Bidirectional Style Transfer Network

In this paper, we present a novel collaborative bidirectional style transfer network based on generative adversarial network (GAN) for cross modal facial image synthesis, possibly with large modality gap. We think that representation decomposed into content and style can be effectively exploited for cross modal facial image synthesis. However, we have observed that unidirectional application of decomposed representation based style transfer in case of large modality gap does not work well for this purpose. Unlike existing image synthesis methods that typically formulate image synthesis as an unidirectional feed forward mapping, our network utilizes mutual interaction between two opposite mappings in a collaborative way to address complex image synthesis problem with large modality gap. The proposed bidirectional network aligns shape content from two modalities and exchanges their appearance styles using feature maps of the layers in the encoder space. This allows us to effectively retain the shape content and transfer style details for synthesizing each modality. Focusing on facial images, we consider facial photo, sketch, and color-coded semantic segmentation as different modalities. The bidirectional synthesis results for the pairs of these modalities show the effectiveness of the proposed approach. We further apply our network to style-content manipulation to generate multiple photo images with various appearance styles for a same content shape. The proposed method can be adopted for solving other cross modal image synthesis tasks. The dataset and source code are available at https://github.com/kamranjaved/Bidirectional-style-transfer-network.


19
The goal of this research is to synthesize realistic cross modal 20 face images while retaining the input face identity. We inter-21 pret facial images of a person from different modalities as 22 facial images with the same shape content and different 23 appearance styles. We have also observed that decomposed 24 representation into content and style can bring great advan- 25 tage to cross modal image synthesis [2]. On the other hand, 26 as can be seen in Fig. 1, directly employing style transfer as 27 The associate editor coordinating the review of this manuscript and approving it for publication was Thomas Canhao Xu . unidirectional feed forward mapping for cross modal image 28 synthesis does not work well in case of large modality gap. 29 Based on our interpretation and observation, we aim to 30 develop a novel bidirectional synthesis network that effec-31 tively employs style transfer schemes to achieve our goal. 32 We could effectively align the shape content from the two 33 modalities and exchange their appearance styles by exploiting 34 mutual interaction between two opposite mappings. In this 35 work, we consider facial photo, sketch, and color-coded 36 semantic segmentation as different modalities.
In the layers of the encoder space, we align the shape content gap. 88 We even view facial sketch and color-coded semantic seg-89 mentation as a facial modality and present bidirectional image 90 synthesis between them although their modality gap is large. 91 We further demonstrate our network for content-style manip-92 ulated synthesis. In this task, we generate multiple photo 93 images from a single segmentation map via conditioning 94 on photos with different styles. The bidirectional synthesis 95 results for the pairs of facial photo, sketch, and color-coded 96 semantic segmentation shows that the proposed methodology 97 can be adapted for solving other cross modal image synthesis 98 tasks. 99 The main contributions of this work are as follows. 100 • We have presented a style-transfer based bidirectional 101 synthesis network to effectively exploit mutual inter-102 action between two opposite mapping to address cross 103 modal image synthesis with large modality gap.

104
• We demonstrate on challenging bidirectional synthesis 105 from face sketch to semantic segmentation and semantic 106 segmentation to face sketch.  [24] and face de-occlusion [25]. Despite promising per-121 formances, they have not utilized mutual interaction between 122 two opposite mappings. In contrast, the proposed network 123 effectively takes advantage of the mutual content informa-124 tion of cross modalities through a bidirectional synthesis 125 framework.

126
Many studies have investigated face photo-to-sketch and 127 face sketch-to-photo synthesis tasks as an image-to-image 128 translation problem using GANs in their models [13], [14]. 129 However, their methods are unable to effectively deal with 130 the large domain gap between photo and sketch. For the last 131 few years, great progress has been made in developing meth-132 ods specifically designed for photo-sketch synthesis tasks. 133 Yu et al. [26] incorporate facial composition information into 134 their GAN based face photo-sketch synthesis. PS 2 -MAN [15] 135 takes an approach of gradually learning low-resolution to 136 high-resolution images using multi-adversarial networks. 137 Although these methods formulate photo-sketch transforma-138 tion through end-to-end mapping, they do not utilize the 139 mutual interaction between two modalities. To effectively 140 reduce the modality gap for photo-sketch synthesis task, Col-141 cGAN [16] learns an intermediate modality between photo 142 and sketch by utilizing the mutual interaction of the two 143 opposite mapping. CUT [27] maximize the mutual interaction 144 between different modalities based on contrastive learning 145 of corresponding patches. StarGAN v2 [28] learns mapping 146 between multiple modalities by utilizing a style encoder 147 and mapping network. These approaches produce plausi-148 ble results when the domain gap is small but struggles in 149 cases where the domain gap is large. On the other hand, 150 face photo-sketch recognition.

153
The separation of an image into content and style com-154 ponents has widely been studied for artistic style trans-155 fer [1], [17], [30], [31]. Image synthesis can be achieved 156 through image style transfer. Gatys et al. [17] showed that   As stated earlier, a synthesis method that decomposes repre-208 sentation into content and style can bring great advantages 209 to cross modal image synthesis [2]. In BSTM, the network 210 learns individual domain characteristics and adopts the cross 211 domain style by incorporating the transferred style factor into 212 the content factor.

213
As shown in Fig. 2 (1) 217 These features F A , F B ∈ R C×H ×W , where W and H indicates 218 spatial dimensions, and C the number of channels, are fed 219 into a BSTM unit and are decomposed into content and style 220 components. Channel-wise mean and standard deviation rep-221 resent image style while normalized feature map represents 222 content or shape in an image. We obtain style and content 223 components as follows: For simplicity, we show here only the style and content 228 component computation for modality A. The style and con-229 tent representations, S B , C B for modality B is computed in 230 the same manner. This decomposed representation is then 231 used to transfer the style components across modalities by 232 simply scaling and shifting the content component of one 233 modality with channel-wise mean (µ) and standard devi-234 ation σ , of the other modality. This produces the feature 235 maps, F A→B and F B→A that contain the shape content of one 236 modality with the appearance style of the other modality as 237 follows: Along with style transfer, we also align the shape contents, C A 241 and C B from the two modalities by computing the l 1 distance 242 between them. This process is repeated in the next block of 243 the encoder. The architecture of the proposed encoders is shown in 247 Fig. 3 (a). The encoders consists of two main blocks. The 248 first blocks consist of two convolution layers while the second 249 VOLUME 10, 2022 We have observed that unidirectional application of style transfer in case of large modality gap does not work well. In order to warrant better synthesis quality, our network utilizes mutual interaction between two opposite mappings in a collaborative way. The three columns for each transformation problem shows the input (first column), the result for the unidirectional style transfer (second column), and the result for the proposed collaborative bidirectional transfer (third column). The architecture of both discriminators follows the one used 260 in the pix2pix [5]. We use a patch-level discriminator that 261 discriminates the image structure at the patch scale of 70 × 262 70. The details of the discriminator architecture is given in 263 Fig. 3 (b).

265
We train our bidirectional network using the joint loss func-266 tion in Eq. 7 which is a weighted combination of multiple 267        More details about this dataset is described in Sec. IV-C. For 307 all experiments, we use images of size 272 × 272, which 308 are randomly cropped to 256 × 256 for training. We train 309 our model for 5,000 epochs for photo sketch in Sec. IV-B 310 and sketch segmentation synthesis tasks in Sec. IV-C, and 311 for 200 epochs for photo segmentation synthesis task in 312 Sec. IV-A. 313 We train our model in three steps. For one third of the 314 iterations, we first train the part of the network for one direc-315 tional synthesis with the synthesis in the opposite direction 316 fixed. We then train the network for another one third of 317 the iterations for the synthesis in the opposite direction with 318 the already trained part fixed. For the remaining iterations, 319 we train the network for the bidirectional synthesis with the 320 BSTM units on. Our model alternatively uses BSTM units. 321 For example, in one epoch we train our network using BSTM, 322 while in the next epoch we do not use BSTM. However, 323 we apply shape content alignment throughout the training 324 epochs. This training scheme helps our model overcoming 325 the problem of directly utilizing style transfer technique for 326 image synthesis and producing results with correct structure 327 and stylized results with smooth texture. In inference time, 328 we do not use the BSTM module for our results except 329 content-style manipulated image synthesis. 331 We give the performance evaluation of our method of bidi-332 rectional cross modal facial image synthesis for photo 333 segmentation in Sec. IV-A, photo sketch in Sec. IV-B and 334 sketch segmentation in Sec. IV-C, respectively. We train 335 all the methods to be compared, except Col-cGAN [16], 336 in two opposite directions separately as they do not support 337 bidirectional synthesis.   Fig. 4 compare the results for synthe-349 sized segmentation map from photo images and bottom two 350 rows for synthesized photo images from segmentation map, 351 respectively. As can be seen in the first two rows of Fig. 4, 352 synthesized photos produced by Pix2pix and SPADE contain 353 deformation for complex face semantics. Moreover, Pix2pix 354 also yields noise and messy face texture. Col-cGAN gives 355 99082 VOLUME 10, 2022  We also provide quantitative comparisons in Table 1. 368 We use Structural SIMilarity (SSIM) and Peak Signal to 369 Noise Ratio (PSNR) for segmentation→photo and mean 370 Intersection-over-Union (mIoU) for photo→segmentation. Table 1 indicates that our method outperforms the other meth-372 ods in terms of PSNR and mIoU, but SPADE gives the best 373 SSIM score. 374 We have additionally experimented on the FFHQ-Aging   To achieve this, we use the model trained for the photo 435 segmentation synthesis task. We translate all photos from 436 the CUFS dataset into segmentation map and use those syn-437 thesized segmentation maps along with the corresponding 438 sketches as paired segmentation/sketch samples. Fig. 7 shows 439 examples of pairs we have created for this task.

441
Results for sketch segmentation synthesis are illustrated 442 in Fig. 6. Pix2Pix, SPADE, Col-cGAN, CUT, and StarGAN 443 v2 obtain almost equivalent results for segmentation outputs 444 from a given sketch. However, they are unable to produce 445 plausible sketches from a segmentation map. As can be seen 446 in the last two rows of Fig. 6, SPADE and CUT fail to pro-447 duce plausible sketches from segmentation map. Col-cGAN 448 outputs are blurred and totally ignore sketch-like appearance 449 styles in hair region and pencil line shadows. Also, they 450 show artifacts on hair texture. StarGAN v2 produces plausible 451 results, but fails to synthesize hair region with finer details. 452 Pix2Pix blends the pencil line shadows and does not give 453 plausible face semantics, e. g., ears in the third row of Fig. 6. 454 In contrast, our method not only produces visually pleasing 455 results, but also obtains more diverse outputs that better retain 456 finer details, especially in segmentation-to-sketch synthesis. 457 We also provide quantitative comparisons in Table 3 458 using SSIM and PSNR for segmentation→sketch and mIoU 459 for sketch→segmentation. Our method achieves the best 460 SSIM and PSNR scores for segmentation→sketch. For 461 sketch→segmentation, Pix2Pix, Col-cGAN, and our method 462 yield equivalent performance.   483 We have additionally performed a pilot user study to evaluate 484 our results using perceptual assessment of people. We have 485 asked fifty two participants to select which output looks more 486 realistic and natural. Each participant is given the total of 487 twenty four questions, four questions for each synthesis task.

488
For every test sample, participants are shown input image 489 along with six images synthesized by different methods for 490 the given input. Table 4 shows that our method significantly 491 outperforms the other representative methods in all three 492 bidirectional synthesis tasks. 493 We think that for performance comparison, a user study 494 like ours can give better performance evaluation because 495 except for segmentation, there is no perfect quantitative eval-496 uation metric that quantifies the quality of generated image.