Uncouple Generative Adversarial Networks for Transferring Stylized Portraits to Realistic Faces

Stylized portraits widely exist in artwork or paintings. It is interesting to restore the original identity of portrait artworks, but consider that the rarity and the style diversity of these artworks, it is difficult to pair and obtain sufficient training data to restore their original identities by existing methods. Therefore, it is challenging to explore a method to restore a single stylized portrait to its original identity. Although CycleGAN can convert paintings into realistic photographs in unpaired datasets, it was not developed specifically for portraits, and photo-realistic faces require more accurate structures, thus the visual results obtained with CycleGAN are not satisfactory. In this paper, we propose Uncouple-Generative Adversarial Networks (UncGANs) for transferring stylized portraits to realistic faces. Our UncGANs framework is inspired by CariGANs to tackle the visual problem in CycleGAN for obtaining realistic faces from stylized portraits. In addition, we introduce three losses, namely, the semantic style consistency loss and the cycle consistency loss to effectively guide the training of generators and discriminators on unpaired datasets, the global and local adversarial loss ensure the consistency of appearance characteristics before and after translation, and the location consistency loss to establish the precise correspondence between the source domain and the target domain as well as assist the discriminators. Extensive experimental results and comparisons with state-of-the-art methods including Style, Deep-Image-Analogy, UNIT, MUNIT, CycleGAN, CP-GAN, and PS2-MAN demonstrate that our framework is better at generating realistic faces from stylized portraits with accurate structures and features.


I. INTRODUCTION
Various approaches have been presented for generating stylized portraits, however, the inverse problem of recovering the latent realistic faces from stylized portraits is yet to be investigated thoroughly. It is due to the fact that the facial details and identity-related information can be distorted or lost in stylized images. In addition, considering the rarity and style diversity of these artworks, it is tough to pair and obtain sufficient training data to restore their original identities by existing methods. Therefore, developing a recovery process that preserves the identity of a stylized portrait is a great challenges.
In recent years, as the performance of deep learning algorithms has been improving, it is possible to adopt deep learning to assist artists to complete the role design task. The seminal works of [4], [5] realized artwork generation The associate editor coordinating the review of this manuscript and approving it for publication was Sudipta Roy . by deep learning, its core concept is to use the Gram matrices to capture style-related contents. Given that the facial features of stylized portraits vary widely, there is no guarantee that the spatial structure of a captured facial image is similar to the target face, it is hard to achieve high-quality results relying on such traditional single architectures. CycleGAN [1] demonstrates how to train a model that maps the style from a reference set of images onto a different image without a training set of paired examples. Although CycleGAN [1] can convert paintings into realistic photographs, it was not developed specifically for portraits, thus the visual results obtained with CycleGAN are not satisfactory. For example, in Fig. 1, the results obtained with CycleGAN [1] are still partially stylized. We also note that the state-of-the-art style translate approaches [35], [36], [38] do not fully consider how to extract facial features from different stylized images, they cannot fulfill the task of restoring the original identity of various and rare artistic portraits. VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ FIGURE 1. Comparison between CycleGAN [1] and UncGANs. Our results use a dual-network for all scale factors. First row: Input images were generated with the method [3]; Second row: Results generated by CycleGAN; Third row: Our results; Last row: Ground Truth. We can see that the results obtained with CycleGAN [1] are still partially stylized. In order to better complete the task of translating stylized portraits to realistic faces, in this paper, we propose Uncouple-Generative Adversarial Networks (UncGANs) that can automatically map stylized portraits to realistic ones in an end-to-end fashion. UncGANs consists of two stages: one explores the shared latent information spaces for cross-domain translation; another stage is to avoid detail deviation by strengthening local constraints. Furthermore, several new loss functions are proposed accordingly. Our asymmetric encoders encode input stylized portraits into a high-dimensional vector, and the semantic style consistency loss and cycle consistency loss are used to effectively guide the training of generators and discriminators on unpaired datasets, the global and local adversarial loss ensures the consistency of appearance characteristics before and after translation, and the location consistency loss is used to establish the precise correspondence between the source domain (stylized portraits) and the target domain (realistic faces) as well as further assist the global and local discriminators. Progressive Growing of GANs (PGGAN) [6] is used as the infrastructure for our generator. In Fig. 1 and Fig. 2 we can see that UncGANs can successfully obtain realistic faces from stylized portraits. Experimental results demonstrate that UncGANs are better at generating real faces with accurate structures and features.
The main contributions are listed as follows: 1) We propose Uncouple-Generative Adversarial Networks (UncGANs) for transferring stylized portraits to realistic faces; 2) We introduce new losses, namely, the semantic style consistency loss and cycle consistency loss, the global and local adversarial loss, and the location consistency loss to adequately address the geometric structural mismatches between unpaired sets, and significantly improve the quality of the results. The rest of this paper is organized as follows. We review the related work in Section 2. Our framework is described in detail in Section 3. In Section 4, we present the experimental results and the comparisons with state-of-the-art methods. Conclusions and future work are given in Section 5.

II. RELATED WORK
Image stylization has been studied for a long time [8]- [24]. In this section, the style translation methods are briefly categorized, and arranged in chronological order. We chose the more representative methods in the past three years and introduce them to facilitate the experimental comparisons.

A. STYLE TRANSLATION BASED ON CONVOLUTIONAL NEURAL NETWORKS (CNNs)
Gatys et al. [5] proposed a universal solution that automatically converts the image into the style of a given artwork. This style translation approach requires two different styles of images: the content image and the style reference image, which are used to convert the style of the reference image into the content image. Subsequent work [25] was presented to generate high-quality results and reduce computing time, which should specifically require two images of the same types of scenes with different styles as the input.

B. DOMAIN TRANSLATION BASED ON GENERATIVE ADVERSARIAL NETWORKS (GANs)
In the domain conversion task, the network needs to transfer an image of one domain into its corresponding image of another domain.
Generative Adversarial Networks (GANs) were proposed by Goodfellow [26]. Given enough training data and time, GANs are able to imitate any appearance style. Zhu et al. [27] described a network that actually requires paired datasets, this network is called BicycleGAN, which is used for multimodal translation. However, image-to-image translate is always a challenge without paired datasets before Taigman et al.'s work [28]. CycleGAN [1] was proposed to ensure that the results are closer to the source domain by the forward and backward cycle-consistency losses, and similar frameworks include DiscoGAN [29], DualGAN [30], and Domain Transfer Network (DTN) [31]. The major difference between CycleGAN, DiscoGAN, and DualGAN is that DiscoGAN uses CNN network as the encoder and decoder, and a fully connected network as the converter; CycleGAN uses ResNet as the converter; DualGAN uses a form similar to Wasserstein GAN (WGAN) [32]. Liu et al. [33] proposed an unsupervised image-to-image transformation (UNIT) and their algorithm can realize the mutual translation of different scenarios, such as day to night, seasons, traffic sign, etc. Huang et al. [34] presented a multimodal unsupervised image-to-image translation (MUNIT), which is an extension of UNIT [33]. UNIT considers that different types of domains can share a public latent domain. MUNIT [34] goes further, it considers that a content space can be shared and has different style spaces from one another. UNIT [33] mainly works on unpaired image-to-image translation and MUNIT [34] focuses on many-to-many translation. Cao et al. [2] presented the first GAN networks for unpaired real-to-caricatures conversion tasks in special application scenarios, called CariGANs [2]. CariGANs decouple a complex cross-domain conversion task into two simple tasks: geometric exaggeration and appearance conversion. User study shows that caricatures generated by CariGANs are closer to those drawn by artists than those results obtained with other state-of-the-art approaches.

C. LATEST PROGRESS
Recently, there are some interesting papers proposed for generating photo-realistic faces. Huang et al. [35] proposed a cartoon-to-photo facial translation with generative adversarial networks (CP-GANs) for inverting cartoon faces to generate photo-realistic and related face images. Wang et al. [36] proposed a novel synthesis framework called photo-sketch synthesis using multi-adversarial network (PS2-MAN) that iteratively generates low resolution to high resolution images in an adversarial way. The hidden layers of the generator were supervised to first generate lower resolution images followed by implicit refinement in the network to generate higher resolution images. Menon et al. [37] proposed a method for generating high-quality realistic photos from sketches and sketches from photos. Chen et al. [38] presented a new method aiming at generating face images from sketches. The main idea is to implicitly model the shape space of plausible face images and synthesize a face image in this space to approximate an input sketch, and this process is from localto-global. All these methods are performed well, however, they were not developed specific for converting artworks or paintings to realistic faces.
In summary, it is almost impossible to restore a single piece of portrait to its original identity by existing approaches. For example, Van Gogh has only about 40 self-portraits, and some artists have even fewer self-portraits, which are hard to pair with their photographs. For the unpaired methods, there were not enough numbers of portraits for training.

III. PROPOSED FRAMEWORK
Given the effectiveness of CycleGAN in translation tasks, it has been the foundation of many works in the last three years. Therefore, we first change the CycleGAN's generator to PGGAN and attempt to restore the portrait to its original identity, but the results with poor facial structures. At the same time, we notice that CariGANs' excellent performance in optimizing facial geometry, we refer to the architecture of CariGANs and propose a method that relies solely on a single unpaired portrait with chaotic styles to restore its original identity. Our framework contains two stages. In the first stage, we introduce the global adversarial loss and the semantic style consistency loss, and PGGAN [6] generator structure is adopted as our generator. The motivation for such design is to make PGGAN better adapt to the translation task while maintaining its remarkable generative capability, but the actual results are defective. We solve this problem in the second stage, which consists of three separate local adversarial losses to improve the facial structures by strengthening the local constraints.
First, let Y 1 and Y 2 be the real face domain and the stylized portrait domain, respectively. The real face is represented as y ı and y  is the stylized portrait, L represents landmark set (l ∈ L). The first stage of UncGANs consists of two generators and encoders (G 1 , E 1 ∈ Y 1 and G 2 , E 2 ∈ Y 2 ), and one global discriminator (D gc Y 1 ). Thus, this realistic face translation can be expressed as G 1(Y 2 ,L)→Y 1 (y  ), which is considered a process of imitation and assimilation. By using the same weight-sharing strategy between both generators, G 1 has the ability to generate the results that are closest to the style of Y 1 , and most similar to the shape of Y 2 . In the second stage, we introduce the local adversarial discriminators D l o Y 1 (eyes, nose, mouth -three local discriminators). With the assistance of the location consistency loss, D gc Y 1 and D l o Y 1 can better ensure that the structural correspondence before and after translation are consistent. In this way, we obtain the desired result with the appearance that is most similar to the input y  , but with the resulting style for Y 1 . In addition, our VOLUME 8, 2020 FIGURE 3. Overall pipeline of our framework. UncGANs decouple a cross-domain translation problem (F : F y → yı ) into two stages. Our architecture obtains initial result G 1(Y 2 ,L)→Y 1 without relying on the local adversarial loss. In this stage, we train the generators and global discriminators involving the location consistency loss for two directions. Next, since we already have a coarse but reasonable result G 1(Y 2 ,L)→Y 1 (y  ), we adopt the pretrained real face regressor to predict landmarks on G 1(Y 2 ,L)→Y 1 (y  ). And with the estimated coordinates, local patches are cropped and used as input to the local discriminators. Lastly, we get a final result.
objective function contains four types of terms: the semantic style consistency loss, which is used to retain the texture information of source domain in the target domain, the cycle consistency loss, which is used to prevent the learned mapping G and F from contradicting each other, the location consistency loss, which is used to establish the precise correspondence between Y 1 and Y 2 and further assist the global and local discriminators, and the global and local adversarial loss, which increases the content adversarial to make a reconstruction content challenging to distinguish with an reference image.

A. THE FIRST STAGE
We present the details of the first stage of UncGANs on the left of Figure 4. Our generator uses PGGAN's generator [6]. To improve the stability of GAN training and speed up the training process, PGGAN progressively grows with the generator and discriminator together and alternates between growth stages and reinforcement stages. During the growing phase, the input from a lower resolution is linearly combined with a higher resolution. The linear factor α grows from 0 to 1 as training progresses, allowing the network to gradually adjust to the higher resolution as well as any new variables added. During the reinforcement phase, any unused layer for a lower dimension is discarded as the grown network does more training. Our discriminator is trained progressively as well.
Previous works [40], [41], and [42] have demonstrated that by using a different set of normalization parameters (γ , β ), which can train a network to output visually different images of the same object. Inspired by this, we capture the style difference between the two domains Y 1 and Y 2 by using two sets of batch renormalization [43] parameters (one for each domain). The motivation for such design is: Because both encoders (E 1 and E 2 ) attempt to encode the same semantic object represented in different styles, by sharing weights in all layers except the normalization layers, we encourage them to use the same latent encoding to represent the two visually different domains. Thus, different from prior works [1], [33] [44], [45] which share parameters only in the higher layers, we choose to share the weights for all layers except the batch renormalization layers. In addition, we use the same weight-sharing strategy for two generators, which enables us to capture shared semantic information with fewer parameters.
In an encoder-decoder structure using convolutional neural networks, the input is gradually down-sampled in the encoder and up-sampled in the decoder. For the translation task, some details and spatial information may be lost in the down-sampling process. UNet [7] is commonly used in image translation tasks where details of the input image can be preserved in the output through the skip connection. We adapt such structure and our encoder mirrors the PGGAN [6] generator structure -as the generator grows to a higher resolution. The skip connection connects the encoding layers right before down-sampling with generator layers right after up-sampling.
Specifically, three types of terms are adopted in the first stage: the global adversarial loss is helpful to maintain its original features of generated face while ensuring that the generated face is indistinguishable from the reference sample; the semantic style consistency loss allows the encoder-generator to capture the same semantic information between different visual domains; the functional range of the cycle consistency loss ensures that the input and output portraits contain the same features.

1) GLOBAL ADVERSARIAL LOSS
In the target object of pix2pix [46] and AC-GAN [47], the adversarial loss increases the content adversarial to make a reconstruction content challenging to distinguish with an reference image. Inspired by this, we design two adversarial losses, each of which is different in object orientation (global and local, respectively). Here, we first introduce the global adversarial loss, given the input (y  , l), and output is represented as G 1(Y 2 ,L)→Y 1 (y  ) y ı ∈ Y 1 , for F: F y  → y ı , the purpose of D gc Y 1 is to generate realistic faces that match . We modify CariGANs [2] architecture and encode input stylized portraits into a high-dimensional vector by an asymmetric encoder. Y 1 and Y 2 share the latent encoder and generator, which let the network understand that Y 1 and Y 2 have a potential semantic association even if the styles vary. In addition, we introduce the constraints of the semantic style consistency loss and the cycle consistency loss to effectively guide the training of generators and discriminators on unpaired datasets. This part is based on a complete bidirectional adversarial network.
landmark position l. The objective function of the conditional discriminator is expressed as: Equation (1) is from classic GAN adversarial loss. In L MatGAN (G (Y 2 ,L)→Y 1 , D gc Y 1 ), L represents the loss, G 1 and D gc Y 1 refer to generator and global discriminator in Y 1 , respectively. For the discriminator D gc Y 1 (y ı , l) should be close to 1 (determined as the true(= real photo) image). In this case, log D gc Y 1 (y ı , l) should get close to 0 (= log1). Further, D gc Y 1 (G 1(Y 2 ,L)→Y 1 (y  , l), l) should get close to 0 (determined as the fake image), which makes Thus, the adversarial loss should be maximized to train the discriminator. For the generator, D gc should be minimized to train the generator.
In order to improve the stability of GANs, we use the objective function of DRAGAN [48] and add it as a penalty: where the optimal configuration of the hyperparameter depends on the architecture, the information space, and the dataset. Our inspiration comes from the hyperparameter configuration of DRAGAN and we set them to λ dg ∼ 10, c ∼ 10, k = 1, real point y ı , noise y ı + δ in our tests.
The final adversarial objective function is defined as follows: where the final adversarial loss min ), which should be maximized to train the discriminator (expressed as max ). We combine Equation (1) and Equation (2) to obtain a more stable adversarial equation.

2) CYCLE CONSISTENCY LOSS
From the experiments we found that this loss can significantly improve the performance of the network and reduce fuzzy output in cross-domain translation, and ensure that the input and output portraits contain the same characteristics. We have also used this loss, please refer to CycleGAN [1] for details.

3) SEMANTIC STYLE CONSISTENCY LOSS
Each style usually has its own tonal distribution, we hope that the stylized object still remains the same features as original. In other words, the encoder needs to extract the same semantic features form the input and output. To achieve this, we propose Equation (4) for loss L sec (E 1 , E 2 , G 1 , G 2 ), which is used to maintain the same features. We describe Equation (4) as follows: In general, there is no strict directly mapping in different style domains (e.g. the stylized portrait lacks sufficient muscles and skeletal structures compared to the real face). Therefore, if you force a direct mapping at pixel level can easily cause mismatches. In UncGANs, we only adopt the loss for embedding, namely, encodes the same semantic object represented in different styles. Thus, we encourage encoders (E 1 and E 2 ) to use the same latent encoding to represent the two visually different domains. Specifically, by sharing weights in all layers except the normalization layers, and the latent code of stylized portrait y  encoded by E 1 should be the same as the latent code encoded by E 2 .
Now we come up with a full objective function as follows: where, L Full represents the total loss of the first stage. We obtain it by the final adversarial loss L MatGAN plus the cycle consistency loss L cyc and the semantic style consistency loss L sec . λ refer to hyperparameter, which is used to control the weight of each target.

4) TRAINING CONFIGURATION
Texture identifiability is critical to our task. Here we set λ GAN = 100, λ cyc = 10, and λ sec = 10 empirically. λ is a coefficient to control outputting better results. According to [6], we use Adam solver [49] with a batch size of 1. All the networks are trained from scratch with an initial learning rate of 1e-4.

B. THE SECOND STAGE
In this section, we describe the details of the second stage, which uses the local discriminator to further ensure characteristics consistency before and after translation. On the right of Figure 4 we show the architecture of the second stage.
In the first stage, the generator outputs G 1(Y 2 ,L)→Y 1 . Subsequently, we use a pre-trained real face regressor (R Y 1 ) to predict facial features and obtain five-channel outputs (each channel represents a landmark point, which does not lose information about the corresponding points) as landmark heatmap l. Lastly, strengthen the local constraints, which ensure that the semantic properties are still correctly matched and contributed to the final results being realistic and identifiable. Similar to the purpose of the global discriminator, the local discriminator is intended to further ensure the consistency of characteristics before and after translation.
We also propose the location consistency loss to assist the discriminators, namely, by a pre-trained real face regressor to predict landmark positions l, we automatically crop the patches on G 1(Y 2 ,L)→Y 1 (y  ) based on l, and use them as inputs of the local discriminator D l o Y 1 , D l o Y 1 constrains patches to ensure the semantic attribute consistency of the source domain and target domain. In addition, the location consistency loss can guide D gc Y 1 training, which further assists the generator to pay more attention to critical facial features in its training.

1) LOCATION CONSISTENCY LOSS
Firstly, l of y ı predicted by R Y 1 (as constraint), and l of y  predicted by R Y 2 (as input) are given. We adopt L-2 norm to calculate L cons is expressed as: For the loss L cons , L represents the input landmark heat map set (l ∈ L), R Y 1 indicates a pre-trained a regressor. Under the constraint of Equation (6), which makes sure that l of G 1(Y 2 ,L)→Y 1 (y  ) predicted by R Y 1 are consistent with the realistic face landmarks predicted by the regressor R Y 1 . We use Equation (6) to establish the precise correspondence between the source and target domain.

2) LOCAL ADVERSARIAL LOSS
To further improve the final results, we propose the local adversarial loss for mouth, nose, and eyes. According to the landmarks of G 1(Y 2 ,L)→Y 1 (y  ) as the center, the locations of eyes, mouth, and nose on G 1(Y 2 ,L)→Y 1 (y  ) are randomly cropped as patches, and then are input to local discriminator to generate exact results. The local adversarial loss is expressed as: (7) where L MatGAN Y 2 →Y 1 local represents the local adversarial loss, which is used to ensure the consistency of appearance characteristics before and after translation. There are three patches, i = 1, 2, 3, respectively. The adversarial process is carried out three times, we summarize them together to get , and D l o Y 1 is the local discriminator, where λ l o is the hyperparameter, which is used to control the weight of each target. Equation (7) aims to solve the local issues, here, y ı (patch) is the local patch of real face and the generated patch is define as G 1(Y 2 ,L)→Y 1 (y  ) (patch) . For F: F y  → y ı , after the first stage, G 1 first generates a preliminary result, then use a pre-trained realistic face regressor to get the predicted landmark set L. Utilizing locations provided by the predicted landmarks, we can crop the local patches as input to the local discriminator. Then, we combine the patches for both eyes to make the final results' eyes more symmetrical and reasonable. Because the gradients can propagate backward based on these local patches, UncGANs is trained end-to-end.

3) TRAINING CONFIGURATION
We set the hyperparameters λ l o = 0.5 empirically and use Adam solver [49] with a batch size of 1. All networks are trained from scratch with an initial learning rate of 2e-4, and polynomial decay strategy.

C. TRAINNING SETTING 1) PATCHES EXTRACTION
Patches are random cropped on G 1(Y 2 ,L)→Y 1 (y  ) empirically for covering each facial organs (e.g. eyes, nose, and mouth). For a 1024 × 1024 facial photo, crop 300 × 300 eyes patches (to make the eyes more symmetrical in the final result, we combine the two eye patches into one for the same discriminator), 230 × 260 for nose patch, and 400 × 200 for mouth patch. Therefore, we extract three patches. Given the landmark position coordinates, we acquire an eye patch and an nose patch with the corresponding landmark as the center coordinate and crop mouth patch with the left and right two landmarks as the boundary.

2) REGRESSOR TRAINING
We first pre-train two type of landmark regressor to predict the landmark position of stylized portraits and real faces, respectively. We adopt the stylized dataset and CelebA dataset to train UNet [7] architecture, respectively, and output a five-channel heat map as the predicted score for face landmarks. We conduct 80,000 iterative training for it.

IV. EXPERIMENTAL RESULTS
In this section, our aim is to learn the mapping from unpaired stylized portraits and realistic faces. At the same time, ensuring that the generated face retains the identity of the stylized portrait.
In this work, let Y 1 and Y 2 be the real face domain and the stylized portrait domain, respectively. {y ı } L ı=1 , L = 13000, y ı ∈ Y 1 , where {y ı } L ı=1 from CelebA database [39] for training. For testing, we collected different types of data, the first dataset: y  M  =1 , M = 1000, y  ∈ Y 2 , the second: y  N  =1 , N = 70. We are committed to learning F: F y  → y ı . This is a canonical cross-domain translation problem, because y  and y ı may have significant differences in both geometry structure and style appearance. Therefore, we decouple F into two stages. Figure 3 shows the pipeline of UncGANs. we present our experiment results with a variety of artistic styles and compare with the state-of-the-art methods. Unc-GANs was developed with TensorFlow on PyCharm. All experiments were conducted on NVIDIA Tesla V100-SXM2 GPU.

A. TRAINING DATASETS
To accomplish F: F y  → y ı , we need two types of data, i.e., stylized portrait dataset (y  ∈ Y 2 ) and facial photo dataset ({y ı } ı=1,...,13000 , y ı ∈ Y 1 ). For y ı , we chose a popular dataset: CelebA dataset [39]. For y  , we selected the Color Pencil Sketch Gallery [3] and collected images from the internet according to the experimental requirements to establish a self-built new dataset.
1) CelebA DATASET [39] has 200,000 celebrity faces. We chose facial images with a frontal and half profile for training and gathered a total of 34,658 portraits. We used about 13,000 of them with clean backgrounds and clear structures, and then cropped and adjusted these images to 1024 × 1024 as real portraits dataset. We manually noted the landmarks (center position of eye, nose tip, and both ends of the mouth) according to our experimental requirements. Our realistic face landmark regressor R Y 1 was also trained with this dataset.

2) STYLIZED PORTRAIT DATASETS 1
Inspired by [39], [50] to establish our famous painting and sculpture dataset, we grabbed over 1,000 images from the museum's website. A pre-trained face analysis tool was used to preprocess and obtain a facial boundary box, then the image were cropped along the box, and resize to 1,024. Using the above operations, we obtained a self-built dataset, referred as the FP&S dataset for training and testing.

3) STYLIZED PORTRAIT DATASETS 2
Lu et al. [3] introduce a method of synthesizing pencil sketches. We applied this method to generate colored pencil sketches as a stylized dataset for network input. And compared our generated results with the original input photo (ground truth) of Lu et al. We extracted the face area with OpenCV and resized it to 1024 × 1024 to form a self-built test dataset to test the conversion ability and robustness of ours. We only chose color pencil sketch portraits, and allowed images for random flips, and adapted the input adjustments to hue, brightness, and saturation. In addition, our stylized portrait landmark regressor R Y 2 was also trained with this dataset.

B. COMPARISONS WITH THE STATE-OF-THE-ART METHODS
We compare UncGANs with the state-of-the-art methods, including Style [4], Deep-Image-Analogy [25], UNIT [33], MUNIT [34], CycleGAN [1], CP-GAN [35], and PS2-MAN [37]. All the results of these methods were obtained using the default settings in these papers, and for those methods that do not provide source code, we redeveloped the codes ourselves. The same dataset was used to train these methods and UncGANs.
We first compare style translation techniques and these models directly translate texture, color, and appearance from the given reference domains. Here, we enumerate two common style translation methods: Style [4] and Deep-Image Analogy [25]. As you can see in Figure 5, neither of these methods converge effectively in the realistic portrait translation, so the facial structures cannot be maintained well, but the style is close to photograph. Next, two conventional image-to-image translation networks are compared. UNIT [33] and MUNIT [34] attempt to get more natural effects, however, their actual results are unnatural. VOLUME 8, 2020 , Deep-Image-Analogy [25], UNIT [33], MUNIT [34], CycleGAN [1], and our framework (UncGANs).
Compared with Style translate [4] and Deep-Image-Analogy [25], our facial feature is sharper and real. Compared with UNIT [33] and MUNIT [34], our results are closer to the real photo and more natural. Compared with CycleGAN [1], our results are more recognizable, more detailed, and more accurate. Finally, we compare our method with the latest methods (CP-GAN [35] and PS2-MAN [37]) that specialize in translating stylized portraits to real faces in Figures 6 and 7, and the results demonstrate that Unc-GANs has minimal artifacts while generating realistic and sharper faces. In summary, we are better able to convert a stylized portrait to a real face while maintaining an original identity.

C. ABLATION STUDY
To prove the effectiveness of our framework, we carried out ablation studies, which focus on the effectiveness of losses and local discriminator. In Figure 10, we compare the generated results of methods with full components and without full components. The experimental results shown in Figures 8  and 9 compare our results using two stages with the results using a single stage (with the global and local adversarial loss in one stage). We can see that the results using a single stage look decent, but in structural details (e.g., the appearance of the mouth corners, eyelids, and hair; the nose and face shape) are significantly inferior to ours. Therefore, our two-stage framework has more advantages in abstract painting translation (e.g., exaggerated portraits by Paul Cezanne or Picasso). This further proofs the necessity of developing a two-stage framework.

1) SEMANTIC STYLE CONSISTENCY LOSS AND LOCAL ADVERSARIAL LOSS
The purpose of this experiment is to verify the effect of our losses. Firstly, we verify the effect of the cycle consistency loss. The results after applying the cycle consistency loss (Ls_c, namely, global adversarial loss + cycle consistency loss) are shown in the second column of Figure 10, we can see that the generated facial features are consistent with its original portrait by adding the cycle consistency loss to the base model. Then, we apply the semantic style consistency loss on the discriminator and generator (Ls_s, namely, global adversarial loss + cycle consistency loss + semantic style consistency loss). The results are shown in the third column, after adding the semantic style consistency loss, we can observe that the facial features of results become more accurate. Compared to Ls_c, Ls_s is able to reduce the facial defects.
In order to verify the role of the local constraint, the results of applying the local adversarial loss (lo_d, namely, global adversarial loss + cycle consistency loss + semantic style consistency loss + local adversarial loss) are shown in the fourth column of Figure 10. Compared to Ls_s, lo_d shows higher-quality results with fewer structural defects, and more opulent details of facial features. In addition, we eliminate the semantic style consistency loss from our full framework (w/lo_d, namely, global adversarial loss + cycle consistency loss + local adversarial loss), and the results are shown in the fifth column of Figure 10. We can see that without the semantic style consistency loss, the appearance feature of results is no longer accurate, and some important information  is not retained, better than Ls_c depends entirely on the effect of the local constraint. Full represents our final result after adjusting each parameter. The result after tuning parameters is slightly better than lo_d.
In summary, as you can see from Figure 10, the results generated by the cycle consistency loss are not visually pleasing. Our full framework has the ability to better generate real faces with accurate structures and features, VOLUME 8, 2020   and each component of our framework has a role to play.

D. PERCEPTUAL STUDY
To further demonstrate the validity of our framework, we did questionnaires on different Facebook groups and calculated the results, which are shown in Table 1. The perceptual study evaluates the results of each model from three aspects: Identity (indicates whether the result effectively retains the identity of the source domain). Fidelity (the result closest to the target domain). Overall evaluation (indicates whether the result has an excellent overall impression). We setup seven groups of samples (all the results of each method were trained on our datasets, and we selected the best test results as samples). Lastly, randomly selected 100 users, who are worked in non-art related industries and do not have color blindness, color weakness, etc.) for the perceptual study.

1) IDENTITY
This study was started by showing the users the original photo and its corresponding ground-truth in order to familiarize users with the ideal outcome. Next, we randomly selected test objects from stylized samples, then showed our results and the control group results (the experiment is repeated 14 times, and two times for each way), the study required the users to pick the same identity pair and rate each group, this experiment only preserved the top three results.

2) FIDELITY
This study was started by showing the users the original photo and its corresponding ground-truth in order to familiarize users with the ideal outcome. Then, in each of the subsequent test, the results were randomly displayed to the users, containing the results generated by ours and control group, and asked the users to rank the samples from ''most similar'' to ''least similar'', the experiment only saved the top three results, which are convenient statistics.

3) OVERALL EVALUATION
The study was started by demonstrating the users a set of sample pairs that are different from study one and study two, and then, required the users to fill out a 10-questions questionnaire (each method), the response time was unlimited, and we based on the survey results to calculate the overall impression of each technique and asked the users to come up with the three best methods after the survey.
From the three sets of indicator values in the user study, our results are basically higher than those of prior methods, which indicates that our framework has achieved visually high-quality results in translating stylized portraits to realistic faces.

E. QUANTITATIVE EVALUATION
The above perceptual study shows that the results of our framework are discernible to naked eyes. But given that the human evaluation often biases the visual expression of the generated samples and ignores the overall distribution of the quantitative characteristics, so we need to have quantitative metrics to evaluate our framework. The Frchet Inception Distance (FID) [51] was used to calculate the similarity between the generated image and the target image. According to the steps of [52], we used different approaches and training datasets to calculate the covariance and mean of 4,096-dimensional eigenvectors, and then derive the FID [51] of different architectures. The evaluation results are shown in Tab. 2 and Tab. 3. Our framework has the minimum FID, proving that the results generated by our    framework are closest to the structural distribution of the target domain and surpasses other existing approaches. In addition, from Tab. 2, it can be seen that after Ls_s, Ls_c and lo_d, the FID indexes are improved. Eliminating the semantic style consistency loss leads to reduced translation performance. Hence, the various components of our framework are especially significant in translating stylized portraits to realistic faces.

F. OTHER STUDY
We also tested the generalization and robustness of our framework in more complex stylized portrait conversion tasks. Other results generated by our framework are shown    in Figures 11 to 17, the input styles range from classic to abstract. The results show that our framework performance is still superior when dealing with stylized portraits with vastly different structures. VOLUME 8, 2020

V. CONCLUSION AND FUTURE WORK
In this work, we propose Uncouple-Generative Adversarial Networks (UncGANs) for transferring stylized portraits to realistic faces. Our framework UncGANs is developed to tackle the problem in CycleGAN for portraits. In addition, we introduce three losses, namely, the semantic style consistency loss and the cycle consistency loss to effectively guide the training of generators and discriminators on unpaired datasets, the global and local adversarial loss ensure the consistency of appearance characteristics before and after translation, and the location consistency loss to establish the precise correspondence between the source domain (stylized portraits) and the target domain (realistic faces) as well as further assist the global and local discriminators. Extensive experimental results and comparisons with state-of-the-art methods demonstrate that our framework is better at generating realistic faces from stylized portraits with accurate structures and features. We also conducted the ablation study, the perceptual study, the quantitative evaluation, and other studies. All these studies show that our framework is effective in transferring stylized portraits to realistic faces. In future work, we plan to further optimize the existing architecture and extend our framework to convert the characters in abstract paintings into realistic videos.
WENXIAO WANG is currently pursuing the Ph.D. degree with the Faculty of Information Technology, Macau University of Science and Technology (MUST), Macao, China. He is also an Amateur Illustrator. He has been engaged in media design, 3D game development, human-computer interaction, film and television special effects, and architectural design. His current research interests include image/video stylization, computer vision, image processing, creative media, digital modeling, 3D animation, virtual reality (VR), and augmented reality (AR).