MHGAN: Multi-Hierarchies Generative Adversarial Network for High-Quality Face Sketch Synthesis

Face sketch synthesis has made significant progress in the past few years. Recently, GAN-based methods have shown promising results on image-to-image translation problems, especially photo-to-sketch synthesis. Because the facial sketch has a hyper-abstract style and continuous graphic elements, compared with other image styles, its local details are easier to expose small artifacts and blur. The existing face sketch synthesis methods lack models for specific facial regions and usually generate face sketches with coarse structures. To synthesis high-quality sketches and overcome the blurs and deformations, this paper proposes a novel Multi-Hierarchies GAN, which divides the face image into multiple hierarchical structures to learn different regions’ features of the face. It includes three modules: a local region module, mask module, and fusion module. The local region module can learn the detailed features of different local regions of the face by GAN. The mask module can generate a coarse facial structure of a sketch and uses the facial feature extractor to enhance the high-level image and learn the latent spaces’ feature. The fusion module can generate the final sketch by combining fine local regions and coarse facial structure. Extensive qualitative and quantitative experiments illustrate that the proposed method outperforms the state-of-the-art methods on the CUFS and CUFSF standard datasets and photos on the internet.


I. INTRODUCTION
Face sketch synthesis is the process of generating face sketches from face photos. Face sketch synthesis has been studied for a long time due to its wide application. It plays an essential role in digital entertainment [1] and law enforcement based on video surveillance [2]. In law enforcement and criminal cases, the intelligent security system [3] can automatically retrieve photos of suspects from the police face database, so that the judicial authorities can quickly narrow down the scope of potential suspects. In practice, suspects' photos are usually hard to acquire, and police The associate editor coordinating the review of this manuscript and approving it for publication was Shuping He . sought the commercial software or experienced artists to generate sketches of a suspect based on the description of an eyewitness. Other than the applications in security, face sketch synthesis also has several applications in digital entertainment. It has also become increasingly popular among smartphone users and social networks, where sketches are used as profile photos or avatars. Thus, face sketch synthesis is an important practical problem.
In the past decade, various methods have been proposed to achieve high-quality face sketch synthesis. These methods can be divided into two categories: data-driven methods and model-driven methods. The data-driven methods first perform image block processing on the training data, perform nearest neighbor selection and linear combination VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ weight calculation, and finally select the best image block for sketch stitching. Since the synthesized sketch block is a linear combination of the training set sketch block, the data-driven methods can obtain good textures and facial detail features. Wang et al. [4] computed the linear combination coefficients and red used them to construct the target sketch by projecting the testing photos onto the training photos. However, the data-driven methods are time-consuming in the nearest neighbor selection process of the image block. The model-driven methods focus on learning the mapping relationship between photos and sketches from the training photo-sketch pairs offline. In the test phase, the model-driven methods can directly transform photos into sketches without searching in the training data, and the synthesis speed is faster than data-driven methods. Thanks to the breakthroughs in the model-driven methods, Generative Adversarial Networks (GAN) have received widespread attention in image style transfer. Zhang et al. [5] combined the high-frequency features of samples on the results of GAN to refine the texture. Compared with other photo styles, sketches have more substantial semantic constraints. The loss and movement of facial features are subjectively unacceptable (even small local defects, such as around the eyes). Due to image elements of different facial regions (such as eyes and hair) in the sketch are inconsistent, it is difficult for a single network to learn multiple regions' features. The state-of-the-art methods based on model-driven can generate barely satisfactory results. However, these methods have not specified the targeted network for different facial regions. Therefore, these methods do not capture the facial detail well. Noise and deformation still exist in the synthesized results. Figure.1 shows the comparison of the synthesis results of the common method and the proposed method. The red box in Figure.1 shows that the FCN and GAN methods cannot generate subtle facial textures, and blur areas can be found in the local regions. It can illustrate the limitation of face sketch synthesis about these methods.
To address the above challenges, we propose a Multi-Hierarchies GAN (MHGAN) for face sketch synthesis. The proposed MHGAN is divided into local region module, mask module, and fusion module. The local region module divides the input photo into multiple hierarchies, and each hierarchy uses a GAN model to capture facial features and generate a corresponding sketch. Specifically, each hierarchy network is designed with an independent loss function to reduce noise and detail loss in the synthesis process. The model architecture in the proposed method can enhance the sketches' shadow and light and draw delicate contour curves. The mask module uses the facial feature extractor to calculate the mask_feature loss between the synthesized sketch and the real sketch called MF loss. We also use the smooth representation of the input image, which are photo's high-level features extracted by a white box function. Then we calculate the mask_structure loss of the synthesized sketch and the smooth image called MS loss. The mask module uses adversarial learning to provide mask sketch for face sketch synthesis and adds a controlling factor to ensure sufficient training of the generation network and the discriminant network. The fusion module uses the local region module's results, mask module's result and landmarks to generate the final sketch. To prove the effectiveness of the proposed method, we have performed extensive experiments on the Chinese University of Hong Kong face sketch benchmark dataset (CUFS) [6] and the CUHK face sketch face recognition technology dataset (CUFSF) [7] and photos on the internet. Compared with state-of-the-art methods, the experimental results show that the proposed method has more key facial details.
The main contributions of this paper are as follows: 1) We propose a novel end-to-end MHGAN framework for face sketch synthesis, which can generate high-quality and expressive clear face sketch images. In particular, our method can be applied to a variety of styles and different races.
2) The artist uses a variety of graphic elements for the face regions when creating a sketch. In order to synthesize a better sketch, the proposed framework divides the face into multiple hierarchies, each of which is controlled by separated loss functions. We also use a facial feature extractor to extract high-level features and texture details. The network introduces a total loss function, making the model more suitable for face sketch synthesis tasks.
3) We conducted multiple sets of comparative experiments in the experimental section. The ablation experiment compares the qualitative and quantitative results of different components of the proposed method to illustrate the contribution of each component; The comparison with traditional face sketching methods synthesis show that the proposed method can generate sketch images with finer textures; The comparison with GAN-based methods show that the contribution made by the proposed method can be more suitable for the face sketch synthesis; The comparison with the state-ofthe-art face sketch synthesis methods show that the proposed method has better performance in face sketch synthesis. The above comparative experiments have been carried out qualitative and quantitative comparative experiments in multiple sketch face datasets and real world's photos, which have proved the proposed method's outperforms these methods in addressing the problems of blurring facial features and losing facial details.
The rest of this paper is organized as follows. Section II introduces the current representative to face sketch synthesis methods. Section III introduces the proposed method in detail. Section IV presents various experimental results and comprehensive analysis. Section V gives a summary of this paper.

II. RELATED WORK
In this section, we discuss some existing face sketch synthesis methods. These methods could be roughly divided into two main types: the data-driven methods and the model-driven methods. In addition, the GAN-based synthesis methods in model-driven will be introduced.

A. DATA-DRIVEN FACE SKETCH SYNTHESIS METHODS
The data-driven methods usually use training set patches for linear combination to generate red face sketches. Tang and Wang [8] first proposed a face sketch synthesis method using feature transformation, but this method will cause local details to be lost or textures to be too smooth. Liu et al. [9] used the Local Linear Embedding (LLE) method to synthesize the face sketch. With the introduction of the Markov model, it has been widely used in many fields. For example, such as Peng et al. [10]- [12] proposed various applications of Markov-based jumping systems (MJSs). In face sketch synthesis, Wang and Tang [6] proposed to use a Markov Random Field (MRF) model to construct a data compatibility function and a smooth compatibility function, and select the most similar image block of target sketch for synthesis. Zhou et al. [13] used Markov Weighted Field (MWF) for face sketch synthesis, thereby improving the synthesized result. Gao et al. [14] used the sparse representation to measure the linear combination weights of similar training patches and then decompose the face into a sparse coefficient matrix and dictionary. Zhang et al. [15] used the sparse coefficient matrix to search candidate image blocks for test photos. Because the use of a sparse coefficient matrix can effectively reduce the computational complexity, but the lack of local constraints will lose facial information. Wang et al. [16] proposed a comprehensive face sketch synthesis method through simple offline Random Sampling and Local Constraints (RSLCR). The main advantage of the above method is that it can synthesize facial details from the linear combination of training image blocks. However, the large amount of calculation of the above method leads to poor real-time performance and low practicability.

B. MODEL-DRIVEN FACE SKETCH SYNTHESIS METHODS
The model-driven methods aim to learn the mapping relationship between face photos and face sketches offline. Chang et al. [17] proposed a face sketch synthesis method using ridge regression and correlation vector machine to learn the mapping relationship between photo-sketch patch pairs. Zhu et al. [18] divide the training photo-sketch patch pairs into clusters. These clusters can learn the mapping relationship by using a simple ridge regression model. Zhang et al. [19] used Fully Convolutional Network (FCN) to synthesize face sketches. Although the detailed information of a specific identity can be preserved, since only convolutional layers are stacked in the network, the real texture details will still be lost. Sheng [20] proposed a deep neural representation guidance method using enhanced 3D patch matching and cross-layer cost aggregation. These methods are trained offline, and the sketch results can be quickly obtained in the test phase. However, the model-driven method's final results often lack facial details or blurry artifacts.

C. GAN BASED FACE SKETCH SYNTHESIS METHODS
GAN has developed rapidly and has been widely used in the image style transfer field. Isola et al. [21] proposed a conditional GAN to learn the mapping relationship between input images and output images. This mapping can be applied to various style transfer tasks (e.g., labels to the street scene, aerial to map, day to night, and edges to photo). Zhang et al. [22] added a probabilistic graphic model to the GAN-based structure and proposed a method for synthesizing face sketches from coarse to fine. Besides, Zhang et al. [1] imitated the painter's painting process and added more detailed parts to the work of [22] to achieve better results. To solve the problem that the unpaired training set cannot be used, Zhu et al. [23] proposed using a cycle-consistent GAN (Cycle-GAN) to learn the image transfer mapping of unpaired images. Similar idea can be found in the latest work of PS 2 -MAN [24]. Zhang et al. [25] proposed a new face sketch synthesis method by using Multi-Domain Adversarial Learning (MDAL). Chen et al. [26] proposed an example-based method (FSW) to subdivide the feature map of the input photo into overlapping small pieces, and then use the corresponding small sketch pieces in the feature space to form a fake sketch feature representation.
Although the face sketch synthesis technology has achieved remarkable results, the existing methods do not consider the end-to-end model problem of region facial details. There are still smooth, blur contours and small artifacts in the facial local region's synthesized sketch. To solve the above problems, we combined the GAN (realistic visual effects) and multi-hierarchies end-to-end framework (used to solve rough facial areas) to study the face sketch synthesis method. VOLUME 8, 2020

III. PROPOSED METHOD A. NOTATION
As shown in Figure 2, the proposed MHGAN framework is divided into the local region module, mask module, and fusion module. MHGAN models the process of learning to transform face photos domain P to the face sketch domain S as a function . The process can be expressed as: S = (P). MHGAN divides the face photos into five hierarchical structures local_parts = {eye_l, eye_r, nose, mouth, mask} as lp. In the local region and mask module, networks learn from the paired training set where N is the number of photo-sketch pairs in the training set. In the fusion module, MHGAN uses the synthesized results of local region and mask module and landmarks to generate fake sketch. The fusion process can be expressed as function T : The proposed framework consists of multiple generators and discriminators, all of which are CNNs networks, designed explicitly for multi-hierarchies face structures. The hierarchical generator of the proposed network is defined as G. The generator G contains multiple partial generators:G = {G eye_l , G eye_r , G nose , G mouth , G mask }. The hierarchical discriminator is generally defined as the discriminator D, and the five hierarchical structures images of the face are input into different discriminators: D = {D eye_l , D eye_r , D nose , D mouth , D mask }. Different local discriminators will discriminate different local sketches to evaluate local regions characteristics.
MHGAN divides the face photos into multiple hierarchical structures because the artist uses different painting techniques for different regions of the face during the drawing process. A face sketch usually spends more time on the eyes, such as pupils, eye corners, etc. When drawing the eyes, the artist will use short and powerful brush strokes. The image elements drawn for the mouth will usually follow the upper and lower lip lines and fill the shape with thin lines. Therefore, the facial region features' drawing process will be quite different. The standard GAN uses a single generator to synthesize the entire face sketch, and all facial regions share generator parameters, making it difficult to generate all facial region features properly. Therefore, MHGAN's hierarchical network design with multiple GANs can help the model better learn facial features in different positions and generate high-quality face sketches.
MHGAN can get fine local sketches and coarse mask sketches by the local region and mask module, and MHGAN inputs them into the fusion module to generate fake sketches. There may be inconsistent boundaries in the fusion process, making it subjectively difficult to accept the semantic information of the synthesized sketch. The fusion module proposes to use the non-conservative guidance field of the foreground sketch and the background sketch to solve the boundary inconsistency problem in the fusion process.

B. LOCAL REGION MODULE
In order to better learn the facial features in different regions of the input image, the local region module includes four local regions lp ∈ {eye_l, eye_r, nose, mouth}. Multiple generators can extract sketch features of local areas that preserve the artist's drawing style. Put the local generator G lp and the local discriminator D lp into four local networks respectively. After the model is trained, each local generator can transform the facial region photo p lp, i into the corresponding facial region sketch s lp,i . The local generator network is constructed by the modified U-Net. Each of G eye_l , G eye_r , G nose , and G mouth is a U-Net with three down-convolution and three up-convolution blocks. A U-Net with skip connections can incorporate multi-scale features, such as low-level features, and provide sufficient but not excessive flexibility to learn the artist's drawing techniques for different facial regions in sketches. Its input is as follow: where p lp, i is the input of the hierarchical generator G eye_l , G eye_r , G nose , and G mouth .They are local regions centered on the facial landmarks (i.e., left eye, right eye, nose, and mouth) obtained by the MTCNN [27], and the region images size is h × w.
In the local region module, the loss function L local_adv can help the discriminator better and correctly distinguish the authenticity of the input image. This module uses the cross-entropy loss in the Cycle-GAN method as the L local_adv for adversarial loss and is defined as: . (2) For each D lp ∈ {D eye_l ,D eye_r , D nose , D mouth }, the input images p lp,i , s lp,i and G lp (p lp,i ) are all matches to the local region specified by D lp . When the discriminator D lp maximizes L local_adv and G lp minimizes this loss, L local_adv will make the synthesized sketch closer to the target domain S.
In the local region module, a strict L 1 loss is set in each hierarchical structure. The four local regions photos are p eye_l,i , p eye_r, i , p nose, i , and p mouth, i respectively, and their loss function is defined as: The local discriminator D focuses on distinguishing whether the generated ''fakes'' sketches is the real sketch. Except for the difference in the input image size of different discriminators, the network structure is the same as Patch-GAN [21]. The Patch-GAN discriminator takes a 70×70 patch as the input image to avoid the lack of high-frequency information caused by directly inputting the two complete images and examines the style of each patch. Different facial partial patches allow the discriminator to learn local patterns and better discriminate real sketches from synthesized sketches. The discriminator D network structure is shown in Figure 3.  The loss function of the local region module is formulated as: where γ and δ are the hyperparameters which control the contribution of the L local_ adv and L local_l 1 .

C. MASK MODULE
The mask module will provide the learning process of the mask hierarchical region, in this section lp ∈ {mask}. The input of the mask module is a real photo, which can ensure that the synthesized image does not lose high-level features and detailed features. This module includes a mask sketch synthesis network and a facial feature extractor.

1) MASK SKETCH SYNTHESIS NETWORK
The generator of the mask sketch synthesis network generates the coarse facial structure aims to preserve the position feature of the photos. This module uses the Earth-Mover (also known as Wasserstein-1) distance W (P lp , S lp ) proposed by Arjovsky et al. [28] to replace of the JS divergence in the standard GAN model. The objective function is as follow: where P lp is the input photo distribution, S lp is the synthetic sketch sample distribution, m = 1 − Lipschitz. W (P lp , S lp ) is approximately the Wasserstein distance of the photo and the synthesized sketch. To ensure the discriminant network meets the 1-Lipschitz constraint, Gulrajani et al. [29] added a gradient penalty term in the objective function. The gradient penalty term is as follows: where Px is a random interpolation sample between the P lp and S lp ,x is the data randomly sampled between the real data and the generated data. Wasserstein distance can improve the mask sketch synthesis network's stability, ensure that important features such as the mask will not appear deformed and noise, and allow the local face region module to be more flexible. The adversarial loss in mask sketch synthesis is formulated as follow: where Sˆs is the random interpolation sample between the real sketch sample distribution S lp and the synthesized sketch VOLUME 8, 2020 sample distribution S lp . To prevent the mask sketch synthesis model from no longer optimizing parameters due to loss reaches a premature equilibrium state, we add a weight control factor to the adversarial loss. The control factor can ensure that the proportion of adversarial loss is small at the beginning of training, and increase the weight later to ensure that the network does not converge quickly. It can improve the learning ability of the network training stage and improve the synthesis effect. Therefore, after adding the control factor the adversarial loss is rewrited as follow: where n is the current number of iterations, N is the total number of iterations, ω is the attenuation coefficient, and its value is fixed as 0.99. The hierarchical generator G mask in the mask module is constructed using a residual network. The structure of the generator G mask is shown in Figure 4, and the input and output are shown in Eq. 9. The network of the discriminator has the same structure as the hierarchical discriminator in the local region module. As shown in Figure 3, the full mask sketch is input. The generator G mask is a residual network with nine residual blocks [30]. The network structure includes a convolutional layer which stride is 1, two convolutional layers which stride is 2, nine residual units, two transposed convolutional layers, and a convolutional layer. The residual unit structure implemente through the shortcut connection and the linear forward process, mainly composed of the convolutional layer, batch Norm layer and ReLU activation function.
2) FACIAL FEATURE EXTRACTOR The extraction process of the facial feature extractor is shown in the Figure 5. It uses the VGG16 [31] network to extract features, and then MHGAN calculates two loss functions: MF loss and MS loss. MHAGN calculates the MF loss between the synthesized sketch and the real sketch, and the MS loss between the smooth image of the real photo and the synthesized sketch. The VGG16 network structure is built by repeatedly stacking a 3×3 convolution kernel and a 2×2 maximum pooling layer. The architecture dramatically reduces the number of network parameters by concatenating convolution kernels. Compared with the network layer constructed with a single convolution kernel, it has more nonlinear transformations and is more suitable for extracting facial features. The purpose of the facial feature extractor is to compensate for the defects of the model-driven method to a certain extent, but the standard L 1 loss requirements are relatively strict, and the L 1 loss function is as follow: The L 1 loss calculates the distance in pixel-wise and directly obtains the real sketch's minimum absolute value and the synthesis sketch. It is difficult to obtain the facial details and texture features of the input sketch, and it loses the meaning of the facial feature extractor. Therefore, the proposed method uses the loss function of the facial feature extractor compares the sum of singular values between the two feature maps, which would improve the detailed features of the synthesized sketch: where w, h is the dimensions of the feature map, φ(s lp, i ) and φ(G lp (p lp, i )) represent the feature matrix output from the real sketch and the fake sketch after feature extraction. L mask_feature would bridge the feature gap in a latent space. The smooth representation of the input photos can display sparse color blocks, global content, and clear boundaries. The input of the white box function is a gray-scale photo, and it will be divided into separate regions by using the felzenszwalb algorithm. Since the super-resolution algorithm only considers the similarity of pixels and ignore the semantic information, and use the average of pixel values to color each segmented region, we further introduce selective search [32] to merge the segmented regions and extract the sparse segmentation map. Finally, the white box function output a bright, smooth representation image. High-level features extracted through input the smooth image to the VGG16 network can impose spatial constraints between the synthesized result and the photo. The white box function is formulated in Eq. 12 and 13. S m,n = (θ 1 * S + θ 2 * S) µ , where S m,n is a single pixel,S is the average value of the pixel, andS is a similar pixel, and µ is a fixed parameter. According to the work of [33], γ 1 and γ 2 are divided into 20 and 40. We found that µ = 1.1 is more suitable for gray-scale images and our method can generate good results. We define the function F structure is the process of the smooth image structure extraction and L mask_structure is: The model parameters of the facial feature extractor in the proposed method use the model parameters pre-trained by the VGG16 network on ImageNet [34].
The loss function of the mask module is: where α, β and η are the hyperparameters which control the contribution of the L mask_adv , L mask_feature and L mask_structure .
Furthermore, we use fake sketch s i and real sketch s i to calculate the fusion loss to optimization the network, the fusion loss is defined as:

E. OPTIMIZATION THE SYNTHESIS NETWORK
The function can be formulated by solving the loss function expression is: where χ is the hyperparameter which control the contribution of the L fusion . And algorithm 1 introduces the synthesis process of the network. This paper's experiments are performed on two public face sketch datasets and photos on the internet. Two public datasets include the CUFS dataset and CUFSF dataset. The CUFS dataset consists of 606 photo-sketch pairs: 188 pairs from the Chinese University of Hong Kong (CUHK) [6] student dataset, 123 pairs from the AR dataset [36], and 295 pairs from the XM2VTS [37] dataset. The CUFSF dataset consists of 1194 sketch-photo pairs in total. Examples of these standard datasets are shown in Figure 7, 8, and Table 1 shows the training set and test set of these datasets divided. VOLUME 8, 2020

B. EXPERIMENTAL DETAILS 1) EXPERIMENTAL SETTINGS
The experiment of this paper is performed through PyTorch. [38]. All photo-sketch pairs in the datasets are geometrically aligned based on three points, i.e., two eye centers and the mouth center, then facial images above are cropped into 256×256. The input images size of the local region modules' hierarchical generators is 40×56 for eyes, 48×48 for nose, and 40×64 for the mouth. The input image size of the mask module is 256×256. Finally, the output image size of the fusion module is 256×256.

2) TRAINING DETAILS
We show how to determine each hyperparameter of the loss function and analyze the sensitivity of them in this subsection. We use a grid search method: the value of each hyperparameter was set to: 0.02, 0.1, 0.5, 1, 10, 15, 25, and we calculate the average FSIM of the CUFS dataset under different parameters, and select the parameter with the highest FSIM as the final parameter, which are δ = 25, γ = 1, α = 1, β = 10, η = 0.02 and χ = 0.02.
In order to analyze the sensitivity of different loss hyperparameters to the overall loss, we fix the six hyperparameters of δ = 25, γ = 1, α = 1, β = 10, η = 0.02 and χ = 0.02 in turn, and then adjust the remaining one hyperparameter to 0.02, 0.1, 0.5, 1, 10, 15, 25 and calculate the average FSIM of the CUFS dataset. When analyzing one parameter, we keep the others fixed. The experimental results are illustrated in Table 2.
It can be seen from the Table 2 that as the value of δ in the local region module increases, resulting in a continuous increase in the FSIM value, which indicates that δ has the greatest impact on the local region loss functions and the overall loss. Compared with δ, the influence of the parameter β of the mask module is the second important. At the same time, the loss parameters of γ and α that drive the image translation in the local region module and the mask module have less influence on the overall loss than δ and β. And it can be seen that the small-range fluctuations of χ and η have the least impact on the total loss, but if η is too large, it will have a greater impact on the mask sketch.
The learning rate and batch size are set to 0.0002 and 1, respectively. This paper chooses Adam [39] with β 1 =0.5 and β 2 =0.999 to optimize all modules, and the weight coefficients corresponding to each loss function are χ=η=0.02, γ = 1, δ = 25, α = 1, β = 10 from the above analysis and demonstration. The training process takes 300 epochs in total.

C. ABLATION STUDY
Our MHGAN combines several components for face sketch synthesis. Several ablation studies are conducted on the CUFS dataset to verify each component's contribution to the proposed method. Qualitative and quantitative evaluations are shown in Figure 9 and Table 3   The facial features of the face sketch contain a variety of drawing techniques and feature details. As shown in Figure 9 (c) and (g), the MHGAN W/O LRM cannot learn the detail of local facial region features well through a single GAN and appear blurry regions compared with our full method. The local region module of MHGAN is essential for capturing the texture details and styles of facial features.
As shown in the red boxes in Figure 9 (d) and (g), when the network converges, the results of MHGAN W/O MF loss (d) contain more blurry facial regions and lose some detail features, and without using MF loss will reduce the quality of synthetic sketches. As shown in the comparison of the red boxes in Figure 9 (e) and (g), the results of MHGAN W/O MS loss (e) contains some artifacts around the local regions, such as the mouth regions. The MS loss can further optimize the synthesized result, which makes the facial features' outline is more precise. The MF loss and MS loss in the mask module's facial feature extractor are crucial for generating a suitable mask.
Finally, we use the fusion module to maintain the boundary's consistency and obtain the fake sketch. As shown in Figure 9 (f) and (g), if non-conservative guide field is not used but other normal guide field in the fusion module, there will be noticeable edit marks at the boundarys of the sketch's facial regions. These traces significantly affect the subjective visual effect and are an apparent defect for the face sketch synthesis. It shows that the non-conservative guide field is beneficial to improve the fusion effect of the synthesized sketch.
As shown in Table 3, the average SSIM, FSIM and LPIPS scores of MHGAN using all components are the best performance, which shows the effectiveness of each component in MHGAN.
As shown in Figure 10, 11, and 12, the synthesized results by LLE and MRF methods are relatively smooth and lose some common facial region structures. Such as the hairstyles at the third and fourth lines of AR, the facial contours at the second line of CUHK, and many artifacts at the first and third lines of XM2VTS. Although the MWF method can generate new candidate blocks by a weighted combination strategy but it cannot generate certain specific feature that only exists in the test sample, such as the glasses in the second and third rows of XM2VTS. The synthesized results by the RSLCR method are blurred at the facial contour and hairstyle. These data-driven methods only consider the similarity of images at the pixel level, so they cannot describe the facial characteristics well and leads to some problems such as blur and lack of texture features. Although the performance of the model-driven methods are better than the above methods and they can generate most facial region features, introducing some noise and reducing the sharpness of these results. For example, in the synthesized results by the FCN method, artifacts appear on the face, which affects the subjective visual effect. The synthesized sketches by the GAN method also loss the detail characteristics, such as headdresses. However, the quality of synthesized sketches by our method are the best. Its contain clear and sharp facial features, decorations, and other abundant low-level features. For example, the hairpins in the first line of CUHK, the hairstyles of AR, and the glasses in the first and second lines of XM2VTS have fine effective subjectively. From the synthesized results on the CUHK, AR, and XM2VTS datasets, we can found that the synthesized results can maintain complete facial structure and contours, mainly due to the MF loss and MS loss in the mask module of our method, which minimizes artifacts while ensuring an exact sketch effect. Furthermore, the Multi-Hierarchies division makes the synthesized result more clearer and sharper.

2) QUANTITATIVE EVALUATION ON THE CUFS DATASET
A quantitative analysis is performed to objectively prove the effectiveness of the MHGAN in the face sketch synthesis. Due to the lack of professional and objective evaluation methods for face sketches, we use traditional image quality assessment methods to evaluate the quality of synthesized sketch images. We utilize both the feature similarity index (FSIM) [40], structural similarity (SSIM) [41] and LPIPS [42] to evaluate the quality of synthesized sketches. The average FSIM, SSIM and LPIPS scores of the synthesized sketches are list in the Table 4. The higher FSIM, SSIM value, and the lower LPIPS value, the better the image quality. The numbers in bold in the table are the maximum values of each index. Figure 13 (a) and (b) show the FSIM and SSIM score statistics of all methods on the CUFS dataset. The horizontal axis shows the quality evaluation score, which ranges from 0 to 1. The vertical axis represents the percentage of synthesized   sketches, whose quality evaluation scores are larger than the score marked on the horizontal axis. Figure 13 (c) shows the LPIPS score statistics of all methods on the CUFS dataset.
Unlike (a) and (b), the vertical axis in (c) represents the percentage of synthesized sketches, whose quality evaluation scores are lower than the score marked on the horizontal axis.  It can be seen from the curves that the MHGAN achieves higher performance than other methods on both SSIM, FSIM and LPIPS scores.
As shown in Table 4, the two average quality evaluation score of the MHGAN are higher than other methods on the CUFS dataset, which means that the generated sketches by our method is closer to the real sketch. Although the average SSIM score of the RSLCR method are higher than our method on the CUHK dataset, it have more blur regions comparing with the MHGAN in Fig. 10. In the work of [43], it was found that when the original sketch was used as the reference image, the FSIM score was closely related to human subjective evaluation. It can be seen from the evaluation score in Table 4 that the average FSIM score of the MHGAN on the CUFS dataset is higher than other methods. Furthermore, it can be seen from Table 4, among all the methods, our method achieves the best performance of LPIPS, which shows that the synthesized sketch by our method has the best perceptual quality and has the most similar texture detailed to the real sketch. Therefore, considering the subjective results and quality evaluation score, our method is very competitive and can generate high-quality sketches.

3) COMPARISON AND EVALUATION ON THE CUFSF DATASET
Compared with face photos, a corresponding sketch drawn by forensic experts in intelligent security applications is with shape exaggeration. To verify the robustness of our method on exaggerated images, we also conducted comparative experiments on the CUFSF dataset. The sketches drawn by artists in the CUFSF dataset are exaggerated in shape and expression. Table 5 shows the average SSIM, FSIM and LPIPS evaluation scores of the MHGAN and other methods on the CUFSF dataset. Figure 13 (d), (e) and (f) show the statistics of the FSIM, SSIM and LPIPS scores of all methods on the CUFSF dataset. In Figure. 14, the LLE, MRF, MWF, and RSLCR methods lose characteristic features, such as hairstreaks in the first and second lines. The FCN method has very complicated artifacts. Some facial components (such as mouth and eyes) are deformed in the GAN method, and some common facial structures are lost, such as facial contours. The comparison in Figure 14 shows that the proposed method's synthetic sketch is more vivid than other methods. The quantitative comparison in Table 5 and Figure 13 (d), (e) and (f) also shows that our method is superior to other methods.

4) COMPARISON WITH GAN-BASED FACE SKETCH SYNTHESIS METHODS
We compare our method with the other three GAN-based face sketch synthesis methods. Figure 15 shows the face sketch synthesis results on the CUFS and CUFSF datasets, which are synthesized by the FSW [26] method, the MDAL [25] method, the PS 2 -MAN [24] method, and our method. Table 6 shows the SSIM, FSIM and LPIPS evaluation results of four methods on two datasets. It can be seen that the results of the FSW method on the CUFSF dataset are smooth and blur. Although the MDAL and PS 2 -MAN methods have been able to synthesize relatively good visual results on CUFS, the facial contours and partial structures are still blurred, and facial region features (such as the region around the eyes) are lost on CUFSF. The proposed method has sharp facial features, and it is better than other methods in the FSIM and LPIPS score.

5) COMPARISON WITH GAN-BASED IMAGE-TO-IMAGE TRANSLATION METHODS
To furthermore illustrate the effectiveness of the MHGAN, it is compared with the other three Image-to-Image translation GAN-based methods. Figure 16 shows the synthesis results of different methods on the CUFS and CUFSF datasets, which are synthesized by the UNIT [44] method, the Cycle-GAN [23] method, the Dual-GAN [45] method, and our method. The synthesized sketches of the UNIT method are vaguer and reduce authenticity. Although the Cycle-GAN method can generate sketches with relatively good visual effects, but still has blurs and artifacts. The synthesized sketches by the Dual-GAN method has severe facial distortion. The proposed method can overcome blur and deformation, and the synthetic results are more realistic and clearer. Table 7 shows the SSIM, FSIM and LPIPS evaluation score of the four methods on the CUFS and CUFSF datasets. It can be seen that MHGAN is superior to the other  three Image-to-Image translation GAN-based methods in the quality assessment.

6) FACE RECOGNITION
Face sketch recognition is usually used for quantitative evaluation of the face sketch synthesis [46], and it is also an important application of face sketch synthesis [47]. High-quality face sketch synthesis will have high-quality recognition accuracy. We employ the Null-space Linear Discriminant Analysis (NLDA) [48] for face recognition experiments to validate our method. In the recognition stage, all the photos in the dataset are first transformed to sketches by the face sketch synthesis method, and then match the input sketches with the synthesized sketches. For the CUFS dataset, we randomly select 150 synthesized sketch and corresponding real sketches to train the classifier, and the rest 188 synthesized sketches and the corresponding 188 original sketches as the testing set. For the CUFSF dataset, we randomly select 300 synthesized sketch and corresponding real sketches for training, and the rest 644 for testing. We randomly divided data on the CUFS dataset and the CUFSF dataset and repeated the face recognition experiment 20 times by NLDA. Figure 17 (a) and (b) show the face recognition rates against the numbers of dimensions on both CUFS and CUFSF datasets. Table 8 shows the best face recognition rate under a certain size. It can be concluded from Figure 17 and Table 8 that the recognition rate of the MHGAN model is the highest on the CUFSF dataset, and it is also very competitive compared with other methods on the CUFS dataset.

7) SKETCH SYNTHESIS ON THE REAL WORLD PHOTOS
The tested photos in the above experiment were all taken under specific conditions(such as lighting, background, facial expressions, etc). However, facial photos were taken by users of digital entertainment applications usually show different head poses and changing lighting environments. To further illustrate the applicability of our method in digital entertainment, we compare MHGAN with the MDAL and PS 2 -MAN methods, which have better visual performance in the above experiments. Figure. 18 shows the synthesized sketches of these methods. The photos used in the test are all from the internet, and the training set is from the AR and CUHK datasets, containing 311 individuals. As shown in Figure. 18, although the synthesized sketches by the MDAL method have a rich texture, the eye structure of the sketch in the second line has been deformed. The synthesized sketches by the PS 2 -MAN method in the third line lack important facial regions and texture of sketch. Compared with synthesized sketches by other methods, the synthesized sketches by our method in  the fourth line produces clear and more realistic, especially facial region details. This is due to the Multi-Hierarchies division of facial images in our method, which will not miss facial region features. The comparative experiments using real-world internet photos prove that our method has excellent performance and more robust applicability.

V. CONCLUSION
To address the problems that sketch local regions are easier to expose small artifacts and blur, we proposed a multi-hierarchies GAN-based face sketch synthesis method. A face photo is divided into multiple hierarchical structures and inputting them to the local region module and the mask module. The local region module can learn the detailed features of different local regions of the face and generate local region sketches. The mask module generates a coarse facial structure of a sketch. Finally, the local region sketches and the mask sketch are input to the fusion module and generate the fake sketch. Through experiments on the different datasets illustrate that the proposed method can synthesis sketch with good details and fine facial textures. Some quantitative evaluation shows that our method has achieved better performance than the state-of-the-art methods and is robust. In the future, we will modify the encoder or fusion module to improve our method's synthesis speed. We also intend to enhance our method's practical application in the wild.
YANAN GUO received the B.Sc. degree from Hubei Polytechnic University, in 2014, and the Ph.D. degree from Yunnan University, in 2019. She is currently a Teacher of electronic engineering with Beijing Information Science and Technology University. Her research interests include machine learning and computer vision.
TAO WANG received the Ph.D. degree from the School of Electronic and Information Engineering, Beihang University, in 2019. He is currently a Teacher of electronic engineering with Beijing Information Science and Technology University. His research interests include radar signal processing, data fusion, and target localization and tracking in intelligent transportation systems or vehicle intelligent assistance systems.