Multi-Scale Feature Channel Attention Generative Adversarial Network for Face Sketch Synthesis

Face sketch synthesis for photos is an applied research topic and it is critical for criminal investigation. However, sketch synthesis remains some challenges because of the blur and artifacts in the generated face sketches. To mitigate these problems in face sketch synthesis, we propose a fast Generative Adversarial Network with fast Multi-scale feature channel Attention, namely MAGAN. In the generator network, multi-scale features are extracted by proposed multi-scale feature extraction to produce detailed sketches. Then, a channel attention mechanism is applied to emphasize the significance of important feature channels, further enhancing the synthesized sketches. Besides, the loss of patch-wise high-layer features from the VGG-19 network is applied to supervise the generator to synthesize more realistic sketches. To accelerate the training process, the features from the pooling layers are adopted to calculate the pseudo sketch feature loss. The experimental results demonstrate that our MAGAN can achieve better performance in both visual evaluations and quantitative evaluations (in terms of feature similarity and learned perceptual image patch similarity), compared with the state-of-the-art methods.


I. INTRODUCTION
Face sketch synthesis aims at generating a face sketch from a given face photo. It is widely used in both digital entertainment and law enforcement [1]. However, face sketch synthesis remains many challenges in practical applications. For example, the photo of a suspect is often unavailable because of the low quality of the surveillance video. A sketch drawn by an artist from the description of the witnesses or the low-resolution videos can assist in identifying the suspect. The face sketch synthesis can bridge the style gap and improve the face recognition rate. Besides, to protect privacy, many people tend to use face sketches as the avatar of social accounts rather than their real photos. However, it is inconvenient for common people to obtain sketches.
The associate editor coordinating the review of this manuscript and approving it for publication was Yudong Zhang . The face sketch synthesis technique allows us to produce sketches in just a few minutes without the help of professional artists.
Many works have focused on face sketch synthesis. Among them, the most typical works are exemplar-based methods [2]- [5]. These methods can achieve good performance based on image patches. Exemplar-based methods always divide images into patches and search neighbors for each input patch. The corresponding sketch patches are used as the reference for synthesis. Although some researches [2]- [4] have reported their promising results, there are still some problems with synthesized results. For example, the synthetic sketches are over-smoothed and cannot maintain important detailed information, such as glasses or hair accessories. Besides, the patch matching algorithms and weight computing algorithms are usually too time-consuming to be conducted in practical applications.
In recent years, with the development of deep neural networks, the convolutional neural networks (CNNs) have been applied to learn the mapping between photos and sketches directly. Zhang et al. [6] constructed a fully convolutional neural network (FCN) to learn the nonlinear mapping. However, the synthesized sketches are very blurry with too many artifacts. Generative Adversarial Network (GAN) [7] is an effective method to alleviate the blur problem in CNN-based methods. GAN can generate fake sketches that are similar to real sketches. However, GAN-based methods also introduce some undesirable artifacts in the results. Deep learning based methods [8]- [12] cannot synthesize satisfying sketches because of the lack of enough paired photo-sketch images for training. Some studies [13]- [17] combine neural networks with exemplar-based methods to further improve synthesis performance. It has been proved that the methods that combine with two kinds of methods can achieve better performance than the single kind of methods.
In this paper, we aim to develop a fast face sketch synthesis method that exploits the characteristics of the sketches and produces more detailed sketch images with specific textures. Follow the FSWild [18], we combine deep learning networks with traditional patch-based methods and benefit from both of them. The GAN is adopted as the main framework to generate sketch images. We use a deep residual network [19] with skip connections as the generator. The markov random field is applied in the neural patch level to match the similar feature patches and obtain the corresponding pseudo sketch features. Besides, we propose multi-scale feature extraction to capture the features of different scales to suit the textures of different scales in the sketches. Meanwhile, the channel attention mechanism is introduced into the generator to learn the inter-relationships of different feature channels. By combining the idea of the exemplar-based methods and GAN, we can synthesize more detailed sketches with better visual performance. Moreover, we extract the high-level features from pooling layers instead of ReLU layers to speed up the training process. The experimental results show our method can achieve better performance in both qualitative and quantitative evaluations at a faster speed.
The main contributions of our work are three folds: (1) Multi-scale feature extraction takes both coarse features and fine features into account to generate more exquisite sketch images with sketch textures and shadows. (2) The channel attention mechanism is introduced to weight the different feature channels by exploring their interdependencies. It allows the generator to emphasize the key features and suppress the trivial features. (3) The patch-wise feature loss from the pooling layers is proved to be robust in generating sketch-like images. Compared with the feature loss calculated from the ReLU layers [18], it can accelerate the training process by 3 times.
The rest of the paper is organized as follows: Section II reviews the related work. Section III introduces our face sketch synthesis model in detail. Section IV presents the implementation details. Section V shows the experimental results and comprehensive analysis of the proposed method. Conclusions are drawn in Section VI.

II. RELATED WORK
In this section, we first review the exemplar-based sketch synthesis methods proposed in previous work. Then, we discuss different strategies which produce dense sketch outputs via GAN. At last, we briefly introduce the use of attention mechanisms that associate with our method.
A. EXEMPLAR-BASED PHOTO-SKETCH METHODS Exemplar-based methods are classical and widely adopted. Tang and Wang [1], [20] and Wang et al. [21] firstly applied principal component analysis (PCA) [22] to project the test photo onto the eigenspace of the training photos. The same linear combination coefficients are then used to reconstruct the target sketch from the training sketches. However, the linear assumption between photo space and sketch space does not always hold. Sometimes, the generated sketches suffer from over-smooth and fail to preserve subtle contents. Inspired by Local Linear Embedding (LLE) [23], Liu et al. [24] suggested learning a nonlinear mapping to reconstruct sketches. The method splits the facial photos into overlapping patches and learns piecewise linear mappings. The K nearest neighbors are selected in the paired database for each test photo patch. Then the reconstruction weights are computed by applying minimum reconstruction error. The target sketch is obtained by a linear combination of the corresponding searched sketch patches with the same reconstruction weights. Song et al. [4] explored a real-time sketch synthesis method that refines the synthesis results by introducing an image denoising method. Wang and Tang [2] (MRF) introduced markov random field to model the relationship between adjacent patches to obtain effective performance. It takes the smooth constraints between image patches into consideration. However, the MRF method cannot generate new patches that are not in the training database. To alleviate the problem, Zhou et al. [3] (MWF) designed a weighted MRF model, which generates target sketch patches by weighting K nearest neighbors and converts the MRF problem into a convex quadratic problem. Peng et al. [25] segmented the face images into superpixels instead of squared patches to preserve the facial structure. Song et al. [5] accelerated neighborhood search by using an offline random sampling strategy and calculated the reconstruction weights under local constraints. There are also some methods based on Bayesian inference [26], [27] and sparse representation [28]- [33]. The main idea of the sparse representation methods is to replace the original sketch patches with trained dictionary atoms. A common problem of the exemplar-based methods is that the synthesized sketches are over-smoothed and lack sketch textures.

B. GAN-BASED PHOTO-SKETCH METHODS
GANs have achieved promising results in image generation and have been widely used in image-to-image translation VOLUME 8, 2020 tasks, including face photo-sketch synthesis. The idea of GAN is inspired by game theory. The framework of GAN consists of a generator (used to generate the target images) and a discriminator (used to distinguish the generated images from the ground truths). The key idea is the adversarial loss, which forces the generated images to be similar to the real ones and makes the fake images cannot be distinguished from the discriminator. Isola et al. [8] (pix2pix) explored the first common framework for image-to-image translation tasks with paired images datasets. Zhu et al. [9] (CycleGAN) proposed a cyclic framework to learn a bidirectional mapping between two domains using cycle consistency loss. It can work on unpaired datasets. Wang et al. (DualGAN) [10] proposed the same cyclic framework with a different network architecture. The cyclic framework consists of two pairs of GANs: one for the forward direction (photo → sketch) and the other one for the backward direction (sketch → photo). For direction photo → sketch, a photo is translated to a fake sketch by G A , and then the fake sketch is then transformed back to a cycled photo through G B ; for reverse direction sketch → photo, it works similarly. Therefore, the cycle consistency loss [9] between the input images and the cycled images can be added to the loss function to constraint the generators. Through the cyclic framework, both G A and G B can be obtained in once training. Wang et al. [11] proposed a framework for photo-sketch synthesis involving multi-adversarial networks (PS2MAN). It generates images of different resolutions and employs multiply discriminators to distinguish images to supervise the synthesis process. Chao et al. [34] utilized high-level features as a loss to train GAN to generate photos from sketches. Zhang et al. [35] applied multi-domain adversarial learning to overcome the defects of blurs and deformations in synthesis results. Bi et al. [36] constructed a three-layer pyramid model to obtain multi-scale information to synthesize sketch textures. Yu et al. [12] proposed a composition-aided GAN (CAGAN) using pixel-wise face labels to help generate sketches. Di et al. [37] combined two different GANs to synthesize photos from visual attributes. However, these methods still fail to synthesize realistic sketch textures. Besides, they introduce many artifacts in the resulting images.
Some studies combine neural networks with exemplarbased methods to further improve synthesis performance. Zhang et al. [13] used low-rank for inter-domain transfer and GAN for intra-domain transfer to jointly synthesize sketches. Zhang et al. [14] introduced a neural network to the probabilistic graphical model (NPGM). Peng et al. [15] proposed a deep patch representation-based probabilistic graphical model (DeepPGM) for face sketch synthesis in the wild. Zhang et al. [16] proposed a face sketch synthesis framework based on deep latent low-rank representation (DLLRR). The method utilizes encoders to generate hidden sketches to add training images and use low-rank representation to synthesize sketches. Zhang et al. [17] proposed a cascaded face sketch synthesis framework that extracts deep features from the VGG network and applies the deep features to a cascade low-rank representation to overcome various illuminations.

C. ATTENTION MECHANISM
Attention plays an important role in human perception [38]. The human visual system always focuses only on the regions of interest, rather than the entire image or sequence. Attention mechanisms have been incorporated into deep learning frameworks to improve the performance of convolutional neural networks. The attention mechanisms make models focus on the salient parts selectively to better capture visual features. The attention mechanisms have been proved to be very effective in many visual tasks, including image classification, image capturing, and image super-resolution. Hu et al. [39] introduced a squeeze-and-excitation block to explore the relationship between channels. In the module, the global average-pooled features are used to calculate channel-wise attention. Parmar et al. [40] proposed an image transform model that adds attention to an autoregressive model for image generation. However, the attention mechanisms are rarely used in face sketch synthesis. In this paper, we introduce the attention mechanism to face sketch synthesis and further improve synthesis performance.

III. MULTI-SCALE FEATURE ATTENTION GENERATIVE ADVERSARIAL NETWORK
Inspired by GAN, we utilize the GAN framework to generate fake sketches. Given a training database F ∈ {p i , s i }, which consists of N paired photo-sketch images: photos {p i ∈ P} N i=1 and sketches {s i ∈ S} N i=1 . The framework of our method is similar to the classic GAN network, with a generator G and a discriminator D. The generator is used to convert an input photo p t to the target sketch domain and get the fake sketchŝ. The discriminator is used to distinguish the generated sketch images {G(p)} from the real ones {s}. The generator network consists of several convolutional layers, residual blocks, and deconvolutional layers. Different from the classic GAN network, we introduce multi-scale feature exaction, attention mechanism, and Markov Random Field (MRF) feature representation to guide the generation of sketch-style images.
The framework of our method is shown in Fig. 1. The generator network G is a deep residual network with skip connections. It is used to generate fake sketches of input photos. The multi-scale feature capture is first used to extract features of different scales. Then, the attention mechanism is applied to learn the interdependencies of different features. Instead of training the generator network with the real sketches directly, the pseudo sketch feature [18], [41] loss from the pre-train VGG-19 network is utilized to supervise the synthesis. Total variation loss is also utilized to suppress noise in the generated images. A discriminator network D is applied to minimize the gap between generated sketches and real sketches.
Next, We will detail these aspects of the proposed Multi-scale feature Attention Generative Adversarial Network method (MAGAN).

A. MULTI-SCALE FEATURE CAPTURE
Sketch images often contain rich textures and lines with different scale features. Meanwhile, there exists large deformations between photo and sketch domains. Single-scale features may not capture all the necessary spatial information for a fine-grained generation. Therefore, we propose a multi-scale feature capture scheme to extract features at different scales. It mainly uses convolutional kernels of various scales to extract features. The multi-scale feature capture scheme is demonstrated in Fig. 2. More specifically, given an input photo p, it is first fed to three sets of convolutional kernels of three scales: a set of 1 × 1 convolutions, a set of 3 × 3 convolutions, a set of 5 × 5 convolutions to get features f 1 , f 2 and f 3 , respectively. In addition, the dilated convolutions are effective in multi-scale feature capture [42]. A set of 3 × 3-dilated convolutions with the dilation rate of 2 are also applied in feature extraction to get feature f 4 . The final features are obtained by concatenating all the resulting features along the depth axis, which is represented as: where concat(·) is a channel-wise concatenation operation. Thus, the multi-scale feature extraction does not increase the depth of the generator. The features are then fed to an instance normalization layer and a ReLU layer. Note that the stride of 1 is applied to all convolutional layers. In this way, the network can deal with both rough features, such as lines and fine features, such as eyes.

B. CHANNEL ATTENTION
After capturing multi-scale features through the proposed feature extraction, we further explore the relationship between the channels of multi-scale features. The importance of different channel features should be different. Previous CNN-based networks treat all channel-wise features equally, which is inconsistent with the actual situation. To explore the interdependencies between different feature channels, we introduce the Squeeze and Excitation (SE) Block into the generator network. It will emphasize important features and suppress slight features to make the generator more powerful.
The channel attention mechanism with squeeze and excitation is presented in Fig. 3. Following SENet [39], the global average pooling is adopted in our method as the channel global information description. Denote the input features as X = [x 1 , . . . , x c , . . . , x C ], which contain C channels feature mappings with the spatial size of H × W . The channel-wise statistic z can be obtained by shrinking X through spatial dimensions H × W . Then the c-th element of z is calculated by: where x c (i, j) is the value at position (i, j) of the feature channel x c . S(·) denotes the squeeze function. The obtained channel statistics z can be viewed as local feature importance. The larger the channel weight is, the more important the feature channel is. To make use of the information of squeeze operation and capture channel-wise dependencies, a simple gating mechanism with sigmoid function is adopted to extract the statistical scaling information: where δ(·) refers to the ReLU function and f (·) is the sigmoid gating. The W d ∈ R C× C r and W u ∈ R C r ×C denote 1 × 1 convolutional layers of channel down-scaling and up-scaling at a ratio r. The final output of the SE block is obtained by rescaling x c :x where u c is the scaling factors u of the c-th channel. According to the lightweight channel attention mechanism, the network can exploit contextual information in a global view and pay more attention to important features. Besides, the channel attention mechanism in Fig. 3 does not increase the depth of the generator. From Eq. 2-4, we can also learn that the attention mechanism has no learnable parameters.

C. VGG-BASED MRF FEATURE REPRESENTATION
The features from the VGG-19 network have been proved to be effective for image matching [18], [41]. Following the FSWild method [18], the VGG-based feature representation [18], [41] is utilized to represent the features of an entire image. We divide the feature mappings from the VGG network to the patch level as the local features. In this way, for any photo patch, similar feature patches can always be found in the given paired training dataset as reference. Therefore, all input photos can get their corresponding sketch features by using semi-supervised learning.
The pre-trained VGG-19 network is used to extract the feature mapping (p t ) of the input photo p t . The output of the middle layer l is denoted as feature mapping l (p t ). The feature representation process is divided into two steps. First, search for the best candidate photo image in the paired training database for the input photo p t ; second, search for the nearest neighbors for each neural patch of p t in the result of the first step. For the first step, the VGG-19 features of the entire image are calculated to match the most similar photo. We obtain all the feature mappings of the in the training database off-line to accelerate the search process. Only the most matching photo is defined as the reference set R for the second search step. For each patch i ( l (p t )) in the second step, we find its nearest neighbors as its feature representation. The normalized crossing-correlation between two patches is used to determine the nearest neighbors in the reference set R. The index {i * , j * } of the best candidate is denoted as: Since the photos and their corresponding sketches are well aligned, we directly apply the corresponding sketch feature patches i ( l (s)) of the searched index {i * , j * } as the pseudo sketch feature patches. Finally, the MRF feature representation of an input photo is denoted as where m is the patch number of l-th layer.

D. LOSS FUNCTIONS 1) PSEUDO SKETCH FEATURE LOSS
As we can get the MRF VGG-19 feature representation of pseudo sketch for any input image, the difference between MRF features of generated fake sketchŝ and the reference sketch feature [18], [41] is adopted as the main loss of the generator. Let (ŝ) denote the feature mapping of fake sketchŝ extracted from the VGG-19 network and ( (ŝ)) denote the all the local overlapping neural patches extracted from (ŝ). The pseudo sketch feature loss function [18], [41] between two feature mapping (ŝ) and (s r ) is defined as: where, m is the cardinality of ( (ŝ)). Each neural patch i ( (ŝ)) has a size of c×k ×k, where k is the patch size and c is the number of channels. The feature layers l = 3, 4, 5 refer to layer pooling_3, pooling_4, and pooling_5 in the VGG-19 network, respectively.  Fig. 4 and training time of every iteration using different feature layers in VGG-19 network. Previous works [18], [41] use feature mappings from ReLU layers (eg. relu3_1, relu4_1) as the output features. The features from low layers may be very similar to the input and it may result in information redundancy. While the features of deep layers after relu3_1 usually provide more meaningful information. They are more robust to geometric transformations and appearance changes. However, we believe that the features from the ReLU layers are still too redundant to train the generator. Fig. 4 shows the synthesis results of using different features in pseudo feature loss L fea . As we can see, the low-level features fail to generate sketch textures, while deep-level features (e.g., relu3_1 and pooling_3) can help generate sketch textures. This is mainly because that deeper layers are more close to the output and have more dimensions of features. The middle layers (such as pooling_3,_4, and relu3_1) can provide more information and play important roles in generating sketch textures. The performance of the highest layers (such as pooling_5 and relu5_1) decreases due to the reduction of the feature dimension. The differences between some images in Fig. 4 may be hard to get. Thus, the Structure Similarity (SSIM) [43], Feature Similarity (FSIM) [44] and Learned Perceptual Image Patch Similarity (LPIPS) [45] values of them are provided in Table 1 as reference. The lower the LPIPS value, the better the performance is. Higher FSIM value means better performance. We can learn that using features of pooling layers can also obtain good performance. In addition, the training time of each iteration using different VGG-19 layers is listed in Table 1. We can learn that training using pooling layers is much faster than training using ReLU layers without performance decreasing. To make a trade-off between speed and performance, we choose layers of pooling_3, pooling_4, pooling_5 as feature layers for robust synthesis.

2) ADVERSARIAL LOSS
Following CycleGAN [9], we use the least square loss in GAN (LSGAN) [46] as the adversarial loss in our proposed method for easier convergence. The LSGAN loss functions of generator G and discriminator D are defined as:

3) TOTAL VARIATION LOSS
The images generated by convolutional neural networks always suffer from noise and unnatural artifacts. To alleviate this problem, we utilize the total variation loss [41], [47] as a smooth constraint to further improve the quality of generated images. The total variation loss L tv of a generated fake sketcĥ s is defined as: where,ŝ m,n is the pixel value at position (m, n) of the synthesized sketchŝ in our method.

4) THE OVERALL FUNCTION
To train the generator G and the discriminator D, the parameters of them need to be updated by back-propagation to minimize the overall loss functions. The overall loss function of generator G integrates all the losses: The loss function of discriminator D is defined as: where, λ fea , λ adv and λ tv are the trade-off weights for each loss to balance the importance.

IV. IMPLEMENTATION DETAILS A. DATASETS
The widely used CUFS database [2] and CUFSF database [48] are used in this paper to evaluate the performance of our proposed method. Fig. 5  CUFS and CUFSF databases are used to train the models.
Since the styles of the CUFS and CUFSF databases are quite different, we train the models for CUFS and CUFSF databases respectively. When training the model on the CUFS database, only photo-sketch pairs in CUFS are selected for reference, and the same on the CUFSF. The parameters of the generator and discriminator are updated alternatively in each iteration. We implemented our model in PyTorch on an Nvidia RTX 2080 Ti GPU with 11G memory. The trade-off weights λ fea , λ adv and λ tv are set to 1, 10 3 and 10 −5 , respectively. Adam [51] is used for optimization at learning rates from 10 −3 to 10 −5 . It costs only 0.9s for every iteration with a batch size of 6 and the model converges after about half an hour's training.

V. EXPERIMENTAL RESULTS
In this section, we evaluate the proposed MAGAN method on both CUFS and CUFSF databases. To demonstrate the effectiveness of our proposed method, eight state-of-the-art methods are selected for both qualitative and quantitative comparisons: MWF [3], SSD [4], RSLCR [5], FCN [6], Pix2pix [8], CycleGAN [9], DualGAN [10] and FSWild [18]. The MWF, SSD, RSLCR are traditional exemplar-based methods. The FCN method is an end-to-end deep learning method with six fully-convolutional layers. The Pix2pix, CycleGAN, DualGAN, and FSWild are GAN-based methods. We take the results or the training codes they released directly for fair comparisons. SSIM, FSIM, and LPIPS are applied for quantitative evaluations.

A. TRADE-OFF PARAMETERS ANALYSIS
In Section IV-B, the trade-off weights λ fea , λ adv and λ tv are set to 1, 10 3 and 10 −5 , respectively. In this subsection, we show how to determine the trade-off parameters and analyze the sensitivity of them. In our objective loss function, there are 3 trade-off parameters: λ fea , λ adv , λ tv . We conduct experiments to determine the parameters and analyze sensitivity of the parameters based on the FSIM and LPIPS on the CUFS database. We fix the λ adv to 1 in all experiments and adjust the other two parameters: λ fea , λ tv . When analyzing one parameter, we keep the other one fixed. We report the LPIPS and FSIM values with λ fea and λ tv varying. The λ fea varies in the range of [500,1500] and λ fea varies in the range of [10e-6,10e-4]. The experimental results are illustrated in Fig. 6 and Fig. 7, subfigures (a) show the average LPIPS scores and subfigures (b) show the average FSIM scores. The parameter λ fea balances the importance of the pseudo sketch feature loss. It can be seen that the feature loss plays an important role in the sketch synthesis because the value of λ fea is much larger than other values. As we can see from Fig. 6, the LPIPS and FSIM have relative small fluctuations when λ fea varies. All results are superior to FSWild [18]. This indicates that our method does not rely on parameter tuning to achieve outstanding performance. Combining the results of LPIPS and FSIM in Fig. 6 (a) and (b), it can be fonud that when λ fea = 10 3 , we can get the highest FSIM value and very low LPIPS value.
The parameter λ tv controls the degree of smooth constraint in the generated sketches. The results in Fig. 7 indicates that the performance of the proposed algorithm does not depend on the parameter tuning of λ tv . However, we can get the best performance when λ tv = 10 −5 . Therefore, in the remaining experiments, we set λ adv = 1, λ fea = 10 3 and λ tv = 10 −5 to emphasize different importance of different loss terms in the proposed model. There is no doubt that the pseudo sketch feature loss is the most important in our model.

B. ABLATION STUDY
In this subsection, we demonstrate the effectiveness of our designed module empirically. In ablation studies, we adopt the ResNet as the base architecture and perform experiments on the CUFS database. Our model can be divided into three parts: multi-scale feature capture, channel attention mechanism, and the generator network. We will focus on the effectiveness of multi-scale feature capture, the attention mechanism, the number of the ResNet blocks. In addition, the depth of the generator network is also analyzed.

1) MULTI-SCALE FEATURE
To explore the effectiveness of the multi-scale features used in the synthesis process, we compare the performance of applying different convolutions in feature capture on the CUFS database. Experimental results of various kinds of convolutions are showed in Table 2. The dilated convolutions can help capture more contextual information and obtain multi-scale information. It can be learned that the dilated convolutions can obtain similar performance to the large scale convolution kernels without increasing the number of parameters. The dilated convolutions can work when adding them in the feature extraction. By adding the dilated convolution, we can get lower LPIPS and higher SSIM values. We can also observe that our method with multi-scale convolution features outperforms that with single-scale convolution features. The method achieves the best performance by aggregating all scales of features. Therefore, we adopt the multi-scale features in our model.

2) CHANNEL ATTENTION
We attempt to prove the validity of attention and explore the significance of different pooling methods for attention inference. Maximum pooling is commonly used in neural networks and has been proved to be very effective. In this set of experiments, two pooling methods are compared: average pooling and maximum pooling. Experimental results with different pooling methods in the channel attention mechanism are shown in Table 3. To demonstrate the effectiveness of attention mechanism, we also test the performance of a model without the attention mechanism as the baseline. We can observe that both maximum and average pooling make sense and have no additional learnable parameters. However, the average pooling behaves better in our face sketch synthesis model. Therefore, we use average pooling in our channel attention. VOLUME 8, 2020

3) NUMBER OF RESNET BLOCKS
We adopt ResNet blocks in the generator to enhance the performance. In this experiment, the effectiveness of the hyperparameter: the number of ResNet blocks in the generator network is studied. Table 4 shows the LPIPS and FSIM scores against the number of ResNet blocks. As the number of ResNet blocks increases, the quantitative results become better. However, the number of parameters also increases significantly with the number of ResNet blocks. The quantitative results reach a peak when the number of blocks increases to 5. Therefore, we set the number of ResNet blocks to 5 in our generator network. Too many ResNet blocks cannot blindly increase synthesis performance.

4) THE DEPTH OF GENERATOR NETWORK
In our generator network, the multi-scale feature extraction and channel attention mechanism are introduced to enhance the synthesis performance. They may increase the depth of the generator slightly. In this subsection, we deeply analyze that the performance enhancement is caused by our multi-scale feature extraction and channel attention mechanism, rather than the increase in generator depth.
For multi-scale feature extraction, we adopt convolutions of various scales to extract features and then concatenate them together form a layer to feed to the next layer. The FSWild method only uses a single scale (3 × 3) of convolutions in the generator. Compared with FSWild, the multi-scale feature extraction does not increase the depth of the generator network. However, using multi-scale features can get better results than using a single scale.
For the channel attention mechanism, we add an SE block to the generator to pay attention to the different importance of multi-scale features. Fig. 8(a) shows the SE block added in our method. Compared with the generator network without channel attention, there are two additional convolutional layers. To prove that channel attention can improve the synthesis performance, we design a ResNet block that has the same layers as the SE block, except it has no SE layer, as shown in Fig. 8(b). The structure of the SE layer in the SE block is showed in Fig. 3. As we know, the SE layer has no learnable parameters, it will not increase the depth of the generator network. Thus, the two blocks showed in Fig. 8 have the same depth. Table 5 lists the performance. As can be seen from the table, the proposed attention mechanism can improve the synthesis performance without increasing the generator depth. Besides, our ablation study in Section V-B ''Number of ResNet Blocks'' also proves that the performance does not increase with the depth of the generator network (see 6 blocks in Table 4).
In summary, the improvement in performance is not due to the increase in network depth, but the use of multi-scale features and channel attention in our method.

C. QUALITATIVE COMPARISON
The comparisons of fake sketch images synthesized on the CUFS and CUFSF databases by different methods are presented in Fig. 9 -11, respectively. Fig. 9 shows the result comparison between our method non-GAN based methods. The images from the top row to the bottom row are from the CUHK student database, AR database, XM2VTS database, and CUFSF database. As we can see, the fake sketch images synthesized by exemplar-based methods (MWF [3], SSD [4], RSLCR [5]) are over-smoothed without sketch textures. Besides, they often miss identity-specific information in the resulting images, such as hair accessories and glasses (see in row 1, column 3,4). The results of RSLCR can keep the character features of input photos. However, they are over-smoothed at the boundary between the background and hair region. And it cannot handle faces with small angles well. Although the FCN is a deep-learning-based method, it produces too many artifacts in fake sketches, especially under different lighting conditions. This is mainly due to the limitation of training data. Our method can generate the sharpest and most natural fake sketches compared to the non-GAN based methods.
The comparison of GAN-based methods (Pix2pix [8], CycleGAN [9], DualGAN [10], FSWild [18] and our MAGAN method) on the CUFS and CUFSF databases are showed in Fig. 10. Pix2pix can produce more sketch-style textures, making them similar to real sketches. However, the fake sketch and the real sketch do not seem to be the same person. And they always introduce undesirable artifacts on the face and discontinuous parts. The fake sketches look dirty. The results of CycleGAN and DualGAN are more like grayscale photos lacking sketch textures. They also produce many noises and artifacts in the facial region. When the hair color in light colors, they fail to produce dark sketch lines (see in row 5). The lines and textures in the results of FSWild are too light. The sketch images synthesized by our method effectively alleviate the problems of other methods. It can be observed that our results have fine textures, especially in face and hair areas. Our method can generate the most natural sketch images with the least artifacts.
There also exist some slight blurs in our fake sketches. The blurring effect also exists in FSWild. This is mainly due to the use of total variance loss adopted in training. When removing noises and artifacts, the total variance loss always introduces some slight blurs in the resulting images. However, the blurring effect in our results is much slighter than data-driven methods [2], [4], [5] (see in Fig. 9). When zooming in the resulting images, the enhancement of our fake sketches over other methods will be more obvious. More detailed results are compared in Fig. 11. It can be seen that our results are better than Pix2pix [8], CycleGAN [9] and DualGAN [10] in details. When our results are displayed in Qualitative results of generated sketches on several database synthesized by Pix2pix [8], CycleGAN [9], DualGAN [10], FSWild [18] and the proposed MAGAN, respectively.   To further compare the subjective performance of GAN-based methods, the Mean Opinion Score(MOS) is applied to evaluate the quality of different results. 50 groups of generated sketches are randomly selected from the CUFS database for rating. The MOS results are summarized in Table 6. We can see from Table 6, our results get the highest score. We can learn that more people think our results are better than other methods. While DualGAN and CycleGAN get lower scores due to obvious noise, artifacts, and unreal textures.
We can also observe that the synthesized sketches on the XM2VTS and CUFSF databases are worse than those on CUHK student and AR datasets. This is because the two databases have more shape exaggerations, age, and illumination variations. However, our method can perform more robustly than other methods.

Compare to Style Transfer
In addition to the image sketch synthesis methods mentioned above, some style transfer methods [41], [52] can also synthesize sketches. In this subsection, we compare our method with one style transfer method: the CNNMRF method [41]. We can see from Fig. 12 that the results of CNNMRF are very different from the ground truths. Our results are more similar to the real sketches. CNNMRF only needs one style image and adds target-style textures to the content image. It tends to retain the colors of the content image in the synthesized image. While photo-sketch synthesis methods always require a large database composed of photo-sketch image pairs. Style transfer is very interesting. However, the target and results of CNNMRF are very different from the photo-sketch synthesis methods. Therefore, style transfer is different from image photo-sketch synthesis. They are two different applications with different goals.

D. QUANTITATIVE COMPARISON 1) SSIM
SSIM is used to evaluate the quality of synthesized sketches in many works [3]- [5]. Table 7 summaries the average SSIM scores of different methods. Our method achieves comparable results with the state-of-the-art method FSWild. But SSIM values of our method and FSWild are slightly lower than RSLCR. However, it can be seen from Fig. 10 that the RSLCR results are much more blurred than the GAN-based methods, though it achieved the highest SSIM values. Some researchers have pointed out that SSIM is not reliable for image perceptual quality evaluation [18]. Because when the  images contain rich textures, SSIM prefers slightly blurred images.

2) FSIM
Due to the drawbacks of SSIM, FSIM is also used to evaluate the quality of fake sketches. Compared with SSIM, FSIM is better at evaluating detailed image textures. Table 8 summaries the average FSIM scores of different methods. It can be seen that our method achieves the highest FSIM scores in all methods on all databases. It illustrates that the features of our results are more similar to real sketches than other methods. This is mainly thanks to our multi-scale feature capture.

3) LPIPS
LPIPS metric is also adopted to quantitatively evaluate the visual quality of fake sketches. It emphasizes the perceptual similarity between the fake images and the ground truths. LPIPS employs VGG [53] deep features to compute similarity. It performs better at evaluating perceptual quality than SSIM. The lower LPIPS value, the better the perceptual quality. Table 9 summaries the average LPIPS values of different methods. It can be learned that our method achieves stateof-the-art performance in terms of LPIPS in all the comparison methods. In particular, the LPIPS score on XM2VTS is 0.0084 lower than FSWild. It illustrates that our synthesized sketches have the best perceptual quality. Our fake sketches have textures and shadows that are similar to the real sketches.

4) FACE RECOGNITION
Sketch-based face recognition is an important application of face sketch synthesis. It is always used in law enforcement. We follow the work of RSLCR [5] and employ the Null-space Linear Discriminant Analysis (NLDA) [54] for face recognition experiments. For the CUFS database, 100 synthesized sketch and corresponding ground truths are randomly selected to train the classifier and the rest 233 sketches consist of the gallery. For the CUFSF database, 300 sketches are selected for training, and the remaining 644 consist of the gallery. To avoid randomness of the experiment results, the face recognition experiments are repeated 30 times by randomly segmenting the data. Fig. 13 shows the face recognition rates against the numbers of dimensions on both CUFS and CUFSF databases. Table 10 shows the best face recognition rates at some certain dimensions (the numbers in parentheses) tested by NLDA in the sketch domain on the CUFS and CUFSF databases.   Our method achieves a comparable recognition rate to the state-of-the-art methods RSLCR [5] and FSWild [18] on the CUFS database (less than 1% lower), and achieves the best performance on the CUFSF database. The recognition rate reaches 96.57% on the CUFS database and 77.56% on the CUFSF database, respectively. This is because the photos in the CUFS database have less light variance. In the CUFSF database, there are more deviations between photos and corresponding sketches. Most methods can behave well on the CUFS database but get less satisfying results on the CUFSF database. In addition, our method achieves the best recognition rates at lower dimensions. It indicates that our method can preserve the special identification information better than the comparison methods, especially at low dimensions. It is mainly thanks to the proposed multi-scale feature and channel attention mechanism.

E. COMPUTATIONAL COMPLEXITY ANALYSIS
In this subsection, we analyze the computational complexity of our method on an RTX 2080Ti GPU with 11G memory.
There are only 2M parameters in our model. And it requires about 5.07 G memory consumption to train the model. It only needs 20 epochs to converge and it takes about 31 minutes to complete the training. Thanks to the high similarity between the sketches of three datasets in the CUFS database, it only needs to be trained once. In the synthesis process, it only takes about 0.006s to synthesize a sketch on the GPU.
In Table 11, we compare the number of parameters, memory consumption, training time, and test time with other sketch synthesis methods. The bold ones are the best scores and the red ones are the second ones. As can be seen from Table 11, the parameters of FSWild and our method are much less than others. Meanwhile, the time of generating a sketch of our method is the shortest in all comparison methods. The training is much faster than the FSWild method. It mainly because we adopt pooling features to reduce redundancy. The Pix2pix method can achieve low computational complexity. However, the quality of its generated sketches is much lower than our method. Overall, our method achieves the best trade-off between the performance and runtime.

VI. CONCLUSION
In this paper, we propose a Multi-scale feature Attention Generative Adversarial Network (MAGAN) for face sketch synthesis. A GAN with skip connections and residual blocks is applied to transfer photos to sketches. The feature loss from the VGG-19 network working on patch-wise is used to supervise the generator to generate more realistic and natural sketch images. To accelerate the training process, we adopt the features from pooling_3,4,5 layers of the VGG-19 network as the MRF feature representation and pre-store the features of the training images. We extract multi-scale features of the input images to refine image details. Moreover, a channel attention mechanism is introduced to further emphasize the important features. Experimental results demonstrate the effectiveness of the proposed method. Our generated sketches can obtain state-of-the-art performance in terms of FSIM and LPIPS. In addition, our method can generate more natural sketch images with less noise, fewer artifacts, more details, and more realistic textures. Furthermore, the training is very fast that only costs about 31 minutes on an RTX 2080 Ti GPU. In future work, we will focus on photos synthesis from sketches and extend our method to faces in the wild.
YAHONG WU (Graduate Student Member, IEEE) received the B.E. degree in electronic information engineering from Nantong University, Nantong, China, in 2016. She is currently pursuing the Ph.D. degree in signal and information processing with the Nanjing University of Posts and Telecommunications, Nanjing, China. Her current research interests include image/video processing, image restoration, low-light image enhancement, and machine learning.