Joint Face Super-Resolution and Deblurring Using Generative Adversarial Network

Facial image super-resolution (SR) is an important aspect of facial analysis, and it can contribute significantly to tasks such as face alignment, face recognition, and image-based 3D reconstruction. Recent convolutional neural network (CNN) based models have exhibited significant advancements by learning mapping relations using pairs of low-resolution (LR) and high-resolution (HR) facial images. However, because these methods are conventionally aimed at increasing the PSNR and SSIM metrics, the reconstructed HR images might be blurry and have an overall unsatisfactory perceptual quality even when state-of-the-art quantitative results are achieved. In this study, we address this limitation by proposing an adversarial framework intended to reconstruct perceptually high-quality HR facial images while simultaneously removing blur. To this end, a simple five-layer CNN is employed to extract feature maps from LR facial images, and this feature information is provided to two-branch encoder-decoder networks that generate HR facial images with and without blur. In addition, local and global discriminators are combined to focus on the reconstruction of HR facial structures. Both qualitative and quantitative results demonstrate the effectiveness of the proposed method for generating photorealistic HR facial images from a variety of LR inputs. Moreover, it was also verified, through a use case scenario that the proposed method can contribute more to the field of face recognition than existing approaches.


I. INTRODUCTION
Blurry and low resolution (LR) facial images, which are frequently observed in surveillance videos and old video footage, are fundamental problems in computer vision and image processing. Ensuring a high performance is difficult when such factors degrade the facial images used for face-related tasks, such as face landmark detection [1], face recognition [2], face parsing [3], and 3D face reconstruction [4], [5]. Therefore, the need to restore high-quality facial images is rapidly increasing.
Convolutional neural networks (CNNs) have recently achieved remarkable performance gains in general single image super-resolution (SISR) by learning the mapping between LR and HR pairs [6]- [10]. However, because a The associate editor coordinating the review of this manuscript and approving it for publication was Chang-Hwan Son .
learning scheme aims to optimize general metrics such as PSNR and SSIM, the quality of the reconstructed images might be visually unrealistic. In particular, face super-resolution (SR), wherein visually pleasing photorealistic results might be more important than conventional quantitative scores, is a more specific and difficult problem than general SISR. Recent methods have employed various facial geometry priors, e.g., facial landmarks, parsing maps, and 3D morphable models, to reconstruct HR facial images [11], [12]. Moreover, additional tasks, such as estimating the face region mask, facial landmark heatmaps, and parsing maps, improve the quality of the reconstructed HR facial images [13]. However, these approaches share the drawback of increased computations and dependency on a labeled dataset.
In contrast, generative adversarial networks [16] (GANs) have been widely used for general image synthesis and facial VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ image restoration [17]. A GAN applies minimax optimization for the generator and discriminator [3], [18], thereby achieving a more visually pleasing restoration than that of a conventional algorithm.
In this paper, we propose a novel adversarial network structure to solve the joint SR and deblur problem for facial images by simultaneously generating HR facial images with and without blur. We first increase the spatial resolution of the LR input image through a five-layer CNN to form the feature image. This image is then mapped to the hidden features in the encoders. These features are conveyed to two branch decoders for generating HR facial images with and without blur, respectively. Local and global discriminators are combined to effectively reconstruct the HR facial structures. An example of HR facial image reconstruction is shown in Fig. 1. The main contribution of this paper can be summarized as follows: • A novel adversarial network consisting of a generator and two discriminators is proposed to simultaneously perform facial image SR and deblurring.
• The proposed network generates more realistic faces than those generated by other state-of-the-art algorithms, with the additional capability of changing various facial details.
• Through use case scenarios, the proposed method is shown to contribute to real-world face-related problems.

II. RELATED WORK A. FACE SUPER-RESOLUTION
Several face SR algorithms have been investigated for facial image analysis [19]- [23]. Facial prior information, such as that regarding the shapes of faces, face parsing maps, and landmark heatmaps, has been used for face SR [24]. Wang and Tang [25] implemented a mapping between LR facial images and HR facial images using an Eigen transformation. Kolouri and Rohde [26] trained a nonlinear Lagrangian model for an HR facial image to obtain the optimal model parameters for a given LR facial image and to reconstruct the HR facial image. Using these techniques for the reconstruction of LR facial images with a large upscale factor is difficult because the reconstruction quality depends on the landmark estimation results. CNNs have recently been successfully applied to face SR, and prior face knowledge of diverse types has been used during training. Song et al. [11] proposed a two-stage method that can generate facial components using a CNN, and then reconstructed an HR facial image through a component-enhancement method. FSRNet [13] performs the HR reconstruction of a facial image using a ''coarse-to-fine'' approach. The algorithm is composed of four networks, namely, a coarse SR network, fine SR encoder, prior estimation network, and fine SR decoder. FSRNet uses face landmark heatmaps and parsing maps as face prior information, and these are estimated in a prior estimation network. They also proposed FSRGAN to incorporate adversarial loss into FSRNet. Their approach exhibits a higher performance than that of existing methods by generating face prior information and reconstructing an HR facial image. However, the aforenoted approach has a disadvantage of requiring prior face information labeling for training. Moreover, the face in the reconstructed image may not correspond to that of the person in the LR facial image, which limits the applicability of this method in tasks such as face recognition.

B. JOINT SUPER-RESOLUTION AND DEBLURRING
Up-sampled blurry images are provided as inputs to conventional SR methods. The blur in these images is mainly due to up-sampling, which differs from the motion blur caused by the motion of an object or camera. Practically, we cannot fully eliminate sudden movements of the face or camera when taking photographs. Thus, employing blurred LR facial images to achieve good results in face-related applications is extremely difficult. However, the reconstruction of blurred LR images can be advantageous in certain applications, such as object detection and face recognition.
The restoration of a blurry LR facial image to an HR facial image generally involves the sequential connection of a blur-removal algorithm and an SR algorithm. However, serializing these existing algorithms is inefficient because of the high computational cost, inaccuracy, and complexity. Therefore, solving blur removal and SR simultaneously is more complicated than a single degraded image restoration problem. In a study using optical flow [27], [28], HR images were produced using LR and blurry video sequences. However, these approaches rely on optical flow estimation results, which complicates their application to a single image. Zhang et al. [29] proposed a deep encoder-decoder network (ED-DSRN) designed to perform blur removal and HR image reconstruction simultaneously. However, ED-DSRN has the disadvantage of using an LR image degraded using a uniform Gaussian blur.

C. FACE SYNTHESIS USING GAN
In recent years, generative models have exhibited substantial improvements in face synthesis applications such as face frontalization [5], [30], face completion [17], [31], and face SR [13], [32]. A GAN synthesizes images from a noisy instance by applying min-max-optimization on the generator and discriminator. Furthermore, GANs have been frequently employed to synthesize HR images from LR images. Karras et al. [32], [33] generated HR images using a GAN, that was trained in an unconditional manner. This network can synthesize facial images with a high level of detail: nevertheless, it has a disadvantage in that it requires substantial computational power. Notably, no previous studies have been conducted on the use of a GAN to jointly address the problems of blur and LR in facial images.

III. PROPOSED APPROACH A. OVERVIEW OF PROPOSED METHOD
As shown in Fig. 2, the proposed network consists of the following components: a five-layer CNN, face region prior, and generator G with a modified U-Net [34] structure, as well as global and local discriminators. The input image (I LR ) is used to generate a feature map (I LR f ), with an upscale factor that yields an 8x spatial resolution, through a five-layer CNN. The generated feature image is fed into two branches to generate an HR facial image with blur (Î HRB ) and an HR facial image without blur (Î HR ). We use the feature map before the last convolution of the upper decoder to generateÎ HR . Furthermore, the local discriminator D l and the global discriminator D g attempt to determine whether the output of G is a real facial image. The proposed network can generate a sharp and realistic facial image by generating an HR facial image with and without blur simultaneously. In contrast with a unified structure, the proposed network with a parallel structure allows HR face reconstruction to be achieved in a coarse-to-fine manner, resulting in faster convergence and better results than those of the unified model.

B. GENERATOR MODULE
The generator network G includes a five-layer CNN, which extracts feature maps with an 8x spatial resolution instead of performing simple 8x upscaling of the input images. Conventional methods extract features after employing bilinear or bicubic interpolation of the input images. Therefore, their features are corrupted by interpolated information. However, the proposed method extracts only LR features by using a five-layer CNN while gradually increasing the spatial resolution to 8x. Because the five-layer CNN is trained using other parts of the generator, it can produce a feature map that is helpful for generating an HR facial image.
The feature map I LR f , is fed into two branches. The first four upper and lower encoder share the parameters. These parameters help to reconstruct the facial structure. The rest of VOLUME 8, 2020 the lower decoder focuses on generating facial details with the upper decoder final feature map. In order to avoid information loss, skip connections between encoder and decoder are used to restore the LR facial image feature map. Three skip connections between the upper encoder and upper decoder are used to reconstructÎ HRB . The last feature map forÎ HRB reconstruction is I f .Î HR is reconstructed by using a full skip connection between the lower encoder and lower decoder. The end of the lower decoder part takes last feature map I f of the upper decoder to generateÎ HR . Thus, the generated HR facial image in this study contains sharp facial details.

C. DISCRIMINATOR MODULES
The discriminator determines whether the HR facial image reconstructed from the LR facial image by the generator is real or not and provides feedback to obtain photorealistic synthesized HR facial image. The proposed network replaces binary cross entropy (BCE) loss with LSGAN [35] and mean squared error (MSE) loss and eliminates the sigmoid function of the discriminator to prevent the convergence from slowing down.
The discriminator network consists of local discriminator D l and global discriminator D g . It is more difficult for the global discriminator to determine whether image is real or fake since it considers the whole image. Therefore, we concatenate the input image and generated image as input. The local discriminator considers only local face region of the generated image and we do not need to concatenate the input image and generated image for the local discriminator. For I LR , an HR facial imageÎ HR is synthesized through the generator, and the masked HR facial imageÎ m is obtained by multiplying a face region mask M face known as face prior information. The ground truth (GT) HR facial image I HR is also multiplied by M face to obtain a masked GT HR facial image I m . Considering our goal to reconstruct facial structure and details of facial components on the image, we need to focus more on the face region. Therefore, the optimization of the local discriminator in the face region is enforced by using the face region mask M face . The local discriminator D l receivesÎ m and I m as input. The global discriminator D g com-binesÎ HR and I HR into I LR , and receives them as input. Global discriminator D g allows for statistical consistency, while the local discriminator D l preserves local facial features. Both discriminators have similar network structures that consists of seven convolution layers. One-dimensional output after the last layer of the discriminator determines whether the input is real or not. The min-max optimization over the generator and discriminators forces the model to synthesize the facial images with improved visual quality.

D. NETWORK LOSS
The objective function for the adversarial loss used in the proposed network is defined as follows: where L GAN (G, D) denotes the total adversarial loss, which is the sum of the global loss L GAN (G, D g ) and the local loss L GAN (G, D l ) and is defined as follows: Because the proposed framework is based on a GAN, it is expected to provide results that might deviate slightly from the GT. With a pixel loss approach, all pixel values of the reconstructed image are compared with those of the GT image to increase their similarity. The L1 distance is defined for the reconstructed blurred HR facial imageÎ HRB and blurred HR GT image I HRB ; reconstructed HR facial imageÎ HR and HR GT facial image I HR ; and masked HR facial imageÎ m and masked HR GT facial image I m , which are summed to define the pixel loss as follows: In addition to the adversarial and pixel losses, a perceptual loss is also added to obtain a photorealistic facial image. Perceptual loss is obtained through the weight of the pre-trained VGG-19 [36], which is defined as follows: where f represents the i-th feature extracted from the VGG-19 network. In our implementation, the perceptual losses on the Pool 1, Pool 2, Pool 3, Pool 4, and Pool 5 layers of the pretrained VGG-19 network are utilized. The overall loss function for the training of the proposed network consists of the adversarial loss, pixel loss, and perceptual losses, which are defined as follows: The weights λ 1 and λ 2 are used to balance the contribution of each loss.

IV. EXPERIMENTAL RESULTS
The proposed model is trained using the ADAM optimizer [37] with a learning rate of α = 0.0002, β 1 = 0.5, β 2 = 0.999, and a batch size of 64. The hyper-parameters are found empirically through exhaustive search. Each network is gradually added instead of training all networks simultaneously. The generator and global discriminator are trained first for 200 epochs. The local discriminator is then added. The values of λ 1 and λ 2 are respectively set as 100 and 10 during training. Training the proposed network on the CelebA-HQ dataset takes approximately 5 days on a single Titan X GPU. For testing, only the generator is run, and less than 1 s is required to perform testing for a single image.

A. DATASETS
The CelebFaces Attributes High-Quality (CelebA-HQ) [32], [38] dataset is used for the training and testing of the proposed network. The HR facial image and the facial component mask are consistently resized to 256 × 256. Blurred LR facial images are synthesized from the CelebA-HQ dataset. Most of the facial images in the dataset, wherein faces are located in the foreground, were captured in real life. Thus, the foreground mask of the HR facial image is generated using the CelebA-HQ mask dataset, and pixel-wise multiplication with the HR facial image is conducted to obtain the facial region. The blur on the facial region is synthesized using an algorithm developed by Gong et al. [39]. Consequently, our dataset contains realistic blur patterns. The spatial resolution of the blurred HR facial image is then reduced to 1 8 . The proposed network processes a 32 × 32 input LR image and outputs a 256 × 256 HR image. Examples from the generated dataset are shown in Fig. 3. A facial region mask corresponding to each facial image is used as prior information for the local discriminator. Note that we use the face region mask only for training and we do not need it for testing. Although the proposed network does not utilize landmarks or segmentation masks, the results are better than the state-of-the-art methods which employ them as prior. A total of 28,800 images are selected for training, and the remaining 1,200 images are used for testing. Horizontal flipping is used in data augmentation to avoid overfitting. In addition to using a synthesized dataset, the proposed model is tested on real blurred LR facial images (e.g., images from selected YouTube videos).

B. QUANTITATIVE EVALUATION
To measure the quantitative performance of the proposed network, we consider the conventional metrics, i.e., PSNR and SSIM. In addition, the Frechet inception distance (FID) [40] is employed to approximate the difference in the feature space. A low FID score indicates that the generated image is statistically similar to the real image.
The performance of the proposed network and a comparison with existing methods are shown in Table 1. Usually GAN based method has the disadvantage of dropping PSNR and SSIM. The lowest PSNR/SSIM in HR image synthesis is typically obtained when applying the FSRGAN. Note that the proposed method demonstrates a better performance compared to the existing approaches under all metric evaluations. Interestingly, the lowest and highest results are obtained from GAN-based methods, which justifies the advantage and the effective structure of our adversarial learning process.

C. QUALITATIVE EVALUATION
In our approach, a GAN is used to reconstruct the HR facial image by simultaneously generating HR facial images with and without blur. Fig. 4 shows the HR facial image reconstruction results on the synthetic and real blurred LR facial images, respectively. Note that the images in the test dataset are different from those in the training dataset. As shown in Fig. 4, test images have various illuminations, head poses, and expressions. The proposed method is also tested on a real blurred LR image captured from YouTube, as shown in Fig. 5. The results show that the proposed method can generate HR facial images without blur, in addition to generating photorealistic and clear facial images, even when the input facial image has experienced significant degradation.

D. COMPARISON WITH STATE-OF-THE-ART METHODS
Because no previous studies have had the same goal as that of the present approach, we chose state-of-the-art algorithms for face SR (FSRNet/FSRGAN [13] and DICGAN [22]), general image SR (SRResNet/SRGAN [9], ESRGAN [41], RCAN [42], and SRFBN [43]), and GAN-based image synthesis (Pix2Pix [34]) for comparison. For a fair comparison, we use the author-released codes of the above models and train them with an upscale factor of 8 using the same training set as that used for the present model.   Qualitative comparisons with the other methods are shown in Fig. 6, where the input image is either 16×16 or 32×32 but the upscale factor is fixed to 8×. Note that the network should be retrained when the input resolution is switched. Although the upscale factor of our network (8×) is the same as that used in most previous studies, we can generate higher-quality 256 × 256 facial images than those obtained using the other approaches. Previous studies tended to synthesize excessively smooth faces. SRResNet generates results with sharp edges, but it can not generate details of the eyes or mouth, which are important parts of the face. SRGAN generates facial details better than SRResNet does; moreover, the latter produces unrealistic images. ESRGAN generates facial details such as the eyes and nose better than SRGAN does; however, the former cannot produce realistic images. By contrast, FSRNet provides realistic facial details, but, its results are blurry. The HR reconstruction results of FSRGAN have clear facial details. Nevertheless, visual artifacts such as inconsistent color and thick edges are observed owing to the over emphasized facial details. Although FSRNet/FSRGAN [13] uses a facial geometry prior, such as a facial landmark heatmap and parsing map, upon reconstructing the HR facial images, the result is still blurry. Furthermore, none of these approaches can generate sufficient details in the hair. By contrast, even if a blurred LR facial image has a specular region or eyeglasses, the proposed method can still be used to synthesize realistic and sharp HR facial images.

E. APPLICATIONS
As a use case scenario, it is beneficial to demonstrate the manner in which the proposed method can contribute to addressing face-related problems in the real world. This observation demonstrates that the proposed method can effectively recover facial details that are not only photorealistic but also sufficiently correct for various facial applications. Fig. 7 shows a qualitative comparison of face alignment [44] and face parsing after applying the baseline and proposed methods. It was confirmed that the proposed method detects the facial boundary and subregion more stably.

2) SIMULTANEOUS SR AND COLORIZATION
To show that our network is able to perform colorization in addition to SR, We train the network by converting the input image into a grayscale image. Fig. 8 shows that the proposed network performs well even for grayscale LR and blurry images. Although the make-up and background color might vary, high-resolution colorized images can be restored well, allowing images of the same person to be identified.

3) FACE RECOGNITION
In Table 2, we show the impact of the selected algorithm for the face recognition task [45]. To this end, we perform facial detection and facial alignment using MTCNN [46] as a preprocessing step. We then extract the embedding vector using Inception-ResNet-v1 [47] pretrained using VGGFace2 [48]. After measuring the L1 distance between embedding features, we determine whether they correspond to the same person using a specific threshold. We use 1000 facial image from CelebA dataset. The face recognition accuracy is measured with the rank-1 retrieval rate. As the results indicate, the proposed method achieved the best performance.

4) FACE DETAILS VARIATION
In a recent work on GAN, Karras et al. [33] elucidated the effect of applying a stochastic variation to different subsets of layers. Noise affects only the stochastic aspects while leaving the overall composition and identity unchanged.  In the current study, we also considered the effect of adding stochastic variations. The value of (w/ N) is defined by adding random Gaussian noise after every convolution layer in the proposed network. Although the value of (w/ N) in this study is quantitatively low, it generates diverse HR facial images from a single LR facial image. Fig. 9 shows that diverse HR facial images are generated by using blurred LR facial images as an input. We can observe the stochastic aspects while leaving the overall composition and identity unchanged. Occlusion problems owing to the hair, shadows, and eyeglasses can be handled despite the additional Gaussian random noise after every convolution. The proposed network still produces photorealistic HR facial images.

F. ABLATION STUDY
To validate the effect of the proposed network components, we conducted experiments by training the network in three different ways, namely, using models 1, 2, and 3. As mentioned earlier, our network includes a five-layer CNN, with a U-Net structure modified to have two ways, global and local discriminators. Models 1, 2, and 3 are defined by excluding the five-layer CNN, two discriminators, and local discriminators from the proposed network structure, respectively. The models are trained using the same training set with an upscale factor of 8×.
All models were evaluated and the performances are listed in Table 3. The five-layer CNN generates useful feature maps for HR facial reconstruction from an LR blurred facial image because the reconstruction result is superior to that of a bicubic interpolation. In addition, global and local discriminators help generate better reconstruction results than using only a global discriminator or no discriminators at all.

V. LIMITATIONS
The proposed method can generate HR facial images from extremely blurry LR images. However, if the face in an LR blurred image is too small or at an insufficient frontal angle, it will be difficult to generate an HR facial image. Nevertheless, the blur of the facial region is assumed to be uniform, meaning that the face is not extremely close to the camera, which is common in the real world. However, if the face has high motion blur in a particular direction, the proposed deblurring will not be successful.

VI. CONCLUSION
In this study, we presented an adversarial network to reconstruct an HR facial image by simultaneously generating such an image with and without blur. The experimental results demonstrated that the proposed approach quantitatively and qualitatively outperforms previous state-of-the-art VOLUME 8, 2020 approaches. Moreover, we showed that our method is applicable to a variety of face-related applications. Furthermore, the proposed approach can be used to generate diverse HR facial images from blurry LR facial images by adding Gaussian random noise after every convolution layer. We believe that the proposed algorithm for HR facial image reconstruction from a blurry LR image can be successfully used in various face-related applications.