Progressive Semantic Face Deblurring

Previous face deblurring methods have utilized semantic segmentation maps as prior knowledge. Most of these methods generated the segmentation map from a blurred facial image, and restore it using the map in a sequential manner. However, the accuracy of the segmentation affects the restoration performance. Generally, it is difficult to obtain an accurate segmentation map from a blurred image. Instead of sequential methods, we propose an efficient method that learns the flows of facial component restoration without performing segmentation. To this end, we propose a multi-semantic progressive learning (MSPL) framework that progressively restores the entire face image starting from the facial components such as the skin, followed by the hair, and the inner parts (eyes, nose, and mouth). Furthermore, we propose a discriminator that observes the reconstruction-flow of the generator. In addition, we present new test datasets to facilitate the comparison of face deblurring methods. Various experiments demonstrate that the proposed MSPL framework achieves higher performance in facial image deblurring compared to the existing methods, both qualitatively and quantitatively. Our code, trained model and data are available at https://github.com/dolphin0104/MSPL-GAN.


I. INTRODUCTION
Facial image deblurring is to restore a blurred face image as a sharp face image. Although human faces are highly variable, they have hierarchical structures that comprise components, such as skin, hair, eyes, nose, and mouth. These facial components are the crucial elements that characterize a specific face; each element has inherent shapes and textures. Thus, most face deblurring methods utilize the prior knowledge (i.e., reference face [1], [2], 3D face [3], face landmark [4], [5], 2D face sketch [6] and semantic segmentation map [7], [8]) of face images to estimate a unique solution.
Recently, deep learning-based methods [7], [8] have achieved state-of-the-art performances in facial image deblurring by utilizing the semantic segmentation map as prior knowledge. These methods consist of a two-step process that generates the segmentation map from the blurred image, and restores the facial image using this segmentation map in a sequential manner. In these methods, the semantic The associate editor coordinating the review of this manuscript and approving it for publication was Simone Bianco . segmentation map is employed to localize the position of each facial component and the boundaries between them for the deblurring process.
However, these methods have a limitation. Accuracy of the estimated segmentation map affects the restoration performance due to their sequential properties. In general, it is nontrivial to obtain an accurate segmentation map from a blurred image. Inaccurate segmentation results often lead to inaccurate localization of facial components, and consequently generate a deblurred result with distorted shapes and/or blurred textures [3], [9].
Specifically, small components of the face (that is, the eyebrows, eyes, nose, and lips) which characterize the faces, are significantly affected by inaccurate segmentation compared to the large components such as hair and skin. In general, each facial component is different in size, and has large deviations. For instance, the eyes have a very small number of pixels compared to the skin or hair regions. Moreover, these small components of the face are more likely to lose information due to noise and blur artifacts than the large components. Due to this degradation, it is difficult to obtain accurate semantic segmentation maps especially for small components from the blurred images. Thus, small components require more attention than the large components for the restoration of their exact shapes and textures. This problem, called the class imbalance in Yasarla et al. [8], is one of the major challenges in the previous method [7], [8].
As investigated in [3], [8], Shen et al. [7] often fails to restore the small components of the face when the generated segmentation results are inaccurate. To address this problem, Yasarla et al. [8] proposed measuring the confidence score of the facial components from the generated segmentation. If the estimated segmentation maps have low confidence, their model reduces the impact of segmentation maps in the deblurring process. This method can effectively reduce the effects of inaccurate segmentation maps. However, their solution is suboptimal, because they do not provide how to utilize the semantic prior when the segmentation map is inaccurate due to severe blurs.
To deal with this problem, we propose a multi-semantic progressive learning (MSPL) framework based on the generative adversarial network (GAN) [10]. Our method leverages the semantic prior information of the face without performing segmentation to prevent the side effects of an inaccurate segmentation map. Furthermore, our method progressively restores the face within four steps, inspired by the success of progressive learning techniques [11], [12]. Conventional progressive learning [11], [12] does not consider the semantic context of the target object in an image. Therefore, we modified the concept of the conventional coarse-to-fine approach to understand the underlying semantic structure of the target object better.
Specifically, the proposed generator network has a cascaded architecture with sub-networks to restore the entire face progressively and incrementally, starting with the simpler facial components. Our network is trained to focus on lowfrequency components first and then incrementally restore the smaller and high-frequency components in the face image, instead of learning all the components of the face image simultaneously. During training, each sub-network focuses on restoring both the shape and texture for their assigned class-specific facial components. This is achieved by minimizing the difference between the sharp and output facial components using masks obtained by precise groundtruth segmentation maps. In addition, the architecture of our generator mitigates the class-imbalance problem. The proposed generator consists of multiple sub-networks. Each sub-network is trained to focus on restoring the assigned facial key component. This simple method allows the proposed method to handle the class imbalance problem of face deblurring more effectively. Fig. 1 clearly shows the effects of the proposed framework. Our MPSL framework restores a sharper face with fine-detailed facial components compared to the previous methods [7], [8].
To generate facial images that are more photo-realistic, we propose a multi-semantic discriminator in our GAN framework. It is designed to handle all the intermediate outputs of the generator using a single classifier network by allowing the flow of gradients at all semantic components in the discriminator to the generator. Through this, our discriminator oversees the entire flow of reconstructions of the facial components.
To the best of our knowledge, there are only a few public test datasets avaliable for facial image deblurring. Shen et al. [7] provided a pioneering test dataset for evaluation. However, most of the provided images are of low quality with unknown blocking artifacts. In addition, all of the test faces are well aligned and centered with the same facial key points. However, in the real-world scenarios, faces are captured in various shapes and poses. Thus, these images are not appropriate to fully evaluate various methods for a range of cases; in view of this, we provide new test datasets that are suitable for a more practical evaluation of face deblurring.
The contributions of our work are summarized as follows: • We propose an MSPL network, which progressively learns the semantics of a human face for deblurring the facial images from complex motion blur. To the best of our knowledge, this is the first time that the idea of a semantic coarse-to-fine manner has been introduced in face deblurring.
• We propose the Multi-Semantic discriminator, which can observe all the outputs of a generator. From this, VOLUME 8, 2020 our generator reconstructs more photo-realistic face components in addition to the entire face.
• To conduct a more accurate and practical evaluation of face deblurring, we suggest new test datasets with extensive and high-quality images. The experimental results show that the proposed model significantly outperforms the previous methods.
The remainder of this paper is organized as follows. Section II provides an overview of the previous works on image deblurring. In Section III, we present the details of our proposed framework. Section IV provides the quantitative and qualitative results of the proposed method compared to the existing methods. Finally, Section V presents the conclusions and discusses the future works.

II. RELATED WORK
Image deblurring has been studied for a long time in the field of image processing and computer vision. In this section, we briefly review the image deblurring methods and recent deep learning-based progressive learning approaches.
After the advent of deep learning [21], various convolutional neural network (CNN) models have been proposed to estimate the complex blur kernels [22], [23]. Sun et al. [22] proposed to predict the probabilistic distribution of motion blur at patch level. Chakrabarti et al. [23] predicted the complex Fourier coefficients of a deconvolution filter and applied them to an input patch. These methods combine the CNNs and maximum a posteriori probability (MAP)based algorithms. On the other hand, several CNN models directly restore the sharp image from blurred image in an end-to-end manner [24]- [29]. Nah et al. [24] proposed a multi-scale CNN model. They first extended the traditional coarse-to-fine pipeline to a CNN-based deblurring field and achieved impressive results. Tao et al. [25] investigated a multi-scale strategy for the recurrent neural network (RNN) based multi-scale architecture. To restore the realistic images, Kupyn et al. [30] introduced a GAN-based deblurring model that exploited Wasserstein GAN with a gradient penalty and perceptual loss.

B. FACE IMAGE DEBLURRING
While the aforementioned methods perform well for natural image deblurring, they often do not perform satisfactorily on domain-specific images such as face images. Therefore, several studies have proposed estimating various types of prior facial knowledge such as the face alignment [4], face sketches [6], reference faces [2], [31], 3D face models [3], and face segmentation maps [7], [8]. The reference prior-based methods [2], [31] extract useful information to restore the face image from a sharp face similar to a degraded face image. However these methods require a redundant collecting and matching computation process to utilize the exemplar face images. Ren et al. [3] proposed a video deblurring method for faces by generating 3D facial priors. They trained the 3D face reconstruction network to estimate more textured facial priors. Despite satisfactory result on video deblurring, their model was unable to perform on a single image. More recently, Shen et al. [7] and Yasarla et al. [8] proposed the use of semantic prior of the face for the deblurring process and achieved state-of-the-art results. These methods are two-step process. The steps include generating the semantic labels from the blurred face first, and then using them as strong prior knowledge for the deblurring process. However, extracting the segmentation map from the blurred face is difficult, and the erroneous prior information directly degrades the quality of the reconstructed face image. To reduce the side effects of inaccurate segmentation maps, Yasarla et al. [8] proposed measuring the confidence score of an estimated semantic map. However, they do not provide how to utilize the semantic prior when the segmentation map is inaccurate due to severe blurs. Unlike previous works [7], [8], the proposed method exploits only the ground-truth segmentation maps for training purposes, instead of generating them from blurred images. Using this procedure, our method can be trained with accurate segmentation maps, regardless of the degree of blurs, and prevent the side effects of inaccurate segmentation maps. In addition, the architecture of the proposed generator allows the restoration of the small components of the face more effectively.

C. PROGRESSIVE LEARNING
Progressive learning is a training strategy that involves starting with an easy task and gradually refining the details. Most existing methods that use progressive learning are based on a multi-scale (coarse-to-fine) approach. Multi-scale frameworks have made significant progress in estimating complex motion blur kernels in single image deblurring [13], [24], [25], [32], [33]. In addition to the single image deblurring field, the multi-scale approach is widely used in other image processing fields such as depth map estimation [34]- [36], and video frame prediction [37]. In recent years, progressive learning [11], [12], [20], [38]- [42] has been actively applied to CNN-based image synthesis. In particular, Karras et al. [11] proposed a progressive growing technique that progressively increases the depth of the layer as well as the resolution of the generated image. Karnewar et al. [12] proposed the multi-scale gradient generative adversarial network (MSG-GAN) by allowing the flow of gradients from the discriminator to the generator at multi-scales. Meanwhile, Yang et al. [43] proposed a method to generate the background and foreground recursively and separately. In contrast to the conventional methods, we suggest progressive learning techniques according to the semantic information of the face.

III. PROPOSED METHOD
As illustrated in Fig. 2, our MSPL framework is composed of a generator (G) and a discriminator (D). G generates a sharp face image I deblur from a blurred face image I blur . Our proposed G incrementally generates the facial components step-by-step in the order of skin, hair, inner parts (eyes, nose, and mouth), and then the entire face. Meanwhile, the proposed discriminator (D) oversees all the generated face image components from the G. A detailed explanation of each network is provided in the following subsections.

A. SEMANTIC PROGRESSIVE GENERATOR
Following Yasarla et al. [8], we divide the ground-truth segmentation labels into three classes as follows: Here, M i is a binary mask image with 1 for the assigned region, and 0 for other regions. Note that M 4 represents the entire area of the image, including the face and background. From Fig. 2, it can be observed that G consists of multiple functional blocks as h, g i , and t, where 1 ≤ i ≤ 4. First, h is an initial 1 × 1 convolution layer that converts an input RGB image I blur , to a feature map F init , for the following layer g 1 . Thus, h can be defined as h : I blur → F init . Second, g i is a function of our sub-network defined by Finally, t is a 1 × 1 convolution layer that converts the output feature map F i generated from g i to an output RGB image O i , as t : Thus, the entire network G can be defined as a sequential composition of all sub-networks as In our framework, g i is the key module of our network that focuses on restoring each semantic structure of the face. Each g i shares the same network architectures; however, their roles are different as each g i renders each facial component using the previous feature maps generated from the g i−1 layer. For this, we design each g i as a fully convolutional U-shaped network, which consists of residual blocks [44]. As investigated in [45], [46], we remove the normalization layers from the standard residual blocks because the normalization layers get rid of flexibility from the network for low-level tasks. To extract more focused features, we apply a channel-attention mechanism [47], [48] to our residual blocks. The entire architecture of g i is shown in Table 1. In Table 1, each row of the ''Kernel'' column specifies the kernel size and, the number of filters and strides. For example, ''3 × 3, 64, s1'' represents the 64 filters of size 3 × 3, with stride 1.
Our goal is to train each g i to reconstruct the assigned facial component perfectly. For this, we define the facial component loss (L c i ), which is defined as L 1 distance between the facial components of the ground truth (GT) images and those generated from g i as where represents the Hadamard product. To refine the entire face more naturally and restore the background of the face, we compare the last output image O 4 , and the entire target image I GT . This allows all sub-networks to share a common objective and provide stability during training. TABLE 1. Architecture of the proposed sub-network (g i ). F i −1 is an input feature of i th sub-network g i . ''W'' and ''H'' represent the width and height of the feature, all of the ''downconv'' layers represent convolutional layer with stride 2 for the downsampling operation, and ''upconv'' layers represent the transpose convolution for upsampling, and ''+'' represents a channel-wise sum operation.
The total facial component loss of our G (L G ) can be formulated as follows: Our proposed objective function in Eq. (5) allows a single g i to focus on the specified facial component using only ground-truth segmentation maps. Thus, our G is able to reconstruct more precise shapes and finer details of the target face without suffering from the side effects of using an inaccurate segmentation map.

B. MULTI-SEMANTIC DISCRIMINATOR
We also propose a multi-semantic discriminator D in our method. Inspired by the MSG-GAN [12], our D handles multiple outputs of G which allows the restoration of more realistic facial components at all intermediate layers.
As shown in Fig. 2, multiple intermediate images are fed to our single D. Thus, a single network D is a function of multiple input images and predicts a final probability p ∈ [0, 1] as where x j is a j th input RGB image of D. Let d j be the j th intermediate layer of D, and let A j be an output feature map of d j . Then, d j can be defined as Each d j consists of a 3 × 3 convolutional layer c, and a single concatenation operation. Then, A j is formulated as where ⊕ represents a channel-wise concatenation operation.  of multiple input images being real or fake. Table 2 shows the detailed architecture of our discriminator. We applied spectral normalization [49] to all the convolutional layers to stabilize the training of discriminator. Following Goodfellow et al. [10], we optimized G and D in an alternating manner to solve the following adversarial min-max function V (G, D): Here, y is a set of images corresponding to the ground-truth components of faces as y = {y i |y i = (I GT M i ), 1 ≤ i ≤ 3, y 4 = I GT }. Then, the adversarial loss (L adv ) is defined as follows: L adv = − log(D(G(I blur ))).
During training the generator, the error is backwardpropagated to the intermediate layers of G from the intermediate layers of the D simultaneously. This provides stability in training, because the sub-networks of the generator can share the same goal. Meanwhile, the discriminator observes not only the final output of G, but also all the intermediate outputs of G.
Recently, perceptual loss [50] has been widely adopted for better visual quality. To take advantage of this, we employed a VGG-face loss L vgg which is defined by where φ(·) represents the feature extracted from the Pool5 layer of the VGG-Face network [51]. The total loss of our MSPL framework L total is the combination of all the above loss functions discussed so far (L G in Eq. (5), L adv in Eq. (10), and L vgg in Eq. (11)) as where λ 1 and λ 2 represent the weights used to balance the different loss terms. We empirically set the weights of our loss terms as λ 1 = 0.05 and λ 2 = 0.05.

A. DATASETS 1) TRAINING DATA
For training, we used the CelebAMask-HQ dataset [57], which provides 30,000 high-quality (1024×1024 resolution) face images. Each image has 19 classes of segmentation labels such as skin, nose, eyes, eyebrows, ears, mouth, lips, hair, hat, eyeglass, earring, necklace, neck, and cloth. We regrouped these labels into three classes, which are the skin, hair and inner parts, as [8]. Following [58], we synthesized 18,000 motion blur kernels using the method of [23]. As trained in [7], [58], the size of the generated motion blur kernels ranges from 13 × 13 to 27 × 27. After applying the blur kernel to the image, we added Gaussian noise with σ = 0.03. The generated images were then split into two subsets: the training images (24,183 images), and the validation images (5,817 images).

2) TEST DATA
For face deblurring, Shen et al. [7] provide a pioneering testset to evaluate the motion-blurred faces. However, many GT images in the testset are of low quality with unknown block artifacts (see Fig. 3(a)). Because the purpose of image deblurring is to restore a sharp and high-quality image, these low-quality GT images are not suitable for evaluating performance. In addition, facial images in their testset are well aligned with the same facial key points [59]. However, the blurred face images are not aligned in the real world, because faces are usually captured under a wide range of conditions. Therefore, their testset does not condsider the practical situations where these blurred face images occur. Therefore, we generated two types of test datasets called the MSPL-Center and MSPL-Random. We collected 240 sharp face images from three different datasets (i.e., 80 images each from CelebA [60], CelebAMASK-HQ (CelebA-HQ) [57] and Flickr-Faces-HQ thumbnails (FFHQ) [42]). Note that each dataset is aligned with different facial key points.
Subsequently, we synthesized 240 random motion blur kernels using the method presented in [23]. Following the protocol by Shen et al. [7], the size of blur kernels ranges from 13 × 13 to 27 × 27. As shown in Fig. 3(b), MSPL-Center contains high-quality and differently aligned face images. Meanwhile, MSPL-Random comprises images that are randomly augmented versions using MSPL-Center (samples are shown in Fig. 3(c)). To be specific, we conducted random crops, random rotation, and random horizontal flips to the MSPL-Center images and convolved the random blur kernels synthesized using [23].

3) IMAGE QUALITY COMPARISON
Following [61]- [63], we used four of the no-reference image quality assessment (NRIQA) metrics (i.e., NIQE [52], BRISQUE [53], NRQM [54], and PIQE [55]) to compare the quality of the images of our testset (MSPL-Center) and testset provided by Shen et al. [7]. The NIQE and BRISQUE metrics measure the image naturalness (or its lack thereof) based on their own natural scene statics (NSS) model [52], [53]. NRQM provides the quality scores of the images based on extracted features from the trained CNN model [54]. PIQE is a perception-based image quality evaluation method that estimates the amount of distortion present in a given image [55]. We also employ the perception index (PI) metric [56], which is formulated as the adjusted mean value of NIQE and NRQM. Shen et al. [7] and MSPL-Center testsets were compared because they both consist of centered facial images. As aforementioned, the face images in MSPL-Random are randomly transformed version of the MSPL-Center. Thus, we do not compare the MSPL-Random dataset with the Shen et al. [7] testset. All the results of image assessments are listed in Table 3. When comparing [7] and MSPL-Center, the values of NIQE are comparable. However, the MSPL-Center dataset achieved better BRISQUE, NRQM, and PIQE than the Shen et al. [7] testset. VOLUME 8, 2020   For a fair comparison, we evaluated the image quality of the Shen et al. [7] and MSPL-Center subsets, both of which were synthesized using CelebA [60]. The results in Table 4 show that the test images in Shen et al. [7] are clearly degraded compared to those in MSPL-Center, even when considering the images selected from the same face dataset [60]. These assessment metrics quantify the noise, artifacts, sharpness and overall quality of the image. Therefore, the comparative results indicate that the Shen et al. [7] testset consist of low-quality GT images and the proposed MSPL testsets are more suitable for the evaluation of face deblurring performance.

B. TRAINING DETAILS
To implement our models, we used Pytorch [64]. The generator and discriminator were trained using the Adam optimizer [65] with β 1 = 0.9, β 2 = 0.999. The learning rate was initialized as 1 × 10 −5 and decayed exponentially by a factor of 0.99 for every epoch. When training, we first resized the 1024 × 1024 images to 512 × 512 images using bilinear downsampling. Then, we randomly cropped the images to 448 × 448 and resized them to 128 × 128. We augment the resized images with random horizontal flips and random rotations in the range [0 • , 90 • ]. We set the batch size as 16 and trained the model with a single NVIDIA TITAN-RTX GPU.

C. EVALUATION METRICS
To evaluate the performances of various methods, we used PSNR and SSIM [66], which are widely used in image restoration fields. The feature distance (d VGG ) of the VGG-Face network [67] was measured to compare the similarity in facial identity between the GT images and the deblurred face images. Following [68], we computed the L 2 distance using the output features from the Pool5 layer of the VGG-Face network [67]. Following the 2020 NTIRE challenge [63], we employed the LPIPS [69] distance, which is computed as the L 2 distance using the output features from the learned CNN for computing human visual perception.

D. ABLATION STUDY 1) EFFECT OF THE SEMANTIC PROGRESSIVE GENERATOR
To investigate the impacts of the reconstruction procedure of the face components, we gradually modified the baseline model and compared the differences. We set the baseline  Table 5 shows the performances of all the conducted models trained with different reconstruction procedures and identical training settings and data. The M i specified in Table 5 represents the semantic mask used to train the corresponding intermediate output of the MSPL model. MSPL_a is a model that is trained to restore a blurry input image in the order of entire image, skin, hair, and inner components. Comparing the results of MSPL_a and other models, it can be observed that restoring the entire face from the last module is crucial to the accuracy of the restoration process. MSPL_b is a trained model in reverse order of the proposed method except the entire component. This model restores the facial components starting from small and high-frequency components first (inner components), and restore the large and low-frequency components (hair, skin) later. The results of the MSPL_b also show that the wrong order of the face reconstruction degrades the deblurring performance. MSPL w/o GAN is a model trained to restore facial components with the proposed procedure using Eq. (5). The results in Table 5 demonstrate that the architecture trained with the proposed order improves the restoration performance compared to the other orders.
In addition,   we can confirm that the PSNR values of all the facial components gradually increase. These results indicate that the proposed method performs the deblurring incrementally. All the sub-networks enhance the quality of the image compared to the output image of the previous stage. Furthermore, we can observe that each sub-network is gradually improving not only the class-specific component, but also the entire face. This is because all the sub-networks share the goal of restoring the whole face. This lends stability to our progressive training method.
In Fig. 4, we can qualitatively observe the deblurring procedure in our proposed method that progressively restores the face image in the order of skin, hair, inner parts (eyes, eyebrows, nose, and mouth) and the entire face. From a blurred input face, the first sub-network generates the first output image O 1 . At this stage, we can see that the overall shape of the facial skin is restored excluding the other facial components (see Fig. 4(b)). In Fig. 4(c), we confirm that the shape and texture of the hair in O 2 are restored from O 1 . However, some blurred artifacts remain in the facial inner parts and background. In the third stage, the O 3 (Fig. 4(d)) shows that the inner parts of the face are significantly restored compared to O 2 from the previous stage. The final output image O 4 is shown in Fig. 4(e). The final result demonstrates that the final sub-network recovers the entire face and background. In particular, the facial components of O 4 are more natural compared to those of O 3 .

2) EFFECT OF THE MULTI-SEMANTIC DISCRIMINATOR
In our MSPL framework, the multi-semantic discriminator is utilized to recover the faces that are more photo-realistic.
To study the effects of this, we additionally trained our generator model with the proposed discriminator and trained with L total in Eq. (12). We denote our model as MSPL_GAN which are trained with the loss function L total in Eq. (12). The results listed in Table 5 indicate that the MSPL_GAN achieves slightly poor results compared to those archived by MSPL w/o GAN. However, the visual results presented in Fig. 5 show that our discriminator assists in reconstructing the faces that are more realistic. In Fig. 5, we compare the results of MSPL w/o GAN and MSPL_GAN (the odd rows and the even rows in Fig. 5, respectively) with the same blurred image as the input. It can be observed that the MSPL_GAN model restores the more realistic facial components than those restored by the MSPL w/o GAN model, especially for the nose, mouth, eyes, and texture of hair. For example, the output images in the second row contain more a realistic nose and mouth than the outputs in the first row. When comparing the images in the third and fourth rows, we can see a clear effect of GAN. This demonstrates the effect of our multi-semantic discriminator that allows the generator to restore the more realistic facial components. In addition, we confirm that our discriminator can affect on not only the final output image O 4 , but also all the intermediate outputs from O 1 to O 3 . In our experiments, the performance of MSPL_GAN model was slightly lower in the PSNR/SSIM compared to those of the MSPL w/o GAN model. However, MSPL_GAN model can restore the more visually plausible outputs that contain inner components, which are more natural, by using our multi-semantic discriminator.

E. COMPARISONS WITH EXISTING METHODS
We compared the performance of our MSPL framework with recent methods based on CNN models [7], [8], [58], [68]. All the experiments were conducted using the official codes provided by the authors [7], [8], [58], [68]. For Xia and Chakrabarti [58], we used the model trained in a supervised manner, as this model has been reported as the best model in their studies.

1) CLASS IMBALANCE PROBLEM
As mentioned earlier, the class imbalance problem is an important and challenging issue for the existing face deblurring methods [7], [8]. To compare the restoring capability of restoring the small and thin components of the face (such as the eyes, lips, and eyebrows), we compared the PSNR value of each individual class in the face using a ground-truth segmentation map, following the experiment presented by Yasarla et al. [8].
As shown in Table 6, our model significantly outperforms the previous state-of-the-arts in restoring individual classes of the face, especially for the inner parts of the face which contain small and important features of the face. These results show that our model effectively restores the facial image by reducing the class-imbalance problem compared to the previous methods.

2) COMPARISONS USING SHEN et al. [7] TESTSET
We conducted experiments on the testset provided by Shen et al. [7]. In Table 7, it can be noted that Yasarla et al. [8], and Xia and Chakrabarti [58] performed the best PSNR and SSIM. However, as can be seen in Fig. 6, the results obtained using the previous methods [7], [8], [58] are overly smoothed images, their models obtain better results in PSNR and SSIM compared to our model. As mentioned before, the problems of low-quality images in the Shen et al. [7] testset are noteworthy. The results in Fig. 6 show this problem more clearly. First, we can observe that not a few GT images have severe blocking artifacts (for example, see the last column in Fig. 6). Second, our model restores the sharp images that are even better than GT images. When comparing the GT images with our images, our results have sharper boundaries at the border of the facial components without blocking artifacts. These observations support that the existing Shen et al. [7] testset has a limitation in providing accurate deblurring evaluation.
On the other hand, the proposed MSPL w/o GAN and MSPL_GAN achieved the best performance in d VGG and LPIPS with a huge margin, as listed in Table 7. The result of  d VGG indicates that our restored faces are similar to the GT images in terms of face identification. In addition, this shows that our model is the best model for higher vision task such as face recognition. LPIPS is the metric, which correlates better with human perceptual opinions [63]. The results of LPIPS show that the restored faces using our model are more visually plausible in terms of human vision.

3) COMPARISONS USING MSPL TESTSET
In extended experiments on MSPL-Center and MSPL-Random testsets, we observed that our proposed method achieved the best performance both quantitatively and qualitatively. The quantitative results are listed in Table 8.
The result values of PSNR, SSIM, d VGG , and LPIPS indicate that our framework significantly outperformed the existing methods. Additional visual results in Fig. 7 and Fig. 8 demonstrate that the proposed method restored sharper and more detailed face images than previous methods. In our experiments, we observed that the performance of Shen et al. [7] was sensitive to alignment and rotation. When a given blurred face image was aligned differently or rotated differently from the training face images, the restoration performance of Shen et al. [7] was severely degraded (refer to the second column images of Fig. 7). The results of Yasarla et al. [8] were visually plausible for all the test images. However, the restored small facial components (i.e., eyes, nose, mouth, and teeth) still lacked details and textures when the blurred artifacts were severe in the input image. Meanwhile, the proposed framework achieved superior performance compared to previous existing methods.

4) REAL-WORLD BLURRED FACIAL IMAGES
We conducted experiments on the twenty facial images distorted by real-world blur provided by [7], [70]. In the realworld, images are easily degraded with unknown complex factors such as the motion blur, lens distortion, sensor saturation, nonlinear transform functions, noise, and compression, in the camera pipeline [70]. However, all of these  factors are not considered when generating the synthetically blurred images. Therefore, this experiment allows us to confirm the more practical performances of face deblurring methods that cannot be evaluated using the synthetically generated blur testsets. The comparative results of the sample images for the real-world blur are shown in Fig. 9. As can be seen in Fig. 9, the results of the proposed method show the most sharp and natural face images compared to other face deblurring methods. These results indicate that our proposed method has the capability of reconstructing the highest-quality images from the facial images with a real-world blur.

5) INFERENCE TIME AND MODEL PARAMETERS
As shown in Table 9, we measured the inference time and the number of model parameters of the existing methods When comparing the proposed model with the recent stateof-art model of Yasarla et al. [8], our model is slightly larger in number of parameters. However, it can be seen that our model is 2 times faster than Yasarla et al. [8]. This shows that the proposed method is more efficient than the two-stage deblurring method [8] consisting of segmentation and deblurring processes.

V. CONCLUSION
In this study, we propose a multi-semantic progressive learning framework for facial image deblurring. Our framework employs an effective GAN-based architecture to restore the semantic structures of the face progressively without performing semantic segmentation. To evaluate the more practical and accurate performances of face deblurring methods, we have provided additional new testsets. Overall, the proposed method outperforms the existing methods both qualitatively and quantitatively. To the best of our knowledge, this is the first study on facial image deblurring using a semantic in progressive approach. We believe that our framework provides a potential approach for numerous other facial image restoration fields.