GFNet: A Gradient Information Compensation-Based Face Super-Resolution Network

Face super-resolution (FSR) is defined as the generation of high-resolution face images from low-resolution face images. Existing FSR approaches usually improve the performance by combining deep learning with additional tasks such as face parsing and landmark prediction. However, the additional data requires manual labeling, and facial landmark heatmaps and parsing maps cannot represent the intrinsic geometric structure of facial components. In this paper, we introduce a FSR network based on gradient information compensation named GFNet, which consists of feature residual blocks (FRBs) and gradient extraction blocks (GEBs). Specifically, the GEB constructs pixel-level gradient maps directly from the feature maps without requiring data labels and extracts gradient features to compensate for the missing high-frequency components in the face features; the FRB extracts the face features in the network. Furthermore, we introduced a feature fusion mechanism between the GEB and the FRB, which fuses the face features with the gradient features. We evaluate the performance of proposed network on the two public datasets: CelebA-HQ dataset and Helen dataset. Experimental results show that the proposed method is able to reconstruct fine face images, which outperforms the other state-of-the-art methods such as SRResnet, FSRNet, and MSFSR.


I. INTRODUCTION
Face super-resolution (FSR), also known as face hallucination, generates high-resolution (HR) face images from lowresolution (LR) face images. With the development of video technology, face information detection has become a recent research hotspot [1]- [3]. However, due to the empirical factors such as devices and distances, the face images are usually low-resolution, which may hinder face information detection.
Unlike the single image super-resolution (SISR) tasks [4]- [8], the difficulty of FSR tasks lies in the recovery of facial components, i.e., eye and nose. These facial components have large pixel variations in face images, and therefore, it is difficult for deep neural networks trained using mean square error loss to recover these facial components. In addition, FSR tasks generally require reconstructing face The associate editor coordinating the review of this manuscript and approving it for publication was Wei-Yen Hsu .
images at high upscale factors (e.g., 8×), while SISR tasks generally focus on reconstructing images at low upscale factors (e.g., 2×, 4×). Therefore, the FSR task is more complex since there are more mapping functions.
Some methods [9]- [13] assist FSR networks in face image recovery by adding additional tasks such as face parsing, landmark prediction, etc. In [9], the authors used a priori estimation network to predict landmark heatmaps and parsing maps. Although the network extracts more facial structure information by introducing these additional tasks, there exist the following drawbacks: 1) it requires manual labeling of data for the additional tasks; 2) estimating face priors from low-resolution inputs is itself a difficult task; and 3) facial landmark heatmaps and parsing maps cannot represent the intrinsic geometric structure of facial components, such as the nose bridge.
In this paper, we propose a FSR network for faces based on gradient information compensation. First, to reconstruct fine facial details of faces, we propose a gradient extraction block (GEB), which constructs pixel-level gradient maps directly from face features; second, we employ a feature residual block (FRB) to extract face features in the network. By introducing a channel attention mechanism, the FRB can focus on important features; finally, we introduce a feature fusion mechanism between the GEB and the FRB, which allows the network to utilize features at different levels. The primary contributions of this paper are summarized as follows: 1) We propose the GFNet, a face super-resolution framework capable of processing very low-resolution face images (e.g., 16 × 16). Without requiring additional training task, GFNet is able to generate fine-grained facial components of faces. 2) We develop a GEB and a FRB to effectively extract the features. The GEB can directly extract gradient features from the original LR image to compensate for the missing high-frequency components of face features, which solves the problem that most FSR tasks require additional face priors; the feature fusion mechanism is applied between the GEB and the FRB, which makes the FRB attaches channel attention to fusing face feature maps and gradient feature maps, and makes the GEB extract gradient information guided by the face features. 3) Experimental results on CelebA-HQ dataset and Helen dataset show that the proposed method outperforms the state-of-the-art algorithms both qualitatively and quantitatively.

II. RELATED WORK
A. FACE SUPER-RESOLUTION FSR was first proposed by Baker and Kanade [14]. Following this work, various FSR methods have been proposed. Traditional FSR methods can be roughly classified into two categories: global model-based methods [15]- [20] and part-based methods [21]- [24]. Global model-based methods reconstruct super-resolution images by learning the overall appearance mapping, such as principle component analysis. This method requires the LR input to be precisely aligned with the HR image. Part-based methods extract facial components, which requires the detection of facial landmarks in the LR image. The performance of both methods degrades significantly in FSR tasks with a high upscale factor. Recently, convolutional neural networks have been successfully used to face super-resolution. Zhu et al. [25] proposed a Cascade Bi-Network (CBN) to recover facial details by estimating the dense correspondence field. Due to the difficulty of accurate dense correspondence field estimation at low resolution, it performs poorly on human faces in different poses. Huang et al. [26] used wavelet coefficients predicted from LR images to reconstruct HR images. Zhang et al. [27] used enhanced facial boundaries and coarse-to-fine supervision to recover fine faces. Compared to landmark points, the enhanced facial boundaries can represent richer contours of facial components, however, the representation of the intrinsic geometric structure is yet to be explored.
Some works utilizes additional face priors to train the network. Chen et al. [9] fused face priors with face features and train the network in an end-to-end manner. Ma et al. [28] introduced a loop iteration for landmark estimation with face recovery in a recurrent iterative network. Dogan et al. [13] proposed a GWAInet that takes the LR image and a high-resolution guide face as inputs. They convert the posture of the guide face to be the same as the LR image. Although introducing additional face priors helps to improve the performance, there are two drawbacks: 1) additional manually labeled data is required, and 2) estimating face priors from LR images is itself a difficult task.
Unlike the above methods, the proposed GFNet uses gradient maps with pixel-level accuracy to compensate for the missing high-frequency components in face features, which means that we can obtain more semantic information even in very low-resolution images. The gradient extraction block constructs the gradient image directly from the face features, so no additional supervised labeling is required.

B. GRADIENT CORRELATION METHOD
In previous SISR studies, researchers have used gradient information to reshape the edges of images [29], [30]. They used a cascade framework consisting of two identical networks to process gradient images and LR images separately; however, the structure of this single cascade framework cannot cope with the complex FSR tasks. Yang et al. [31] divided the face image into three parts: facial component, contour, and smoothed region. They matched the gradient maps of the three parts with the reconstructed HR image to maintain the image structure. The gradient image of the facial component is determined by a set of landmark points and two label sets. They then further improved their algorithm to process the compressed face images [32]. Pei et al. [33] proposed the iterative gradient constrained weighted sparse representation method, where they exploit the gradient information of the images during the patch representation. They combined both the l 1 reweighted constraint and the gradient information of the images, and fused them into the sparse representation to improve the performance. Although the above method improves the performance, there are still several shortcomings: 1) the deep learning-based FSR task usually requires a large amount of data. The above method divides the face images into different regions, meaning that it requires more supervised labels. 2) When processing very low resolution face images, the errors caused by landmark detection can lead to artifacts in the reconstructed images. In contrast, the gradient extraction block can extract gradient information without landmark detection. 3) To generating a gradient map for the whole face, the network should get a more comprehensive view of the facial structure. Fig. 1 shows the structure of the proposed network, which consists of three main parts: downscale, feature extraction, and upscale parts. Each part consists of several FRBs and GEBs. We introduce a feature fusion mechanism between the two blocks. To enable the network to focus on features at different scales, we designed the downscale part as well as the upscale part. In the Downscale part, we change the structure of the latter FEBs and GEBs to achieve downsampling (see Sections III-B and III-C for details). In the upscale part, we change the structure of the former FEBs and GEBs to achieve upsampling. Let I LR , I SR and I HR denote the lowresolution image, the super-resolved image and the ground truth HR image, respectively.

III. PROPOSED METHOD A. OVERVIEW
First, we upsample I LR to the same size as I HR as follows.
where F bic ( * ) denotes the bicubic interpolation function, and I UP denotes the upsampled image. Then, I UP is fed into the GFNet.
where F GFNet denotes the function of our GFNet.
Given a set of training set I containing N samples, we optimize GFNet by minimizing the pixel-level l 2 loss given by where denotes the parameter set of GFNet.

B. FEATURE RESIDUAL BLOCK
The GFNet needs to process both face features and gradient features simultaneously. We add the residual channel attention block (RCAB) [34] to the FRB, which makes the network focus its channel attention on fusing face feature maps and gradient feature maps. As shown in Fig. 1, the FRB consists of two 3 × 3 convolutional layers and a RCAB. Each convolutional layer is followed by a batch normalization (BN) and a rectified linear unit (ReLU) activity function. Fig. 2(a) is the structure of RCAB, channel attention is integrated into residual block. The residual component is mainly obtained by two 3 × 3 convolution layers. In channel attention, the weights on feature map channels are calculated by two 1 × 1 convolution layers. They adjust weights automatically. In this process, the feature map channels are rescaled with ratio r by two 1 × 1 convolution layers. In this paper, we set r to 16. Since residual blocks have achieved great success in SR tasks [35] and FSR tasks [6], [36], so we add residual connections to the feature residual module and determine the number of channels by a 1×1 convolutional layer. We assume that the input of the i-th feature extraction module is x i−1 of 128 channels, then the process of feeding it into the i-th feature extraction module can be expressed as follows: where f i denotes the output of the last convolutional layer in the feature extraction module, Conv in out ( * ) n denotes an n × n convolutional layer with in input channels and out output channels, followed by BN and ReLU. R( * ) denotes the RCAB function.
As shown in Fig. 3(a), in the downscale part, we use a convolutional layer with a step size of 2 to achieve downsampling. In the upscale section, we use a nearest-neighbor upsampling layer with a convolutional layer to slightly modify the feature extraction module to achieve upsampling. Nearest-neighbor upsampling layer with a convolutional layer helps to avoid checkerboard artifacts [37]. Thus equation (5) becomes: where F scal ( * ) denotes the scale transformation function.

C. GRADIENT EXTRACTION BLOCK
Gradient information has been widely used in various image tasks to improve system performance, such as image translation [38], SISR [29], [30], and image restoration [39]. Gradient information can compensate the missing parts in high-frequency components such as edges and structures. So it can improve the problem of easy generation of smooth images using l 2 loss in SR tasks. In this paper, we design the GEB to extract the gradient information from the features. As shown in Fig. 1, the GEB consists of three convolutional layers and one gradient layer. Since we need to extract the gradient information from the features of three channels, we set the kernel size of the second convolutional layer to 1 × 1. In the gradient layer, we obtain the gradient map by calculating the difference between adjacent pixels as follows: where T ( * ) denotes the gradient map extraction, and the element of gradient map is the gradient length of the pixel of coordinate A = (a, b), where I a (A) is the pixel gradient of a pixel A along direction a on the image I . It is the same as the pixel value difference between the previous pixel and the latter pixel in direction a. It is also the same case in direction b. ∇I (A) is a two-dimensional vector representing the pixel gradient of A. By calculating the two-dimensional norms of these two-dimensional vectors of all pixels on the image I , the gradient map T (I ) can be obtained. We assume that the input of the i-th gradient extraction module is y i−1 with 128 channels, then the process of feeding it to the i-th feature extraction module can be expressed as where y i denotes the output of the i-th gradient extraction module. As shown in Fig. 3(b), the scale scaling function of the gradient extraction module is achieved by using the same strategy as described in Section III-B.

D. FEATURE FUSION MECHANISM
We found that the FRB was rich in structural information, which is crucial for the GEB to extract gradient features.
Since the gradient features carrying high frequency information in the GEB, this facilitates the feature residual module to obtain richer face features. Therefore, we introduce the feature fusion mechanism, which fuses the features between the two modules and input the fused features into the next FRB and GEB respectively, as shown in Fig. 1. This process can be expressed as where c denotes the concatenate of feature maps, and z i denotes the feature after fusing the output x i of the i-th feature residual module and the output y i of the i-th gradient extraction module. Equation.

IV. EXPERIMENT A. DATASET
We used two commonly used face datasets, CelebA-HQ [40] and Helen [41], to evaluate our model. CelebA-HQ is an updated version of CelebA [42], with a total of 30k images, each with a resolution of 1024 × 1024. We randomly selected 17k images as the training set and 13k images as the test set.
For the Helen dataset, we randomly selected 2000 images as the training set and 330 images as the test set.

B. IMPLEMENTATION DETAILS
We resized the images of the CelebA-HQ dataset to 128×128 pixels as the ground truth HR images. For the Helen dataset, we cropped the 128 × 128 pixel face images based on the annotations of the facial components of the images. We then downsampled the HR images to 16 × 16 pixels by bicubic interpolation as the LR input and enhanced the training images by randomly rotating them by 90 • , 180 • , 270 • , and horizontal flip. We used PSNR and SSIM as performance metrics to evaluate the model. We use the Adam optimizer to train the model, where β 1 = 0.9, β 2 = 0.999. The initial learning rate is 1 × 10 −4 and halved at 15k, 30k, 45k, 90k, 180k iterations. The batch size is set to 16. We implemented the code in Pytorch.

C. ABLATION STUDY
We conducted a series of ablation experiments on the CelebA-HQ dataset to validate the effectiveness of our proposed method.

1) GFNet-WG
To evaluate the effect of gradient information, considering that the gradient layer is an artificial design without any learnable parameters, and its computational complexity is negligible, we keep the parameters of GFNet constant and remove the gradient layer in the BEM to observe the performance of the network. The new model is defined as GFNet-WG. The visualization results are shown in Fig.4. As can be seen from the local zoomed-in view, with the help of the gradient information, our network is able to generate a more three-dimensional face structure (e.g., nose) with more accurate contours than GFNet-WG. Table 1 shows the results of GFNet-WG in both PNSR and SSIM tests. By comparing the PSNR and SSIM scores in results of GFNet-WG with GFNet, it can be concluded that the proposed gradient information compensation strategy into the face features is effective.

2) GFNet-FM
We evaluate the effectiveness of the feature fusion mechanism by eliminating the concatenate operation of the feature map  between the FRB and the GEB, as shown in Fig. 5. We define the model with the feature fusion mechanism at M cancelled as GFNet-FM, where M ∈ {0, 8, 15}. When M is set to 15, the feature exchange mechanism is located between the last feature residual module and the gradient extraction module. As can be seen from Table 2, the smaller M is, the better the model performs in the test. This means that the feature exchange mechanism can improve the performance of the network.

D. COMPARISONS WITH STATE-OF-THE-ART METHODS
We compare our model with state-of-the-art methods on the CelebA-HQ and Helen datasets. In generic SR methods, we chose SRCNN [4], SRResNet [6], GIDN [30] and VDSR [5]; in FSR methods, we chose PFSRNet [10], FSRNet [9] and MSFSR [27]. Since only the test code of FSRNet was released, we re-implement FSRNet with Pytorch. All of the above models were trained using the same dataset as our model. The experimental results on the test set show that our models are able to generate finer facial components compared to state-of-the-art methods. Table 3 shows the quantitative comparisons on the test sets of CelebA-HQ and Helen datasets. Compared with other SISR and FSR methods, GFNet achieved the highest scores in both PSNR and SSIM tests. Table 4 shows the computational complexity and parameter amount of each method. We calculate the FLOPs on the assumption that the resolution of generated SR images is 128 × 128. Compared to other FSR methods, GFNet outperforms existing methods in parameter while the result of FLOPs is higher than the others. As can be seen from Fig. 6, most methods generate facial components with blurred shapes, in contrast, GFNet can generate results closer to the real images.
We found that SRCNN and VDSR in the SISR method could not achieve satisfactory results in the 8× FSR task. As shown in Table 3, their scores on PSNR and SSIM are far behind those of the FSR method. The visualization results    in Fig. 7 show that their recovered faces are blurred and some facial components are missing, e.g., eye and nose. This is due to the fact that the face images are much more complex compared to the images processed by the SISR task. In addition, we evaluated the model with 8× upscaling factors, and additional face information was needed for face image reconstruction at very low resolution. Note that the images recovered by SRResNet are perceptually closer to HR images than SRCNN and VDSR, due to the deeper network layers and the use of a large number of residual blocks in SRResNet [42]. This proves from the side that the proposed feature residual blocks inspired by ResNet are beneficial for the FSR task. GIDN proposed a network of distilling gradient information to improve performance. Compared with other SISR methods, GIDN can generate clearer outlines.
We compared the results of GFNet with those of PFSRNet, FSRNet and MSFSR. From the results in Table 3, we found that GFNet achieves the best PSNR and SSIM scores. The visualization results in Fig. 8 show that FSRNet cannot recover the correct facial components, so the images it generated are not realistic. This is because that although parsing maps can represent the contours of facial components, they cannot represent the intrinsic structures. Specifically, as shown in Fig. 9, a parsing map can represent the general shape of the eye, but it ignores the structure of the eye. PFSRNet generates facial components with artifacts, which is caused by the difficulty in detecting landmarks in very lowresolution images. In contrast, the proposed GFNet is able to generate more realistic face images and does not produce artifacts. The images generated by MSFSR using the enhanced   facial boundaries are similar to GFNet, but the facial structure is still blurred. As shown in Fig. 10, the gradient map is richer and more detailed for the facial structure compared to the enhanced facial boundary. The effectiveness of our use of gradient information to compensate for face features is demonstrated by qualitative comparisons with state-of-the-art methods.

E. LIMITATIONS
Generally, there remains a challenge in reconstructing images at 16× upscale factor for FSR methods. In Fig. 11, we show the visual results of GFNet on upscale factor of 16×. Although the GFNet compensates gradient information to face features to improve performance, the reconstructed images are not satisfactory and some facial components are missing, e.g., nose and lip. This is mainly because at 16× upscale factor, and it is challenging to obtain enough gradient information from LR images. The network cannot obtain enough additional face information, leading to the difficulty of reconstructing face images.

V. CONCLUSION
In this paper, we proposed a GFNet for FSR task, which consists of FRBs and GEBs. The FRB extracts the face features and the GEB extracts the gradient information. We also introduced a feature exchange mechanism between the two modules, which allows the network to utilize features at different levels. Extensive experimental results on the CelebA-HQ and Helen datasets demonstrate that GFNet can generate faces with finer facial components compared to other SR models at 8× upscale factor. Following the idea of this work, future work can be extended to develop better GEB and to investigate better combination method of gradient information with face features.