Face Inpainting via Nested Generative Adversarial Networks

Face inpainting aims to repaired damaged images caused by occlusion or cover. In recent years, deep learning based approaches have shown promising results for the challenging task of image inpainting. However, there are still limitation in reconstructing reasonable structures because of over-smoothed and/or blurred results. The distorted structures or blurred textures are inconsistent with surrounding areas and require further post-processing to blend the results. In this paper, we present a novel generative model-based approach, which consisted by nested two Generative Adversarial Networks (GAN), the sub-confrontation GAN in generator and parent-confrontation GAN. The sub-confrontation GAN, which is in the image generator of parent-confrontation GAN, can find the location of missing area and reduce mode collapse as a prior constraint. To avoid generating vague details, a novel residual structure is designed in the sub-confrontation GAN to deliver richer original image information to the deeper layers. The parent-confrontation GAN includes an image generation part and a discrimination part. The discrimination part of parent-confrontation GAN includes global and local discriminator, which benefits the reconstruction of overall coherency of the repaired image while obtaining local details. The experiments are executed over the publicly available dataset CelebA, and the results show that our method outperforms current state-of-the-art techniques quantitatively and qualitatively.


I. INTRODUCTION
Face inpainting is a challenging task of recovering details of facial features on high-level image semantics. It can be applied in many face recognition occasions, such as wearing sunglasses, microphone occlusion during performance, and covering mask. The purpose of inpainting technology is to repair the broken part of the image with known image information. The most important goal of this task is to avoid introducing noise into non-repaired areas and to generate reliable repaired areas. Based on this technique, noise, hiatus and scratch can be removed.
Because of the strong correlation between pixels in one image, lost image information can be restored as much as The associate editor coordinating the review of this manuscript and approving it for publication was Chunbo Xiu . possible based on undamaged or occluded area of the image and its pattern priori. During inpainting process, the content information of the whole image is considered, including lowlevel texture information and high-level semantic information. Traditional inpainting methods rely on low level cues to find best matching patches from the uncorrupted sections in the same image [1]- [3]. These methods work well for background completions and repetitive texture pattern. However, low level features are limited for face inpainting task as face image consists of many unique components, and inpainting process needs to be carried out with a highlevel semantic level [4]- [6]. The traditional methods based on finding patches with similar appearance patches does not always perform well.
Rapid progress in deep convolutional neural networks (CNN) and generative adversarial networks (GAN) [7] inspired lots of studies [6], [8]- [10] to restore damaged images. The GAN model [6], [10], [11], [43] is proposed to deal with both low-level textural features and highlevel semantic features, which can complete the blanks in the images. However, one of the essential challenges about inpainting via GAN model is that the reconstructed area is blurry compared to global image [11]. The reason is that the output of model approximates to the global loss minimum, which will make intensity of output vague. To tackle this problem, a complete training framework based on nested generator adversarial network (NGAN) is proposed in this paper. This generation network includes a sub-confrontation GAN and a parent-confrontation GAN. Applying sub-confrontation GAN, the location of missing area is found and rough result is obtained. To avoid the loss of defect area information and the degradation of the GAN, our model adopts residual structure to jointly transmit features in different layers to a deeper network. In order to solve the puzzle of ambiguity of repairing region, a special residual transfer connection is utilized for four times in sub-confrontation generation network, which can reduce loss in convolutional network transmission process. In the parent confrontation GAN, the global and local discriminators are combined to capture both local continuity of image texture and pervasive global features in images, which aims to achieve high-quality local repair area and overall coordination.
We evaluate our method using CelebA [12] dataset compared with other state-of-the-art methods. The contributions of our work are summarized as follows: • A NGAN based framework is proposed for face inpainting, which is a combination of a sub-confrontation and a parent-confrontation network. The networks produce a priori semantic constraint to reduce model collapse.
• A novel residual connection structure is introduced in sub-confrontation generation network, which is beneficial to generate high-quality details for facial image with mask and eliminate ambiguity.
• Local discriminator and global discriminator are combined in our framework, which can ensure global consistency of inpainting results and guarantee the details of the local inpainting area. The remaining of the paper is organized as follows. Section II presents a short review of relevant and recent image inpainting techniques. The details of NGAN method are presented in Section III. Section IV shows the experimental results before we conclude the paper in Section V.

II. RELATED WORK
As an important branch of digital image processing, the research of image inpainting is extensive. Methods for image inpainting fall mainly into two categories: copy-paste and learning-based.
Copy-paste inpainting methods are based on the information relations between damaged areas and known areas in the image and migrates the surrounding information to the blank area. The idea of diffusion model is to iteratively propagates the underlying texture information of known image areas to damaged unknown areas [1]. The basic principle of this type of models is from the thermal diffusion equations in physics [2], [13]- [15]. Another type of inpainting approaches based on geometric image variational model imitates the process of image restoration by hand [16]- [19]. During the processing of this method, the universal function is determined based on the data prior distribution, and the defect area is repaired using the established model. These copy-paste image restoration techniques have achieved good results in smooth and continuous small-scale damaged images. However, when the loss area is large-scale, or the texture is rich and complex, the diffusion or image data model will not be able to accurately describe the lost information, resulting in unnatural and unclear results.
To solve the above problem, texture synthesis technology was presented [20]. The texture blocks with appropriate size are determined, and the missing area is synthesized by the similarity of blocks texture. Image energy optimization [1], [3], [21], [22] was introduced to measure texture proximity, and image gradient was integrated to the distance measurement between reconstructed texture [23]. Texture measurement was extended to include image segmentation and texture generation [24]. This type of approaches can improve efficiency and achieve a real-time image restoration through the patch-match algorithms. In addition, some methods for automatically estimating the structure of the scene have also been proposed [25]- [29]. These methods improve the quality of image completion by preserving important structures, such as points of interest [30], lines [31] and perspective distortion [32]. However, the image structure guidance is a heuristic constraint based on a particular type of scenes, and it is limited to a specific structure. For different images, distinct guiding rules of image results need to be designed, and these rules cannot be applied to arbitrary images. Besides, these approaches are difficult to reconstruct semantic information because they only fix the underlying texture.
Although copy-paste methods have a good performance in image restoration, it is difficult to produce textures that are not in the original picture. In order to obtain more information, a large images database was used [33]. However, compared with the general method, the premise that the database contains a large number of similar or same scenarios greatly limits its applicability.
With the development of deep neural network, deeplearning based methods are introduced to predict the unavailable content and achieve semantic inpainting results. The convolutional neural network-based image inpainting method [8] can obtained pleasing result for small occluded areas. It was applied to repair missing data from MRI and PET [9]. Generative Adversarial Net (GAN) based on dualistic game theory combined with convolutional network [6], [10], [34]- [36] could bring out very real impaired images. In these networks, the input image data includes mask areas which are to be repaired. These mask areas must be manually annotated in the real word. To address this time consumption VOLUME 7, 2019 limitation, a novel end-to-end network is proposed in [6], [43], which doesn't need an additional mask as the input information.
Although deep-learning approaches consider both content texture and semantic feature and have a good performance in image inpainting, some features are easily lost, resulting in reconstructing unreasonable structures, such as over-smoothed and/or blurry [12]. Especially, the distorted structures or blurry textures inconsistent with surrounding areas will be produced [36]. In this paper, we propose NGAN for semantic face inpainting. In our model, a nested GAN structure is introduced to constraint generation process and reduce noise introduction. A new residual connection is constructed to transmit missing information caused by network forward propagation process to deeper layers. The global and local discriminators are combined to reconstruct the overall coherency image and to obtain local details.

III. APPROACHES
A. GAN REVIEW GAN model was proposed by Goodfellow et al. [7], which consists of two parametrized deep neural nets: generator, G, and discriminator, D. G maps a random vector z, sampled from a prior distribution p z , to the image space while D maps an input image to a likelihood. The target of G is to produce images that are realistic enough, while D discriminates between the image generated from G, and the real image, x, sampled from the data distribution p data .
The G and D networks are trained by optimizing the loss function: The generator is trained to acquire minimum loss while the discriminator is trained to acquire maximum loss. The loss eventually approaches 0.5 when the training process finishes.

B. NESTING STRUCTURE OF GAN
GAN is an unsupervised learning model, which can generate clear and realistic images [6]. We introduce a generative CNN model and a training procedure for the hole filling in face images problem. Our network consists of a nested structure including two different generation networks, which are called sub-confrontation generation network and parentconfrontation generation network.
The sub-confrontation generation network identifies the location of image defects, which can preserve the original information in the non-repaired area of the image. After the confrontation training, code generator can produce robust coding information, which will be decode to generate output image. In addition, the residual structure and the dilated convolutional structure are adopted in code generator of subconfrontation generation network to improve the local details of the output image. Meanwhile, the coding information are used as a priori semantic constraint to reduce model collapse. The sub-confrontation generation network consisted by a code generator and a code discriminator. The parent-confrontation generation network has two parts: generation part and discrimination part. The output of the global and local discriminators are fed back into the image generator. The output of the coding discriminator is fed back into the code generator.
The parent-confrontation generation network has two parts: generation part and discrimination part. The parentconfrontation generation network takes the corrupted image and tries to reconstruct the repaired image. The generation part uses the coding information of sub-confrontation generation network to recover the input image through multiple convolutional layers. Unlike traditional networks, the discrimination part consists two different scales discriminators. The overall structure of our framework is shown in Fig. 1.

1) SUB-CONFRONTATION GENERATION NETWORK
The sub-confrontation generation network is consisted by code generator and code discriminator, and the corruption image z is used as the input. After antagonistic training, the code generator produce code information, z , which is judged by code discriminator to be the same classification as ground truth coding. This network can obtain the ability to extract the robust features of the damaged image. Furthermore, it is also a prior constraint on the image generator, which effectively reduces the collapse of the generator pattern.
The code generator is trained with an additional code discriminator and can learn the features of the occlusive image and retain the semantic information of the original image as much as possible during the coding process. Code generator and code discriminator form an antagonistic structure and are iterated alternately until obtaining consistent coding for the corruption image and the corresponding ground truth.
The code generator consists 5 convolutional layers and 5 dilated convolutional layers. Dilated convolution can increase the receptive field of the network without increasing the number of model parameters [37], which will be analyzed in detail in III-D. To avoid losing information in calculating, a specially designed residual connection is applied, which transports original information to deep layers. The novel residual connection structure will be described in III-C. The code discriminator consists of three convolutional layers and one fully connected layer. The structure of sub-confrontation generation network is shown in Fig. 2.

2) PARENT-CONFRONTATION GENERATION NETWORK
The parent-confrontation generation network has two parts: generation part and discrimination part. The generation part  The parent-confrontation generation network in our framework consists of a complex generator and two discriminators. The defective image is as the input of the generation part. The semantic repaired is carried out by the network and the result of reconstruction is the output, which will be evaluated by the discrimination part.
is composed of a code generator of sub-confrontation generation network and an image generator. After coding process, the image generator reconstructs broken image from code information. Using the encoded information as input instead of the image directly can improve the robustness of our model and reduce model collapse.
The discriminator part consists of a global discriminator and a local discriminator. The global discriminator judges the authenticity of the whole image and enforce global consistency on a large scale. Different from global discriminator, the local discriminator only constrains the richness of image detail information and local coherency. Both discriminator networks have similar network structures, which are spliced together and produced by the fully connected hierarchy. The generation part and the discrimination part form a confrontation structure. Through the confrontation training, the generation part can reconstruct pleasing image. The structure of the parent-confrontation generation network is shown in Fig. 3.

C. NOVEL RESIDUAL CONNECTION STRUCTURE
When the information is propagated forward between layers, the size of the feature map decreases by the convolution kernel with stride 2 or larger, which will result in losing of detail texture information and degradation of the generated image. In addition, using the activation function will lose the information of original image. For example, for a single image x 0 through a convolutional network, the previous convolutional feed-forward network connects the output of the l th layer to the (l + 1) th layer, which applies the following layer transition: x (l+1) = H l (x l ). In order to achieve sparse network connections and avoid losing negative value, we usually use leaky ReLU function, which reduces the negative value response of the former feature map. However, leaky ReLU function does not completely reflect the impact of information loss on image generation.
To tackle this limitation, our method takes advantage of the original available data using novel residual connection structure. Residual network structure was proposed in [38], which added a skip-connection that bypassed the non-linear transformations with an identity function: where x l is output of the l th layer, x (l+1) is output of the (l + 1) th layer, H l (·) is Leaky ReLU function. Using residual block structure, the original information can be delivered to deep layer, which cannot only retain details of the input image but also avoid introducing noise. In [39], an improvement was made to reduce the number of residuals and improve network performance. However, these two residual connections can only transport the original information directly. This will transfer all the information from the damaged area of the image to the deeper layers and degrades the quality of the resulting image. Therefore, it is necessary to change the residual connection structure in order to pass only valid information. In our approach, we improved the structure in [39] and change the residual structure as follow: where K l (x l ) is processing function in l th layer, which is down-sample operation by convolutional layers and pooling layers in our network; φ l (·) is a robust information extractor for the output of l th layer, which can filter out the original information lost through layers. The extractor function φ l (·) is defined: where K * l (·) is an up-sample operation by single convolutional layer structure. The original information t is implicitly and adaptively transported to K * l (K l (t)) and the interpolation between feature map in deeper layer and feature map in shallow layer is the missing information in feed forward network. By transmitting the missing information φ l (t) to deeper layers, our approach takes advantage of more primitive and useful semantic information as well as the ability to generate reliable results.
We compared different residual connection patterns and presented the experimental results in Fig.4. The result illustrates that our method is outperform than those in [38] and [39]. The inpainting images based on our connection pattern have clear details of the left eye and good consistency with the right eye, and the details of results based on the other two methods are obviously missing. The repaired results based on element-wise sum [38] have significant mosaic effect, while the results using depth concatenation [39] are seriously blurred in detail.

D. DILATED CONVOLUTION
When repairing large missing regions in an image, the network needs to have a large area of receptive field. Using large convolutional kernel or deeper network will increase the parameters and make training process more difficult. To eliminate this disadvantage, dilated convolution [37] is introduced to our network. As there are some zero units in large kernel, dilated convolutional layers can obtain large receptive field without increasing the parameters.
The receptive field of dilated convolutional layer is: where f i is the receptive field of the i th layer, f i−1 is the receptive field of the (i − 1) th layer, K i is the size of kernel of the i th layer and S i is the extension rate of the i th layer. When the size of convolution kernel is fixed, the size of receptive field for neural network increases exponentially with the number of layers. The increase of receptive field by using extended convolution also introduces the problem of gridding effect [44], which may note be good for learning. Because the local information is completely missing and the information can be irrelevant across large distances, the design paradigm of Hybrid Dilated Convolution (HDC) [44] is adopted to solve the gridding problem. There can be no common divisor greater than 1 for the expansion rate of adjacent layers, which ensures that every pixel in the receptive field participates in the calculation. The convolutional rate is selected to follow the zigzag structure design as [1,2,5], subject to the following rules: (6) where r i is the dilation rate of the i th layer, M i is the max dilation rate of the i th layer. As shown in Fig.5, when the FIGURE 5. Illustration of the solution of the gridding problem. The receptive field of sequence of 9 convolutional layers has dilation rates of [1,2,5], respectively with kernel size 3 × 3. The times of pixel counted are represented by the color depth. dilated rates is [1,2,5], all pixels that participate in the calculation and the gridding effect are completely eliminated. The shallow receptive field has a checkerboard effect. As the number of layers increases, the receptive field gradually tends to concentric circles.

E. DISCRIMINATION PART
To achieve clarity of detail and overall consistency at the same time, an additional local discriminator for detail discrimination is used in our network. The input of the local discriminator is only the local area of the image and the discriminator only distinguishes the details. We compare the results between only using global discriminator and using both global discriminator and local discriminator in Fig. 6. From the results, the left eyes based on global discriminator tend to be blurred, while the results based on our method are clear and real, and have good consistency with the surrounding areas and the right eyes. The results deduce that our method has good performance in fixing the details and generates detailed and consistent reconstructed images regardless of the mask location.

F. OBJECTIVE FUNCTION
At the training stage, we use a combination of five loss functions. They are optimized jointly via back propagation using RMSProp optimizer [41]. In addition, an adaptive step-by-step function is trained in our model. We describe each loss function briefly as follows.
Code generator loss is the entropy deviation of information between the input and output of code generator. Even though it forces the network to produce a blurry output, it guides the network to roughly predict the robust information and the location of corrupted area. The code generator loss comes from the reconstruction loss of the structure of code generator, coding loss and the loss of GAN with code discriminator. It is back-propagated through the code generator and defined as: (7) where X is the ground truth, X is the reconstructed image, C(·) is the output of the code generator, and D code (·) is the output of code discriminator, MSE(·, ·) is the pixel-wise mean square error between two images.
Code discriminator loss is the distance between synthesized image and ground truth.It is back-propagated through code discriminator. It is the discriminator loss in GAN: Image generator loss is the loss in coding reconstruction process and generator loss of the generative adversarial neural network. It is back-propagated through image generator and defined as: where G(·) is the output of image generator, Y is the reconstruction result of broken image, D GL (·) is the sum result of global discriminator and local discriminator.
Global discriminator loss and local discriminator loss compute the accuracy of distinguishing synthesized image and ground truth. Global discriminator calculates based on whole image while local discriminator calculates only based on reconstructed area. They are back-propagated through the global discriminator and the local discriminator separately. They are defined respectively by: where x is the missing area of corrupted image, y is the corresponding region in ground truth, D G (·) is the result of global discriminator, D L (·) is the result of local discriminator.

A. IMPLEMENTATION
In our work, we utilize the architecture of deep convolutional GAN (DCGAN) to train the five parts of the model. The implementation environment of the experiment is Tensor-Flow 1.14.0, CUDA 10.0.130, Indel(R) Core(TM) i7-6700K CPU and NVIDIA GeForceGTX1080. Our NGAN is trained on the CelebFaces Attributes (CelebA) dataset, which consists of 10,177 identities with 202,599 face images. By adding occlusion to the original face image as the input of the missing image to be repaired, the fabrication of occlusion dataset CelebA-Mask was realized. We randomly selected 10 percentage of CelebA-Mask as the test set and the remaining 90 percentage of the images as the training set. The activation function of our network uses Leaky ReLU, which can reduce the loss of information caused by negative shielding. Our framework consists of five neural networks, and the network is constrained step-by-step during training. The training process is as follows: 1) Train code generator and image generator only; 2) Fix code generator and image generator, and train code discriminator, local discriminator and global discriminator. The number of iterations is fixed so that the training degree of the discriminators is close to that of the code generator and image generator. 3) Alternately train code generator, image generator and discriminators using the training method of GAN.   128 × 128 pixels is missing randomly and the size of missing area is 16 × 16 pixel.

B. QUALITATIVE EVALUATION
To evaluate the effectiveness of our model, the comparison of results with [41] and [36] are presented in Fig. 8. The occlusive area of the input map contains a wealth of semantic information, which is very different from simple texture repair work.
The results demonstrated that our approach performs best in terms of overall consistency and detail repair. In terms of detail repair, we can easily find that the repaired area of the left eye using [41] and [36] methods are blurred, while our method achieved much natural and clear details. In terms of overall consistency, the left eye and the right eye of the results using the compared methods are not consistent, while our results show that the left eye and the right eye remained coherently. The enlarged results of the inpainting area are shown in Fig. 9. The area reconstructed by the method in [41] has a clear border, and the hue of the complementary area is different from that of other parts of the face. Likewise the hue of area reconstructed by the method of [36] is different. Our method does not have this drawback, and overall consistent tone of face skin is obtained.

C. QUANTITATIVE EVALUATION
In order to estimate our model quantitatively, we tested it on the whole CelebA-Mask dataset. The mean square error (MSE), peak signal-to-noise ratio (PSNR), structural similarity index (SSIM) and FACENET distance [42] of the repaired image are calculated and compared with the other two methods. The results of our method (end-to-end and with manual processing) are compared with Pathak et al. [41] and Yu et al. [36], and the evaluation results are shown in Fig. 10. The manual processing means replacing the non-mask region of the output with the original pixels with corresponding position manually. The corresponding results are shown in Fig. 10 as ''Our''.
As shown in the figure, our method performs well on MSE, PSNR, SSIM, and FACENET, especially for the end-to-end results with post-processing. It deduces that our model can leverage the repaired detail, surrounding consistency, and structural reduction with less artifacts. The method directly coping raw image patches has lower MSE and FACENET, and higher PSNR and SSIM, which indicates that the repaired area of the output by our method has a high consistency with ground truth and has better facial similarity in the task of facial semantic repair. The improvement of the quality of our  [41], Yu [36], our method (end-to-end) and the result with manual processing of our method. FIGURE 11. Comparison 1 (%) and 2 (%) on CelebA-Mask with [36], [41] and our method. method compared with manual processing also demonstrates that the end-to-end model still introduces little noise to output images to maintaining the overall consistency and preserving the surrounding information of input image.
In addition, to compare the deductive repair capability of the network, we report our evaluation in terms of mean 1 error and mean 2 error for the results of the repaired region. The statistics distributions of [36], [41] and our method with ground truth are evaluated and compared in Fig. 11. The lower values of 1 and 2 in our method indicate that the repaired area by the proposed method is more similar to the ground truth in statistics. In addition, it is also verified that in subjective experiments, our method achieves detailrich textures and better performance in the repaired areas.

D. LIMITATION
Although our model is able to generate semantically plausible and visually pleasing content, there is some limitations. We implement various data to test and verify the effectiveness and robustness of our method. In the experiments in Fig. 12, our model fails to reconstruct the image for profile images. Due to the limitation of the training data, this method only works with rectangular patches (16 × 16 in this work). In the future work, we plan to merge expression detection and face position detection to our framework to address this issue.

V. CONCLUSION
In this paper, we present a novel deep generative model-based approach that improves the quality of reproducing filled regions while exhibits fine details. Our network employs a novel nesting structure to find the location of missing area and to reduce mode collapse as a prior constraint. A residual structure is adopted to deliver richer original image information to the deeper layers. Both qualitative and quantitative experiments show that our method performs well in fine details and global uniformity and can achieve end-to-end repair of defect images without additional semantic information. AILONG MA received the B.S. degree from the China University of Petroleum, Qingdao, China, in 2010, and the Ph.D. degree in photogrammetry and remote sensing from Wuhan University, Wuhan, China, in 2017. He is currently a Research Associate with Wuhan University. His major research interests include remote sensing image processing, evolutionary computing, and deep learning. VOLUME 7, 2019