A Novel GAN-Based Network for Unmasking of Masked Face

Recent deep learning based image editing methods have achieved promising results for removing object in an image but fail to generate plausible results for removing large objects of complex nature, especially in facial images. The objective of this work is to remove mask objects in facial images. This problem is challenging because (1) most of the time facial masks cover quite a large region of face that even extends beyond the actual face boundary below chin, and (2) facial image pairs with and without mask object do not exist for training. We break the problem into two stages: mask object detection and image completion of the removed mask region. The first stage of our model automatically produces binary segmentation for the mask region. Then, the second stage removes the mask and synthesizes the affected region with fine details while retaining the global coherency of face structure. For this, we have employed a GAN-based network using two discriminators where one discriminator helps learn the global structure of the face and then another discriminator comes in to focus learning on the deep missing region. To train our model in a supervised manner, we create a paired synthetic dataset using publicly available CelebA dataset and evaluated on real world images collected from the Internet. Our model outperforms others representative state-of-the-art approaches both qualitatively and quantitatively.


I. INTRODUCTION
The goal of this research, as illustrated in Figure 1, is interaction-free large object (e. g., face mask) removal from facial images. In this work, we focus on unmasking of masked face because it is a very intriguing problem of great practical value. Given an input masked facial image, we detect the mask region, then feed the input image and a binary map of the detected mask region into a GAN [1] based network and generate an image without the non-face object, which is the mask object in our case.
Trend of wearing masks in public is growing in recent years all over the world. First, people wear masks to guard themselves from pollution. Second, some people are selfconscious about their look and they want to hide their face and emotions from the public. Removing the mask object that covers almost half of the face might be of help in guessing one's identity.
To address this task, early non-learning based works [2], [3] erase unwanted object and synthesize the missing content by matching similar patches from the The associate editor coordinating the review of this manuscript and approving it for publication was Qiangqiang Yuan . remainder of the image. In [4], they find similar patterns from a database of millions of scene images and paste those patterns in the damaged part. Park et al. [5] remove eye glasses from facial images using PCA reconstruction and recursive error compensation. However, these non-learning based algorithms are limited only to small object removal from images.
Recent advances in learning-based methods empower image editing algorithms by learning from large-scale datasets, and thereby outperform non-learning methods for removing unwanted object in an image. Izuka et al. [6] use a GAN setup with two discriminators to remove unwanted object and fill damaged region with synthetic content in an image. Two-stage networks have been presented in [7]- [12]. In the first stage, they generate a coarse output and refine it in the second stage. Khan et al. [8] also employ a coarseto-fine network approach to remove microphone object from facial images. SPG-Net [12] and EdgeConnect [11] also use a two-staged adversarial approach to remove unwanted object. In the first stage, they produce some guidance information and complete the image in the second stage using the guidance information from the first stage. We give more detailed descriptions on relevant literature in the related work section.
In general, learning-based image editing approaches work well for removal of objects that have less structural and appearance variations. However, these approaches do not fit well for unmasking of masked face due to large size and complex nature of the object, i. e., mask. For example, most of the time masks cover not only half of the face semantics but also some parts beyond the actual boundary of the face. It starts from upper part of the nose (just below the eyes) and ends up covering some part of the neck, also some parts beyond the cheeks.
To solve this problem, we propose a novel GAN-based network that automatically removes mask and completes the missing hole so that the completed face not only looks natural and realistic but also has consistency with the rest of the image. We break the problem into two: mask object detection and image completion of the detected mask region. In the first stage, we detect the non-face object, i. e., mask, and generate a binary segmentation map of the object using an encoderdecoder network. In the second stage, we take an approach of gradually learning global coherency and deep missing semantics. We first train our model using one generator and one discriminator. This discriminator looks at the whole image and hence help enforcing the global coherency. Although this setup generates the face structure, especially, the chin and cheeks part covered by the mask intact with the rest of the face, but is unable to synthesize well the deep region of the missing hole. By 'deep region of missing hole', we mean part of the face far away from the occlusion boundary caused by the mask object, e. g., mouth part of the face, more specifically, lips and teeth. To focus more on generating the deep missing semantics, we add a second discriminator to the model that looks only at the missing region. This scheme enforces the two discriminators to provide fair feedback to the generator to complete the effected region with fine details while maintaining global structure of the facial image. More details of the model training are discussed in the training part of the experiments section. We also introduce a joint loss function that encourages visually plausible, sharp and semantically consistent results.
Moreover, because facial image pairs with and without mask object do not exist, we have created a paired synthetic dataset by editing images from publicly available CelebA dataset.
The main contributions of this work are: • We propose a novel approach that automatically removes mask object from face and synthesizes the affected region with fine details while retaining the original structure of the face.
• To retain structural and appearance consistency of the recovered face, we use a gradually growing network approach using two discriminators. Where one discriminator first help learning the global structure of face and then another discriminator comes in to focus on learning the deep missing region. This way, we achieve the effect of coarse-to-fine image completion.
• To overcome the data scarcity problem, we have created a synthetic paired dataset using publicly available CelebA dataset.
• Our unified feed forward model generates structurally and perceptually plausible facial image for challenging real images although trained on the synthetic dataset created.

II. RELATED WORK
Object removal from an image consists of two main tasks: a) object detection, b) image completion. There has been a considerable amount of non-learning or learning based work in the field of computer vision to tackle the task of object removal in an image. Due to the plethora of related literature, we only review some representative works related to object detection and completion in an image.
A. OBJECT REMOVAL AND IMAGE COMPLETION Table 1 shows comparison of our method with nonlearning or learning based state-of-the-art object removal approaches. Non-learning based object removal algorithms [2], [4], [13] erase unwanted object from an image and complete the missing region by finding the similar structure from input image or external data. Hays and Efros [4] use thousands of scene images to search information which is most similar to the input sample, and then copy and paste those information into the missing pixels of input sample.
In [2], [13], they complete holes left behind of the removed object by extending the surrounding contents into the missing region. However, they produce inconsistent content for images having complex semantic structures and diversified texture, e. g., human faces. Park et al. [5] remove eye glasses from facial images by introducing a regularized factor to adjust the patch priority function in computing the filling order. Their work only performs well for removing small object such as eye glasses and fail to generate plausible contents for large objects removal in facial images. On the other hand, learning-based image editing methods outperform those traditional methods both qualitatively and quantitatively. There has been a considerable amount of learning based work on image editing. They mainly describe image inpainting with the main application of object removal. Li et al. GFCM [22] and Iizuka et al. GLCM [6] train their model to remove an object and reconstruct the damaged part using a GAN setup. To make the generated part locally and globally consistent with rest of the image, GLCM uses two discriminators (global and local discriminator) combined with post processing. Although GLCM completes the image for random damaged region in facial images, it is limited to relatively low resolutions (178 × 218) and produce artifacts when damaged part is at the margins of an image. The output of GFCM also suffers when the removed object is large in size. Dong et al. [23] synthesize high-quality results for filling voids of radar data. They use a shadow constrained conditional GAN network to restore the damaged region. However, this work is limited to radar data restoration.
For object removal, Contextual Attention (GCA) [9] and MRGAN [8] use a two stage network. The first stage network produces a coarse result while the second stage network refines the output from the first stage. GCA introduces a contextual attention layer to explicitly attend on related feature patches. MRGAN remove microphone object from facial images. They generate a coarse output for the damaged region only in the first stage and refine it in the second stage. Both GCA and MRGAN generate plausible results for removing small objects but produce unnatural contents for large complex missing region. EdgeConnect [11] and SPG-Net [12] also use a two-staged adversarial approach. Instead of generating a coarse output, they generate an edge map or segmentation map in the first stage. In the second stage, they generate the missing region using the guidance map along with the input image. These schemes do not work for our problem because most of the time the first stage is unable to generate a reasonable map due to the large size of missing region. Moreover, all these deep learning based image editing works [6], [8], [9], [11], [22] assume that users provide the object map at the inference stage. In [16], they automatically detect object region and remove it in general scene-level images. However, their output heavily depends on automatic object detection which oftentimes fails to detect the object region due to large variations in appearance and structure of both mask and face. Moreover, they fill the removed object region by propagating information from surrounding regions. These reasons cause difficulty for the method in [16] to automatically remove large objects from facial images. For object detection we give more reviews at the end of this section.
EdgeConnect [11] is the closest method to our work in a sense that it generates the guidance information in the first stage and edit the image in second stage. Different from EdgeConnect, we generate a binary segmentation map of the non-face object while EdgeConnect generate the edge map of the complete image. Moreover, it uses a GAN setup with one discriminator in both stages while we use a simple encoder-decoder architecture in the first stage for generating binary segmentation map and employ two discriminators in the second stage as in GLCM [6] and GCA [9]. In contrast, GLCM and GCA train both discriminators jointly at the same time along with generator to learn global consistency and deep missing region while we gradually add them to the model. [24] is a pioneering object detection model that uses deep convolution neural network (CNN) for object detection. It first extracts thousands of regions from an image using selective search algorithm. These regions are then fed into a CNN that produces a feature vector for each proposed region. Finally, SVM classifies the presence of the object within that candidate region proposal from the extracted feature. Fast R-CNN [25] and Faster R-CNN [26] improve the performance of R-CNN by modifying its network architecture. Although these methods produce state-of-the-art results, they require a huge amount of training samples and computation power. Hence, instead of using these expensive algorithms for automatically detecting non-face object in facial images, we employ a simple segmentation network focusing on mask object.

R-CNN
Fully convolutional neural network (FCN) [27] is one of the pioneering end-to-end trained network for image segmentation that use CNN-based auto-encoder setup. FCN encoder is a modified version of popular classification module by replacing the fully connected layers with 1 × 1 convolution. This produces good results though oftentimes fuzzy object boundaries occur. U-Net proposed by Ronneberger et al. [28] is one of the most popular end-to-end fully convolutional network in biomedical image segmentation. Encoder captures the context in the image using a series of convolution with max pooling layers while decoder upsamples the encoded information using transposed convolution. Moreover, feature maps from the encoder are concatenated to the feature maps of the decoder. This helps better learning of contextual (relationship between pixels of the image) information. Due to simplicity and better performance of the U-Net architecture, we use it with slight changes to detect the non-face object in the image and generate the corresponding binary segmentation map of the object.

III. APPROACH
The overall structure of our framework is illustrated in Figure 2. It consists of two main modules, map module and editing module. Details of each module are explained in the following.

A. MAP MODULE
The output of the map module is a binary segmentation map, I mask_map , with 1 indicating the mask object and 0 for the remaining pixels in the image. The map generator, G mask , consists of a CNN-based encoder and decoder architecture, which is a modified version of the U-Net [28]. The encoder part of the generator consists of five blocks of convolution layers shown in Figure 2. Here, each block means a convolution layer followed by Lrelu activation function and instant_norm layer except the first layer of the encoder. The decoder architecture is a mirror copy of encoder architecture except that convolution is replaced by deconvolution layer. The last layer of the decoder uses tanh activation function without normalization layer. Also, we combine local information with the global information by concatenating the result of the deconvolution layers with the feature maps from the encoder at the same level. Usually, these connections are referred as skip connections shown in Figure 2. The map generator network takes an input image I input and is downsampled to the bottleneck layer using the encoder network. The decoder network is then up-sampled to predict a binary map. We use a cross-entropy loss between the predicted binary map and corresponding target map. To get a clean mask, we take a post processing step by using simple morphological image processing operations of erosion and dilation.

B. EDITING MODULE
The goal of this module is to remove the mask and complete the left behind region in a way that is both structural and appearance wise consistent with the ground truth image. Given the input image I input , guided by the object map I mask_map , our aim is to generate a complete image without the mask. The main blocks of this module are editing generator, discriminators and perceptual network.

1) EDITING GENERATOR
The editing generator, G edit , has the same architecture as the map generator G mask . Different from G mask , we use squeeze and excitation (SE) block [29] at the output of the first three blocks of the encoder. Moreover, between encoder and decoder, we employ four layers of atrous convolution (rate: 2,4,8,16) [30], which helps make the missing part generation coherent with rest of the face image by capturing large fields of view. The generator takes the input image, I input , concatenated with the output of the map module, I mask_map , and produce a generated image, I edit .
To force the editing generator to produce realistic missing content, we use reconstruction loss which is amalgam of l 1 loss and structural similarity loss SSIM [31] , expressed as: (2) VOLUME 8, 2020 L l 1 loss is the pixel difference between the generated image I edit and the ground truth I gt as: SSIM measures the structural similarity between the I edit and I gt and its corresponding loss function is written as:

2) DISCRIMINATORS
We use two discriminators called D whole_region and D mask_region as shown in Figure 2. The architecture of both discriminators is the same as the discriminator in pix2pix [32]. They penalizes the dissimilar structure at the patch scale of 70 × 70. The role of both discriminators is to force the editing generator to produce visually plausible and semantically consistent images. Instead of training both discriminators at the same time along with the editing generator, we train the editing generator along with D whole_region for the first 2/5 period of the total training iterations. This helps enforce the output produced by the generator to be structurally consistent with the original input face image by minimizing the following objective function: Here, O and S denote real and synthesized image sets, respectively. However, this loss is not capable of generating plausible content at the deep pixels of the missing region. We enforce the optimization of D mask_region to produce good semantics in the missing region only. We add D mask_region along with D whole_region to the editing generator. We train them jointly for the rest of the training iterations. To train D mask_region , the following objective function is minimized: Here, I mask_region = I input ⊗(1−I mask_map )+(I edit ⊗I mask_map ) and ⊗ denotes the element-wise multiplication.
In order to train our model in a GAN setup, the generator fools the discriminators by minimizing the following loss functions: L mask_region adv

3) PERCEPTUAL NETWORK
The third block of the editing module is a perceptual network. It is a pre-trained VGG-19 fixed network [33]. The purpose of this network is to encourage the generator output, I edit , to have similar feature representation to the ground truth, I gt .
We use a perceptual loss L perc [34] to penalize the outputs that is perceptually not reasonable by defining a feature level distance measure between the intermediate feature maps of I edit and I gt based on a pre-trained network (VGG-19 [33]). Let ϕ i is the activation map of the i th layer of ϕ, the perceptual loss is defined as: We exploit the intermediate convolution layer feature maps (conv_3, conv_4 and conv_5) of VGG-19 (Pre-trained on ImageNet data [20]) network to get rich structural information and thus helps in recovering plausible structure of the face semantics. The joint loss function to train the editing module is defined as:  We have set the weight parameters as λ rc = 100, λ D whole_region = 0.3, λ D mask_region = 0.7, λ adv whole_region = 0.3 and λ adv mask_region = 0.7. L comp helps in generating natural looking, structurally consistent and perceptually plausible output.

IV. EXPERIMENTS
In this section, we present synthetic dataset creation, training details of our model and comparison of our method visually and quantitatively with other state-of-the-art image editing approaches. Moreover, in the last part of this section we provide ablation studies of our model.

A. SYNTHETIC DATASET GENERATION
There is no publicly available dataset that contains facial image pairs with and without mask object to train our model in a supervised manner. We construct a synthetic dataset of 10k images using publicly available CelebFaces Attributes Dataset (CelebA) [19]. CelebA is a large-scale face attributes dataset with more than 200K celebrity images. We have used 50 kind of masks of different sizes, shapes, colors and structure in our synthetic dataset. Some of the examples of facial masks in our dataset are shown in Figure 4. To create synthetic samples, we first align the faces using eye-coordinates for all images using dlib [35]. Then we randomly place mask on face using Adobe Photoshop CC 2018. We also generate the corresponding binary map for the mask. Figure 3 shows a couple of examples of our synthetic dataset.
For fair comparison, we have trained current stateof-the-art approaches Iizuka et al. [6], Yu et al. [9], EdgeConnect [11] and MRGAN [8] using our synthetic dataset. We also provide the object binary map generated by our map module along with input image both at training and inference stages because all these methods assume that object binary map is given.

B. TRAINING DETAILS
For training of the map module, we have fed input image I input into the network and generate a binary map I mask_map that is close to the target binary map I tm . The generated binary map I mask_map along with input image, I input , is then fed into the VOLUME 8, 2020 We have implemented our model in tensorflow [36]. Both stages of the model are trained alternatively. We have used 10,000 training samples of size 256×256 of our synthetic dataset for training our model with a batch size of 10, and Adam optimizer. We have trained the model for 500,000 iterations.
In the second stage, instead of training the whole network at the same time as done in GLCM [6] and GCA [9], we first train G edit along D whole_region for almost half of the training iterations to generate a reasonable global structure of the face. This helps in getting the actual boundaries of face region. However, it suffers in generating the deep missing region. Once the reasonable global structure of the face is formed, we add D mask_region , to focus more on the missing part for the rest of the training iterations to generate the deep missing semantics more plausibly.
The training details of our model is as follows. To overcome the problem of the editing generator G edit being too weak at the start as compared to the discriminator, we first train the G edit only (no discriminator) for 50,000 iterations and then G edit along with D whole_region for another 200,000 iterations. For the rest of the training iterations, we train the whole network jointly by giving more weight to the D mask_region . This scheme of training helps in providing fair feedback to the editing generator from both discriminators. The whole training procedure takes around about 100 hours using NVIDIA GeForce 2080Ti GPU.

C. COMPARISON AND DISCUSSION
In this section, we analyze results generated by our model and compare with other state-of-the art image editing methods such as Iizuka et al. [6], Yuet al. [9], EdgeConnect [11] and MRGAN [8] both quantitatively and qualitatively on real world test images. Figure 5 shows the sample generated by our model for real test images. Our test samples contains a lot of diversity in terms of background (blank and wild backgrounds), size, shape, color and structure of masks. In each test image, a mask covers almost half of key facial semantics. As can be seen in Figure 5, our model successfully removes the mask object and generates natural looking outputs with structural consistency. Figure 6 compares our model with the other sateof-the-art approaches. The results show that our approach successfully removes the mask object and completes the face that looks not only structurally consistent but also naturalistic. Iizuka et al. [6], Yu et al. [9], EdgeConnect [11] and MRGAN [8] unable to correctly achieve the task. Iizuka et al. produce plausible new content in scene images but fail to produce plausible results for face images having missing regions with large structural and appearance variations. The essence of the GCA technique is contextual attention layer which learns to generate missing patches by copying feature information from known background patches. This strategy works well for images where there is high probability of finding same patterns in the neighbouring patches but fails to handle large missing region in facial images (using nose to fill in holes at mouth locations) as shown in third column of Figure 6. Edgeconnect generates better results than GLCM and GCA. However, Edgeconnect's final output depends on the edge map generated by the first stage. For this problem where the size of the missing region is large, edge generator cannot generate reasonable edges and hence the final outcome of the Edgeconnect network suffers. MRGAN generates only the missing region and keeps the rest of the image as it is. That is why it generates the missing semantics well but also produces some unnatural semantics for the missing region that lies outside the actual boundaries of the face. In summary, our proposed model overcomes the limitations of the other state-of-the-art methods by producing realistic and consistent results regardless of the size, shape, structure and color of facial mask.

2) QUANTITATIVE COMPARISON
We compare the results between our method and the other methods using the following quantitative metrics: 1) Structural SIMilarity (SSIM) [31]; 2) Peak Signal to Noise Ratio (PSNR); 3) Frechet Inception Distance (FID) [37]; 4) Naturalness Image Quality Evaluator (NIQE) [38]; and 5) Blind/Referenceless Image Spatial Quality Evaluator (BRISQUE) [39]. Table 2 shows quantitative comparison with Iizuka et al. [6], Yu et al. [9], EdgeConnect [11] and MRGAN [8]. The results show that our model also achieves the best quantitative values. In addation, as can be seen in Table 2, the SSIM value for EdgeConnect and ours model is the same. The reason is that the objective of image editing techniques is to synthesize realistic-looking content rather than the exact same content as the original image. Therefore, we argue that as reported by many other works [7], [40], quantitative analysis may not be the most effective measure of the image editing task.

3) USER STUDY
We conduct a pilot user study to evaluate our results using perceptual assessment of people. We have asked the total of 20 questions and at every test sample, participants are VOLUME 8, 2020 shown input image along with five randomized options (results produced by Iizuka et al. [6], Yu et al. [9], [11], MRGAN [8], and our model). We have asked 100 participants to choose one option out of the five that have effectively removed the mask and complete the image while retaining natural look and structure of the face. Our results got 76 votes, MRAGN [8] and EdgeConnect [11] earned 10 and 7 votes, respectively, while Iizuka et al. [6], Yu et al. [9] were voted by 5 and 2 voters.

4) ADDITIONAL RESULTS
We have retrained our model and the other state-of-the-art representative models for inpainting irregular and random rectangular missing holes. In case of the other state-of-the-art models, we provide the object binary map generated by our map module both at training and inference stages. As can be seen in Figure 7, our model completes the face images while retaining the naturalness and structure of the faces comparably well to the other state-of-the-art approaches. Moreover, in Table 3, we can see the quantitative comparison for both irregular and rectangular missing region. The results show that our model achieves better quantitative performance in most of the cases for completing diverse missing regions in facial images.

5) LIMITATIONS
Although our model can handle removal of mask objects of various shapes, size, color and structure, there are some examples as can be seen in Figure 8 where our model fails to completely remove the mask object. Common failure cases occur when the map module is unable to produce a reasonable segmentation map of the mask object. This happens when mask objects are very different than those in our synthetic dataset in terms of both shape and structure. As can be seen in the first couple of rows of Figure 8, the shape, color and structure of the mask objects are totally different than the mask types we used in our synthetic dataset, failing to detect them properly. In the third row, the network is unable to detect the whole mask region due to complex mixture of colors. The network failed to detect the part where its color is similar to face texture because it was considered part of the face.

D. ABLATION STUDY 1) ROLE OF USING TWO DISCRIMINATORS
We investigate the effectiveness of using one discriminator at a time and using both (D whole_region and D mask_region ) by gradually adding them to the model. The first column of Figure 9 shows the result of using only D mask_region with the rest of the setting same as our model. As this setting will only focus on generating the affected region and keep the rest of the image same as original input image, it produces good semantics at the missing area, e. g., mouth and down part of the nose. However, as can be seen in Figure 9 (b), it mixes the chin with the neck (specially neck area covered by the object) by considering it as part of the chin. This setting generates  the worst results when mask color is similar to the neck color. To overcome the problem, instead of using D mask_region , we have additionally used D whole_region with the rest of the setting same as our original model. We drop the D mask_region and use only D whole_region along with the editing generator. The second column of Figure 9 shows results produced by this setting. We can see that it produces more consistent structure of the face and does not mix chin with neck but this setting is incapable of synthesizing plausible content in the deep region of the missing part: it produces teeth that look neither symmetric nor natural. In order to generate plausible content in the missing region that is consistent with the rest of the face, we have used both discriminators along the editing generator by adding each discriminator gradually to the network as stated earlier in training part of the experiment section. The last column of the Figure 9 shows that our training strategy of two discriminators not only generates plausible contents under the large missing region but also recovers correct semantic structure.

2) EFFECT OF MAP MODULE
We have dropped the map module part and only used the editing network to validate the effect of I mask_map on object removal and image editing. For this, we have only fed I input (image with mask) into the editing network generator while the rest of the model is kept the same as our baseline network. Figure 10 shows that without using mask segmentation in the image editing network produces irregular structure of the face. For example, in first row of Figure 10 (b), the texture of the generated region is different from the rest of the face, while in second and third rows, we can see that lips are mixed with each other and boundaries of the chin looks very unnatural. On the other hand, we can see in Figure 10 (c) that using the segmentation map helps not only recover the correct texture of the damaged region but also generate sharp boundaries of the chin part covered by mask. This shows that mask segmentation provides enough information about where the object pixels are and makes the task easy for the image editing network. Hence, using mask segmentation along the input image for editing network results in more accurate object removal and realistic face image editing.

V. CONCLUSION
In this work, we have proposed a novel method for interaction-free large object removal from facial images, focusing on mask object. For image completion, we have employed GAN based image inpainting through image-toimage translation approach to produce plausible results. We have shown that the proposed training scheme of two discriminators for gradually learning global coherency and deep missing region is quite effective in producing realistic and structurally consistent outputs. Both qualitative and quantitative comparison show that our model is capable of producing high perceptual quality results for large missing hole in facial images as compared to other state-of-the art image editing methods.