E2F-GAN: Eyes-to-Face Inpainting via Edge-Aware Coarse-to-Fine GANs

Face inpainting is a challenging task aiming to fill the damaged or masked regions in face images with plausibly synthesized contents. Based on the given information, the reconstructed regions should look realistic and more importantly preserve the demographic and biometric properties of the individual. The aim of this paper is to reconstruct the face based on the periocular region (eyes-to-face). To do this, we proposed a novel GAN-based deep learning model called Eyes-to-Face GAN (E2F-GAN) which includes two main modules: a coarse module and a refinement module. The coarse module along with an edge predictor module attempts to extract all required features from a periocular region and to generate a coarse output which will be refined by a refinement module. Additionally, a dataset of eyes-to-face synthesis has been generated based on the public face dataset called CelebA-HQ for training and testing. Thus, we perform both qualitative and quantitative evaluations on the generated dataset. Experimental results demonstrate that our method outperforms previous learning-based face inpainting methods and generates realistic and semantically plausible images. We also provide the implementation of the proposed approach to support reproducible research via (https://github.com/amiretefaghi/E2F-GAN).


I. INTRODUCTION
Image inpainting is used to complete missing information or substituting undesired regions of pictures with conceivable and fine-grained content. It encompasses a wide extend of applications in fields of restoring harmed photos, editing pictures, removing objects, etc [1] [2]. Many conventional methods typically use low-level and hand-crafted features from the corrupted input image and utilize the priors or additional data. By propagating the extracted features from visible and well-structured parts to the missing regions or by filling missed small areas by looking and melding comparative patches from the same or other images. In spite of the fact that these strategies have great effects in the completion of replicating structures, they are restricted by the accessible regions in an image and cannot create novel image substance. In recent years, learning-based strategies have been proposed to overcome these confinements by utilizing huge volumes of training data [3] [4]. Notably, despite of great achievements of learning-based methods in this task, they are limited by at least three challenges: the inpainted area should be C1) region, then, can be constructed from the extracted information. Alternatively, a two-stage architecture generates an intermediate coarse image after recovering structures in the first stage, and then feeds it to the second stage for improving the texture. Additionally, another category called structural guidance-based methods uses an assistance algorithm to provide more information for the main inpainting method. An edge and a contour generator have been used within a two-stage architecture in [15] and [16] respectively.
Although, it is worthy to mention that in face inpainting, besides the above-mentioned challenges (i.e., C1-C3), we are facing further requirements. Notably, a facial representation can be considered for the purpose of biometric recognition due to the special topology of different facial elements (i.e., forehead, eyes, eyebrows, nose, mouth, jaw, chin, cheek) and their distinctive characteristics [42]. Thus, revealing the hidden parts of a face by using other elements such that the topological face elements along with consistency in face attributes (e.g., demographic and other biometric information [43]) are preserved is a challenging task, yet it will have a strong impact on the feasibility of biometric recognition conducted by human experts (i.e. in forensic investigation [44]) or by machine learning [45] or hand-crafted algorithms [46]. Therefore, the requirements of face inpainting are as follows: R1) the face topological structure should be reconstructed so that all elements are placed in the right position semantically and continuously. For this, first, the shape of the face (oval shape, square shape, round shape, etc.) should be predicted. Then all other elements should be placed proportionally within the predicted frame. Additionally, to look more realistic, the head pose should be naturally aligned and integrated with other elements. These requirements are the 1 https://github.com/amiretefaghi/E2F-GAN main challenges (i.e., C1-C3) of every inpainting method modified for face inpainting solutions. Since the aim of this paper is a special case of face inpainting where a large region of the face except eyes is hidden, besides R1, two other requirements which make the inpainting task more challenging should be considered. R2) Researchers have found that the area of skin around the eyes is useful to determine soft biometric information such as age or gender [17,37]. The proposed inpainting model should utilize the color, texture, and size of eyes and eyebrows to estimate this kind of demographic attributes and inpaint other face elements according to the estimated features. R3) The proposed solution should preserve the identity-related biometric properties present in the eyes regions [38,18] when generating the full face [39]. Noteworthy, this eye region is demonstrated to encode a large part of the identity information present in the face [44] enabling both person recognition and fake face detection [40].
Additionally, it is worth to mention that the hidden portion of the image can directly affect the performance of proposed solutions, and clearly large masks make meeting the referred requirements (i.e., R1-R3) more difficult. Considering this issue, the aim of this paper is to complete the face based on the eyes region (periocular region), our used mask type will cover most parts of the face. In this paper, a novel DL-based architecture has been proposed such that it complies with the referred requirements (i.e., R1-R3, see Fig. 1). Therefore, our contributions and novelties can be summarized as follows: • In this work, an effective end-to-end solution for reconstructing the face based on just the eyes region has been proposed. This innovative GAN-based architecture called E2F-GAN benefits from the advantages of coarseto-fine, coarse-and-fine, and structural guidance-based architectures. The code for our proposed method is available in GitHub 1 . • By using various loss functions during the training process [41], not only the quality of inpainted regions but also demographic and biometric features have been preserved and measured by several quantitative and qualitative evaluation metrics. • A new dataset of masked faces called E2Fdb has been generated and made publicly available (same GitHub indicated before). • In terms of selecting the most informative guidancebased method, we experimentally show that edges provide more structural and contextual information compared to landmarks.

II. RELATED WORKS
In eyes-to-face inpainting, a face (a raw image indicated by !×#×$ hereafter) is corrupted by a binary image mask ( !×#×$ ), where H, W, and N show the height, width, and number of channels of the image respectively, and the corrupted image will be shown by % ( % = ⨀ , where ⨀

Masked Image
Our Result Original is the element-wise production). The inpainting model takes % and as input, and its output, reconstructed face, should fulfill the R1, R2, and R3 ( ≅ ( ). The proposed inpainting methods use different architectures and various types of masks. In this section, we review recent face inpainting methods based on DL architectures and widely used mask types.

A. FACE INPAINTING METHODS
Apart from traditional methods which utilize low-level features extracted from the same image or a group of images, the learning-based strategy is the main focus of recent proposed methods due to using high-level features that enable them to inpaint the damaged regions semantically. In the following, we review several learning-based existing works that attempted to inpaint corrupted faces, similar to the aim of this paper.
The coarse-to-fine structure has been used in recent face inpainting tasks. Li el al. [19] proposed a generative-based coarse-to-fine structure that benefits from an attention layer to capture long dependency between features to generate more realistic images. Yu et al. [13] uses a coarse-to-fine structure to inpaint free-form masks. In the same context, Liu et al. [12] proposed a coarse-to-fine architecture with a novel attention layer. Chen et al. [19] proposed a coarse-and-fine structure including a coarse network for extracting global semantic information and a fine network to extract multi-level local features. Besides the coarse-to-fine based strategies, another category so-called structural guidance uses additional information to assist the main inpainting module. Nazari et al. [15] leverage an edge generator first to recover the edges, and the corrupted image is fed to the image inpainting network along with predicted edges. Chen and Liu [16] use a dual branch network including texture and edge branches to extract features and recover structures and textures of missed regions. Some works estimate facial landmarks to assist the main inpainting network [20] [21]. In this paper, we will take the advantages of different architectures, i.e., coarse-to-fine, coarse-and-fine, and structural guidance.
The above-mentioned methods produce a unique result per each input. On the other hand, some approaches inpaint the corrupted regions differently per each execution for each specific input. Zheng et al. proposed a Variational Auto-Encoders (VAEs)-based [22] dual pipeline including a reconstructive path that uses the ground truth to learn the prior distribution of missing regions and a generative path for which the conditional prior is connected to the distribution obtained in the reconstructive path. An unsupervised conditional framework based on generative adversarial networks for varied image inpainting that can learn conditional completion distribution has been proposed by Zhao et al. [23]. A similar approach using GANs to restore low quality face images was recently proposed in [47]. It should be noted that, in E2F-GAN, we need a unique output for each input even after several executions to fulfill the requirements R2 and R3.

B. MASK COVERAGE
The used masks in face inpainting scenarios can be classified into two categories called free and fixed-form masks. In widely used free-form masks [8,10,15,21,22,24,26], there are irregular shapes randomly placed on the images (Fig. 2a). Instead, in the fixed-form masks [13,21,24,25], regular shapes cover part of the images which are located on the images randomly or purposefully ( Fig. 2b) [24] [25]. Since the aim of this paper is to complete the face based on eyes, our used mask type is in the latter category with a large-size mask (≈75% of the face).

III. PROPOSED METHOD
The overall network architecture of our proposed method, which is based on a coarse-to-fine architecture and includes two main modules called coarse and refinement, is shown in Fig. 3. Different from others [2,3,7,13,19], both modules (i.e., coarse and refinement) are GAN-based networks, therefore, each of which includes a generator and a discriminator. The coarse module, which comprises a generator called coarse generator ( ), has a dual encoder that follows the coarse-andfine structure to capture global semantic features and extract multi-level features from the eyes region. Besides this module, a GAN-based refinement module which consists of a refinement generator ( ) and a discriminator ( & ) has been utilized to improve the coarse outputs. Intuitively, the refinement network sees a more completed scene than the masked images, so its network can learn better feature a) b)

Edge Predictor
representations than the coarse network. Therefore, our endto-end method includes two GAN-based modules which are training to generate the final result. In the following subsections, each module is described in detail.
Notably, facial landmarks [21] or edges [15] are usually the most widely-used structural guidance in image inpainting tasks. In our proposed E2F-GAN, where the used mask covers most parts of the face, predicting both landmarks and edges is a challenging problem. As a consequence, our proposed method utilizes both landmarks and edges during our experiments, in an effort to use the most effective structure (e.g., landmarks or edges). For facial landmarks, we used the landmark prediction method proposed in [27] and for predicting edges, we used the edge predictor proposed by Nazari et al. [15]. Both methods have been trained again on our generated dataset that contains specific eye masks. As we will see in the experiments, our quantitative and qualitative metrics will show that the edge structural guidance provides more effective information for our coarse generator. Therefore, in our final setup we use edges generated by an edge predictor ( ' ) as structural guidance for .

A. Coarse Module
The proposed GAN-based coarse module is responsible for extracting the required features from the masked image and generating the first coarse result. To do this, we designed the module with three submodules including edge predictor ( ' ), coarse generator ( ), and discriminator ( ( ). In the following, we explain the role of each network, its architecture, and the used loss functions.

1) COARSE GENERATOR
The coarse generator has the main responsibility for meeting the three requirements (i.e., R1-R3). Not only the biometric and demographic feature should be extracted from the periocular region, but also the initial coarse prediction should look realistic, and semantically and continuously structured. This is achieved using three networks: two encoders so-called fine encoder ( ) ) and pose encoder ( * ), and a decoder. The encoder ) deals with the finest features of % and * deals with the predicted structure of faces obtained from ' . Therefore, first % is fed to ' to predict edges of visible and hidden regions ( '+,' ) and then '+,' is concatenated with % to fed * . This assists to predict the pose of different elements of the face. Additionally, % will be fed to ) with the aim of extracting identity attributes. Finally, the decoder will predict and inpaint the hidden regions based on the two feature maps received from ) and * . In the following, we describe each of these networks and their roles in our scheme.
Fine Encoder. The aim of using this encoder is mainly to extract demographic (e.g., age, gender) and biometric properties (e.g., identity, skin color) from % . Therefore, the skin color around the eyes, wrinkles, the size of eyes and eyebrows, the distance between two eyes, and other possible properties should be considered. On the other hand, it should be noted that due to the high coverage ratio of % , ) is fed with a lot of unusable information (the black region). To prevent deteriorating the quality of output and filter out these pixels, the first seven blocks of ) are configured as with gated convolutions (GC) [14]. These blocks contain parallel convolution layers with different sorts of activation functions which assist to extract an appropriate feature map and eliminate extracted features from the masked region. Then, three interleaved gated residual blocks (IGRB) [19] have been placed after GC blocks to extract multi-level features.
Pose Encoder. For extracting coarse structure and global semantics features, and consequently preserving the quality as well as the structure of the predicted face, an encoder called pose encoder ( * ) has been placed in the Coarse Module ( ). It has been fed by concatenation of '+,' and % . Doing this, a receptive field for recognizing face structures will be available for * . However, the inputs '+,' and % are both sparse. To extract a meaningful feature map, similar to [29], we used three spatial pyramid dilation blocks (SPD) after six convolution layers. Notably, SPD blocks contain parallel convolution layers with various dilation rates to extract a large receptive field from the given input image.
Decoder. To inpaint the coarse output based on features extracted by * and ) , a decoder including seven layers (one attention layer and six upsampling convolution layers) has been used. In common encoder-decoder approaches, the decoder receives features directly from the encoder but in our proposed method, the decoder receives two types of features including low-level features extracted by large receptive fields that may lack detailed information (i.e., the output of * ), and high-level detailed features with a small receptive field (i.e., the output of ) ). Thus, we use a CSAB as the first layer of the decoder to discriminate the more effective features from others by assigning more weights.
Channel and Spatial Attention Block (CSAB). According to the outputs of * and ) , the input to the attention block contains two types of features: a) large receptive field that may lack detailed information and b) output of ) , i.e., highlevel detailed features with small receptive fields. We adopt the concatenating operation to aggregate these two types of features. On the other hand, we may achieve redundant information about multi-level contextual information and this situation will not be efficient for our goals. Thereby, as shown in Fig.3, we adopt a specific attention block called channel and spatial attention block (CSAB) [19] to assign more weight to important features [48] and alleviate the interference of redundant features by channel and spatial attention. Hence, attention block composes of two main attentions which we will introduce. Convolution operation leads to local contextual information. Discriminative features representation is essential for inpainting. We leverage the attention mechanism to fulfill this desire. The channel attention emphasizes interdependent feature maps by exploiting the dependencies between channels. Meanwhile, the spatial attention encodes a wide range of contextual dependency within each channel, thereby improving the overall representation capability by gaining mutual for similar features.

B. REFINEMENT MODULE
The coarse module's output ( ( 0 ) consists of face coarse structure including placed face elements, stated face pose, specified color skin, etc., suffering from fine details. To add more details to the ( 0 , we propose a GAN-based refinement module.

1) REFINEMENT GENERATOR
Inspired by the U-Net architecture [28] and the refinement network proposed by [29], we proposed a more effective architecture by replacing some DL blocks with SPD and selfattention (SA) blocks which receive the concatenation of ( 0 and '+,' as its input. We have adopted SPD blocks with four dilation rates in the middle of our architecture to extract features with various receptive fields from input images and then used SA blocks between middle layers. SA benefits from the concept of self-similarity, which is useful for reclaiming the reconstructed pattern based on the remaining ground truth in a masked image. As mentioned before, the duty of this stage is that it should improve fine details of images, hence, we use reconstruction and perceptual losses to adjust the fine details.

C. DISCRIMINATOR
To inpaint and generate more realistic high-quality faces, both coarse and refinement modules have been designed based on GAN structures, thus, two discriminators have the responsibility of evaluating the output of and . The coarse module's discriminator ( ( ) receives ( 0 and consequently the refinement module's discriminator ( & ) has been fed by ( ) . We have combined the concept of SN-GAN [30] and Patch-GAN [31] for these discriminators to distinguish real or fake images. Besides this combination, we have used the hinge adversarial loss function for our discriminators. These combinations and loss functions help us to train our discriminators faster and more stable, distinguishing real or fake images efficiently.

D. E2F-GAN END-TO-END TRAINING
The E2F-GAN model is trained in a supervised and end-toend manner. We have defined four groups of loss functions [41] for various parts of our proposed method to achieve considerable results. To train , we have utilized four specific loss functions including reconstruction loss, perceptual loss, style loss, and adversarial loss; and just reconstruction and perceptual losses have been used for training . With the aim of having an end-to-end training process, we define the total loss ℒ which consists of four groups of component losses as below: In the following, the formulation of the used losses and the notion behind each loss is described. The reconstruction loss (ℒ 1'0 ) or per-pixel loss measures the pixel-wise difference between the synthesized image and the ground truth image. This loss is essential for maintaining texture information. It is calculated as the L1-norm between ( 8 and the corresponding ground truth , . ℒ 1'0 is defined as follows: where is replaced with or depending on the ℒ 1'0 is used for or , respectively.
It is worth to mention that, an element-wise loss cannot consider high-level semantics. Accordingly, recent research [19,21,22] suggests using perceptual distances based on a pre-trained network, VGG19 which was trained on the ImageNet. The perceptual loss (ℒ *'10 ) measures the difference between features extracted from the various layers of the VGG19 network for ( and its corresponding ground truth.
where ? 8 and , are extracted features from ( and , respectively, and is replaced with or depending on the ℒ *'10 is used for coarse or refinement, respectively. We extract features from layers of the pre-trained network.
In order to provide richer texture, we also employ style loss (ℒ 2345' ). In style loss, a Gram matrix calculates the correlation between channels in a feature map. The style loss then calculates on the features map produced by the pre-trained VGG19 network.
For generative adversarial learning, our discriminators are trained to distinguish between generated images and ground truth images. on the other hand, the generators strive to cheat the discriminators by hardening that classification. We As mentioned before, we combine the used loss functions with appropriate weights as follows: 1'0 = 1, *'10 = 0.1, 2345' = 250, 6+7 , = 0.1, 6+7 + = 1.

IV. EXPERIMENTS AND DISCUSSION
In this section, we evaluate the E2F-GAN performance on a new generated face dataset (E2Fdb) based on CelebA-HQ. We compared our results with three other methods called EdgeConnect (EC) [15], Pluralistic Image Completion (PIC) [22], LaFIn [21]. To have fair comparison, the three methods have been trained using the E2Fdb. For quantitatively measuring the performance difference among the methods, we employ several statistical metrics. Moreover, to measure the amount of preservation of demographic and biometric features, we calculate False Non-Match Rate between original and inpainted faces. Using a competitive face biometric matcher [49] based on ArcFace [36].

A. DATASETS
We conduct all experiments on our generated dataset called E2Fdb (available on project's GitHub page) 2 extracted from the well-known CelebA-HQ dataset [32,49]. To extract the periocular region from each face image, the images are reshaped to size 256 × 256 and then by utilizing a landmark detector [27], eyes are detected, similar to [50]. Doing this, and % are produced for each image. Moreover, we removed misleading samples including those eyes covered by sunglasses or faces that have more than 45 degrees in one angle (roll, pitch, yaw) leading to hiding one of the eyes by using WHENet [33] algorithms. Finally, the total number of samples is 24,554 among which 22,879 will be used for the training process and the rest, which is 1,685 images, for the test.

B. EVALUATION METRICS
We evaluate the image inpainting performance of the proposed model using quantitative and qualitative comparisons. For quantitative comparison, two types of metrics called statistical and identity metrics have been measured. In the following, we describe each category and its corresponding metrics briefly.

1) STATISTICAL METRICS
We use five statistical metrics: ℓ ( loss, Peak Signal to Noise Ratio (PSNR), Structural Similarity (SSIM) [34], Frenchet Inception Distance (FID) [35], and Total Variation (TV). Notably, the ℓ ( loss shows the model's reconstruction ability for images. PSNR measures the visibility of errors between the ground truth , and image inpainting ( to evaluate the image quality. SSIM aims at estimating the perceptual changes in the structural information, which shows human's subjective feelings more accurately than PSNR. FID is a widely used metric in the image generation field to measure the visual quality. TV assists to measure the amount of noise in the image by calculating the sum of the absolute differences for neighboring pixels.

2) IDENTITY METRICS
To measure the amount of preservation of demographic and biometrics characteristics after completing inpainting process, we calculate the False Non-Match Rate (FNMR). FNMR is the rate at which a biometric algorithm miss-categorizes two captures from the same individual as being from different individuals. Here, we assumed that ( and are two faces for the same individual and using ArcFace [36], we calculate the corresponding embedding vectors for each face, and finally calculate the cosine similarity between each pair. Finally, the FNMR for different thresholds is shown.

C. COMPARISON WITH EXISTING WORK
By using the above-mentioned metrics and presenting some outputs, the results of our proposed method have been qualitatively and quantitatively compared against three stateof-the-art approaches, named PIC, EC, and LaFIn. We trained the three methods over our generated dataset (i.e., E2Fdb) according to the best configurations of each method mentioned in the corresponding paper. In the following subsections, we present the results.

1) QUANTITATIVE COMPARISONS
The results of the statistical metrics calculated on the validation set of E2Fdb including 1,675 samples are reported in Table I. As can be seen from the numbers in Table I, E2F-GAN is superior over PIC, LaFIn, and EC in most metrics, except for the ℓ ( loss for which LaFIn works slightly better. Overall, our E2F-GAN outperforms the others by large margins in terms of FID, SSIM, PSNR, and TV metrics. More specifically, our large margins in FID and TV metrics demonstrate that our method can inpaint the masked image with much higher quality compared to other methods. Moreover, FNMR has been measured for E2F-GAN and other three compared methods as shown in Fig. 4. For different thresholds, E2F-GAN has lower false non-match rate which shows the ability of our algorithm extracting identity information from the periocular region and transferring it to the reconstructed face. Notably, since the PIC method generates different outputs for a specific input, we executed this method five times and the best results have been reported.    low compared to our and LaFIn results. Therefore, although like EC we used edge predictor in our scheme, there is a large margin between our outcomes. Additionally, with aim of further investigation of the models' outputs regarding age and gender prediction based on the periocular region, we presented some challenging examples in Fig. 6. That figure shows three faces including two elders (a man and a woman) and a young woman. As seen in those examples, E2F-GAN can assess the age based on periocular region and reconstruct the face with a reasonable quality.

V. ABLATION STUDY
In this section, firstly we qualitatively and quantitatively analyze the effect of three main components of our proposed model including the edge predictor, the refinement module, and the attention block. Table II reports statistical metrics indicating the degree of effectiveness each of the three components in the performance of E2F-GAN. Specifically, the refinement network is the most conspicuous one which benefits the model by providing conformity and consistency among face components and skin texture around the eyes, such as wrinkles and skin color. The edge guidance contributes to ensuring that the structure of the face is wellpreserved (see Fig. 7). Visually, the effectiveness of the attention block may not seem tangible. However, the quantitative results demonstrate the advantages of attention block. We also compared the effect of edge and landmark predictors. As shown in Table III, the edge guidance provides better values in most quantitative metrics specially for SSIM metric.
Finally, Fig. 8 shows a few challenging examples for preserving the gender of the person based on the periocular region. Our observations show that E2F-GAN can preserve the gender of subjects with a high accuracy.

VI. CONCLUSION
The aim of this paper is a particular case of face inpainting where we try to reconstruct the face based on just using the periocular region. To do this, we presented E2F-GAN, a GAN-based architecture that benefits from the advantages of coarse-to-fine, coarse-and-fine, and structural guidance-based architectures for face inpainting. It includes three main modules for extracting face's edges (edge predictor), coarse prediction of face elements (coarse generator) and refining the coarse predicted image (refinement generator). We analyzed E2F-GAN and compared it with other well-known face inpainting methods to measure the efficiency and quality   Masked Image Ours Original Figure 8. Illustration of gender preserve in our proposed method performance. For doing this, we modified a widely used face inpainting dataset called CelebA-HQ such that the whole face except the periocular region is masked and used for E2F-GAN input, calling the resulting dataset E2Fdb. Our proposed inpainting algorithm E2F-GAN and the used dataset E2Fdb are both available in the project GitHub 3 . Several qualitative and quantitative metrics have been measured during our experiments to show the performance of E2F-GAN in terms of preserving identity and non-identity features of each face after inpainting. Experimental results show that our method outperforms previous learning-based face inpainting methods and E2F-GAN can generate realistic and semantically plausible images.
Future work includes analyzing biometric quality aspects of the resulting faces using recent objective measures [51,52]; analyzing [49] and reducing [53] undesired biases in the face generation process; and combining multiple face generation approaches for better outputs [48].