Image Inpainting With Learnable Edge-Attention Maps

This paper proposes an end-to-end Learnable Edge-Attention Map (LEAM) method to assist image inpainting. To achieve a better-recovered effect, we design an edge attention module, which extracts the feature information of the edge map and re-normalizes the image feature information when automatically updating the edge map. And the information of known regions is adopted to assist the decoder generates semantically consistent results. A dual-discriminator structure consisting of the local discriminator and global discriminator is proposed to generate realistic texture details and improve the consistency of the overall structure. Experiments show that our method can obtain higher image inpainting quality than the existing state-of-the-art approaches, which improves PSNR by 3.58%, SSIM by 2.27%, and reduce MAE by 9.21% on average.


I. INTRODUCTION
Image inpainting aims at reconstructing missing regions of images according to the known content [1]. These algorithms have a wide range of applications in image editing, such as completing occluded regions [1], removing unwanted objects [32], and restoring damaged areas [2], [3]. The main challenge of image inpainting is to generate realistic texture details in the missing areas and maintain the semantic structure of global images [4], which can effectively affect the visual quality of images.
Traditional studies perform well to handle small holes using diffusion-based methods, which extract features from the hole boundaries and select matching textures to fill in the missing holes. These methods can generate texture details, but the complex structure in the missing areas of images, when filling large holes, might fail to be recovered [5]. Patch-based algorithms [2], [6], [17], [18] copy information from similar exemplar patches or image collections to fill in the missing holes. However, without a high-level understanding of the image contents and structures, these methods usually struggle to reconstruct the semantically meaningful content of locally unique regions.
The associate editor coordinating the review of this manuscript and approving it for publication was Mohammad Shorif Uddin .
Deep learning-based approaches learn the mapping of nonlinear complex relations among training samples through training of massive data, which can achieve good results, especially for large missing holes produce a plausible structure [24], [25], [27]. However, despite the merits of these approaches, earlier methods [7], [8] cannot efficiently use context information to generate meaningful content, which often leads to fuzzy results.
Some recent approaches try to use contextual information to obtain inpainting results [20]- [22]. Some methods with spatial attention [9], [10] use the surrounding image features to recover the missing area. These methods can ensure the semantic consistency of the generated content, but they only focus on rectangular holes. When dealing with irregular areas, pixel discontinuities often occur, which is an obvious semantic gap.
To effectively deal with irregular holes and reduce ambiguity. Nazeri et al. [11] proposed a model termed EdgeConnect, a two-stage model comprised of an edge generation network followed by an image completion network. The edge generation network estimates the possible edges as the prior information of the image completion network and then generates the final recovered image together with the distorted information. However, the edge map of EdgeConnect is only used in the first layer of the image completion network, which cannot directly propagate to the deep network layers to describe the edges in highly textured regions accurately. When most areas are missing, the recovered images tend to appear structural confusion. By the image completion network of EdgeConnect, the generated edge information would not be learned and updated during the training process.
In this paper, we propose a learnable edge-attention map method, which aims to utilize feature information for generating credible content effectively. To avoid misusing edge information, we design an edge attention module to extract the feature information of the edge map and re-normalize the image feature information. The attention module makes the most of the information of the known regions to enhance details better and restore the structure. In the meantime, U-Net [12] is used as the backbone of our generator to retain the different information of different layers by the skip-connect. Benefiting from end-to-end training, the edge attention module can effectively adapt to the irregular holes and propagation of convolutional layers. Moreover, more feature information can be retained to the deep network layers by the attention module, thus providing possible preconditions for the reasonable structure information generation.
For effectively handling the irregular holes, we introduce a dual-discriminator consisting of the global discriminator and the local discriminator. The global discriminator focuses on the overall image that improves the consistency of the overall structure. Simultaneously, the local discriminator focuses on the missing regions, which can further improve the quality of detail and reduce the generation of artifacts.
EdgeConnect builds the generator by the residue blocks, which may suffer limitation propagating the feature information of different layers to the deep layers in the image completion network. And the single discriminator of this method could not sufficiently handle the irregular holes, especially for large areas missing. Based on these insights, U-Net [12] is used as the backbone of our generator to retain sufficient feature information of each layer. In the meantime, we use the attention module to better incorporate edge information for providing preconditions into the successive image completion process. Moreover, the dual-discriminator improves the quality of recovered images.
We experimented with standard datasets Paris StreetView [13] and Places [14]. The qualitative and quantitative tests show that our approach can obtain higher quality inpainting results compared with the existing methods.
Overall, the main contributions in this paper are summarized as follows: 1. We propose a Learnable Edge-Attention Maps method (LEAM) for improving color consistency, texture fidelity, and semantic coherence. When adapting to irregular holes, it can effectively utilize edge feature information of images and feature information of known regions.
2. We design an edge attention module to extract the feature information of the edge map and re-normalize the image feature information. The edge attention module assists the decoder in generating a consistent semantic structure by utilizing information of known regions.
3. We introduce a dual-discriminator network that can help the network generating recovered images with overall consistency and realistic details. 4. Experiments on two datasets show that our method achieves higher-quality results than the existing state-of-theart approaches.
This paper is organized as follows: in Section II, we give the related work of image inpainting; Section III describes the proposed method details; Section IV shows the experimental results and analysis; Section V summarizes the paper and prospects the future work.

II. RELATED WORK
Previous research methods for image inpainting can be roughly divided into two groups: traditional methods and learning-based methods.

A. TRADITIONAL METHODS
Image inpainting has already appeared before the wide application of deep learning technology. These traditional image inpainting methods can also be divided into two parts: diffusion-based and patch-based. Diffusion-based methods [5], [15], [16] extract the features from the image background and select the matching texture to synthesize the missing regions. However, these methods could not capture global information to generate meaningful structures in the missing parts. Patch-based methods [2], [6], [17], [18] fill in the missing regions by copying information from the same patches in the image background areas or image collections. However, these methods are not effective for the image where the background and the image dataset have lower similarity with the missing regions. Traditional approaches have a common problem: they could not catch the high-level semantics to produce meaningful content and are not suitable for dealing with large missing areas.

B. LEARNING-BASED METHODS
Learning-based methods usually use generative adversarial networks (GAN) [19] to generate information in the missing holes. Context Encoder [20] used a deep neural network for image inpainting, introduced an encoder-decoder network to output the prediction of the missing regions, which improves the visual and semantic rationality of the recovered image. However, the results often lack fine-detailed textures and contain visible artifacts. Shortly thereafter, Iizuka et al. [21] suggested a local and global context discriminator (Global & Local) improves detail quality and ensures the consistency of generated images. However, the sharpness level of details needs improvement, and this method is not suitable for generating complex structural textures.
Yang et al. [22] further proposed an inpainting model of multi-scale neural patch synthesis (MNPS) based on the Context Encoder, composed of a content constraint model and a local texture constraint model. It can work well for VOLUME 9, 2021 high-resolution images. However, this method significantly increases the computational cost due to the complexity of the optimization process. Yu et al. [23], [29] proposed Contextual Attention and Gated Convolution, which consist of two stages. In the first stage, the network adopts the reconstruction loss to obtain the coarse results. The second stage uses the contextual attention layer to complete the fine details. The image inpainting results have a more reasonable structure and texture in a visual sense by using these methods. However, these two methods require the coarse estimate at the first stage must be reasonably accurate. Besides, Gated Convolution [29] needs the accurate result of Holistically-nested Edge Detection (HED) [30] edge detector to guide the network to generate the mask regions.
Most of the inpainting methods are aimed at rectangular missing. In real-world applications, these holes are usually irregular. To better handle irregular holes, Liu et al. [24] presented Partial Convolutions (PConv) with an automatic mask update. This method can effectively suppress image blurriness and generate realistic textures. However, PConv adopts the fixed feature re-normalization may unreasonably extract the image features, resulting in the limitation of this method in handling the color difference.
Some deep learning methods also introduce prior information for inpainting, such as semantic structure, contour, and edge information, producing more impressive results [11], [25]- [29]. Nazeri et al. [11] used the edge information to image inpainting, but the edge generation network may not accurately describe the edges in highly textured regions. Wang et al. [25] introduce a multistage attention module. This module can flexibly use the feature map of different layers to obtain information at various scales, improving the structural consistency of the results. However, the module may cause unwanted artifacts. Li et al. [26] proposed Visual Structure Reconstruction (VSR), which can gradually add image structure information in image inpainting. However, it is not effective in the image of large irregular holes. Yang et al. [27] suggested a multi-task learning framework. This framework can learn the relevant structural information and integrate it with the image inpainting process through the parametric shared generator. However, this method might lead to unreasonable details due to a lack of consideration of the local feature information.

III. APPROACH
The framework of our method is shown in Fig.1. The inputs of the image completion network include the edge map, input image, and mask. We first use the edge attention module, in the encoder segment, to extract effective edge feature information and re-normalize the image feature information. And then, in the decoder segment, the edge attention module can further extract information of the known regions to generate the output image. Finally, we use the dual-discriminator to improve the final quality.
The edge map is generated by the edge generation network of EdgeConnect [11], which consists of an encoder that down-samples twice followed by eight residual modules and a decoder which up-samples twice to generate images of the original size [11]. G EC denotes the generator of the edge generation network. And D EC denotes the discriminator of this network, which is the 70×70 PatchGAN architecture [31] to determine whether the image module with size 70 × 70 is real or not. The following processes describe how to generate the edge map.
The original image is denoted as I gt . And C gt denotes the edge of the original image. M = (1 − m) is the mask (m is the ground-truth mask). Denote by I m gt = I M the input image and C m gt denotes the edge image of the input image. The grayscale of input image is represented by I g . Then, the generated edge map is C pred = G EC (I g , C m gt ). Besides, C gt and C pred conditioned on I g are used as inputs of the discriminator D EC that predicts whether the edge map is true or not by the adversarial loss L EC and feature-matching loss L FM [11]: where α EC and α FM are hyper-parameters which balance the contributions of the two loss. For our experiments, we set α EC = 1, α FM = 10. The adversarial loss L EC is expressed as [19]: The feature-matching loss L FM is used to compare the activation maps in the intermediate layers of the discriminator to improve the quality of the edge map. The feature-matching loss L FM is defined as [11]: where F is the final convolutional layer of the discriminator, N i is the number of elements in the i-th activation layer, and D i EC is the weight matrix of the discriminator at the i-th layer of D EC .

A. GENERATOR OF IMAGE COMPLETION NETWORK 1) ENCODER
The convolution layer without bias is widely used in U-Net [12] for image color filling [31], image style transfer [31], and image inpainting [10], [32], which layer is used to build the generator of our network. This generator includes the encoder, decoder, and attention module, in which attention module helps the network improve the quality of recovered images by using a different strategy in the encoder segment and the decoder segment. The encoder details are shown in Fig.1 (marked by the green dotted box) and Fig.2.
Let F in be an input image or feature map in the U-Net and W be a convolution filter. The convolution of the input image or feature map is defined as: We use the local edge map E L as input of the encoder attention module: where M represents the mask. The network should be told where the mask is to avoid the misuse of invalid data during convolution. C pred,M = C pred (1−M) denotes the generated edge map of the missing areas, which can help the completion network to extract effective information reasonably. However, the generated edge map is inevitably different from the mask map of real images when dealing with the large missing holes. To improve this situation, we add (1 − M + C pred,M ), which treats the map as an unknown part to help the network accurately exact the features of images. We use a learnable convolution filter Km e with size 4×4 to learn the edge and mask feature information from local edge map E L and generate the convolved local edge map. Formally, the convolved local edge map E c is defined as: Then, the extracted map is used for image feature renormalization. is interpreted as the element-wise production of the image feature map and the edge feature map, F out represents the output feature map: where ⊗ denotes the convolution operator, Km3 denotes a convolution kernel with size 3 × 3. The subscript of convolution Km3 denotes the convolution size. Our method uses the element-wise to re-normalize edge features and image features. The combination of them is relatively rough. To solve this problem, the convolution operation of Km3, which does not change the size of the feature map, is adopted to further extract the feature information in the feature map and ensure a small number of operations. Moreover, Km3 can effectively improve the ability to obtain deep semantic information [4]. gA(E c ) denotes the edge feature map. The step of extracting features from convolved local edge map and updating to generate the edge feature map is defined as: where Km1 is a learnable convolution filter with a size of 1 × 1. ga(E c ) is the activation function for the edge feature map, formulated as: where α, µ, σ are learnable parameters, we set them as α = 1.1, µ = 2.0, σ = 1. gA(E c ) can further increase network depth, enhance the nonlinearity of network [33], [34], and improve the ability VOLUME 9, 2021 to obtain deep semantic information. Km1 is the same as that of Km3 to extract features further.
To make edge map adapt to irregular holes and propagate with layers in the edge attention module, the convolved edge map E c needs to be updated reasonably. E out denotes the updated edge map: The activation function used for generating the updated edge map step is defined as: where θ is a hyper-parameter and we set θ = 0.8.

2) DECODER
Most learning-based methods adopt standard convolution to treat known regions and missing holes, but it might lead to color difference and blurring inevitably [11], [20]- [23]. Focusing the decoder on filling the irregular holes with edge feature information and the information of known regions, we introduce the learnable edge attention map to avoid the misuse of invalid data and replace the standard convolution.
The decoder details are shown in Fig.1 (marked by the red dotted box) and Fig.2.
We use global edge map E G as the input of the decoder attention module: where 1 − M represents the known regions. Image inpainting requires that the result generated is highly consistent with the known regions in quality and vision. The decoder of the image completion network needs to pay more attention to the known regions and extract feature information. C denotes the global edge, which is composed of the actual edge of the know regions and the edge generated by the edge generation network of the missing areas. The global edge C is defined as: which can further extract the semantic structure feature of the whole image, not just the missing area. This can improve semantic consistency and reduce the color difference in the results. (1 − M ) × (1 − C) denotes the complementation of C in the known regions, which can help the network make use of the edge feature information reasonable. The convolved edge map of the decoder is denoted as E d c , which extracts the reasonable features of edge information and structure information of known regions. Formally, E d c is defined as: where the learnable convolution filter Km d learns the mask and edge feature information from global edge map E G to generate the convolved edge map E d c . The convolution kernel size of Km d is 4 × 4.
F d out denotes the operation of feature re-normalization using feature map extracted from local edge map and global edge map: where gA(E d c ) denotes the steps of extracting features from global edge map. gA(E d c ) helps the network obtain high-level semantics.
F d out can combine features reasonably to improve the utilization of information [35] for avoiding generate unwanted content.
For adapting the propagate with layers, the convolved edge map updated of the decoder is defined as: We use the learnable convolution filters Km e and Km d , which change the size of the feature map to learn extract feature information from the feature map. On the one hand, Km e and Km d enable the attention module to update the size of feature map synchronously with U-Net [12] to learn and utilize the feature information of different feature levels effectively. On the other hand, for each feature layer, these learnable convolution filters help the network to distinguish, learn, and process regions with different states (including known background and unknown foreground regions), avoiding the abuse of invalid data in the process of convolution. The learnable convolution filters Km3 and Km1, which do not change the size of the feature map, are used to improve the ability to obtain deep information. Km3 and Km1 respectively extracts information from the feature map after feature normalization and the edge map. Km3 uses a 3×3 convolution filter because a sufficient receptive field is required for network inpainting. Km1 uses a 1 × 1 convolution filter to increase the network depth further for extracting edge information better.

B. DISCRIMINATOR OF IMAGE COMPLETION NETWORK
The marked region by the blue dotted box in fig.1 is the discriminator, and details of the structure are showing in fig.3. Based on the Global & Local [21], we propose the dual-discriminator strategy, which can be suitable for the irregular holes while the discriminator of Global & Local is only working well for rectangular holes. Our method utilizes a local discriminator to focus on the missing regions, which can effectively handle irregular holes and generate high-frequency detail results. In the meantime, we use a global discriminator to improve the consistency between missing regions and known parts. The following processes describe global discriminator and local discriminator.
Input image I m gt and global edge C are used as inputs of the generator of image completion network. Fill in the missing area, the image completion network could finally generate the inpainting image I pred = G(I m gt , C). The adversarial loss of global discriminator D 1 of image completion network is expressed as:

C. LOSS FUNCTIONS
For better recovery of semantics and realistic details, we train our network with Adversarial loss [19], Pixel Reconstruction loss, Perceptual loss [37], Style loss [38].

1) ADVERSARIAL LOSS
Adversarial loss [19] can improve the visual quality of generated images, which is often used for image generation [39] and image style transfer [40]. Moreover, Adversarial loss makes the generator and discriminator optimized continuously, improving the detail quality of generated images [41]. The total adversarial loss [36] of our image completion network is computed by: where α adv,1 and α adv,2 are pre-defined weights to balance the two learning tasks. For our experiments, we set α adv,1 = 0.8, α adv,2 = 0.2.

2) PIXEL RECONSTRUCTION LOSS
The l 1 -norm error of pixel reconstruction loss is denoted by: where pixel reconstruction loss L l1 [37] measures the perpixel difference between the inpainting images I pred and the original images I gt .

3) PERCEPTUAL LOSS
Adversarial loss improves texture quality, but this loss is limited in learning structural information. In some recent VOLUME 9, 2021 methods, Adversarial loss and Pixel Reconstruction loss are used to train a network for improving image quality. However, these losses still could not capture high-level semantics and are not suitable for generating images consistent with human perception [38]. Perceptual loss, different from these, compares the features obtained by convolution with the ground-truth image. This loss can measure the similarity of high-level semantics between images [42], effectively improving the structure of the inpainting results. The Perceptual loss of the image inpainting network is formed as [37]: where φ i is the activation map of i-th layer of a pre-trained network. In our implementation, φ i corresponds to activation maps from layers relu1-1, relu2-1, relu3-1, relu4-1, relu5-1 of VGG-16 network pre-trained on the ImageNet dataset [43].

4) STYLE LOSS
Although Adversarial loss and Perceptual loss can effectively improve texture quality and enhance detail recovery, they could not avoid creating visual artifacts. Therefore, Style loss is added here to improve the overall consistency. We use the feature maps from the pooling layers of VGG-16 pre-trained on the ImageNet dataset [43]. For our experiments, we use relu2-2, relu3-3, relu4-4, relu5-2. The Style loss is defined as: where G φ j () is a Gram matrix constructed by the pre-trained network [38], and its construction is defined as:

5) MODEL OBJECTIVE
Taking the above loss functions, the overall objective of our model is formed as: where α, α p , α s ,and α l1 are hyper-parameters that balance the contributions of different loss terms. In our implementation, we set α = 0.1, α p = 1, α s = 250, α l1 = 1 according to the literature [11].

IV. EXPERIMENTS AND ANALYSIS
We conduct experiments to evaluate our LEAM method on two datasets: Paris StreetView [13] and Places365-standard (the core set of Place [14]). For Paris StreetView, we use all the images (the total number 14900) in the original training set for training. And for Place, we select 10 categories from 365 categories in Places365-standard, with a total of 50,000 images for training. The masks used for training comes from Pconv [24], a total of 12,000 images. The size of the masks and images for training and testing is 256 × 256 pixels. We use the Adam algorithm to optimize the model with a learning rate of 0.0001, and the training iterations are 200 epochs. All experiments are conducted on a PC equipped with a single NVIDIA Quard T4000 GPU.

A. QUANTITATIVE COMPARISON
We compare our method quantitatively with Global & Local (GL) [21], Context Attention (CA) [23], Partial Convolutions (Pconv) [24], and EdgeConnect (EC) [11] Table 1, the performances of all methods on all metrics deteriorate gradually with the missing areas increasing. Compared with the four methods, our method performs the highest PSNR, SSIM, and lowest MAE, which indicates that recovered images have the highest definition, best quality, and lowest distortion. Specifically, our method improves PSNR by 3.58% and SSIM by 2.27% and reduces MAE by 9.21%.
To quantitatively investigate the effectiveness of color restoration in our method, we calculate the mean color difference between the original images and generated images. The smaller color difference demonstrates, the more color similarity between the restored image and original image. We use CIE Lab chromatic aberration formula to evaluate the performance of different methods. Table 2 shows that our method has the smallest color difference and strongest color restoration ability among all the test images.

B. QUALITATIVE COMPARISON
As shown in Fig.4 and Fig.5 (the red rectangle areas are inpainting results, and the details are shown in the enlarged images), GL [21] is effective in generating realistic local details, but the results present meaningless textures and fuzzy artifacts. This is mainly due to the fact that this method could not reasonably separate the foreground and background boundaries of missing holes and known regions, which leads to inaccurate filling. VOLUME 9, 2021 FIGURE 6. Results on real-world object removal images. From left to right are: original image, input with objects masked (white area), Global & Local (GL) [21], Context Attention (CA) [23], EdgeConnect (EC) [11], and ours. (a), (b) are divided into two parts: the lower part is the enlarged image of the upper image's corresponding red rectangle area. CA [23], compared with GL, can ensure that the inpainting results have a certain degree of semantic coherence. However, this method still could not avoid generating boundary artifacts and confusing colors. This is since that CA is not suitable for the inpainting with irregular holes. Moreover, its coarse estimate is not reasonably accurate, leading the network to generate visually implausible structures.
EdgeConnect [11] produces more smooth and reliable results, but the continuities in color and lines do not hold well, and a few artifacts still are observed in the results. This is because EdgeConnect is not a method specifically designed for handling irregular holes. For the large-area irregular missing parts, EdgeConnect may not generate completely accurate edge information, which leads the network to generate unreasonable content finally. Compared with these methods, our method handles these problems better, which makes texture details more realistic and ensures semantic coherence of the inpainting image. This is mainly due to the fact that our method extracts effective edge feature information and uses the information to re-normalize image feature information. The attention module helps the network utilize known information further to generate semantically consistent inpainting results. Furthermore, the dual-discriminator improves the quality of details and reduce color difference.

C. OBJECT REMOVAL
We use the model trained on Places to evaluate the effect of our method on the real-world object removal task. As shown in Fig.6, we use the white outline shape to cover the target area. The red rectangle areas are the inpainting results generated by ours and the competing methods.
When the object is removed, we observe that the results of GL generate obvious artifacts. The predictions of CA show the semantic gap. EdgeConnect effectively improves the overall structure consistency, but this method still generates noise. In contrast, our method generates credible content because the edge attention map can help the network extract and represent feature information accurately. The usage of the dual-discriminator improves the quality of details.

D. ABLATION STUDY
To illustrate the effectiveness of our method, we analyze how the proposed modules of our method contribute to the final performance of image inpainting. We take the U-Net [12] image generator and a single global discriminator as the baseline, then gradually add modules until the whole model is formed. The modules include the edge attention module in the encoder (ABE), edge attention module in the decoder (ABD), and a dual-discriminator (DD).
As shown in Table 3, compared with the baseline, our method can perform better gradually as it progressively integrates each module. With gradual introduction of the edge attention module and the dual-discriminator, the quality of generated images is significantly improved.  The qualitative comparison is shown in Fig.7. For irregular holes, the whole module makes the best results. Specifically, the effects of image inpainting are gradually improved by gradually adding the attention module and dualdiscriminator.
To further investigate the effectiveness of the dualdiscriminator, we replace the dual-discriminator with the global and local context discriminators, which are taken from GL [21], to make a comparison. As shown in Table 4, our dual-discriminator performances well to improve the quality of generated images.

V. CONCLUSION
In this paper, we proposed a novel edge attention map method of image inpainting based on a learnable attention module. The module effectively utilizes edge information in the encoder and decoder. Specifically, our edge attention module extracts edge information and utilizes the mask information of missing areas. The information of known regions is adopted for better detail and structure recovery. Moreover, we introduce a dual-discriminator to improve the highfrequency detail quality and reduce the color difference of the final generated images. Experimental results demonstrate the effectiveness of our approach. Compared with the stateof-the-art methods, our approach improves PSNR by 3.58%, SSIM by 2.27%, and reduce MAE by 9.21% on average. In the future, we plan to extend this approach to other image tasks, such as text-to-image generation and single-image super-resolution. Moreover, we will investigate the influence of prior information, especially structure knowledge for image inpainting. He is currently a Lecturer with the University of Shanghai for Science and Technology, Shanghai, China. His current research interests include virtual reality, computer animation, and computer graphics.
MINGXI ZHANG received the Ph.D. degree in computer software and theory from Fudan University, in 2013.
He is currently an Associate Professor with the University of Shanghai for Science and Technology, Shanghai, China. His current research interests include social network analysis, information retrieval, and graph mining. VOLUME 9, 2021