Multi-Reception and Multi-Gradient Discriminator for Image Inpainting

Many deep learning methods for image inpainting rely on the encoder-decoder architecture to estimate missing contents. When guidance information from uncorrupted regions cannot be adequately represented or utilized, the encoder may have difficulty handling the rich surrounding or background pixels, and the decoder could not recover visually sophisticated or realistic content. This paper proposes an effective multi-scale optimization network to alleviate these issues and generate coherent results with fine details. It adaptively encodes multi-receptive fields feature maps and puts multi-scale outputs into a discriminator to guide training. Specifically, we propose a Multi-Receptive feature maps & masks Selective Fusion (MRSF) operator that can adaptively extract features in different receptive fields to handle sophisticated destroyed images. Then a multi-gradient discriminator (MGD) module uses the intermediate features of the discriminator to guide the generator to produce results with natural textures and semantically real contents. Experiments on several benchmark datasets demonstrate that the proposed method can synthesize more realistic and coherent image content.


I. INTRODUCTION
Image inpainting is the task of filling in missing image regions with semantically plausible contents [1]. This problem has attracted active research attention for decades, and the technology can be applied in many real-world applications, such as restoring masked image parts, removing undesired visual contents, and automatic image editing [2], [3], [4], [5].
Traditional approaches are primarily based on exemplary methods that borrow similar patches from known areas or external database images [5], [6]. These methods can effectively fill small regions with homogeneous backgrounds. However, they cannot restore large disrupted regions or generate novel content because they lack the ability to model high-level semantic features.
architectures are used to effectively extract meaningful information and restore missing pixels [7], [12]. To better extract high-level features and generate the corrupted image regions with visually natural patterns, many methods have been jointly trained with generative adversarial network (GAN), such as local and global discriminators [7], [8], [15] and PatchGAN [9], [16].
The critical aspects of these approaches are how to encode a corrupted image into a robust feature map that has available guidance information and can reconstruct or synthesize desirable contents for the corrupted regions. Some previous inpainting methods [17], [20] have mainly adopted one branch with a single receptive field in each convolution layer to extract features, which can obtain satisfactory results in some cases. However, when dealing with complex scenes or large damaged regions, they cannot extract adequate features and often lead to inpainting results with color distortions, blurred textures, boundary artifacts, or distorted structures, as shown in Figure 1. Some researchers have focused on a multi-scale strategy to extract powerful features. For example, SKNet [21] adaptively adjusts the receptive field VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ FIGURE 1. Example of inpainting results for faces and buildings with irregular masks. Each column of images shows the results from different inpainting methods, including PC [17], GC [9], RFR [12], VCN [18], PUT [19], and our network. It is seen that others' methods suffer from color distortions, blurred textures, boundary artifacts, inconsistent objects, etc.
size based on multiple-scale input features and can capture objects of different sizes. Wang et al. [22] designed a multicolumn image inpainting network that uses three subnets with different filter sizes to generate multi-scale features, fuses different branch features, and discriminates the completion results locally and globally. Based on these works, we want to obtain better results by combining local, global, or different scale features and introducing a multi-branch structure with different convolution receptive fields to obtain features suitable for the inpainting task through an adaptive selection mechanism.
In this paper, we propose an Multi-Receptive feature maps & masks Selective Fusion (MRSF) operator for image inpainting. Specifically, it uses a group of convolutional kernels with different kernel sizes to extract multi-scale representations of the input image. It fuses representations and adopts a selective operation to dynamically aggregate multi-scale features. Furthermore, we propose the Multi-Gradients Discriminator (MGD) module to strengthen the structural consistency and refine the texture details of the restored images. MGD uses gradients produced by intermediate features of discriminators of the Patch-GAN [23] to provide richer supervised signals for the generator. In summary, we designed an encoder-decoder network following the U-Net architecture [17], which stacks MRSF layers into the encoder stage, and puts the intermediate reconstruction results into the MGD module.
The main contributions of this study are as follows: • We design the Multi-Receptive feature maps & masks Selective Fusion (MRSF) operator to adaptively extract multi-scale features to handle sophisticated destroyed images better.
• We define the Multi-Gradients Discriminator (MGD) module, which uses intermediate features of the discriminator to guide the generator to produce results with natural textures and semantically real contents.
• We evaluate our methods on multiple benchmark datasets, and the results show that our method outperformed the state-of-the-art results in both visual quality and evaluation scores.

II. RELATED WORK
This section briefly reviews the relevant literature in the following categories: traditional inpainting methods, learningbased inpainting methods, attention-based methods, and generative adversarial network-based inpainting methods.

A. TRADITIONAL IMAGE INPAINTING METHODS
Traditional image inpainting methods can be broadly divided into diffusion-based and patch-based methods. Diffusionbased methods fill holes by propagating the neighborhood information [24], [25]. These methods are effective for handling color and texture variance in small regions. However, they cannot deal with large missing areas with rich textures, particularly when the surrounding contents cannot provide relevant reference information for propagation. Patch-based methods exploit similar and associated patches from the same image or a dedicated set of reference images [26], [27]. These methods can produce more realistic results for relatively large missing regions but with high computation costs and memory consumption when iterating across all possible target-source patch pairs. To find more suitable patches, the bidirectional similarity [28] is introduced to summarize the visual information. To improve computation efficiency in terms of processing time and memory usage, PatchMatch [5] used the fast nearest neighbor field method to search for suitable patches. These methods assume that suitable content for inpainting is available from the same image or reference images, which may not always hold. Therefore, the performance of these traditional methods is limited because they lack the ability to use highlevel semantic and structural information to generate complex content.

B. LEARNING-BASED IMAGE INPAINTING
Recently, many learning-based methods have been proposed to model inpainting as a conditional generation problem. Benefiting from deep convolutional neural networks trained from massive data, these methods significantly improve the inpainting performance. For example, as one of the pioneer methods in this category, the context encoder [7] trained an encoder-decoder network with pixel-level reconstruction loss and adversarial loss for reconstruction. Iizuka et al. [15] proposed local and global discriminators using dilated convolution to improve the receptive fields and generate realistic and coherent images. In addition, partial convolution [17] was proposed to handle irregular holes using masked and re-normalized convolution on valid pixels. Gated convolution [9] provided a learnable dynamic feature selection mechanism for each channel and each spatial location in a partial convolution and used spectral-normalized PatchGAN to achieve better visual results. These methods only use one or two convolutions with a large kernel size, while others use convolutions with a small kernel size (e.g., 3 × 3), which may lead to difficulty in capturing the multi-scale size of contents. Some more recent methods used a two-stage architecture to improve the inpainting performance. For instance, Yu et al. [20] proposed a two-stage coarse-to-fine architecture where the coarse network made an initial estimation, and then the refinement network took the initial guess and adopted a contextual attention module to borrow spatial information remotely to ensure color and texture consistency, finally producing more satisfactory results. EdgeConnect [11] adopted two GANs to estimate the edges of the corrupted regions and to generate a complete image. Xiong et al. [13] and Ren et al. [29] developed a similar model that used a contour or structure generator instead of an edge predictor for intermediate results. Song et al. [30] used segmentation to predict the labels and separately generate completed results. These two-stage architectures usually have more parameters and art built on the assumption that the intermediate estimate in the first stage is reasonably accurate. Nevertheless, without the capacity to obtain efficient guidance information in an end-toend manner, the inconsistencies between the two steps often lead to boundary artifacts and distorted textures.
In contrast to two-stage methods, some parallel frameworks have been attempted to ensure consistent results while reducing computational costs. For example, pluralistic image completion [31] used reconstructive and generative paths to achieve diverse plausible results. A progressive framework has been proposed to solve the inpainting tasks. For example, PRVS [10] added progressively reconstructed boundary maps as extra training targets to assist in the inpainting process of a U-Net. Zhang et al. [32] used a curriculum LSTM learning strategy for semantic image inpainting to fill the holes progressively. Gong et al. [33] used high-and lowresolution networks to obtain interactive features progressively. A recurrent feature reasoning (RFR) network [12] progressively infers the hole boundaries to extract feature maps to enhance consistency and improve performance.
Inspired by the primate visual cortex, inception series networks [34], [35], [36], [37] explore multi-scale strategies to efficiently aggregate information from different receptive fields and increase the depth and width of the network. Cai et al. [38] applied a multi-scale body-part mask to facilitate feature extraction for person re-identification. Liu et al. [39] used multiple residual subnets with different receptive fields to extract multi-scale features and proposed a dual-branch module to fuse inter and inner correlations of multi-scale features. Some researchers [22], [40], [41] have proposed multi-column subnets and multi-scale attentions to generate more realistic and complex results. Zeng et al. [42] learned a pyramid-context encoder and multi-scale decoder to restore the image. Kim et al. [43] utilized a multi-to-single frame in a 3D-2D encoder-decoder network to aggregate temporal features in a video inpainting task. Li et al. [44] used two generators with multi-scale structures and textures to reconstruct a damaged image. Mo et al. [45] used four discriminators on global and local inpainting results, which led to better performance by ignoring the linkages between multi-scale features. Zhou et al. [46] proposed real/fake and collocation discriminators based on a pyramid-style extractor and an outfit generator to supervise the synthetic fashion items from global and local viewpoints. Yang et al. [47] utilized a shared multi-task generator with an edge and gradient to complete the corrupted image, thus encouraging the generator to exploit relevant structural knowledge for inpainting. Wang et al. [48] proposed a multi-scale adaptive partial convolution and stroke-like mask to restore Thanka murals. This improved partial convolution is similar to our MRSF operator, which extracts multi-scale features using multiple kernels of different sizes. However, it fuses features by static weights without selecting valuable features.

C. ATTENTION-BASED METHODS
Adopting the attention mechanism [49] has shown improvements across many computer vision tasks such as classification, detection, person Re-ID, etc. It helps the model to allocate features for the most informative regions and simultaneously suppress the irrelevant ones, which is highly demanded in the inpainting task. Contextual attention [20] searched for similar patches from the available contextual patches in a high-level semantic feature space, which led the generation model to obtain semantically consistent results. Song et al. [50] introduced a patch-swap layer that compared the similarity of the feature map from the VGG network and selected the most similar patch from the uncorrupted regions into the corrupted regions. Zeng et al. [42] introduced an attention transfer network (ATN) to learn region affinity in a high-level feature map and transfer the learned attention to the previous feature map of a pyramid context encoder. Xie et al. [51] designed a bidirectional attention model to renormalize features and update masks during the feature generation. Liu et al. [52] proposed coherent semantic attention, which initializes an unknown patch with the most similar known feature patch to guarantee global semantic consistency. It iteratively optimizes adjacent patches to ensure local spatial consistency, encodes more valuable features, and generates more coherent texture results. Liu et al. [53] introduced a bilateral attention layer that synthesizes feature patches by computing the relationships with distance and value, and it can produce visually plausible inpainting results. Qiu et al. [54] adopted spatial and channel attention to obtain long-distance pixels and useful channel features using a two-stage inpainting network.

D. BASED ON GENERATIVE ADVERSARIAL NETWORK (GAN) METHODS FOR INPAINTING
The generative adversarial network (GAN) formulates the inpainting task as a model-fitting problem that determines the distribution of missing pixels within a given training set and has a lasting influence on image inpainting, which usually consists of a generator for producing fake samples and a discriminator for differentiating fake and real samples. GAN-based inpainting models can overcome the problem of blurry images and generate clearer textures. Pathak et al. [7] introduced a GAN into an image inpainting problem with both reconstruction and adversarial losses to generate reasonable semantic results. However, it is difficult to maintain global consistency. Guim et al. [55] proposed an invertible conditional GAN to control the high-level attributes of faces. Yeh et al. [8] used a global GAN to control the semantic attributes of generated images. Subsequently, Iizuka et al. [15] and Yu et al. [20] adopted both global and local GAN to complete corrupted images, which improved global and local consistency. Nazeri et al. [11] proposed a PatchGAN-based inpainting network to model the patch details. Gated convolution (GC) presented SN-PatchGAN [9] to improve the quality of the completed images. The visual consistent network (VCN) [18] adopts a discriminative model to predict mask regions and then uses a spatial normalization to enhance information aggregation of the inpainting network. Zhao et al. [56] proposed a co-modulated GAN with conditional and stochastic style representations that generated diverse and consistent content. PD-GAN [57] generated probabilistic diverse inpainting results based on fitting the random noise distribution with a vanilla GAN, modulated features via a proposed SPDNorm to incorporate context constraints, and used a perceptual diversity loss.
Most of these methods adopt only the final inpainting results, and the characteristics of the mid-level layers are not adequately modeled to control the generated inpainting results. It is difficult to maintain more apparent textures during image completion.

III. PROPOSED APPROACH
This section first introduces the architecture of our inpainting network based on the two proposed modules. An overview is shown in Figure 2. We then describe the proposed Multi-Receptive feature maps & masks Selective Fusion (MRSF) operator and present the Multi-Gradients Discriminator (MGD) module.

A. NETWORK ARCHITECTURE
We refer to the architecture of U-Net used in partial convolution [17], and design a one-stage and end-to-end inpainting network G with 18 layers, as shown in the blue box of Figure 2, which includes eight layers in the encoder stage and ten layers in the decoder stage. Specifically, we used the U-Net with partial convolution as our baseline network. Furthermore, we refer to some classic inpainting network structures, such as partial convolution (PC) [17], gated convolution (GC) [9], and recurrent feature reasoning (RFR) [12]. Their encoders often use larger convolution kernels in the first few layers to capture the features of large receptive fields and smaller convolution kernels in the deeper layers. Therefore, at the beginning of the encoder, we introduced several Multi-Receptive feature maps & masks Selective Fusion (MRSF) layers to preserve multi-scale information and recommended using two to six MRSF layers at the beginning of the encoder. Moreover, our network produces intermediate spatial dimension recovery images with L 1 reconstruction loss in the last three up-sampling decoder layers. Finally, the intermediate recovery images, final inpainting result, and corresponding dimension ground truth images were flowed into the Multi-Gradients Discriminator (MGD) to strengthen the decoder's guiding force and obtain a more realistic result. The procedure for our inpainting network is summarized in Algorithm 1. In learning-based image inpainting models, the extraction of powerful features and how to obtain better results are essential. Partial convolution (PC) [17] adopts mask updating and re-normalization to focus on valid pixels and extract much more valuable features. However, each partial convolution layer uses one branch with a single receptive field, and the inpainting results still suffer from color distortions, blurred textures, and distorted structures [12]. Therefore, we introduced multiple receptive fields of operators to extract more valuable features. Furthermore, extracting features from the convolution of different receptive fields is more conducive to capturing adequate features for the inpainting models.
Based on these, we utilize multi-scale adaptive receptive of feature maps and masks that adapt to the different kernel sizes of partial convolution on each lane. Furthermore, motivated by SKNet [21], we propose a nonlinear attention fusion module to combine multi-receptive feature maps and masks and then use a global selective operator to obtain the final output. The above module is called Multi-Receptive feature maps & masks Selective Fusion (MRSF), and its architecture is shown in the orange box in Figure 2. Our MRSF operator adopts three operators: extraction, fusion, and selection of multi-receptive filed features. In the MRSF, we chose three receptive fields to easily show the case.

1) EXTRACT MULTI-RECEPTIVE FIELDS FEATURE MAPS AND MASKS
For a given feature map X ∈ R H ×W ×C and mask M ∈ R H ×W ×C , we adopt multiple partial convolutions PC i with different kernel sizes to extract multi-receptive fields feature maps F i ∈ R H ×W ×C and masks M i ∈ R H ×W ×C . All kinds of kernel sizes are in the set K, and we use K = {3 × 3, 5 × 5, 7 × 7} and i ∈ K. The number of elements in set K is denoted as len(K ). The multiple partial convolutions are formulated as follows: where X y,x is the feature value for the convolution window of the location (y, x), W i is the i th convolution kernel weight, β is the scaling factor that adjusts the result by the rate of valid inputs, and is an element-wise multiplication.
The mask M owns binary values, where 1 represents the valid input, and 0 represents the invalid input. The operator of sum(M) summarizes the mask value in the convolution window. If sum(M) > 0, the partial convolution PC i operates by the weights W i and the scaling factor β on the valid inputs to obtain the feature map F i,y,x , and it updates the location mask M i,y,x as 1. If sum(M) = 0, the feature map F i,y,x and mask M i,y,x are 0. After the partial convolution operation, the VOLUME 10, 2022 mask is heuristically determined to be valid or invalid, and is then updated by the rule that the location mask is valid when at least one valid input value is in the previous layer convolution window.
Partial convolution improved the inpainting quality. However, when the mask values of the previous layer in the convolution window are used to set the mask value of the next layer, partial convolution treats one valid pixel or all valid pixels in the convolution window of the previous layer in the same manner (as shown in equation (1)). Therefore, this study uses the Fuse and Select operator to learn optimal masks and feature maps, which automatically adjust their receptive field sizes according to the content to gradually convert all mask values to valid and extract the adaptive feature maps.

2) FUSE
For neurons to adaptively adjust feature maps and masks of different receptive fields, we used nonlinear operators to fuse these feature maps and masks. MRSF uses the features F i and masks M i as inputs. Firstly, an element-wise sum is used to combine these multi-receptive field feature maps and masks. It is formulated as follows: Then, we use global average pooling (GAP) for F and M to compute the channel-wise global information s F ∈ R 1×1×C and s M ∈ R 1×1×C . Each element of s is the mean value of F and M for the spatial dimensions H × W .
Next, we apply a shared fully connected (FC) layer with the ReLU function to obtain compact features z F ∈ R 1×1×r and z M ∈ R 1×1×r . r is the reduction in dimensions that can reduce the calculation of the FC layer, and we set r = C 8 in the experiments. It can be formulated as follows: z F = ReLU(FC(GAP (F))), z M = ReLU(FC(GAP(M ))). (3)

3) SELECT
To dynamically select information for different receptive field feature maps F i and masks M i , we design a channel-attention module on the compact features z F and z M . First, this module uses multiple FC layers to transform z F and z M to z F,i ∈ R 1×1×C and z M ,i ∈ R 1×1×C , where i ∈ K , and the number of FC layers is len(K ).
Subsequently, we concatenate z F,i and z M ,i on the second dimension to obtain z F ∈ R 1×len(K )×C and z M ∈ R 1×len(K )×C . Next, a softmax operator is applied to the second dimension of z F and z M to obtain the channel-attention scores a F ∈ R 1×len(K )×C and a M ∈ R 1×len(K )×C . It can be defined as: Then, it splits a F and a M on the second dimension into a F,i ∈ R 1×1×C and a M ,i ∈ R 1×1×C , where i ∈ K .
Finally, it selects the different receptive fields feature maps F i and masks M i using the channel-attention scores a F,i and a M ,i . The final aggregated information and mask can be defined as: It is worth mentioning that the MRSF parameters are less than concatenation with 1 × 1 convolution, and it has better inpainting results, as shown in the ablation experiments.

C. MULTI-GRADIENTS DISCRIMINATOR (MGD)
By training the encoder-decoder network, the pixels in the missing area can be filled with reconstruction loss but cannot be visually consistent. To obtain more realistic results, inspired by [9] and [58], we developed a Multi-Gradients Discriminator (MGD) to strengthen the guiding force of the decoder. The MGD module uses a single Patch-GAN [23] discriminator to connect the final inpainting result and different scales of intermediate decoder reconstructions, which flow gradients at multiple stages of the inpainting results. The architecture is illustrated by the green box in Figure 2. Compared with the multiple discriminators used at each stage, our MGD module has smaller parameters, can share information across stages, and provides more effective guidance on this task.
Firstly, we define the function r k that outputs the corresponding spatial dimension recovery images I out,k at the different decoder stages. The recovery images I out,k correspond to different downsampled versions of the ground-truth image I gt,k . We model r k simply as two convolutions that learn the transformation from feature maps into the RGB space and convert the intermediate decoder layer feature map g k into recovery images I out,k . It can be formulated as follows: where, In our experiments, we used the last three resolutions decoder layers feature maps g k to predict the corresponding spatial dimension recovery images I out,k . We set k ∈ {1, 2, 3}, and the elements in k correspond to the H /2 × W /2, H /4 × W /4, and H /8 × W /8 spatial dimensions.
Then, the recovery image I out,k is returned to the next decoder layer through a concatenation operation to improve the final inpainting result.
The discriminator d is the network ψ. The inputs of the network ψ are the recovery outputs I all out and the corresponding ground truths I gt,k . The recovery outputs of decoder I all out include the final output of the decoder I out and the intermediate recovery images I out,k . The ground truths I all gt include the ground truth image I gt and the downsampled ground truth images I gt,k . This discriminator network ψ makes gradients flow from the discriminator's intermediate layers to the decoder's intermediate layers. The network ψ is defined as: where ψ i and i ∈ {0, k} demote the resolution layers of the discriminator network. The inputs of the first resolution layer ψ 0 are I out and I gt , and the inputs of other resolution layers are I out,k and I gt,k .

D. LOSS FUNCTIONS
Our Multi-Gradients Discriminator (MGD) requires intermediate recovery images, so we need a generator to achieve the intermediate recovery images and the final inpainting result. We imposed a pyramid L 1 reconstruction loss [47] to obtain intermediate recovery images. For the final inpainting result, we adopt several common loss functions to train our inpainting network, including the reconstruction loss [17], perceptual loss [59], style loss [23], adversarial loss [60], and variation loss [59]. We define the input image with the mask as I in , the binary mask as M (1 for valid), the final inpainting image of network prediction I out , the ground truth image as I gt , the intermediate recovery images as I out,k , the corresponding different downsampled versions of the ground truth image as I gt,k , and the completed image as I comp . Our network obtains the final completed images as follows:

1) ADVERSARIAL LOSS
The MGD module is beneficial for generating more realistic complementary images. We denote the adversarial loss [60] as: where d denotes the MGD discriminator.

2) INTERMEDIATE PYRAMID RECONSTRUCTION LOSS
We use the pyramid L 1 -norm distance to guide the intermediate recovery images and define the intermediate pyramid reconstruction loss as follows: The L 1 -norm distance is between the predicted intermediate recovery images I out,k and the corresponding ground truth I gt,k .

3) RECONSTRUCTION LOSS
The reconstruction loss [17] L re−hole and L re−valid calculate the L 1 -norm distance between the final inpainting output I out and ground truth I gt in the unmasked and masked areas, respectively. It is defined as:

4) PERCEPTUAL LOSS
Owing to the lack of high-level semantics using only the reconstruction loss, we use the perceptual loss [59] to compute the L 1 -norm distance between the high-level semantic space features φ i (I out ) and φ i I gt , where the high-level feature map of size is C i × H i × W i . Perceptual loss is defined as where φ is a pre-trained VGG-16 classification model on ImageNet and projects these images into higher-level feature spaces. φ i represents the activation map of the i-th specified layer. In our study, φ i corresponds to layers pool_1, pool_2, and pool_3 of the VGG-16.

5) STYLE LOSS
The style loss [23] is similar to the perceptual loss and calculates the L 1 -norm distance between the Gram matrix from activation maps φ i . The style loss L style is written as where φ i φ i T results from a C i × C i Gram matrix, and the activation maps φ i are the same as those used in the perceptual loss.

6) VARIATION LOSS
The variation loss [59] L tv is the smoothing image intensity in the hole region. It is defined as where m and n are pixel coordinates.

7) FINAL LOSS
The final objective function for the proposed network can be written as follows: L final = λ rh L re−hole + λ rv L re−valid + λ p L perc + λ s L style + λ tv L tv + λ a L adv + λ inter L inter , (15) where λ rh , λ rv , λ p , λ s , λ tv , λ a , and λ inter are the hyperparameters. In our experiments, we set λ rh = 1, λ rv = 6, λ p = 0.05,λ s = 120, λ tv = 0.1, λ a = 0.1, and λ inter = 1 based on several similar methods [17], [61], [62]. The reconstruction losses L re−hole and L re−valid [17] minimize the distance between the final inpainting outputs and truth images. The perceptual loss L perc [59] and style loss L style [23] minimize the distance in high-level feature spaces, which guides the network to learn more semantic and style consistent features. The variation loss L tv reduces checkerboard artifacts and keeps the image smooth [59]. The intermediate pyramid reconstruction loss L inter is used to recover intermediate images by L 1 -norm distance, and then the network can use the recovered intermediate images into the L adv . Adversarial loss L adv [60] discriminates the multi-scale recovery results and ground truth to guide more semantically real content.

IV. EXPERIMENTS
This section reports the experimental results based on two benchmark datasets. The performance on visual quality and quantitative metric scores were compared with several stateof-the-art methods. Moreover, the ablation studies of our model are presented.

A. EXPERIMENTAL SETTINGS 1) IMPLEMENTATION DETAILS
The experiments were implemented in PyTorch [63] and trained on a single GTX 1080Ti (11GB) with a batch size of 10 images. Adam [64] was used to optimize the network. We trained our network from scratch and fine-tuned it. We set the learning rate at the scratch stage as 2 × 10 −4 and 2 × 10 −5 for the generator and discriminator, respectively. Then at the fine-tuning model stage, we set the learning rate as 5 × 10 −5 and 5×10 −6 for the generator and discriminator, respectively, and froze all Batch Normalization layers.

2) DATASETS
We evaluated our model using CelebA-HQ [65] and the Places2 [66] datasets. 1) CelebA-HQ Dataset: The dataset is a high-quality version of the CelebA dataset, mainly consisting of human faces. We used the first 2000 images for testing and the remaining 28,000 images for training. 2) Places2 Dataset: The dataset contains over 1.8M natural images from more than 365 different scenes and is released by MIT. We used the original training/validation partition and randomly selected 1,000 images from the original validation set for testing.

3) EVALUATION METRICS
Following previous works [9], [12], [17], we chose the mean L 1 distance (L 1 ), peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), and FID (fréchet inception distance) to evaluate our model. L 1 and PSNR measure the pixel similarity between two images. The SSIM measures the structural similarity by measuring the distortions of the generated image relative to the ground truth. FID calculates the distance based on deep high-level representations, which is closer to human perception [67].

B. PERFORMANCE EVALUATION
We compared the proposed model with the following stateof-the-art methods.
1) Partial convolution (PC) [17]: An encoder-decoder inpainting model uses partial convolution in the U-Net architecture with irregular damaged areas. Qualitative results on CelebA-HQ with PC [17], GC [9], RFR [12], VCN [18], PUT [19], and our model. The last two rows of images display the local details in the zoom-in red box.

FIGURE 4.
Qualitative results on Places2 with PC [17], GC [9], RFR [12], VCN [18], PUT [19], and our model. The last two rows of images display the local details in the zoom-in red box.
2) Gated convolution (GC) [9]: A two-stage inpainting model uses the gated convolution and the SN-PatchGAN to complete the damaged image.
3) Recurrent feature reasoning (RFR) [12]: A recurrent inpainting model proposes knowledge-consistent attention and progressively infers the generator with the partial convolution of U-Net architecture. 4) Visual consistent network (VCN) [18]: A two-stage inpainting method first uses a discriminative model to predict visually inconsistent regions as a mask. Then it uses a probabilistic context normalization to repair these mask regions. 5) PUT [19]: A transformer-based pluralistic image inpainting method adopts a patch-based auto-encoder VQ-VAE [14] and an unquantized transformer to reduce the computation cost and information loss.

1) QUANTITATIVE COMPARISONS
As can be observed from Table 1 and 2, our approach achieves good performance when filling the irregular mask with different mask ratios. It outperforms all the other benchmark methods in terms of all metrics. The results show that our approach with the proposed MRSF operator and the MGD module can effectively encode multi-scale feature maps and improve the inpainting performance.
2) QUALITATIVE COMPARISONS Figure 3 and 4 show the qualitative results for the validation sets of Celeba-HQ and Places2, respectively. The inpainting results from the PC [17] and GC [9] contain warped structures, noticeable artifacts, and blurry textures. Inpainting results from RFR [12] suffer from artifacts and inconsistencies near the mask boundaries. The recovered images from the VCN [18] and PUT [19] have curved edges/hair, inconsistent contents, and incomplete objects. Specifically, as the local details in the inpainting result in the zoomed-in red box, our method is good at generating distinct and smooth contours (e.g., the edges of glasses, chin, and pool) and more refined details (e.g., the eye and foot regions). Compared with all baselines, our results have better consistency, greater integrity, and more textures. We attribute this merit to the proposed MRSF, which extracts better global features to encode the image structures, and the MGD module, which guides the reconstruction by the decoder to achieve better inpainting outputs.

C. ABLATION STUDY
We further studied the effectiveness of each proposed component, the fusion and selection operations in MRSF, and the number and location of the MRSF layers in the encoder.
All the experiments in this section were derived from the CelebA-HQ dataset with irregular masks.

1) EFFECTIVENESS OF PROPOSED COMPONENTS
To analyze the effectiveness of each component in the proposed inpainting framework, we performed experiments based on the U-Net architecture. Our baseline network included 18 layers [23]. Next, we added the MGD module, intermediate pyramid reconstruction loss (denoted as Pyramid-L 1 ), and MRSF operator, respectively. Then, we used the MGD module and Pyramid-L 1 together. Finally, we integrated the MGD module, Pyramid-L 1 , and MRSF operator into the inpainting network, denoted as Ours. The results are presented in Table 3. The performance of our framework improved as each component was integrated. First, we compared ''Baseline'' with ''With MGD'' and ''With Pyramid-L 1 '', and the results show that the performance of adding MGD and Pyramid-L 1 has improved. Second, we compared ''Baseline'' with ''With MGD, Pyramid-L 1 '', and the results show that the performance has been further improved. These two analyses of adding MGD and Pyramid-L 1 modules demonstrated the availability of guidance information that helped to generate a better inpainting output. Third, after adding the MRSF operator into the baseline network, the overall scores of ''With MRSF'' increased by a large margin, which indicated that the MRSF operator played an essential role by encoding the corrupted images into strong features that could reconstruct good results. Finally, our inpainting network achieves the best performance. Figure 5 presents some examples of the inpainting results for qualitative comparison. The baseline model resulted in images with boundary artifacts (column 2 in Figure 5), particularly in the mask regions with a strip shape. Adding the MGD module can reduce the artifacts (columns 3 and 5 in Figure 5). Adding the MRSF operator resulted in fewer artifacts (column 6 in Figure 5), which verifies the effectiveness of the adaptive multi-receptive fields feature in the MRSF operator. Our model combines the MRSF operator, MGD module, and Pyramid-L 1 loss, which has no noticeable artifacts (column 7 in Figure 5). We can conclude that the MRSF operator characterized the better multi-scale representations, the MGD module provided clear guidance for the decoder, and our completion model achieved a good output with fewer artifacts, better structure consistency, and refined texture details.

2) EFFECTIVENESS OF THE FUSION AND SELECTION OPERATION IN MRSF
To verify the effectiveness of the fusion and selection operations in MRSF, we compared the proposed fusion & selection operation with some common ways, such as summation and concatenation. This experiment utilized different ways to process multi-receptive field feature maps and masks and obtained the final results. However, summation and concatenation are the only different ways of selection and do not own the fusion module. Specifically, for the concatenation operation, it first concatenates multi-receptive fields feature maps and then uses 1 × 1 convolution to fuse feature maps and masks. The summation operation directly adds the multireceptive field feature maps as the outputs. As shown in Table 4, the concatenation operation failed to recover the image; therefore, it was unsuitable for combining multireceptive field feature maps. The summation operation also worked, but the fusion & selection operation obtained the best inpainting results by extracting more robust feature maps that considered different receptive fields with a few more parameters and flops.

3) NUMBER AND LOCATION OF MRSF LAYER IN ENCODER
To determine the best number and location of the MRSF layers, we started from U-Net with eight layers of the encoder. In theory, in deeper layers of the model, the same spatial size on the feature map describes a larger receptive field. Therefore, we separately converted the first two, four, six, and eight layers to MRSF; the comparison results are presented in Table 5. It can be observed that the best inpainting results do not come from the network with more MRSF layers. Using four or six MRSF layers is better than using eight MRSF layers. This observation indicates that with more than six MRSF layers in the encoder, the improvement of the multi-receptive field feature maps becomes less significant. We conjugate this because after the sixth layer, the receptive  fields of the feature maps are sufficient for the encoder to generate the final results.
Then we conducted experiments to verify the effect of the location of the MRSF layer. According to the number of MRSF layers results, we used consecutive 4 MRSF layers to replace the layers in the encoder. The experiments put the MRSF operator in layers 1-4, 2-5, 3-6, 4-7, and 5-8, respectively. The comparison results are presented in Table 6. It can be seen that the best result is the row of MRSF in 1-4 layers. The results verify that the MRSF operators used in the front layers can capture more useful multi-receptive field features.
Based on these observations, we used the MRSF operator in the front of 4 layers in the encoder layers. VOLUME 10, 2022 V. CONCLUSION We presented the multiple receptions and gradients discriminator for the image inpainting network to generate realistic results with reasonable semantics and richer details. Specifically, the encoder adopts the proposed MRSF operator to assist with partial convolution, which captures robust feature maps from different receptive fields to suppress artifacts. Furthermore, we adopted the MGD module in the discriminator to enhance guidance for image inpainting. Qualitative studies demonstrated that the proposed approach generated more coherent images with fine details, and the quantitative results demonstrated superior performance over several stateof-the-art inpainting methods.