Image Inpainting Based on Interactive Separation Network and Progressive Reconstruction Algorithm

Recently, learning-based image inpainting has gained much attention. It widely utilizes an auto-encoder structure and can obtain compact feature representation in the encoder to achieve high-quality image inpainting. Although this approach has achieved encouraging inpainting results, it inevitably reduces the high-resolution representation due to interval downsampling. In order to solve this problem and achieve an excellent image inpainting effect, this paper proposes a brand-new generative network, Interactive Separation Network, which retains the high-resolution information and extracts the semantic features from corrupted images. Furthermore, this paper also discusses network designs with different complexity in different application scenarios. Finally, to improve the effectiveness and robustness of our proposal to large corrupted regions in the inpainting image, we further propose a flexible and highly reusable reconstruction scheme to complete the inpainting in the prediction process gradually. Experiments show that our proposed generation network and reconstruction scheme can significantly improve the quality of repaired images. The proposed method significantly outperforms the state-of-the-art image inpainting approaches in image quality.


I. INTRODUCTION
Image inpainting, a.k.a. image completion, refers to the process of filling in missing content in damaged images. It plays a critical role in addressing the various issue of computer vision, such as object or artifact removal, 3D reconstruction, and depth-image-based rendering (DIBR) technology. Due to the complexity of the context in different scenes, image inpainting becomes one of the challenging problems in imaging tasks.
The last decade has seen a growing trend toward convolution neural networks (CNN) [1], [2] and generative adversarial networks (GAN) [3], so they have obtained much attention. CNN can learn the high-level representation of images in deep learning, so it delivers marvelous success in a variety of computer vision tasks (e.g., image classification [4] The associate editor coordinating the review of this manuscript and approving it for publication was Chun-Wei Tsai . and object recognization [5]). Besides, based on GAN, the content produced by the generative network is more realistic and closely conforms to the human visual patterns. Benefiting from these technics, many approaches [6]- [12] have the ability to recover the damaged image with strong prior knowledge for the semantic understanding of the scenes. They have achieved encouraging results, which have driven the rapid development of image inpainting in recent years.
The CNN-based encoder-decoder architecture [6] has been widely employed in image inpainting, and most of them are similar to Unet-style [13] structures. Although the pooling layer in the Unet-style network can compress the features into a compact representation, they inevitably discard numerous high-resolution signals in spatial dimension due to the interval downsampling, as observed by Figure 1. The problem is not having an efficient one-stage model that also puts enough attention on the textured pattern and semantic analysis of images in a learning-based fashion. The stride convolution or stride pooling reduces the spatial representation of features by interval downsampling. Although these operations compress the data efficiently, they inevitably throw away some of the original information from the data.
There are serval attempts [7], [9] to perform semantic inpainting in a learning-based fashion. Nevertheless, these approaches do not maintain the trade-off between texture content and structural semantics in images. For example, as shown in Figure 2, the image recovered by GLCIC [7] has a large region of texture artifacts, and the Contextual Attention (CA) [9] crafts well-structured massif but fails in the structures of clouds. To acquire a better inpainting effect, researchers have also begun to investigate the legible inpainting with the perceived semantics of images. One of the common ways is building an additional network [12] to further refine the general results. The result of EdgeConnector [12] in Figure 2 displays pretty general structures of objects, the completed textures have subtle flaws. However, there are many regions in the result of EdgeConnector that have subtle flaws, which make the overall inpainting effect mediocre. On the other hand, these kinds of approaches need a large amount of computation and will increase space complexity because they commonly require two large networks.
In order to solve the above problems and achieve a better image inpainting effect, this paper proposes a brand-new generative network, named Interactive Separation Network (ISNet), which maintains the balance between textured pattern and semantic context through two well-designed network branches. For the convenience of interpretation, this paper definite three main operations -Inpainting, Interaction, and Aggregation (the definition is similar to the literature [14]), which manipulate the feature representation and form each independent stage in ISNet, and the step-by-step connection of the formed stages constitute the main body of ISNet. A brief overview of the ISNet framework is shown in Figure 3. In order to solve the inherent problem of current image inpainting approaches, that is, lack of robustness to a big damaged region, we further propose an efficient, straightforward, and highly reusable algorithm in the prediction process to progressively image completion. The proposed method is demonstrated that achieve the state-of-the-art inpainting results. The pre-trained models and code for network structure, progressive inpainting, and ablation research can be accessed at https://github.com/GuardSkill/Large-Scale-Feature-Inpainting/tree/journal.
In summary, the main contributions of this article are as follows: • We propose an efficient completion network (ISNet) that can both understand the scene and recognize the texture pattern of images. Experimental results show that it achieves excellent image inpainting performance; • We propose an efficient and highly reusable completion scheme that can progressively complete the images in the prediction stage to improve the robustness of the proposed network structure; • In order to verify our proposed method that can be widely applied to different situations, the experiments study the efficiency of ISNet in diverse network structure settings, which can leave a valuable experience for researchers to design their own models.
In the past few years, a variety of learning-based approaches have flourished in the field of image inpainting. Benefiting from CNN, well-trained generative networks can perform the high-level recognition of diverse scenes. However, even though there are several earlier CNN-based approaches [21], [22] that were designed to restore the corrupted image or text-covered image, these early learning-based inpainting approaches lack the versatility for irregular masks of different sizes because they only handle very small and thin holes. At that time, the generalization of inpainting models still needed to be improved. In order to achieve higher quality image inpainting results, some improved methods are proposed. Specifically, there are three types: learning-based image inpainting approaches, image inpainting approaches by introducing perceptual loss, and progressive image inpainting approaches.

A. LEARNING-BASED INPAINTING APPROACHES
Goodfellow proposed Generative Adversarial Network (GAN) [3] that can calculate the adversarial loss via a primary network called generator and an auxiliary network called discriminator. Using this kind of adversarial learning, GAN has attracted major research interest from various fields. One of the typical and classic learning-based inpainting approaches is the Context Encoder [6], in which the encoder embeds the input image into the high-level feature maps with low spatial dimension, then the decoder exploits the compact features to reconstruct the original image. This approach has shown that learning-based architecture has the extraordinary ability to understand image context and hallucinate realistic objects in the completed region. Influenced by this exploratory contribution, the Unet-style networks [13] have been widely used as a generative model among the learning-based approaches in the field of image inpainting. However, due to the highly efficient compression scheme and sequential structure of Context Encoder, the generated content in the resulting images are excessively smooth and visually obscured.
Following the Context Encoder, the other well-known learning-based inpainting approach was proposed by Iizuka [7]. Iizuka built two discriminators to respectively verify the authenticity of general images and inpainted regions. Afterward, the discriminators feed realistic scores back to generative models to recover images with comprehensive coherence. However, there still exist style inconsistencies between the completed region and the existing region, which make their results greatly dependent on the post-processing. Due to the lack of optimization of discriminators, the intricate training procedure of Iizuka work is time-consuming and unstable.
However, previous methods often lead to issues such as inter-frequency collisions and repair impairments since they all simply apply together different losses that focus on synthesizing content at different frequencies. Therefore, Yu et al. [23] proposed a wavelet-based inpainting network, which effectively alleviates inter-frequency conflicts and fills the missing regions in each frequency band by decomposing the image into multiple frequency bands and applying L1 reconstruction loss in low-frequency bands and adversarial loss in high-frequency bands.

B. PERCEPTUAL LOSS IN IMAGE INPAINTING
Liu et al. [8] introduced the high-weighted style loss term to produce structural content, which can carry out a high-level recognition of content and style. Although their result is visually plausible, some of their inpainted images contain excess smoothly content in the filled region, and some still exist the checkboard artifacts in the unofficial reproduction of their work.
In the mode of Partial Convolution Network (PConv) [8], the perceptual loss and style loss [24], [25] are considered to be two of the objective function terms that need to be minimized. For the perceptual loss, it encourages the model learning to generate images that have a similar high-level representation as to the original image. The perceptual loss can be expressed as Eq.(1).
where I pred and I gt are repaired image by generative model and corresponding ground truth image, φ i (x) is an image feature extractor, which uses VGG [4] to extract the corresponding features of the image. Here, φ i corresponds to the values in the feature maps from pool1,pool2, and pool3, so the L is equal to 3 in this paper. N i represents the number of elements in φ i , and this formula can be understood as the Mean Absolute Error [26] of the higher-level feature spaces. For the style loss, it can penalize the difference between the predicted image and ground truth in terms of style and general tone by the correlation of feature maps. The style loss can be formalized as Eq. (2).
G is a C j × C j nomarlized Gram matrix. C j C j refer to normalization factor.

C. PROGRESSIVE INPAINTING APPROACHES
Progressive image inpainting, including the structure-totexture approach and the boundary-to-center approach, has recently been investigated. The structure-to-text approach usually employs a two-stage network structure, of which each stage respectively generates the structure/edge and texture features. Yu et al. [9] proposed a contextual attention mechanism and a two-stage inpainting scheme. They built two networks to complete high-quality inpainting. The first network in their approach is designed to infer the coarse content, indicating the rough structure information of the missing region, and then the second one aims to refine the produced coarse results. Afterward, their following work introduced Gated Convolution and user-guided information [11] into the two-staged approach and achieved further expansion. The other promising two-stage inpainting approach proposed by Kamyar et al. [12]. This work firstly restores the general contour of the image from the corrupted image and the corresponding edge map and then takes the filled edge map as prior structure information to guide colorization.
Although these methods attempted to solve inpainting tasks by adding structural constraints, they still have some problems. Firstly, due to the use of a series-coupled architecture, it is easily subjected to the adverse effects of unreasonable structure preconditions during the inference time. For this issue, Liu et al. [27] recovered structures and textures via feeding a structure branch and a texture branch with the deep and shallow features and concatenating and equalizing the features they output. Guo et al. [28] proposed a two-stream coupled network for image inpainting, which uses a structure-constrained texture synthesis stream and a texture-guided structure reconstruction stream to better utilize each other for a more rational generation. Also, Bi-directional Gated Feature Fusion (Bi-GFF) module is proposed to combine the results of the two streams. Secondly, it still lacks information for restoring deeper pixels in holes for their backbone when the corrupted region is relatively large. To solve this problem, Li et al. [29] devised a Recurrent Feature Reasoning (RFR) module which recurrently infers the hole boundaries of the convolutional feature maps and then uses them as clues for further inference, progressively strengthening the constraints for the hole center and inpainting the image.
Although these approaches [9], [11], [12], [28], [29] can generate well-defined content by a two-stage inpainting strategy, which means that they need to separately build two different networks in two stages, it consumes intensive computation. Besides, the performance of their second network will suffer if the first stage network has poor inpainting prediction. By contrast, this paper proposes a network that can efficiently capture both structural and texture information in a one-stage network.

III. APPROACH
This paper proposes a GAN-based neural network, called the Interactive Separation Network (ISNet), which is trained to perform image inpainting tasks. For better comparison, we employ the discriminator and objective function similar to the EdgeConnect [12] in the early stages of the experiment. Instead, the novel generative network in ISNet has two branches designed to maintain low-resolution representations and high-level information, which can be seen in Figure 3. Blue and yellow rectangular represent high resolution and low resolution branches, respectively. In order to concisely describe the proposed model, this paper divides the forward process of the network into 4 consecutive segments (called phases), which have similar ways to manipulate the features of the two branches. In this section, we state the internal structure inside each phase, the generator's overall structure, and the objective function used in the proposed approach.

A. THREE OPERATIONS
Each phase of ISnet is composed of three defined operations -Inpainting, Interaction, and Aggregation. Except for the first phase, all the phases process feature representations separately at two different resolutions. Taking the first two phases as an example, Figure 4 describes the detailed propagation of these phases. It is worth mentioning that the internal structure of phases 3 and 4 in Figure 3 is consistent with that of phase 2 in Figure 4. During the Inpainting process of the second phase, two different ResBlocks [30] are adopted separately in two branches to handle two types of features that are produced by the previous phase. In the first phase, only one branch is equipped with one ResBlocks to process the feature.
The Interaction operation aims to interact the information between the two branches and further downsample the low-resolution branch with a higher channel dimension. It employs the convolution with stride 2 and the double number of filters to downsample the feature resolution to half of the previous state. Meanwhile, repeat n (where n is equal to phase index) stride convolution to progressive scale down the first-branch feature into the resolution same as the second branch. Then outputs are concatenated together into the second branch. To propagate the high-resolution branch, the previous second branch features are put into (n − 1) Sub-Pixel [31] convolutional layers and then stacked with previous features of the high-resolution branch, which adaptively learn an upsampling scheme using CNN. As same to the Inpainting operation, the Interaction process only inputs the high-resolution branch in the first phase because there only exists a larger-resolution feature branch.
The Aggregation operation is designed to fuse the information that is produced by the Interaction operation. It exploits convolution with a 3 × 3 kernel in both branches to merge and compress the channels into specific numbers, where the resolution of the current feature determines the compressed channel number. In the proposed ISNet, the channel numbers are distributed in 32, 64, 128, 256, and 512 as the resolution of the feature decreases.

B. NETWORK DESIGN
Let's look back to Figure 3 again. Before inputting the features into the 4 sequential phases, ISNet firstly embeds the masked RGB images into 32-dimensional feature maps with the resolution of 256 × 256. Because there are two branches to process data, the resolutions of outputs produced by the last phase (4th phase) are 256 × 256 and 16 × 16, respectively. At the end of the generative model, the Final Fusion (FF) process is applied to fuse these output features into a high-resolution feature block through stacked Sub-Pixel layers and concatenation. Finally, the features are decoded into a repaired image using a 3 × 3 convolution. To verify the effectiveness of the final fusion process, we try to drop out of the FF process and decode the high-resolution branch directly into 3-channel images. Section IV demonstrates that this simplified design results in worse performance than the network equipped with this process. It is noticeable that the final 3-channel features are input into tanh(x) function and mapped to the range between 0 and 1. The mapping can be formalized as Eq.(1). In all intermediate processing of the generator, the tanh(x) activation function is adopted to deliver feature values, and the zero padding is used to control the variation of resolution.
In the discriminator of ISNet, the high-level features are collected using 3 continuous vanilla convolution layers with stride 2 and then using the 2 convolution layers with the same padding. We are able to manipulate the number of feature channels and reduce the number to 1. In addition, the discriminator in this work is based on the PatchGAN [11], [32]. Therefore, the final output of the discriminator is a single feature map, in which each pixel will judge the part of the region related to it. In order to represent the probability of whether the receptive region of the neural unit is generated by networks or real, the sigmoid activation function is used in each layer of the discriminator.
As mentioned in [33], the advantage of using spectral normalization is that it can stabilize the training process. Spectral normalization suppresses the weight matrixes in each layer by utilizing the maximum singular value of the weight matrixes, which limits the Lipschitz constant of the network to 1. The spectral normalization is originally used only in the discriminator, however, Odena [34] has recently demonstrated that it can refrain generators away from dramatic changes in parameters and gradient. As a result, spectral normalization is applied to both the generators and discriminators in ISNet.

C. LOSS FUNCTION AND EXPLORATION
In this paper, the mapping function of the generator and discriminator of ISNet are respectively denoted as G(x) and D(x) for short. Following the aforementioned symbols, M refers to the mask that only includes binary values, I gt and I pred respectively represent the ground truth image and the image inferred by the generative model. Using I gt M represent damaged image, the image generation process can be expressed as I pred = G I gt M , where is Hadamard product. To train the discriminator, the real image and repaired image are sent to the discriminator in a one-toone ratio for discrimination. The hinge loss is adopted as the objective function of the discriminator, which can maximize the margin between positive and negative samples. By minimizing the objective function, the discriminator can better distinguish whether an image is repaired by the generative  Table 1, respectively. It's worth noting that all these models are trained 3200 times and might are not the optimal models. model. The hinge loss can be described as Eq.(3) where m refers to the parameter of margin, ψ represents the ReLU function that is used to filter the negative values.
In the early stages of the trial, the generative model is trained on a joint loss function that is similar to EdgeConnect [12], the objective function that expects to minimize can be described as Eq.(4).
For the objective function (Eq.(4)) of the generator, we further investigate the effects of different components of the loss function and analysis the principal components of the loss function.
Firstly, we conduct a random hyperparameter search on the weights of each loss term, and we train the dozens of models for different weight combinations, and each model is iterated nearly 3200 times with nearly 256,000 samples (batch size is set to 8). In order to widely search the weight space, all the parameter magnitudes are randomly chosen and the ranges of magnitudes are empirically selected. Some of the experimental results are described in Table 1.
However, it's very time-consuming work, and we find that quantitative score is not necessarily proportional to qualitative effect. The models achieving high quantitative scores may generate extremely unpleasant images. For example, the objective function with high-weighted L 1 and L perc term can achieve a higher quantitative score, but the checkboard artifacts and blurry contents may exist in the generated image. These phenomenons can be observed in Figure 5. According to this study, the high-weighted L 1 , L perc , and L style terms can greatly improve the quantization score of the repaired images than other components. By paying more attention to these key components, we find that L perc loss term can produce more structural inpainting, and the L D term has the ability to reduce the checkboard artifacts produced by L 1 and L perc , and thus make the recovered image more realistic, which also can be observed in Figure 5.
Based on the aforementioned experience, we further conduct a series of small-scale ablation experiments on loss function, which can be observed in Table 2 (each experiment trains the model 255,000 times). In the first 4 rows, it can be clearly seen that L 1 , L D , L style and L perc loss term benefit the quantitative score. However, the simple combination of these loss terms produces the unpleased inpainting results ( Figure 5). After weighting the L style and L D to improve the visual effect as well as maintain high quantitive scores, the final objective function is defined as Eq. (5).

D. PROGRESSIVE RECONSTRUCTION ALGORITHM
In order to comply with the habit of human painting and improve the robustness of repaired images to large damaged regions, this paper proposes a progressive inpainting strategy that sequentially inputs images and masks into the generative model multiple times during the prediction stage. Specifically, suppose there is a rough recovered image I comp = I pred (1 − M ) + I gt M that is obtained from the first inpainting process, then the mask M is dilated to M , it means that the size of the damaged area is reduced to a certain extent (it depends on the size of the convolution kernel). Afterward, the data of I comp M is inputted into the models to produce a new recovered image I comp . By repeating the above process, the final recovered image can be considered as our final output. The whole inpainting process can be defined as Algorithm 1. sum(x) in Algorithm 1 calculates the sum of the values of the elements in the matrix x, and the numel(x) counts the number of elements in the matrix x.
Intuitively, Figure 6 displays the variations of masks in each dilation operation. It can be observed that the valid region becomes larger as the algorithm processing.
Because the progressive reconstruction scheme takes place in the prediction stage, it improves the results while only slightly increasing the time-consuming. The comparison of prediction latency can be observed in Table 3. In this experiment, the kernel size of each dilation step is set to 15, and the VOLUME 10, 2022 , L D , L style , L perc , L FM columns respectively refer to the weights of each loss term, PSNR and MAE refer to the quantitative score produced by corresponding conditional models and 3000 test samples.  prediction latency only includes the time for prediction, not the time to initialize the model, calculate scores, etc. It can be observed that the prediction latencies per image of proposals are less than 0.1 seconds. , L D , L style , L perc , L FM columns respectively refer to the weights of each loss term, PSNR and MAE refer to the quantitative score produced by corresponding conditional models and 5000 inpainting samples. In order to validate the improvement of the progressive reconstruction algorithm (PRA), we also perform a comparative experiment on the model that is optimized by the objective function of the 5th row in Table 2, which is not too robust to the large damaged region. According to the experiment, it is observed that PRA can visually improve the inpainting performance in the case of a large damaged region. Some of the inpainting results with large holes are selected to display in Figure 7.

IV. EXPERIMENT A. IMPLEMENTATION AND TRAINING SETUP
The proposed architecture and all supplemental experiments are implemented by Pytorch [36]. All the images in experiments are from two public datasets -Places2 [35] and CelebA [37]. The details of both datasets are shown in Table 4. Therefore these images have a uniform resolution of 256 × 256. All mask maps that marked damaged areas with values 0 are sampled from the NVIDIA public dataset [8] and are resized into the resolution of 256 × 256. And the generative loss and adversarial loss are both optimized by  Table 2, yet the last column is the result using the progressive reconstruction algorithm.
one well-known stochastic descent method -Adam optimizer [38]. The learning rate of the generator is 10 −4 , and the learning rate of the discriminator is set to one-tenth of the learning rate of the generator, the models in all experiments are trained until their generator convergence. It is worth mentioning that the evaluations in this section only conditional measure the difference between the real testing image I gt and the combined image I comp = I pred (1 − M ) + I gt M .

B. QUANTITATIVE COMPARISON AND ANALYSIS
Furthermore, in order to study the effect of different sizes of damaged areas on the inpainting performance, we evaluate the performance of the models under the different sizes of the damaged region using the three following indicators to conduct the quantitative evaluation: the Peak Signal to Noise Ratio (PSNR), Structural Similarity Index (SSIM) [41], and Mean Absolute Error (L1 distances) [26].
Given the ground-truth image and inpainted image, MAE measures the total absolute difference between the pixel values of the ground-truth and the inpainted image. A low MAE computed as Eq. (6) indicates that the quality of the reconstructed image is good.
PSNR is the ratio of the maximum possible value (power signal) to the power of distortion noise that affects the quality representation based on two homogeneous images (reconstructed/original). The higher the PSNR value computed as Eq. (7), the better the quality of the inpainted image, where MAX I is the maximum fluctuation in the input image data type.
The SSIM models three factors of two images, namely correlation loss, luminance distortion, and contrast distortion. Given the input signals (x,y), SSIM computes the combination of luminance, contrast, and structure to output a similarity measure expressed in Eq. (8). The higher the SSIM value, the better the quality of the predicted image.
The comparison results show that ISNet can achieve better quantitative performance in all aspects of the indicators regardless of the proportion of damaged areas. It proves the efficiency of the two-branch network in image inpainting tasks and also demonstrates that high-resolution components play an important role in addressing the issue of texture inconsistency in image inpainting. According to the column 'ISNet(Eq.(4))' and column 'ISNet(Eq.(5))' of Table 5, the proposed loss function (Equation 5) achieves better inpainting performance than using Equation 4 as the objective function. Besides, the results in the column 'ISNet(Eq.(5)' and column 'ISNet(Eq.(5))+PRA' of Tables 5 and 6 demonstrate the effectiveness of the proposed progressive reconstruction algorithm. Due to the PRA can improve the visual performance of inpainting (see in Figure 7) with slight degeneration of quantitative score, we consider the ISNet+PRA as our main proposal. Figure 8 shows some selected inpainting results from the Places2 test dataset. It can be observed that the proposed approach (ISNet+PRA) produces more visually plausible results and remove the style (color) difference between the filled and non-filled areas. From the yellow box of the second row, the proposed method fills these holes with more reasonable textures, and it makes the inpainted region consistent with the surrounding in terms of color and semantics. For VOLUME 10, 2022  [35], the existing recorded data are taken from the literatures [8], [12] [39], [40], and the records of our proposal are estimated over 10,000 test samples from Places2 dataset, the mask dataset in training and test phase are provided by Liu [8].  [37], the records of our proposal are estimated by 5,000 samples from CelebA test dataset, the mask dataset in training and test phase are provided by Liu [8].

C. QUALITATIVE COMPARISON AND OBSERVATION
the inpainting of the large damaged region, the proposed approach output a more realistic inpainted scene without serious artifacts. Figure 9 displays the inpainting results of facial reconstruction from the CelebA test dataset. What stands out in the figure is the authenticity of the inpainting results of the proposed method. The proposed approach has the ability to reconstruct high-resolution faces while maintaining the consistency of the color of skin and the authenticity of decoration objects on the head.

D. ABLATION STUDY
Places2 is a public dataset that collected various natural scenes. It consists of more than 1,000,000 training samples and more than 300,000 images for testing. Considering the abundant training data and testing data in the dataset, it's very suitable to conduct the ablation experiments on it. We quantitatively evaluate the inpainted result of the test dataset in three kinds of measurement -PSNR, SSIM, and MAE.
We initially build a simple model prototype without the sub-pixel layers and the design of final fusion (FF). Hence, we consider the simple model as a base model and compare our approach in the cases of whether equipped these modules/local architecture are. It is worth mentioning that the adding of final fusion will increase the learnable weights of the generator model. Aiming to demonstrate that improvement does not just benefit from the increase in weight number, we decrease the channel number of some intermediate layers of the base model when conducting the experiment that adds the final fusion operation. Because the sub-pixel layer is a technic that allows the network adaptively learn upsampling means, it's inevitable to add more learnable weight to the model. Hence, the experiment no more reduces the number of channels when adding the sub-pixel layers. In order to demonstrate the efficiency of ISNets-like generative models, we also introduced a UNet-like [13] generator model from the work [12], and adjusted the number of parameters close to ISNet via adding additional layers to the generator.
As shown in Table 7, it shows the performance of trained models with different network designs on the Places2 test dataset. It's worth noting that all the environments and training setup are constant (e.g., batch size, number of iterations). According to the table, it can be observed that all the techniques used in ISNet can improve the quantitative score. Regardless of the number of parameters, it can be concluded that the ISNets-style models have better quantitative performance than the UNet-like model.  Besides structural design, this paper also explores the network performance under different numbers of residual blocks added in each Inpainting process. As shown in Table 8, we evaluate the inpainting performances of ISNet with the different number of residual blocks. The maximum block number is set to 4 due to the limitation of our 11GB GPU VOLUME 10, 2022 device. Apart from validating the effectiveness of applying residual blocks, these results also provide a reference for the network deployments requiring diverse RAM resources.

V. CONCLUSION
Due to two-branch generative networks having pivotal roles in the maintenance of low-resolution components and highresolution information, the proposed method (ISNet) is suitable for the inpainting task and obtains excellent inpainting performance in terms of visually and quantitatively. ISNet performs better than other state-of-the-art approaches on the two public image datasets. On the other hand, this paper explores the trade-off of quantitative performance and visual results in different combinations, the proposed composition of loss function greatly improves the inpainting performance of ISNet. Furthermore, the proposed progressive reconstruction algorithm improves the visual robustness of ISNet for the large-damaged-region inpainting (Figure 7).

ACKNOWLEDGMENT
(Jun Gong and Siyuan Li contributed equally to this work.)