A Non-local Enhanced Network for Image Restoration

Non-local modules have been widely studied in image restoration (IR) tasks since they can learn long-range dependencies to enhance local features. However, most existing non-local modules still focus on extracting long-range dependencies within a single image or feature map. On the other hand, most IR methods simply employ a single type of non-local module in the network. A combination of various types of non-local modules to enhance local features can be more effective. In this paper, we propose a batch-wise non-local module to explore richer non-local dependencies within images. Furthermore, we combine various non-local extractors (different attention modules) with the proposed batch-wise non-local module as the Enhanced Batch-wise Non-local Attentive module (EBNA). Besides exploring richer non-local information, we build the Non-local and Local Information extracting Block (NLIB), in which we combine the EBNA with DEformable-Convolution Block (DECB) to utilize richer non-local and adaptive local information. Finally, We embed the NLIB within a U-net-like structure and build the Non-local Enhanced Network (NLENet). Extensive experiments on synthetic image denoising, real image denoising, JPEG artifacts removal, and real image super resolution tasks demonstrate that our proposed network achieves state-of-the-art performance on several IR benchmark datasets.


I. INTRODUCTION
Image restoration is a classic computer vision task that aims to restore high-quality image from its various degradation. It has been widely applied in many practical applications, such as medical image processing [1], [2], surveillance [3]- [5], synthetic aperture radar (SAR) image processing [6]- [8], image compression [9], and so on.
Traditional methods build handcrafted models to solve the image restoration problem based on specific degradation prior knowledge [10], [11]. However, such kinds of methods, including Block-Matching and 3D filtering (BM3D) [12], non-local means (NLM) [13], sparse coding [14], usually have limited robustness towards real-world data. To remedy this problem, the recent deep neural network (DNN) based methods tend to learn the parameters of the model using massive paired data in specific degradation, such as SRCNN [15], DnCNN [16] and RDN [17].
SRCNN [15] first introduced the convolution neural network to IR for the image super resolution task. Recent development in DNN showed that larger receptive fields could learn informative features from a larger neighborhood in the image. Therefore, more and more researchers try to build deeper and wider networks to improve the restoration performance, such as RDN [17], VDSR [18], and EDSR [19]. Another way of learning from larger receptive fields is employing wavelet decomposition to generate multi-scale inputs. Such as in divide-and-conquer framework [20], authors first decomposed images to multiple subspaces according to the visual importance and used different models to preserve texture details based on prior knowledge. In another work, authors tried to use CNN-based models for sparse coding (DCSC) [21]. In DCSC, the features were sparsely coded by employing CNN models instead of the handcrafted coding method. In MWCNN [22], authors employ wavelet decomposition and reconstruction to generate multi-scale features.
We can acquire non-local information in multiple ways, such as employing self non-local modules (based on nonlocal means [13]), global pooling modules (channel-wise and spatial-wise attention), multi-scale inputs, long-range connections and recurrent connections in the network. In IR approaches, for example, many networks employ global pooling to extract long-range dependencies, such as RCAN [23], [24] and RIDNet [25]. In these attention modules, they squeeze the feature map (channel-wise or spatial-wise) and generate attentive weights [26]- [28] based on the whole channel of the feature map or the spatial location of the feature map (which is used as the non-local features). For networks like MemNet [29] and Rednet [30], they pass nonlocal information through recurrent connections to obtain richer features. COLA-Net [31] builds a non-local module that uses patches from a single image (or feature map) to extract long-range dependencies.
However, there still exist several problems in the above IR approaches. First of all, self non-local information has been explored in methods such as COLA-Net [31] and NLRN [32], in which they ignore the helpful patch-wise non-local information among multiple images. Secondly, most IR models employ a single type of non-local module in the network, limiting the feature extracting ability. Combining various types of non-local modules to enhance local features can be more effective for IR. Finally, besides the non-local features that capture the long-range dependency, more sophisticated local features can complement the restored local texture.
We propose the batch-wise non-local module to extract sophisticated non-local information and build long-range dependencies to tackle the above problems. Different from the previous self non-local modules [31], [33], our proposed batch-wise non-local module can fuse the prior and relevance from a batch of images (or feature maps), which can restore more contextual details.
To further explore diverse information from non-local regions, we propose a novel block named EBNA, which combines the proposed batch-wise non-local module and various existing non-local modules. In the proposed batch-wise nonlocal module, we intend to extract richer information from a batch of images instead of using only one image. Especially in situations when self-similarity is limited within one image, the chance of extracting more relevant information from a batch of images is increased. Unlike the non-local modules that employ global pooling to extract non-local relatedness, the batch-wise non-local module generates the non-local feature based on patch-wise relation within feature maps. In contrast, channel-wise attention (CA) and spatial-wise attention (SA) module extract non-local features by global pooling weights among channels and spatial location within the features. Combining the three could enhance each other and generate more diverse non-local features to improve the restoration of the textural and contextual details in the images. FIGURE 1 shows the difference between our proposed EBNA module and several existing non-local modules. The highlighted parts in FIGURE 1 demonstrate the difference between the self non-local module and the proposed batchwise non-local module. They match the patches within a single channel of features and a batch of features, respectively. We can observe that the batch-wise non-local modules in EBNA can explore long-range dependencies within images and extract more sophisticated non-local features. In EBNA, the combination of various non-local modules can generate more diversified features compared to existing networks which employ a single type of non-local module.
Furthermore, based on EBNA, we build a novel block named NLIB to collaborate the local and non-local features. In NLIB, we employ a DEformable-Convolution Block (DECB) to extract local features. Deformable convolution can learn local features from the adaptive receptive field, but the limited receptive field size still restricts the module from learning non-local information. Such cooperation between DECB and EBNA can extract more enhanced features from both local and non-local regions of the image.
Finally, we stack the NLIBs in a U-net-like multi-scale structure model and build the Non-local Enhanced Network (NLENet). Extensive experiments show that NLENet achieves state-of-the-art performance on several IR task benchmark datasets.
The main contributions of the paper can be summarized as • We propose a novel batch-wise non-local module to explore non-local dependencies among images and build a novel block called EBNA that combines various complementary non-local information. • To cooperate non-local information with adaptive local information, we further employ EBNA together with DEformable-Convolution Block (DECB) as a new module NLIB. Based on the NLIB, we propose our final model, NLENet, which utilizes various non-local information and the adaptive local feature to improve IR performance. • Extensive experiments on synthetic image denoising, real image denoising, JPEG artifacts removal and real image super resolution tasks show that the proposed model achieves state-of-the-art performance. Furthermore, the ablation study also demonstrates the superiority of the proposed network.
The rest of this paper is organized as follows. In section II, we introduce the related works. In section III, we present the structural details of our proposed model. Extensive experiments are conducted in section IV to evaluate the effectiveness of the proposed network on synthetic image denoising, real image denoising, JPEG artifacts removal and real image super resolution tasks. Furthermore, ablation study is presented in section V. The conclusion is given in section VI.

II. RELATED WORKS
In this section, we give a brief review of the works related to our proposed network. We first list the typical traditional model-based and recent state-of-the-art DNN based IR methods. Then, we briefly introduce the typical non-local operations and local feature extraction approaches in IR.

A. IMAGE RESTORATION
Image restoration, as a fundamental component in the image processing area, has been widely studied for decades. Traditional IR methods like BM3D [12], SA-DCT [34], and TNRD [35] have provided reasonable results on both accuracy and robustness. However, these algorithms usually have drawbacks, such as high complexity and limited generalization.
Recently DNN based IR methods have gained considerable attention and significant performance improvement. Researchers develop deeper, wider models to acquire larger receptive fields and extract pixel-wise relations from a larger region. SRCNN [15] first introduced CNN for IR tasks. Based on the CNN structure, researchers developed VDSR [18]. In VDSR, a structure that consists of several cascading filters was proposed to broaden the receptive field and increase the depth of the model. In building VDSR, the authors found that increasing the depth of the model could bring performance improvement. To solve the gradient descent problem in training deeper models, DRCN [36] proposed a deeper model together with a gradient clipping and recursivesupervision method, which increased the IR performance significantly. DnCNN [16] introduced the residual connection to ease the propagating of feature flow and solve the gradient vanishing problem in deep IR models. In another work RCAN [23], a deeper model with a residual in residual structure was proposed. RCAN also introduced the channelwise attention module within the residual in residual structure to obtain a deep network and learn more adaptive features simultaneously. To increase the IR model's efficiency, RDN [17] employed dense connection, feature fusing, and residual connection to make full use of the features from different scales. In this paper, we propose a novel network that employs both non-local and local features to improve IR performance, as shown in section III.

B. NON-LOCAL OPERATION
Non-local information has been explored in many areas, such as extracting relevance in video processing [37], [38], building long-range dependency among a sequence of words in natural language processing [39] and text summarization [40]. Besides extracting long-range dependency in the time domain, non-local operations also can build relevance in the space domain, such as in computer vision tasks.
In computer vision area, non-local operations that extracts long-range dependency among pixels, has been used for many tasks, for example object detection [33], [41], semantic segmentation [42], [43], video action recognition [44], image compressive sensing [45], and image restoration [13], [31], [46], [47]. To better understand the non-local operation's efficacy, we can observe it as an attention mechanism for pixel-to-pixel relation modeling. This relation is modeled as the dot-product between the features of two pixels. The larger the dot-product value indicates more relevance of the two pixels.
At first, traditional methods usually apply Additive White Gaussian Noise (AWGN), TV regularization [48], Fourier domain [49] or wavelet domain [50] coefficients transform in different IR tasks. However, it is the idea of non-local means (NLM) denoising [13] that brought the importance of longrange dependencies into IR tasks. Non-local means methods are built upon self-similarity and redundant information over realistic images. Later on, another non-local denoising approach BM3D [12] was developed.
As for DNN based methods, NLNet first introduced deep neural networks to perform non-local processing for color image denoising task which achieved remarkable performance. Non-local information is also widely explored in image super resolution area [51]- [53]. Such as in the NLSN [51], they proposed a Non-Local Sparse Attention (NLSA) with dynamic sparse attention pattern module to generate non-local attention with spherical locality sensitive hashing (LSH). Furthermore, in MHNAN [52], non-local information is extracted by a Mixed High-Order Attention (MHA) module. In another work, COLA-Net [31] tried to build a learnable non-local module to extract long-range dependencies within the degraded image. However, it only extracts relevance within one single image, which lacks non-local VOLUME 4, 2016 information among multiple images. In contrast, we proposed a novel batch-wise non-local module to extract the non-local information among multiple images, which has not been studied in the existing non-local methods. Based on the proposed batch-wise non-local module, we proposed the NLENet, which combines various non-local features to enhance the local feature and preserve more contextual details in IR.

III. PROPOSED NETWORK
In this paper, we proposed a novel network, NLENet, for IR tasks. Here we present an overview of the proposed IR network, including the models for synthetic image denoising, real image denoising, JPEG artifacts removal and real image super resolution. FIGURE 2 illustrates the overall architecture of the proposed network, which is a multi-scale structure embedded with the proposed block NLIB. We can observe that NLIB consists of two proposed blocks EBNA and DECB. EBNA, based on the proposed batch-wise nonlocal module, collaborates with different types of non-local extractors to build an enhanced batch-wise non-local and attentive module. And DECB explores the local information by employing deformable convolution. With both modules, NLIB can combine the local feature and non-local feature. More concretely, (1) we propose a batch-wise non-local module to explore the relevance among images; (2) based on the proposed batch-wise non-local module, we propose EBNA, which provides enriched non-local features from various types of non-local modules; (3) to fully utilize the non-local and local information, we propose NLIB built upon EBNA and DECB. We stack NLIB in a U-net structure model to build NLENet, utilizing enriched non-local and local features to preserve better contextual details in the restored images.

A. BATCH-WISE NON-LOCAL MODULE
Following the idea of non-local means operation [13], the generic non-local operation can be defined as: in which the output patch y j at position j has the same size as the input x j , S represents a normalization factor and i, j represent different patches in image I. f () is a scalar to compute the affinity between the patch x j and patch x i , which represents the relationship between two patches. g() is an embedding function that transforms the input x j to another representation domain. In this way, the non-local operation uses all the predictable information within a single image to restore the current patch. Further applying this idea in the DNN based models, the non-local module employs the same process within each channel of the feature maps to explore self-predictable information.
We extend this search region of predictable information from one single image to a batch of images in our work. Similar patches of pixels (or feature maps) are searched to generate more abundant predictable information. We reform the single image non-local operation to batch-wise non-local (BNL) operation as follows: where i, j are from patches of a batch of images I batch . Different from Equation 1, our proposed batch-wise non-local module, in Equation 2, expand the 'non-local region' from a single image to a batch of images. In the existing self nonlocal module, patches from a single image are cropped and perform the patch matching process. While in the proposed batch-wise non-local module, the patches are extracted from a batch of images where more relevant information can be found. FIGURE 3 shows the batch-wise non-local module in detail. We take a feature map of size (bs, c, w, h) as an input and q n ×bs represent as the number of patches unfolded from a batch of feature maps (w.r.t.as Query, Key and Value). q c represents the number of channels of the feature map, and q w and q h are the width and height of the patches (we set patch size as 4 × 4). Then the Query feature map is reshaped and multiplied with the Key feature map to generate a weight matrix. As an evaluation of the relevance among patches, the weight matrix is multiplied with the value feature map to generate the non-local feature map.
In the batch-wise non-local module, richer information can be extracted from a batch of images instead of only one single image. Especially in situations when self-similarity is limited within one image, the chance of extracting more relevant information from a batch of images is increased.

B. ENHANCED BATCH-WISE NON-LOCAL AND ATTENTIVE MODULE
In this section, we describe the proposed EBNA module in detail. The main idea of developing the EBNA module is to employ diverse non-local information to enhance the local features. Besides the proposed batch-wise non-local module (based on patch-wise relevance among feature maps), we use different modules that employ global pooling operations to extract diverse non-local information.
Non-local relevance can be extracted through various operations, such as global pooling operation (channel-wise attention [54] and spatial-wise attention [55] et al.). A set of weights as long-range dependencies are built on a specific channel or spatial location feature in CA and SA.
To take advantage of the above non-local modules, we cooperate CA and SA with the proposed batch-wise nonlocal module to build a novel block EBNA. FIGURE 3 shows the overview of the proposed EBNA module.
Taking feature map x f m ∈ R bs×c×w×h as input and y f m ∈ R bs×c×w×h as output, the EBNA module can be defined as follows: where Conv() and Concat() represents the convolution operation and feature maps concatenation operation. F BN L () represents the proposed batch-wise non-local module, F CA () and F SA () represent channel-wise and spatial-wise attention modules, respectively.
In EBNA, the proposed batch-wise non-local module can extract patch-wise dependencies from a batch of images and build relevance weights among image patches. In contrast, CA and SA modules can extract more general dependencies across channels and spatial locations. Combining the proposed batch-wise non-local module and attention modules can generate enhanced non-local features and extract more diverse non-local information among features.

C. NLIB
Besides using abundant non-local information, local features are also essential in IR models. Based on the above concept we build a block named as NLIB which combine various non-local and adaptive local information from the degraded image.
As shown in FIGURE 4, the proposed NLIB consists of two parts, in which EBNA can extract various non-local dependencies and DECB can extract local features. We build the local feature extractor based on the deformable convolution. The DECB includes the deformable convolution and an attentive operation. The deformable convolution can learn from adaptive receptive fields, while the attentive operation with a residual connection can help extract adaptive features with focus. Thus, sophisticated local features can be acquired.
NLIB can be defined as follows: where Conv() represents the convolution operation. F EBN A () represents the proposed EBNA module, F DECB () represents DECB module. The output feature maps of EBNA and DECB are added together through a convolution layer.
In NLIB, various non-local features can enhance the local features without the limits of the receptive field and learn from long-range dependencies. Both non-local and local information extracted by NLIB can help to restore more structure and texture details. Furthermore, as the basic component of NLENet, we apply NLIB at every down-sampling and up-sampling stage to extract richer features at each scales.
During training, given the corrupted images {Î i } N i=1 ,Î i ∈ R H×W ×C (H as the height, W as the width and C as the channel of the image) as inputs, NLENet learns a mapping function f θ with a set of parameters θ in generating the corresponding restored images {I i } N i=1 , I i ∈ R H×W ×C , by employing ℓ 2 loss function formulated as follows:

IV. EXPERIMENTS
We perform extensive experiments to demonstrate the proposed NELNet's effectiveness on four IR tasks in this section. The evaluated IR tasks including (a) synthetic image denoising, (b) real image denoising, (c) JPEG artifacts removal, and (d) real image super resolution. We test on several benchmark datasets in each task to give a thorough performance evaluation of the proposed network. The model is trained with the Adam optimizer under setting β 1 = 0.9, β 2 = 0.999. We train the models with an initial learning rate 1×10 −4 and gradually decrease to 1×10 −6 . During training, we apply data augmentation (including random horizontal and vertical flipping) for better performance. The training batch size set as 6 with the patch size 256 × 256. Similar settings are used in all four IR tasks. The experiments are conducted on NVIDIA Tesla V100 with the PyTorch library [56].
We list the results comparisons in terms of PSNR, SSIM in TABLE 1,2,3 (for methods which code and model are not available, we can only compare their published results from the original paper and "NA" is placed because of missing results in some cases.). We will release the pre-trained models along with the source code upon the acceptance of the paper.

A. SYNTHETIC IMAGE DENOISING
This section shows the comparison results of the proposed NLENet for AWGN denoising on grayscale images. We train the synthetic image denoising models with the training set of DIV2K [66] in grayscale. Then we evaluate the trained models on Set12, BSD68 [58], and Urban100 [59] datasets, which are commonly used in synthetic images denoising task. To fully validate the proposed network's denoising ability, we train models with AWGN under different levels of noise, i.e., σ=15, 25, 50, and 75 (standard deviation σ) and compared with the SOTA methods, which are listed in TABLE 1.  TABLE 1 presents quantitative comparisons of PSNR and SSIM [57], where we can observe that the proposed NLENet outperforms the traditional and latest SOTA CNN based denoising methods at most noise levels. Specifically, compared to the latest non-local networks COLA-Net, our algorithm demonstrates a performance improvement of 0.02 to 0.1 dB in PSNR at different noise levels on all test sets.
We also give a visual compassion of the denoised results from the proposed method and latest methods in FIGURE 5. From the FIGURE 5, we can easily find that methods like DnCNN lose fine details. In the visual results of RIDNet and AINDNet, the restored images obtain blurred edges.
Compared to the latest non-local network COLA-Net, the denoised result lose part of the lines in the zoom-in area. While our NLENet preserves clear lines. Therefore, NLENet is able to reconstruct the structural information and fine texture of the noisy image.

B. REAL IMAGE DENOISING
To further demonstrate the merits of our proposed method, we compare the proposed network with several SOTA real image denoising approaches. Unlike synthetic image denoising, real images are corrupted by realistic noise during capturing, and we have no prior knowledge of the noise distribution. We train the real image denoising model with the training set of SIDD medium [67]. And SIDD [67] and DND [68] test sets are used as evaluation. The training set of SIDD medium [67] contains 320 very high-resolution image pairs captured by smartphones under different environments. Moreover, the test set of SIDD contains 1280 images of size 512×512. And DND [68] which has 1000 images of size 512×512.
Results comparisons are summarized in Table 2. We can observe that our NLENet outperforms the latest methods on SIDD and achieves competitive results on DND. For instance, compared with another non-local network COLA-Net and latest method GNSCNet, the PSNR results of NLENet are about 0.6∼0.7 dB (in PSNR) higher on SIDD. As for SSIM, GNSCNet achieves the highest result in the comparing methods, NLENet achieves the second-highest result. How-VOLUME 4, 2016 ever, in the synthetic image denoising task, NLENet gains over 0.1 dB in terms of PSNR and 0.02∼0.03 in terms of SSIM higher on all the test sets and noise levels compared to GNSCNet.
As an IR structure with generalization ability, NLENet shows superior or comparable performance on different IR tasks and test sets. In the DND dataset, we still achieve a competitive denoising result, while AINDNet and COLA-Net achieve the higher performance because of employing extra training data. AINDNet is specially designed for real image denoising and trained with extra data (more data be-sides the SIDD training set) for better performance. COLA-Net also employs extra training data during training. While we still employ only the SIDD training set as most methods did. However, when COLA-Net and AINDNet are trained with the same dataset as NLENet, NLENet still achieves the highest PSNR and SSIM in all the noise level of synthetic image denoising datasets. The visual comparison of the results are shown in FIGURE 6 and FIGURE 7 in which we can see that the NLENet recovers cleaner outlines and preserves more textural details than other competitors' approaches.

C. JPEG ARTIFACTS REMOVAL
In this section, we evaluate our NLENet in JPEG artifacts removal task. We train the models with DIV2K [66] training set and test on classic JPEG artifacts removal test sets CLAS-SIC5 and LIVE1 [73]. We compare the NLENet with SOTA JPEG artifacts removal approaches in terms of PSNR and SSIM [57]. The results are shown in TABLE 3, in which we can see that our proposed method demonstrates the best results on all test sets and quality factors over previous approaches in PSNR. For instance, compared with the latest non-local network COLA-Net, NLENet achieves superior performance on both test sets. We can also observe that although NLENet achieves the highest PSNR, QGAC shows a slightly higher SSIM. Because NLENet is trained with L2 loss with one stage. While QGAC requires two-stage training, in which it requires L2 loss for initial training and then employs GAN loss to finetune the model. Training with GAN loss, which contains perception terms, will improve the SSIM results and better visual quality but decrease in terms of PSNR.
The visual quality comparison is shown in FIGURE 8, in which we can observe that the results from DnCNN and RNAN show an over smoothing in preserving texture details. In the results of MWCNN and the latest non-local networks, COLA-Net, a blurred outline is preserved in the restored image. In contrast, NLENet can preserve more subtle texture details and clear edges in the restored image, which further demonstrates its superiority.

D. REAL IMAGE SUPER RESOLUTION
We apply the proposed NLENet to the real image super resolution task and compare with the SOTA SR algorithms (VDSR [18], SRResNet [80], RCAN [23], LP-KPN [79]) and CDC [82] on the RealSR test dataset with upscaling factors of ×2, ×3 and ×4. Note that all the comparing algorithms are trained on the training set of RealSR [79] (the comparing results are also provided from RealSR [79] and CDC [82]). RealSR [79] is a real image super resolution dataset, which contains LR-HR real-world image pairs captured by adjusting the cameras' focal length. RealSR has 183 (×2), 234 (×3), 178 (×4) very high resolution image pairs for training and 30 images pairs for testing in each scales. In the experiment, we compute the PSNR and SSIM [57] using the Y channel (in YCbCr color space), which is a common practice in SR [18], [23], [80]. The results are summarized in Table 4, and we can observe that NLENet shows a superior performance among the competitive methods. The PSNR improvement is around 0.06 to 0. 19    methods restore blurry edges.

V. ABLATION STUDY
In this section, we further explore and investigate the effectiveness of the proposed NLENet. Here we study the impact and effectiveness of each proposed component on the final model performance. The ablation experiments are performed for the grayscale image synthetic denoising task with noise level σ = 25.

A. ABLATION ON THE PROPOSED MODULES
We apply different combinations of the proposed modules to test their effectiveness in the proposed NLENet . TABLE 5 shows the comparison results test on Set12. In the experiment, the performance of a baseline multi-scale architecture (mostly based on stacked convolution layers) without using any proposed components are shown in case1. In case2 and case3, we apply only the non-local module proposed by COLA-Net [31] (self non-local) and our proposed batch-wise non-local module respectively in the multiscale structure to compare their influence on the IR performance. Case2 shows good performance gain, demonstrating the effectiveness of building long-range dependencies. The comparison of case2 and case3 demonstrates the superior performance of our proposed batch-wise non-local module compared to the self non-local module.
Subsequently, from case3 to case5, we add the proposed components gradually to explore the performance changes. In case4, we apply the proposed EBNA in the multi-scale model. We can observe that combining various non-local modules achieves better performance than using only one type of non-local module (compare to case3). Case5 is our proposed NLENet, which achieves the highest performance. To further compare the self non-local and the proposed batchwise non-local module, we replace the batch-wise non-local  module in NLENet with the self non-local module [31] in case6. In case7, we remove the CA, SA module in EBNA and remain the other part in NLENet to explore the impact of combining CA, SA with the batch-wise non-local module. We can observe that without the CA and SA modules, NLENet still achieves superior performance.
Based on the results from the TABLE 5, we can summarize the following observations: (1) Our proposed batch-wise non-local module can achieve better performance than the existing self non-local module (used in COLA-Net), as shown in case2 and case3.
(2) Adding the proposed EBNA and NLIB modules in the model can improve the performance, as shown in case4 and case5. Also, case5, as our proposed NLENet, achieves the best performance among all the cases.
(3) We replace the batch-wise non-local in NLENet with self non-local in case6, which further illustrates the superiority of our proposed batch-wise non-local module against COLA-Net [31] 's self non-local module.
(4) Removing the CA and SA modules in EBNA, our NLENet still achieves competitive performance in case7.

B. ABLATION ON BATCH-WISE NON-LOCAL MODULE
To further explore the impact of the proposed batch-wise non-local module, we train several models with a different number of images b = 1, 2, 4, 6 which are employed in the batch-wise non-local module. The results of loss and PSNR test on Set12 are shown in FIGURE 10. We can observe that larger b accelerates the convergence of the model and achieves better performance. We can also find that when the b increases the performance gain is getting moderate. However, larger b also leads to more memory consumption in the patch matching process during training. In this case we set the b = 6 because the memory consumption reaches the limit capacity of the GPU we use. So we choose the largest b we can to improve the performance.

C. FEATURE VISUALIZATION OF THE PROPOSED MODULES
To further explore the proposed modules' influence, we visualize their feature map output in Figure 11. We can observe that the feature map extracted by our batch-wise nonlocal module (in (a)) is more clear and abundant compared to self non-local module (in (d)). We can also find that compared to the batch-wise non-local module, the CA and SA modules extract non-local features with a clear focus area in the feature map. Thus, we collaborate these modules to build EBNA (in (e)), which generates more sparsed non-local features. To further preserve more delicate local details in the restored images, we add a deformable convolution block (DECB) to cooperate with EBNA. We can see from (f) that the DECB extract more sophisticated local features. With the EBNA and DECB, NLIB can combine non-local and local features to generate more enriched features, as shown in (c). As mentioned in reference [83] that feature map with lower channel weight contains more noise-like information. Therefore NLIB can preserve more texture and structural information, since it generates higher channel weight feature maps (in (c)) after combining the non-local and local features.

D. PARAMETERS AND INFERENCE TIME STUDY
FIGURE 12 shows the model parameter size and the GPU run time of the competing methods on synthetic denoising task. The Nvidia cuDNN-v7.0 deep learning library is adopted under Ubuntu 16.04 system. From (a) in FIGURE 12, we can observe that despite the superior performance of NLENet, its parameter size is larger compared to RIDNet, SADNet, and COLA-Net. Because in NLENet, the proposed non-local module includes several extra convolution layers and attention modules.
While from (b) in FIGURE 12, the run-time evaluation demonstrates that our proposed model still achieves a competitive speed with an outstanding performance. Especially compared with COLA-Net, our NLENet can achieve significant speed improvement. We achieve such effectiveness because NLENet employs a multi-scale structure, saving time at a low-resolution scale. Furthermore, in COLA-Net, overlapping non-local patches are extracted and will cost more time in the non-local process. NLENet has a longer inference time than SADNet because NLENet has a more sophisticated structure. Besides the deformable convolution modules that explore adaptive local information, the EBNA module employs various non-local modules to generate enriched non-local features, which takes more time. Therefore, our multi non-local enhanced module can achieve better performance while keeping a competitive inference speed at the same time.

VI. CONCLUSION
In this paper, we propose a batch-wise non-local module to explore long-range dependencies. We further build an EBNA module based on our proposed batch-wise non-local module, in which various non-local modules are combined to extract more enriched non-local features. Besides EBNA, we build a novel block named NLIB, which collaborates various non-local features with adaptive local features to acquire the capability of preserving fine contextual details. Finally, we embed the NLIB in a U-net-like structure named as NLENet. Extensive experiments show that NLENet consistently achieves state-of-the-art performance for several image restoration tasks, such as synthetic image denoising, real image denoising, JPEG artifacts removal and real image super resolution.