Mixed High-Order Non-Local Attention Network for Single Image Super-Resolution

Attention has been diffusely used in many tasks since it can guide network concentrating on the most important regions of an input pattern. Nevertheless, many advanced works focus on first-order attention design, e.g. channels and spatial attention, but ignore higher-order attention mechanisms. In this work, we propose the Mixed High-Order Attention (MHA) module to model the complex and high-order information in the attention mechanism, which captures the subtle texture and outputs the discriminative attention map. Besides, the region of the convolution is local, which can’t capture global context and long-range dependencies. Therefore, we propose a non-local block to obtain global attention features. We also propose the Mixed High-Order Non-local Attention Network (MHNAN) to improve the richness of attention. Extensive experiments are conducted to demonstrate the superiority of our MHNAN for super-resolution over several state-of-the-art models.


I. INTRODUCTION
Single image super-resolution (SISR) has attracted a lot of attention. The task of SISR is to recovery clear high-resolution (HR) images given its low-resolution (LR) images. Nevertheless, many high-resolution solutions can map to any LR input, thus this is an ill-posed problem. Therefore, researchers proposed a lot of SR models, ranging from model-based and interpolation-based [1]- [3], to current learning-based methods [4]. These conventional methods are efficient, but they still exist some drawbacks: the accuracy may quickly decrease; these methods are time-consuming.
Deep learning has achieved unexpected results in many different fields [5], [6]. Recently, a lot of CNN-based SR models concentrate on learning to map HR images given LR images. CNNs have obtained satisfactory results in SISR [7]- [16] through exploiting the image patterns. Dong et al. [7] first try to employ CNN to tackle SISR. They proposed a CNN-based model named SRCNN to learn a non-linear transformation from LR to HR images. [10], [17] proposed much deeper networks through integrating recursive learning The associate editor coordinating the review of this manuscript and approving it for publication was Hengyong Yu . and residual learning, obtaining significant results over the SRCNN. Nevertheless, the above methods use the interpolation method to firstly upscale an low-resolution image to the desired output size and then fed it into the network, thus suffering from reconstruction artifacts and extra computational cost. In order to deal with this problem, some researchers proposed new methods, which upscale the final low-resolution feature in the tail of the network. Dong et al. [8] proposed a deconvolution module at the tail of the architecture to upsample the final LR information. Shi et al. [18] proposed a novel upscale module named sub-pixel convolution layer to upsample the LR feature to the HR feature at the tail of the network. Therefore, more and more researchers employ this efficient post-processing method, which not only reduces computational load but also deepens the network. Recently, Lim et al. [9] proposed a network by integrating residual learning, named EDSR, which won the NTIRE2017 [19] SR Challenge championship [19].
Nevertheless, a lot of CNN-based super-resolution models still have some drawbacks: (1) most of the CNN-based models ignore the high-order information, leading to not get the utmost out of the information from the blurry LR pattern, thus resulting in unsatisfying performance; (2) the region of the convolution is local, which can't capture long-range dependencies and rarely exploiting the global feature correlations, thus limiting the learning ability of CNNs.
In order to tackle the above issues, we introduce a deep Mixed High-Order Non-local Attention Network (MHNAN) for mixed high-order feature extraction. Especially, a mixed high-order attention (MHA) module is introduced to extract and mix high-order information. Through exploiting mixed high-order feature information, our MHA adaptively learns feature inter-dependencies. Such MHA guides our network to focus on the significant pattern and improve performance. Besides, a non-local enhanced group (NEG) is proposed to integrate non-local operations to model long-range dependencies. We extract the pattern from the LR images by stacking the non-local residual channel attention groups (NRCAG) structure. To sum up, our contributions are as follow: • We introduce a Mixed High-Order Non-local Attention Network (MHNAN) to tackle the image super-resolution task. Extensive experiments on multiple datasets show that our MHNAN is superior to many advanced models.
• We propose mixed high-order attention (MHA) module, which is deep plug-and-play and can adaptively capture features through considering feature information higher than first-order. Thus, this MHA mechanism makes the network enhance discriminative learning ability and concentrate on the more important features.
• We propose a non-local enhanced group (NEG), which further integrates non-local operations to extract long-range pattern information.

A. NETWORK FRAMEWORK
As can be seen from Figure 2, our MHNAN primarily comprises the following parts: shallow feature extractor, non-local enhanced group (NEG), up-scale layer, and reconstruction layer. Give I LR and I SR as the input and output of our MHNAN. Following the [9], [20], we use a convolution layer to extract the shallow feature F 0 from the low-resolution input where H SF stands for the convolution operation. Then we feed the shallow feature F 0 in NEG, which obtains the deep feature as where H NEG denotes the NEG based non-local enhanced group, which comprises multiple non-local residual channel attention groups to extract the long-range information and channel information. Then the deep feature F DF is upsampled by the upscale layer through where F ↑ and H ↑ are upsampled feature and upsample layer respectively. We have many methods to apply as an upscale module, like deconvolution [8], ESPCN [18]. We embed the upscaled module in the tail of the network to achieve high performance. This method is preferable in current super-resolution models [8], [9], [20]. The upscaled feature fed in one convolution layer to reconstruct high-resolution information where H R , H ↑ , and H MHNAN are the reconstruction layer, upsample layer, and the function of MHNAN, respectively. Then we use a loss function to optimize MHNAN. There are some widely used loss functions, such as L2, L1, perceptual loss. To demonstrate the effectiveness of our MHNAN, we use the L1 loss. Given a training set with N low-resolution images and high-resolution images denoted by {I HR , I HR } N , the purpose of the MHNAN is to optimize the loss function: where θ denotes the parameter of MHNAN. We apply Adam optimizer to minimize the loss.

B. NON-LOCAL ENHANCED GROUP (NEG)
We here describe the proposed non-local enhanced group (NEG) (see Figure. 1). The NEG is composed of several non-local residual channel attention groups (NRCAG) and an MHA module. The non-local block can extract the long-range feature. Each NRCAG further contains M simplified residual channel attention blocks [4]. We here detailedly describe the non-local block. We give an input image with shape (H , W , C in ). For simplicity, we have omitted the batch dimension. It can be expressed as: Stacking several blocks may be a helpful and simple way to improve network [9], [11]. However, the network constructs in this method would result in training difficulty and performance bottleneck due to the problem of gradient exploding and vanishing. Simply stacking repeated blocks may not achieve better results. To handle this problem, we propose the NEG to bypass abundant low-frequency features and facilitate the training of the network. Then NRCAG in the g-th group can be denoted as: where H g is the operation of the g-th NRCAG. F g , F g−1 represent the output and input of the g-th NRCAG. Then deep feature can be extracted as:

C. MIXED HIGH-ORDER ATTENTION
Attention is a tool that biases the allocation of available resources towards the most useful part of the pattern. In CNN, it is widely used to weight convolution response graphs to highlight significant parts and suppress irrelevant information, like channel [21]and spatial attention [22]- [28]. Given input image X ∈ R C×H ×W ,We briefly formulate spatial and channel attention to a general case: | C means copy this spatial mask M along channel dimension by C times and M ∈ R H ×W is a spatial mask. Therefore, Eq. 9 represents the implementation of spatial attention.
Nevertheless, channel attention or spatial attention cannot learn the high-order information, leading to failure in capturing the subtle texture in the image. Therefore, we concentrate on modeling A(I ) with high-order feature representation.
where α r is the weight vector and σ represents an non-linear activation function, like ReLU function. We denote R as the number of order. A(i) in Eq.10 is used as the required mixed high-order attention feature for the i.

A. SETUP
Following [9], [15], we use 800 training images from DIV2K dataset [19] as training data. In order to demonstrate the performance of our MHNAN, we use 5 public datasets: Set5, Set14 BSD100, Urban100, and Manga109. The Matlab resize function with the bicubic operation was adopted by us as a degradation model. The PSNR and SSIM were evaluated as SR results.
In the training stage, the LR images are augmented by horizontally flipping and randomly rotating 90 • , 180 • , 270 • . We set R as 4, which means our MHA with order {1, 2, 3, 4}. We set 16 low-resolution image patches with the size of 48 × 48 as mini-batch. The ADAM algorithm was applied to optimize MHNAN with β 1 = 0.9, β 2 = 0.99, and = 10 −8 . The learning rate was initialized as 10 −4 and then reduced to half every 200 epochs.

B. IMPLEMENTATIONS
We fix the NRCAG number as G = 18 in the NEG and embed two MHA modules at the head and tail of NEG. In each NRCAG, we set M = 10 residual channel attention blocks. In addition to the shallow extract layer and upscale layer, we set the number of the filter as C = 64. For the upscale layer, we follow the works in [4], [15] and use sub-pixel convolution [18] to upscale and reconstruct the deep feature, followed by a 1 x 1 convolution with three filters to output RGB images.

C. ABLATION STUDY
As can be seen from Figure 1, our MHNAN contains two primary components, including non-local enhanced group (NEG) and mixed high-order attention (MHA) modules. To demonstrate the effectiveness of the various modules, we train and test MHNAN with its variants for comparison. Specific results are shown in Table 1.  We set R base as a basic baseline, which only contains the convolutional layer containing 18 residual channel attention groups RCAGs and 10 residual channel attention blocks [4] in each NRCAG. It can be found that the performance of the R base only reaches 27.43 PSNR. When we embed the MHA module before NEG, the performance of R a reaches 27.57 PSNR. If we only embed the MHA module after NEG, the performance of R b reaches 27.59 PSNR. This phenomenon demonstrates the effect of our MHA from the results of R a and R b . Specifically, R c means that non-local blocks are employed to capture long-range dependencies by a self-attention mechanism. It can be found that the performance of the R c can reach 27.61 PSNR when we employ the non-local block to our model. R e means that the result of both applying MHA before and after NEG. It can be found that both of R e achieve better results than methods of R a to R c , which reach 27.69 PSNR. When we combine MHA and non-local block, R d can reach 27.75 PSNR.
As can be seen from Figure 3, we show the visualization of different methods. Many super-resolution models fail to reconstruct the lattice and have severe blurring artifacts, but our MHNAN achieves sharper results and reconstructs more texture details. In the case of ''img005'', most models output severe blurring artifacts. Compared to ground-truth, MHNAN can restore more image details and get more reliable results. Although recovering high-frequency texture on the VOLUME 9, 2021 input information of the limited LR is hard, our MHNAN can make the best use of the limited low-resolution pattern to transfer mixed high-order non-local attention and has a more powerful pattern representation, which results in more accuracy.

E. COMPONENT ANALYSIS
To explore the effect of the different number of orders in the MHA module, we perform quantitative comparisons on MHNAN. As shown in Table 3, we can observe that MHNAN enhances the performance of the model for single image super-resolution tasks over both RCAN and SAN. Especially, comparing MHNAN-2 with RCAN and SAN, it can be found that using higher-order attention patterns indeed enhances the learning ability of the network. Moreover, the performance will further improve with the number of order. When increasing the number of order 2 to 4, the performance of MHNAN-4 is higher than MHNAN-2. This phenomenon demonstrates that employing a more large number of order to MHNAN is beneficial to capture high-order information, and MHNAN-4 outperforms all the baseline models, showing the effectiveness of our network. Nevertheless, when increasing the number of order to 6, there was little performance improvement. Therefore, we're not reporting here. Table 4 shows the model size and performance of the current CNN SR model. In these methods, MemNet and NLRG contain far fewer parameters, which reduces performance. Not only does MHNAN have fewer parameters than RDN and RCAN, but it also gets the better performance, which means that MHNAN can achieve a good performance compromise between model complexity and performance.  Figure 4 shows the speed comparison for 2× scale factor with several state-of-the-art models on the Set5 dataset. It can be found that our MHNAN achieves best results compared with other models. Moreover, the inference time of MHAN is largely reduced compared with EDSR. Although the inference time of MHAN is a little slower than RCAN [15] and SAN [20], our MHNAN achieves better results, which demonstrates our MHNAN obtain a trade-off between inference time and performance.

H. VISUAL ANALYSES OF MHA
In order to demonstrate the effectiveness of our proposed MHA, we visualize the average feature maps for the input VOLUME 9, 2021 and output of MHA. As can be seen from Figure 5, we can find the bottom line, the visual of the output of the MHA has more rich texture than the input of MHA, which indicates our proposed MHA can extract high-order feature, like texture. Besides, we can find the visual of the output of the MHA have more negative values than the input, which indicates our proposed MHA can suppress smoothness, such that extract more texture.

IV. CONCLUSION
In this work, we propose the mixed high-order attention (MHA) module to further enhance the discrimination of attention proposals through capturing the complex and high-order information. We also propose the Mixed High-Order Non-local Attention Network (MHNAN) to tackle the SISR task. Importantly, MHNAN achieved promising results through applying the MHA module and NEG to single image super-resolution task. Extensive experiments demonstrate our superiority of our MHNAN.
YUJUAN SI received the master's and Ph.D. degrees in engineering from the Jilin University of Technology, in 1988 and 1996, respectively. She is currently a Professor with the Institute of Communication Engineering, Jilin University. She is also a Lecturer with the Zhuhai College of Jilin University. Her research interests include embedded systems and biomedical signal processing and recognition.
LINA XU is currently pursuing the Ph.D. degree with the College of Instrument Science and Electrical Engineering, Jilin University. Her research interests include small signal processing and biomedical image signal processing.
CHONGJIN LIU is currently pursuing the bachelor's degree with the Zhuhai College of Jilin University. His research interests include signal processing, image recognition, and machine learning. VOLUME 9, 2021