Lightweight Attended Multi-Scale Residual Network for Single Image Super-Resolution

Recently, deep convolutional neural networks (CNN) have been widely applied in the single image super-resolution (SISR) task and achieved significant progress in reconstruction performance. However, most of the existing CNN-based SR models are impractical to real-world applicants due to numerous parameters and heavy computation. To tackle this issue, we propose a lightweight attended multi-scale residual network (LAMRN) in this work. Specially, we present an attended multi-scale residual block (AMSRB) to extract multi-scale features, where we embed the efficient channel attention block (ECA) to enhance the discrimination of features. Besides, we introduce a double-attention fusion (DAF) block to fuse the low-level and high-level features efficiently. We use spatial attention and channel attention to obtain guidance from the low-level and high-level features, which is used to guide the feature fusion. Extensive experimental results demonstrate that our LAMRN achieves competitive performance against the state-of-the-art methods with similar parameters and computational operations.


I. INTRODUCTION
Single image super-resolution (SISR) is a classical computer vision task that aims to reconstruct a high-resolution (HR) image from the low-resolution (LR) one. SISR is an ill-posed problem since multiple HR images may degrade to an identical LR image. It has been studied for decades, widely existing in medical imaging [1], remote sensing [2], surveillance, and some scenarios that required high-frequency details. The SR methods are usually divided into three categories: interpolation-based, reconstruction-based, and learning-based.
Recently, numerous CNN-based SR methods [3]- [8] have achieved remarkable performance than the traditional methods [9], [10]. EDSR [11], RDN [12], RCAN [13] obtain higher reconstruction accuracy by stacking ResBlock [14], Dense Block [15] or embedding attention mechanism [16]. However, those methods generally achieved excellent performance by expanding the width or increasing the depth of the network, which limited their application in the real-world scenario. Consequently, designing a lightweight network while maintaining comparable performance for SISR becomes a rather tough but promising task.
The associate editor coordinating the review of this manuscript and approving it for publication was Naveed Akhtar .
CARN [17] constructed a lightweight network by stacking cascading residual blocks, achieving comparable results with fewer parameters and computational operations. Hui et al. [18] proposed a lightweight network with the information distillation mechanism (IDN) for better recovering the high-quality SR images. Based on IDN, IMDN [19] applied channel separation and advanced channel-wise attention mechanism to improve the network representation. Most of them rarely utilized the multi-scale structures to extract the multi-scale features due to numerous parameters and computational operations. However, multi-scale feature representations could boost the performance by providing more information in various vision tasks [20]- [23].
On the other hand, the deep feature module in SR models generates features of different levels. The low-level features are rich in spatial details, while the high-level features contain plentiful contextual information. But the previous SR works [11], [13], [17], [19] generally used the output features or concatenated these features as the input of the reconstruction module. Most of them do not fully utilize the difference between these features to guide the fusion.
To extract multi-scale features and efficiently fuse the different features produced in the deep feature extraction, we propose a lightweight attended multi-scale residual network (LAMRN) for SISR in this work, as shown in Fig.1. Inspired by Res2Net [24], we utilize an attended multi-scale residual block (AMSRB) to extract multi-scale features. Specially, AMSRB constructs hierarchical residual-like connections within one single residual block to represent multi-scale features. To improve the discrimination of features of each scale, we embed the efficient channel attention block [25] with negligible parameters and computation. Besides, we propose a double-attention fusion block (DAF) to fuse the low-level and high-level features efficiently. Similar to [26], our DAF utilizes spatial attention and channel attention to obtain guidance from the low-level and high-level features, respectively. Then the guidance is used to fuse the low-level features and high-level features efficiently. Experiments on the benchmark datasets demonstrate the effectiveness of our model with similar parameters and computation.
Overall, the main contributions of our method are three-fold: • We propose a lightweight attended multi-scale residual network for single image super-resolution (LAMRN), which utilizes the attended multi-scale residual block to extract multi-scale information effectively.
• We propose a double-attention fusion block (DAF) to efficiently fuse the low-level and high-level features, further boosting the performance.
• The experimental results on benchmark datasets show that the proposed method achieves competitive performance with the state-of-the-art lightweight SR methods. The rest of the paper is organized as follows: Section II introduces the related work. Section III elaborates on the structure of the proposed method. Section IV shows our experimental results on benchmark datasets. Finally, Section V draws the conclusions.

II. RELATED WORK A. CNN-BASED SINGLE IMAGE SUPER-RESOLUTION
As the pioneering work, SRCNN [3] utilized a three-layer convolutional neural network to resolve the interpolated LR images, outperforming the traditional SR methods [9], [10] on both reconstruction accuracy and speed. VDSR [4] increased the network depth to improve performance, which demonstrated the network depth plays an important role on network performance. FSRCNN [27] extracted features from the original LR image and upscaled the features to the desired size by the deconvolutional layer at the tail of the networks, significantly reducing the parameters. Subsequently, some efficient upscale modules [28], [29] were proposed and widely used. EDSR [11] further increased the network depth and won first place in the NTIRE2017 Super-Resolution Challenge [30]. RDN [31] utilized the residual dense block (RDB) to fully reuse the feature maps, which achieved comparable result with EDSR but reduced nearly half parameters. RCAN [13] used channel attention to rescale the feature maps, achieving state-of-the-art performance.
Recently, one research direction of SR is exploring lightweight but competitive architecture, which is more suitable for resource-constrained devices. Ahn et al. [17] proposed a fast, accurate, and lightweight structure (CARN) with cascading residual block. IDN [18] utilized the information distillation mechanism to design their lightweight model. IMDN [19] applied channel separation and advanced channel-wise attention to further improve the network representation. AWSRN [32] made full use of the previous features for reconstruction by the adaptive weighted multi-scale module. Li et al. [33] proposed a flexibly adjustable super-lightweight SR network to improve performance with limited parameters and operations.

B. MULTI-SCALE REPRESENTATIONS
Multi-scale feature representations are of great importance to various vision tasks, such as semantic segmentation [20], object detection [21], and so on. MS-RHDN [23] utilized the atrous-spatial-pyramid-pooling (ASPP) structure to learn the multi-scale features. MSRN [22] introduced a multi-scale residual network to extract multi-scale features, achieving competitive performance. Res2Net [24] constructed hierarchical residual-like connections within one single residual block, exploring the multi-scale representation efficiently.

C. ATTENTION MECHANISM
The attention mechanism is proven to be a potential means to enhance the performance of CNN. RCAN [13] incorporated the channel attention mechanism into the residual block to further improve the performance. CBAM [34] utilized both channel-wise attention and spatial-wise attention to obtain consistent improvements in performance. ECA-Net [25] proposed a local cross-channel interaction strategy without dimensionality reduction to learn channel attention, achieving competitive performance with SENet [16] but fewer parameters.

III. PROPOSED METHOD
This section presents the details of our LAMRN. The overall architecture is first introduced, followed by the detailed descriptions of our proposed attended multi-scale residual block (AMSRB) and double-attention fusion block (DAF).

A. OVERALL ARCHITECTURE
As shown in Fig.1, our model LAMRN mainly consists of four components: shallow feature extraction, deep feature extraction, feature fusion, and reconstruction modules. Let's donate the input and output images of our LAMRN as I LR and I SR , respectively.
Firstly, we use the shallow feature extraction module to extract shallow features F sf from the original LR input: where H SF donates the shallow feature extraction module, which contains a 3 × 3 convolutional layer. Then, F sf is passed to the deep feature extraction module, which is stacked by M attended multi-scale modules (AMSM). So we can further obtain several multi-scale features of different levels F m (m ∈ [1, M ]): where H AMSM donates the attended multi-scale module. Those features are then passed into the feature fusion module. The low-level features are rich in spatial information, while the high-level features are full of contextual information. Considering the difference between features of different levels, we use D double-attention fusion (DAF) block(s) to fuse these features efficiently in the feature fusion module. First, the low-level features F 1 and high-level features F M are fed into the first DAF block. Then, the output features of the previous DAF F d and the features F m are transferred into the next DFA block: where H DAF donates the DAF block.
Next, we add the output features of the final DAF block F D with the shallow features F sf for feature reuse. Finally, we use the reconstruction module to upscale the feature maps to the desired size and reconstruct the SR images I SR : where the reconstruction module H Rec consists of an upsampling module H up and a convolutional layer H conv . Following [13], [17], [19], our upsampling module contains a convolutional layer to increase the number of the channels of feature maps and a sub-pixel layer to rearrange the feature maps to the desired size. But we use 1 × 1 convolutional layer instead of 3 × 3 convolutional layer to reduce the model parameters and computation.
Recently, several loss functions have been investigated to train the CNN-based SR models, such as L 1 , L 2 , perceptual and adversarial losses. Following the previous works [11], [12], [19], [32], we optimize our model with L 1 loss function. Given a training set , which contains N LR input images and the corresponding HR counterparts, the loss function is defined as: where donates the parameter set of our model.

B. ATTENDED MULTI-SCALE MODULE
As shown in Fig.2, our attended multi-scale module (AMSM) mainly consists of B attended multi-scale residual blocks (AMSRB) and 1 × 1 convolutional layer. As is well known, multi-scale representations of CNN [6], [22], [23] could provide more information and boost the performance. However, most multi-scale structures are constructed with several convolutional layers with different kernel sizes, bringing numerous parameters and computation on account of the large kernel sizes.
Inspired by Res2Net [24], we propose an attended multi-scale residual block (AMSRB) to extract multi-scale features, which does not increase the heavy burden of model parameters and computation. As shown in Fig.2, AMSRB contains several convolutional blocks (CB), two 1 × 1 convolutional layers, and a skip connection.
Firstly, the input feature maps of the b-th AMSRB are fed into a 1 × 1 convolutional layer to increase the channel numbers. Then the features are split into s feature map subsets along the channel axis, donated by X i (i ∈ {1, 2, . . . , s}). Each feature subset X i has the same spatial size as the input feature maps. Except for X 1 , each X i is passed to a convolutional block H CB for feature extraction, which contains a convolutional layer, a ReLU, and an efficient channel attention (ECA) block. We donate the output features of the i-th CB as Y i . The feature subset X i is added with Y i−1 and then passed to the next H CB for extracting features with different receptive fields. Following [24], we omit the convolutional block for X 1 to reduce parameters and computational operations, which can also be regarded as feature reuse. Y i can be written as: where H CB = H ECA (δ(H c (X i ))), δ is the function of ReLU, H c donates the 3×3 convolutional layer, H ECA is the efficient channel attention block.
To enhance the discrimination of the features of different scales with negligible parameters and computation, we embed the ECA block [25] in each CB. As shown in Fig.3, the input features F in ∈ R H ×W ×C are first passed through the global average pooling to generate channel-wise statistics z ∈ R 1×1×C . Then, instead of using two fully-connected (FC) layers to capture the cross-channel interaction, one 1D convolutional layer with kernel size k is utilized on z. In other words, the weight of z i is calculated only by considering the interaction between itself and the k neighbors. Finally, the output features Y i are obtained by rescaling the input features F in with the normalized weights: where H C1D donates the 1D convolutional layer, H GAP represents global average pooling, ⊗ means elementwise multiplication.
After extracting the multi-scale features Y i , we concatenate all of them and feed them into a 1 × 1 convolutional layer for channel number reduction. So we obtain the fused multi-scale features F b AMSRB from the b-th AMSRB: To fully use the intermediate features, we concatenate all the output features from B AMSRBs at the end of the m-th AMSM and then compress the channel number with the 1× 1 convolutional layer, as shown below: where Concat donates the concatenation operation, and H conv is the 1 × 1 convolutional layer.

C. DOUBLE-ATTENTION FUSION BLOCK
Generally speaking, the spatial expression ability of the network decreases while the semantic expression ability increases as the network get deeper. The previous works only used the output features of the deep feature extraction [11], [13], [35] or simply concatenated features at different levels and then passed to a 1 × 1 convolutional layer [19], [22]. Inspired by [26], we use a double-attention fusion block (DAF) to fuse the features from different levels efficiently. As shown in Fig.4, our DAF is composed of two attention blocks: channel attention (CA) block and spatial attention (SA) block. Taking the first DAF as an example, the low-level features F 1 are pass through the SA block to obtain the spatial attention Z SA ∈ R H ×W ×1 . Specially, we first use two FC layers to generate the spatial statistics along the channel axis and then use the sigmoid function to normalize the statistics. While the high-level features F M are fed into the CA block to produce the channel attention Z CA ∈ R 1×1×C . Firstly, a global average pooling is used for generating channel statistics. Then, two FC layers are adopted for capturing the cross-channel interaction and a sigmoid function for normalization.
The low-level features contain more spatial information, while the high-level features involve rich semantic context.  We multiply Z SA with the high-level features, refining the boundaries of the high-level features. Z CA is multiplied by the low-level features, providing contextual information for the low-level features. Finally, the rescaled low-level and high-level features are fused by an element-wise addition. Hence, the fused features pay attention to both spatial details and semantic context, further improving the network performance.

IV. EXPERIMENTS
In this section, we will first introduce the datasets and evaluation metrics. Then, we illustrate the detail of the experimental settings. Finally, we present the results and ablation study to demonstrate the effectiveness of our proposed method.

A. DATASETS AND EVALUATION METRICS
DIV2K dataset is published in NTIRE 2017 Challenge on single image super-resolution [30], which contains 800 training, 100 evaluation, and 100 testing 2K images. Following [13], [17], [18], [32], we use the training set of DIV2K to train our models. The low-resolution images are obtained by bicubic downsampling from the high-resolution images. The commonly used evaluation metrics PSNR and SSIM are utilized for quantitative comparisons with other SR methods. The visual results are also provided for more intuitive comparisons with other methods.

B. IMPLEMENTATION SETTINGS
We randomly crop the LR image into patches with a size of 48×48 as input for each mini-batch training. The corresponding patch size of HR images is 48r × 48r, where r is the scale factor. The batch size is set as 16 Our models are trained by the ADAM optimizer [36] with β 1 = 0.9, β 2 = 0.999, and = 10 −8 . The learning rate is set to 2 × 10 −4 and then decreases to half every 2 × 10 5 iterations of back-propagation. We first train a model for ×2 and then use it as the pre-trained model to train models of other scale factors. We implement our models using PyTorch [37] framework with an NVIDIA TITAN Xp GPU.

C. MODEL ANALYSIS
The channel number of the intermediate feature maps is an important factor for calculating the model parameters and computational operations. Following [17], [32], we use Multi-Adds to represent the number of computational operations, which is calculated on a SR image with a size of 1280 × 720. To obtain models with different complexity, we train them with different channel numbers: 16, 32, 64, which are donated as LAMRN-16, LAMRN-32, and LAMRN, respectively. For scale 4×, the total parameters range from 89K to 1407K, and Multi-Adds ranges from 5.6G to 85G. Fig.5 depicts the trade-off between parameter/Multi-Adds and performance for scale 4×. Our LAMRN obtains the higher PSNR on Set5 with fewer parameters and computation than AWSRN [32]. Compared with SRCNN [3] and FSRCNN [27], the smallest model (LAMRN-16) achieves better performance with similar size and computation. The middle model (LAMRN-32) obtains a better trade-off than SRMDNF [31] and IDN [18]. These results show that our models can achieve a better trade-off between the performance and model size.
We also use a different number (B) of AMSRB in AMSM to change the depth of LAMRN. The performance of different models and the corresponding parameters and Multi-Adds for scale ×4 are shown in Table 1. When B increases by 1, the parameter and Multi-Adds increase 444K and 25G, respectively. The performance improves by about 0.1 dB when B is less than 4, but the increment becomes less for the model with B = 5 and B = 6. Note that the model with B = 6 achieves comparable results with EDSR [11] (32.46 dB/43M) with only 2.7M parameters, demonstrating that our LAMRN has strong learning ability with considerably fewer parameters and computational operations. To compare with the stateof-the-art lightweight SR models, we use B = 3 as the final setting in this work.

D. COMPARISON WITH STATE-OF-THE-ART METHODS
We compare the proposed LAMRN with the state-of-the-art lightweight SR methods for ×2, ×3, ×4, ×8 scales, including SRCNN [3], [38], FSRCNN [27], VDSR [4], DRCN [6], LapSRN [5], EDSR-baseline [11], SRMDNF [31], IDN [18], CARN [17], AWSRN [32] and MSRN [22]. Following [3], [17], [22], [38], [39], PSNR and SSIM are calculated on the Y channel of the YCrCb color space. Table 2 shows the quantitative results on five benchmark datasets for different SR algorithms. We also present the parameters and Multi-Adds of these models for more intuitive contrast. As shown in Table 2, our LAMRN-16 outperforms some CNN-based SR methods with a few parameters and computational operations. And the model LAMRN-32, which has fewer parameters and operations than AWSRN-S [32] and IDN [18], achieves better results on ×2, ×3, and ×4. Compared with the state-of-the-art SR methods, our LAMRN achieves competitive performance with fewer parameters and Multi-Adds. As the magnification becomes larger, the degraded LR images lose more high-frequency information, which is hard to recover high-quality SR images. Our LAMRN model achieves better results with limited model size and operations owning to the effectiveness of the proposed attended multi-scale residual block and double-attention fusion block. Fig.6 and Fig.7 present the visual comparisons for each scale. We select some detail patches from different SR images for better comparisons. For image 'img_004', our LAMRN generates more correct textures, while other methods generate distorted textures. For images 'DollGUN' and 'BokuHaSitatakaKun', our LAMRN-16 achieves comparable performance with IDN [18], and LMARN produces clearer textures than other lightweight SR methods. Our LAMRN generates clearer textures than other models for image 'Plat-inumJungle'. And for images 'img_093' and 'zebra', the proposed method produces more correct textures than others, which look closer to the HR patches. These results demonstrate the effectiveness of our LAMRN.

E. ABLATION STUDY
We conduct several ablation studies to explore the effectiveness of our proposed method.
To evaluate the effectiveness of the efficient channel attention mechanism in AMSRB, we conduct three ablation studies. Specially, we remove the ECA block in each CB of AMSRB (no ECA), use only one ECA block after fusing multi-scale features from different CB (1ECA), and substitute it with the channel attention mechanism (CA), respectively.    Table 3 and the left figure of Fig.8, embedding an ECA block in AMSRB improves the model without ECA by 0.13 dB on Urban100, which keeps the same model size and operations. While embedding ECA in each CB of AMSRB further brings performance improvement, donating the  channel attention is better than traditional channel attention with fewer parameters in our model.

As shown in
We also explore the effectiveness of our proposed double-attention fusion block. We use the final feature maps to reconstruct SR image (w/o Fus), concatenate features from all AMSMs and then pass to a 1 × 1 convolutional layer for channel compression (1 × 1 Fus), remove channel attention (w/o CA), and remove spatial attention (w/o SA), respectively. As shown in Table 4 and the right figure in Fig.8, our model outperforms the model w/o Fus, 1 × 1 Fus, w/o CA, and w/o SA by 0.31 dB, 0.20 dB, 0.07dB, and 0.17dB on Urban100, respectively. Note that our model has fewer parameters and operations than the model with 1 × 1 fusion, demonstrating the effectiveness of our DAF block.
To explain that our DAF focuses on both spatial information and contextual information more intuitively, we present   the feature visualization of the input and output feature maps of the first DAF block. As depicted in Fig.9, the low-level input feature maps contain more spatial information, such as the edge information, while the high-level input feature maps emphasize contextual information. The output features contain both spatial information and semantic context, demonstrating that our DAF block fuses the features from different levels efficiently.

V. CONCLUSION
We propose a lightweight attended multi-scale residual network for single image super-resolution in this paper. Specially, we use an attended multi-scale residual block as the basic block to extract multi-scale features, where we further embed the efficient channel attention block to improve the network representations. In such a way, the discriminative multi-scale features are extracted at a granular level efficiently. What's more, we utilize the double-attention module to fuse the features at different levels from the deep feature extraction module efficiently. Extensive experimental results demonstrate the effectiveness of our model with similar parameters and computation.