Lightweight Single Image Super-Resolution With Multi-Scale Spatial Attention Networks

Convolutional neural networks (CNNs) generally provide higher performance gain for single image super-resolution (SISR) as the depth and number of parameters are increasing. However, just increasing the layers of straightforward deep networks has a problem that it requires an impractically large number of parameters for obtaining state-of-the-art performance. Instead, some researchers proposed lightweight networks, which is designed with more sophisticated network structures for achieving better performance than the straightforward networks at the same parameter requirement. In this paper, we propose new lightweight Multi-scale Spatial Attention Networks (MSAN) for SISR, which attempt to bring out a better performance from the relatively small number of parameters. Specifically, we adopt a dense connection with feature fusion layers to broadcast abundant features to every level of layers, and propose a double residual structure that provides an additional skip-connection. We also design a Multi-scale Spatial Attention Block (MSAB) to exploit multi-scale spatial contextual information. Furthermore, we introduce a spatial attention module which adaptively focuses on the most informative feature scale in a given region of the image. In the experiments, we validate that the proposed MSAN achieves significant accuracy compared to recent lightweight models and comparable performance to the state-of-the-art methods.


I. INTRODUCTION
Recently, deep convolutional neural networks (CNNs) have shown great success in most of computer vision tasks, including single-image super-resolution. Since Dong et al. [1] first proposed a three-layer light CNN for the SISR, many researchers have proposed better and deeper CNNs for SISR [1]- [16]. Earlier works mostly focused on training deeper models with straightforward structure (stack of single-size small-kernel convolution layers) or a large number of filters per layer to achieve state-of-the-art performance without considering the undue increase of parameters. For instance, one of the state-of-the-art networks, EDSR [7] has about 43 M parameters with 256 number of filters for a layer. Also, RCAN [9] has more than 400 layers, which is enormously large compared to the three-layer SRCNN that needs just 57 K parameters.
The associate editor coordinating the review of this manuscript and approving it for publication was Mehul S. Raval . Hence, rather than just deepening the network, some researchers tried to find more sophisticated structures and methods to achieve better performance with less number of parameters. Specifically, there are relatively lightweight networks which attempt to achieve high performance while requiring the number of parameters under 2 M. 1 For example, IDN [19] was proposed as a fast and accurate SISR network with information distillation. DSRN [17] adopted dual-state recurrent network and achieved competitive performance. CARN [18] was proposed as a fast, accurate, and lightweight CNN for SISR with some sophisticated techniques.
We note that conventional networks for SISR are based on stacked 3 × 3 convolution layers, which seem to lack flexibility in dealing with the diverse scale of spatial contexts. Precisely, they cannot capture both local and global features at the same time, and rarely learn multi-scale nonlocal self-similarity due to the limited receptive fields. Hence, the features from CNNs are often redundant to deal with image structure or not very useful to exploit multi-scale structure. Moreover, as evidenced in [20], internal structure and multi-scale recurrence are sometimes more important than the evidence from external training patches. To fully utilize such properties, it is required to capture both local and global contextual information simultaneously.
Regarding the above issues, we propose Multi-scale Spatial Attention Networks (MSAN) that can adaptively give attention to the most appropriate scale of features in a specific region of the image. Precisely, we design a Multi-scale Spatial Attention Block (MSAB) as a basic building block for the MSAN. The MSAB includes a multi-scale structure with the spatial attention mechanism and local residual learning (LRL). Also, we adopt a dense connection with feature fusion (FF) to push useful information to the overall layers. In addition, instead of using only one global residual learning (GRL), we propose a double residual learning (DRL) scheme to enable the network to learn more sparse features. We also introduce a variant model with the recursive scheme, namely MSAN-X (eXtremely lightweight model). Experimental results show that the proposed architecture yields superior performance while maintaining the number of parameters to be comparable to the existing lightweight models such as [17]- [19].
In summary, we propose a new lightweight CNN, where we attempt to reduce the number of parameters as much as possible while achieving a performance gain over the other lightweight models. In developing the structure, our main contributions are as follows: • We propose a new network architecture (MSAN) that extracts dynamic multi-scale features for feature enrichment.
• This is achieved by developing a new basic building block, namely MSAB, which adopts multi-scale structure with dilated convolution that enables the use of local residual learning and spatial attention for the dynamic feature selection.
• We also propose an extremely lightweight model, namely MSAN-X, with the recursive scheme.
• Extensive experiments show that both MSAN and MSAN-X achieve state-of-the-art performance compared to the other lightweight SISR methods. Finally, it is worth to note that increasing the layers of lightweight models such as [17]- [19], and ours does not work well; specifically, they do not converge well when they are deepened. Unlike the classical models that can generally achieve better performance with deeper layers, the structures of lightweight models are so complicated that they are not well trained when deepened. Rather, lightweight models have the virtue of requiring a practically small number of parameters while achieving comparable performance to the classical deeper networks.

II. RELATED WORK A. CNN-BASED SINGLE IMAGE SUPER-RESOLUTION
As a pioneer work, Dong et al. introduced SRCNN [1] with a three-layer light network. VDSR [2] adopted a very deep structure of 20 convolution layers with residual learning and has proven ''the deeper, the better'' in SISR. ESPCN [21] first introduced an efficient sub-pixel convolution layer for the upsampling to save the computational cost. DRCN [22] and DRRN [6] adopted the recursive structure and achieved significant results compared to the networks with similar depths but with a smaller number of parameters. Also, MemNet [23] exploited recursive units for general image restoration problem, and LapSRN [3] adopted the Laplacian pyramids in CNN. SRGAN [4] was proposed as a generative approach for SISR based on generative adversarial net (GAN) [24], which focused on the perceptual quality rather than the mean squared error. Recently, EDSR [7], RDN [8] and RCAN [9] have been proposed and achieved very highperformance gain by using very deep architecture with a huge number of parameters more than 10 M (For EDSR, more than 40 M.)

B. LIGHTWEIGHT CNNS FOR SISR
Compared to the networks with a large number of parameters and depth, some of the recent works are focusing on the lightweight but competitive SISR architectures. For example, CARN [18] was proposed as a fast, accurate, and lightweight structure with less number of parameters. Also, IDN [19] achieved the same goal via information distillation network. Dual CNN [25] achieved considerable performance gain with a reduced number of parameters by addressing the dual domains separately. DSRN [17] employed dual-state recurrent structure for the efficient SISR. Liu et al. [10] introduced non-local module within a recurrent network, which adopts non-local self-similarity of natural images. Our approach also focuses on the designs for lightweight networks with competitive performance.

C. MULTI-SCALE SPATIAL FEATURES
Szegedy et al. proposed the Inception module [26], which consists of multiple-size kernels at the same level of layer. Hence, multi-scale contexts in images can be utilized at a layer, and various size of features can be extracted at the deeper layers. As a result, the Inception network has shown outstanding performance compared to the VGG-style networks with much less number of parameters in the image classification task. They also modified Inception module to a better version, proposed ''Inception-v4'' [27], which leverages the accuracy. For the SISR, Shi et al. [28] proposed a modification of the Inception module as a dilated convolution layer. Recently, MSRN [29] adopted the multi-scale structure for SISR and have shown competitive performances. In this paper, we also focus on multi-scale spatial features to increase and enrich the quality of features.

D. ATTENTION MECHANISM
It is well-known that human visual system captures information not from the whole scene but the salient local parts for efficiency, which is the attention mechanism. Since it plays an essential role in image perception, many researchers developed neural network modules that can incorporate the attention mechanism [9], [16], [30]- [32]. For example, Wang et al. [30] proposed Residual Attention Network for image classification problem by controlling whole feature maps with attention module. Also, squeeze and excitation network with channel-wise attention was proposed in [31]. RCAN [9] adopted a channel-attention mechanism for SISR and achieved state-of-the-art performance. Recently, CBAM [32] was introduced, which considers both spatial and channel attention with a small number of parameters and thus boosts image classification and object detection performances. Very recently, SAN [16] was introduced, which considers second-order statistics for channel attention to further enhance the performance in SISR. Inspired by CBAM [32], we also consider spatial attention with some modification. The original spatial attention module from the CBAM examines ''where to focus on,'' which is not proper for a low-level vision problem. To be specific with SISR, it is important to focus on the correct scale of the structure to predict its high-frequency component. Hence, it is essential to focus on the most informative features that reflect the appropriate spatial scale, rather than just the saliencies. Thus, our multi-scale spatial attention module is designed to give attention to the features that have the most appropriate scale for the given region in the image.

III. PROPOSED METHODS
In this section, we present our MSAN and MSAN-X for SISR. Our MSAN is designed to extract and exploit enriched features with a reduced number of parameters. Moreover, MSAN-X is designed to further reduce the number of parameters by employing a recursive scheme.

A. SINGLE IMAGE SUPER-RESOLUTION
In this part, we define the SISR problem and present some notations for the rest of the paper. The SISR is to predict a plausible super-resolved image I SR ∈ R rH ×rW ×C from a single low-resolution image I LR ∈ R H ×W ×C , where r is the scaling factor. In this paper, we wish to minimize the difference between the predicted I SR and the true high-resolution image I HR ∈ R rH ×rW ×C . For the CNN-based methods, the goal is to find a mapping function f (I LR ; θ) parameterized by θ where its output is I SR . The parameter of CNN is trained by using a paired training dataset {I where N is the number of training image pairs.

B. NETWORK ARCHITECTURE
The overall architecture of MSAN is shown in FIGURE 1. The proposed network mostly consists of MSABs with some convolution layers and feature fusion (FF) layers. The MSAB will be explained in the later subsection, which is designed to extract dynamic multi-scale features via local residual learning and spatial attention. The figure also shows that we adopt the dense connection to push informative intermediate features to the higher layers. The global feature fusion (GFF) layer in our network fuses all output features from previous MSABs. Similar to recent SISR works [4], [7], [8], [18], the upsampling process is located at the latter part of the network. Hence most of the computation is done in the LR scale, which greatly reduces computational cost compared to the ones that require bicubic interpolation [1], [2].
The proposed MSAN can be divided into three functional parts: feature extractor, feature enhancer, and reconstructor. The first convolution layer works as the feature extractor, which extracts the low-level feature F 0 from the input LR image I LR , described as Then, the central part, which consists of FF layers, MSAB, GFF, and the convolution layer for double residual learning (DRL), works as the feature enhancer. The FF layer fuses VOLUME 8, 2020 where C FF−d (·) and C MSAB−d (·) denote the d-th FF layer and the d-th MSAB, respectively (d = 1, 2, · · · , D). After the last MSAB, all output features from the previous MSABs and F 0 are concatenated, and then aggregated by GFF. Additionally, we adopt GRL and DRL to provide additional gradient paths. Between two residual learnings, one convolution layer C E (·) enhances the features from the previous GRL. Hence, we have where F GFF is the features after GFF and GRL, and F E denotes the final enhanced features after the feature enhancer. Then, we feed F E to the reconstructor which consists of an upsampler U (·) and the last convolution layer C R (·) to upsample and reconstruct the final super-resolved image I SR , i.e.,

C. MULTI-SCALE SPATIAL ATTENTION BLOCK
The MSAB is the main building block of our proposed network, which is shown in FIGURE 2a. As shown in FIGURE 1, the role of the d-th MSAB is to take mid-level features after the d-th FF layer H d−1 ∈ R H ×W ×G as the input, where G denotes the number of features. At the first part of MSAB, four parallel 1 × 1 convolution layers C 1×1 (·) followed by ReLU compress the input features for each branch, which is expressed as where l = 1, 2, 3, 4, denotes each branch which corresponds to the scale of the receptive field, and h l 1,d−1 denotes output feature of the l-th branch. Then, features from each branch are fed to the dilated convolution layers with different dilation rates followed by ReLU activation function. By doing so, we get where D l 3×3 denotes 3 × 3 dilated convolution layer with dilation rate l. As a result, the dilated convolution layer provides various sizes of receptive fields without increasing the number of parameters and hence plays a key role in our method.
As state previously, we note that the important scale of context is spatially variant, i.e., a specific scale of the context in a region might be more informative than the other scales for reconstructing the details. To reflect such property, we adopt an attention module to focus on the scale of features which are the most informative at a given region. Specifically, the features h l 2,d−1 are fed to the spatial attention module to generate the spatial attention map M l ∈ R H ×W ×1 so that the features are further rescaled by M l to get ''attention-applied features h l 3,d−1 .'' In short, the final output of each branch is Finally, all scale features are concatenated and fused by local feature fusion (LFF) layer C LFF (·) which works as a bottleneck layer. In addition, local residual learning (LRL) is adopted, which is proven to boost performances [4], [7], [33]. Formally, the overall output feature F d is obtained as

D. SPATIAL ATTENTION MODULE
For dynamic feature selection regarding the scalesignificance, we adopt the spatial attention module as shown in FIGURE 2a, which is adopted from [32]. The spatial attention map is obtained as where C 7×7 (·) denotes a convolution layer with kernel size of 7, and σ (·) denotes the sigmoid function (subscript d − 1 is omitted for brevity). 'MaxPool' and 'AvgPool' stand for channel-wise max pooling and average pooling operations, respectively.

E. IMPLEMENTATION DETAILS
The kernel size of the convolution layers is 3 × 3 unless otherwise stated. The FF layers are implemented with 1 × 1 convolution layer followed by ReLU. LFF and GFF are implemented by 1 × 1 convolution layer. For the upsampler U (·), we adopt one sub-pixel convolution layer [21] for scaling factor ×2 and ×3, while two sub-pixel convolution layers are adopted for ×4. We set the number of features G = 64, and the number of MSABs D = 16.
For the loss function, we use the mean absolute error (MAE) which is described as

A. COMPARISONS TO THE RECENT SISR MODELS a: COMPARISONS TO RCAN
RCAN [9] adopts the residual-in-residual structure and channel-attention module. Our proposed model adopts skipconnections and attention module similarly to RCAN. The skip-connection is currently a widely used scheme, but we note that the difference comes from where to make connections. We adopt LRL and GRL, where we add one more skipconnection, i.e. DRL. Also, we combine dense connection with skip-connection to boost performance. For the attention module, RCAN focuses on channel attention with global average pooling, whereas our model focuses on spatial attention to select spatially important features.

b: COMPARISONS TO CARN
CARN [18] was proposed as a lightweight structure with cascading residual blocks. The key idea of lightweight structure design lies in the construction of group convolution layers. Since our main building block MSAB processes each branch in parallel, it can be considered a cascaded group convolution. Specifically, if we alter dilated convolutions to normal convolution layers, it is just the same as a group convolution with the group size of 4. The group convolution is known to reduce parameters and computations approximately proportional to the group size, which enables the lightweight design for both CARN and our method. However, different from CARN, we adopt dilated convolution and spatial attention module, and hence, our MSAB can be viewed as an extension of a cascade of group convolution layers that can have many informative features.

V. EXPERIMENTAL RESULTS
We present the results of two models: MSAN and MSAN-X. The MSAN-X has recursive MSABs, in which one MSAB is shared D times. We train our networks with DIV2K [34], which is a high-quality dataset with 2 K resolution, containing 800 images for training, 100 for validation, and 100 for the test. ADAM optimizer [35] is used with the initial learning rate of 4 × 10 −4 and halved every 100, 000 iterations until it reaches 5 × 10 −5 . The RGB input LR training patches are generated by using MATLAB bicubic downsampling function, and the size is set to 48 × 48. We evaluate our models with four benchmarks: Set5 [36], Set14 [37], BSD100 [38], and Urban100 [39]. We assess our models by using PSNR and SSIM on the luminance channel of YCbCr color model. Three scaling factors are evaluated: ×2, ×3, and ×4.

B. COMPARISONS WITH THE STATE-OF-THE-ART METHODS
For further analysis, we compare our methods with several state-of-the-art methods that need a large number of parameters: EDSR [7], RDN [8], MSRN [29], and RCAN [9].    requires 12.2 times more, and RCAN needs about 8.5 times more parameters than MSAN to achieve 0.315 dB PSNR gain. The MSRN requires just 3.5 times more parameters than MSAN, but the MSAN shows slightly better performance even with much smaller model size. In summary, the overall comparison of lightweight and deeper models is graphically shown in FIGURE 6(a).

C. ANALYSIS OF COMPUTATIONAL COMPLEXITY
Since the number of operations is more closely related to the computational complexity than the number of parameters, we analyze the number of operations in terms of Mult-Adds in TABLE 3. We directly follow the same setting and the method from CARN [18] to calculate the number of Mult-Adds. Like the method in [18], we assume the size of HR image to be 1280 × 720. The table shows that our MSAN uses comparable parameters and computational costs to CARN [18]. Compared to other lightweight models such as DRRN and MemNet [6], [23], MSAN requires much fewer Mult-Adds while using more parameters. In conclusion, our MSAN has advantages in terms of both the number of parameters and the number of operations.

D. VISUALIZED RESULTS
For qualitative comparisons, we visualize some of superresolved results on FIGURE 3, 4, 5. As shown, MSAN predicts multi-scale high-frequency details better than other methods.

VI. CONCLUSION
In this paper, we have proposed a new lightweight CNN-based SISR method by designing new schemes that can extract multi-scale spatial features. To achieve the performance gain with a limited number of parameters, we focused on the quality of intermediate features. For this, we designed a building block that addresses multi-scale receptive fields, which enables to capture local to global contextual information, and also reflect multi-scale self-similarity. Moreover, we adopt the spatial attention module to focus on the regionadaptive informative spatial features. It is shown that the dense connection, along with the skip connection, boosts the performance. In our extensive experiments, we have demonstrated that the proposed networks show competitive performance against other recent CNN-based methods. Our codes are publicly available at https://github.com/JWSoh/MSAN.