Multi-Resolution Space-Attended Residual Dense Network for Single Image Super-Resolution

With the help of deep convolutional neural networks, a vast majority of single image super-resolution (SISR) methods have been developed, and achieved promising performance. However, these methods suffer from over-smoothness in textured regions due to utilizing a single-resolution network to reconstruct both the low-frequency and high-frequency information simultaneously. To overcome this problem, we propose a Multi-resolution space-Attended Residual Dense Network (MARDN) to separate low-frequency and high-frequency information for reconstructing high-quality super-resolved images. Specifically, we start from a low-resolution sub-network, and add low-to-high resolution sub-networks step by step in several stages. These sub-networks with different depth and resolution are utilized to produce feature maps of different frequencies in parallel. For instance, the high-resolution sub-network with fewer stages is applied to local high-frequency textured information extraction, while the low-resolution one with more stages is devoted to generating global low-frequency information. Furthermore, the fusion block with channel-wise sub-network attention is proposed for adaptively fusing the feature maps from different sub-networks instead of applying concatenation and $1\times 1$ convolution. A series of ablation investigations and model analyses validate the effectiveness and efficiency of our MARDN. Extensive experiments on benchmark datasets demonstrate the superiority of the proposed MARDN against the state-of-the-art methods. Our super-resolution results and the source code can be downloaded from https://github.com/Periter/MARDN.

effectiveness of HRNet, which contains multi-resolution subnetworks to extract different resolution representations and achieves the state-of-the-art results. Inspired by Sun et al's work, we propose a novel multi-resolution network to reconstruct low-frequency and high-frequency information, separately.
For SISR, the low-frequency and high-frequency information of the super-resolution (SR) images are not only affected by the depth of the network, but also by the resolution of the features. To the best of our knowledge, deeper networks have larger receptive fields to extract global low-frequency semantic information, while shallower networks have smaller receptive fields which focus on local regions containing more high-frequency information. As far as the resolution of features is concerned, high-resolution features contain more pixels and high-frequency information than the low-resolution ones. In other words, the high-resolution features can better reflect the regions where the frequencies or signals change rapidly.
The key challenge of SISR is that low-frequency and high-frequency information can not be reconstructed well at the same time and thus makes SISR prone to failure. To tackle this problem, we propose a novel SISR method based on Multi-resolution space-Attended Residual Dense Network (MARDN) to extract different frequency representations in parallel for reconstructing the final super-resolved images. Specifically, we start from a low-resolution sub-network, gradually add low-to-high resolution sub-networks in several stages. In our model, the high-resolution sub-network with few stages is used to extract the local high-frequency representations, while the low-resolution sub-network with more stages is applied to extract the global low-frequency representations. Additionally, we propose a channel-wise sub-network attention block to adaptively fuse the features from different sub-networks. To accelerate the end-to-end training, we use a Global Skip Connection (GSC) between the shallow features and the deep features, a Local Skip Connections (LSC) in the space-Attended Residual Dense Blocks (ARDB), and a Direct Connection (DC) from the shallow features to the sub-network whose features have the same resolution with the original input image. As for our basic block, the ARDB, we apply residual connection, dense connection and an attention module to enhance feature representation ability. As shown in Fig.1, with the help of channel attention, RCAN [8] can recover part of the lines in the circled region but still produces severely blurred texture. While our MARDN recovers all the lines rightly and is closer to the HR image, benefiting from the design of multi-resolution subnetworks.
Overall, the contributions of our method can be summarized as follows: • We design a novel multi-resolution space-attended residual dense network for effective image super-resolution. Specifically, different sub-networks extract different frequency information by controlling their resolution and depth.  [20] with scale factor 8. Compared with RCAN [8], our method recovers clearer and more correct textures in the circled region.
• We propose an adaptive fusion block based on channelwise sub-network attention for better fusing the feature maps from different sub-networks.
• Extensive experiments on benchmark datasets demonstrate the effectiveness of our model and the results are superior to the state-of-the-art super-resolution methods.
The rest of the paper is organized as follows: Section II introduces the related work. Section III elaborates the structure of the proposed network. Section IV discusses the difference between our model and similar works. Section V shows our experimental results on benchmark datasets, including quantitative and visual comparisons with other methods as well as the effectiveness of each components in our network. Finally, Section VI draws the conclusions.

II. RELATED WORK
Recently, deep convolutional neural networks (CNNs) have brought remarkable improvements in computer vision owing to their powerful learning ability. This section first reviews several CNN-based super-resolution (SR) methods, and then briefly describes HRNet [19], a method using multiresolution parallel sub-networks in human pose estimation, which inspired our method.

A. CNN-BASED IMAGE SUPER-RESOLUTION
Dong et al. [4] first proposed a CNN model (SRCNN) to learn the end-to-end mapping between interpolated low-resolution (LR) and high-resolution (HR) image patches, which outperforms the traditional SR methods in restoration quality and speed. VDSR [5] and DRCN [7] increase the network depth and introduce residual blocks to alleviate the training difficulty, demonstrating that the network depth has an important impact on the reconstruction performance. Using the interpolated LR image as the network input results in redundancy computation, particularly when the networks are very deep. Instead, some works take the original LR image as input, then extract feature maps from the LR image, and finally use different upscaling modules to rescale the feature maps to the desired size, such as deconvoltional layer [21], sub-pixel layer [22], [23], and EUSR [24]. The LapSRN method [10] progressively reconstructs the sub-band residuals of the highresolution images at multiple pyramid levels and uses the 40500 VOLUME 8, 2020 robust Charbonnier loss functions. The authors also adopted the multi-scale training strategy to train a single model for handling multiple upsampling scales (MSLapSRN) [25]. The EDSR method [11] removes the batch normalization layers in the ResNet [26] and constructes a very deep network, which has won the first place in the NTIRE 2017 Super-Resolution Challenge [27]. Haris et al. [28] utilized mutually-connected up-and down-sampling stages with an error feedback mechanism for reconstructing SR images. Zhang et al. [13] proposed residual dense block (RDB) as building module to set up the SR Network (RDN), which could generate comparable or even better results than that of EDSR, but reduced nearly half of parameters. Zhang et al. [8] also proposed RCAN with residual-in-residual structure and channel attention mechanism to produce appealing SR results. Furthermore, Dai et al. [29] presented a second-order attention network (SAN) with a non-locally enhanced residual group structure, which achieved the state-of-the-art performance with less parameters.
The aforementioned methods aimed to minimize the mean square error (MSE) for higher PSNR value, but sometimes produced over-smoothed edges. To produce photo-realistic SR images, SRGAN [6] first uses the generative adversarial networks (GAN) [30] for SISR. Wang et al. [15] proposed enhanced GAN-based method (ESRGAN) with Residual-in-Residual Dense Block (RRDB) and used RaGAN [31] as the discriminator to produce more photo-realistic images. In addition, ESRGAN has won the first place in the 2018 PIRM Challenge on Perceptual Super-Resolution [32]. Recently, Zhang et al. [33] proposed Super-Resolution Generative Adversarial Networks with Ranker (RankSRGAN) to optimize the generator in the direction of perceptual metrics, which achieved visually pleasing results and reached state-of-the-art performance in perceptual metrics. Zhou and Susstrunk [34] proposed KMSR to build the realistic blurkernel pool with GAN, and used it to construct real photograph paired dataset for unsupervised training. Lugmayr et al. [35] also presented an unsupervised SR model through CycleGAN, which can transfer the LR image from bicubic degradation domain to real-world degradation domain, in this way, real world LR-HR paired training and testing datasets are formed.
The above-mentioned methods focus on increasing the network depth or designing efficient network structure to extract more effective features for improving the network reconstruction performance either in accuracy or perceptual quality. However, they do not take multi-resolution features in consideration and use only a single network to extract the same resolution information. Our proposed method pays attention to different frequency representations and uses a multi-resolution model with several sub-networks to extract different frequency information in parallel.

B. HIGH-RESOLUTION NETWORK
Most existing methods for pose estimation usually recovered high-resolution representations from low-resolution representations produced by a high-to-low resolution network. Similar to hierarchical neural networks [36], Sun et al. proposed the HRNet [19] to maintain high-resolution representations during the whole process, which achieved superior pose estimation results over two benchmark datasets: the COCO keypoint detection dataset [37] and the MPII Human Pose dataset [38].
Most networks had the high-to-low and low-to-high processes, where the former process aimed to produce low resolutions and high channel counts and the latter was designed to produce high-resolution representations and reduce the resolution of the feature maps. Compared with these networks, HRNet had two benefits: i) It formed a high-resolution sub-network as the first stage and gradually added high-tolow resolution sub-networks to form more stages. Then it connected high-to-low resolution sub-networks in parallel rather than in series as most networks did. In this way, HRNet can maintain the high-resolution through the whole process for spatially precise heatmap estimation. ii) It conducted multi-scale fusions across parallel sub-networks such that each sub-network repeatedly received the information from other sub-networks. Consequently, the high-resolution representation was boosted and the keypoint heatmap was potentially more accurate and spatially more precise, resulting in better performance compared with other pose estimation networks.

III. PROPOSED METHOD
This section introduces our MARDN model. The overall architecture is presented firstly, followed by detailed description of the space-Attended Residual Dense Block (ARDB) and the fusion block.

A. ARCHITECTURE
As shown in Fig.2, MARDN reconstructs a super-resolution image I SR from its low-resolution counterpart I LR . The MARDN consists of four modules: shallow feature extraction, deep feature extraction, upscaling, and reconstruction modules. And the pseudo-code is shown in Algorithm 1.
Firstly, the original low-resolution images are fed into the shallow feature extraction module, which transfers those images from the color space to the feature space, resulting in the shallow features F sf , where H SF represents the function of the shallow feature extraction module, which is constructed by a 3 × 3 convolutional layer. The shallow feature F sf then passes through the deep feature extraction module, and produces the deep feature maps F df , where H DF denotes our multi-resolution sub-networks, which is inspired by HRNet [19]. VOLUME 8, 2020 Update parameter with ADAM optimizer: ← Optimizer(L , η) 11: n ← n + m 12: end while 13 Different from existing SISR methods, which use single channel networks to extract deep features and reconstruct SR images, in order to extract features of different frequencies effectively, we adopt multi-channel sub-networks to extract features. We first downsample the shallow features into lowresolution features through bi-linear interpolation. Taking the low-resolution features as input, the deep feature extraction module starts from a low-resolution sub-network as the first stage, and then adds low-to-high resolution sub-networks gradually in several stages, where each sub-network consists of several ARDBs. The features from different resolution subnetworks join together at the start of each stage via our specifically designed fusion blocks, which apply channel-wise sub-network attention blocks for adaptive fusion. Each of the low-to-high resolution representations adaptively receives information from other parallel representations over and over.
Considering the difference in complexity between the lowfrequency and the high-frequency information, we extract different frequency information by controlling the depth and resolution of the sub-networks. To make the low-resolution sub-network extract the global low-frequency information with large receptive field, the data flow in this sub-network passes through more stages and more ARDBs than that in the high-resolution one. N sr is the sub-network with r-resolution in the s-th stage, whose resolution is 2 r−1 times of the first sub-network. As shown in Fig.2, the deep feature extraction module H DF contains 3 different resolution sub-networks in parallel, and can be represented as: After obtaining deep features F df from the deep feature extraction module, we use the upscaling module to produce the features F up with the same size as the desired superresolved images, expressed as: where H UP is the function of sub-pixel convolution proposed in ESPCN [22], F sf ↑ represents bi-linear upsampling of the shallow features F sf , and the symbol + represents the global skip connection (GSC). Specifically, we only construct one ×2 higher-resolution sub-network because it does not introduce much computational burden. For ×2 output we deactivate the upscaling module, and for ×4 and ×8 output we activate the upscaling module. At the end of the network architecture, we use the reconstruction module to output the desired super-resolved image I SR : where H REC is the function of the reconstruction module consisting of a convolutional layer for mapping feature space into color space. Our MARDN is a supervised model, and a certain loss function is benefit for training our model. Some loss functions have been used in existing SR methods, such as L 1 [10], [11], [13], L 2 [5], [21], [39], [40], perceptual and adversarial losses [6], [15], [41]. For balancing the convergence rate and performance, we follow the optimization process as Zhao et al. [42]. We first train our model with the L 1 loss function, which is shown to speed up the convergence of training compared with L 2 loss function. At the second step, we use the L 2 loss function to fine-tune the model, which can achieve better PSNR performance. Given the training set of N pairs of LR images and their HR counterparts , the goal of optimizing MARDN is to minimize the objective: where represents the parameter set of the MARDN model. The loss function is L 1 when p is 1 and L 2 while p is 2. More training details can be found in subsection V-A.

B. SPACE-ATTENDED RESIDUAL DENSE BLOCK
We now present more details about our basic block, the spaceattended residual dense block (ARDB). As shown in Fig.3, we remove the batch normalization (BN) layers in our network since Lim et al. [11] and Wang et al. [15] have proved that batch normalization not only consumes large computational resources but also brings BN artifacts in SISR. An ARDB consists of three space-attended residual dense units (ARDU) and a local skip connection (LSC) between the input of the ARDB and the output of the last ARDU. Moreover, the ARDU contains a dense block of five convolutional layers, a space-attended module and a skip connection between the input of the unit and the output of the spaceattended module.
As is well known, an image has different frequency information in different regions. In order to make the sub-networks focus on different frequency features, we construct a spaceattended module to exploit the interdependence among the feature space. In the space-attended module, we denote F uin as the input features of the ARDU and F rd as the output of the dense block. The size of both features are C × H × W , where C is the number of channels, and W and H are the width and the height of the feature map. We simultaneously apply 1 × 1 convolution, max-pooling, and average-pooling along the channel axis to generate various feature maps, denoted as F sa 1 , F sa 2 and F sa 3 , respectively.
where Conv 1×1 , MaxPool, and AvgPool are convolutional layer with 1 × 1 kernel size, max-pooling, and averagepooling, respectively. These three spatial feature maps are then concatenated and forwarded to a 1 × 1 convolution layer to encode contextual information into the local features F sa with size 1 × H × W , where Sigmoid represents sigmoid activation, and ⊗ denotes multiplication with broadcasting rules. So the final output F ardb of our basic block (ARDB) can be represented as: where F bin is the input of ARDB, and H ARDU represents the function of ARDU, its output is the attented feature F ardu . VOLUME 8, 2020

C. FUSION BLOCK
In each stage, multi-resolution feature maps are generated from different sub-networks in parallel. Instead of directly fusing these feature maps via concatenation and 1 × 1 convolutional compression, our proposed fusion block (FB) fuses them with attention mechanism to enhance feature aggregation. As an example, we illustrate the last fusion block in the network, which receives three different resolution feature maps as input. The structure of this fusion block is shown in Fig.4. The first step of the fusion operation is to interpolate those low resolution feature maps to produce feature maps with the same size as the high resolution ones. The interpolated feature maps are first passed to global average pooling and squeeze-and-excitation layers (the convolutional layers), respectively. Then, the obtained three vectors with size of 64×1×1 are concatenated and processed with a softmax layer at the same channel index to generate an attention matrix, whose size is 64 × 3 × 1. After that, we split the matrix into three attention vectors with the size of 64 × 1 × 1 for the three sub-networks. Finally, we multiply the interpolated feature maps and the corresponding attention vectors, and sum them up to obtain the fused output. The general representation of the fusion blocks is where v r and F r denote the attention vectors and the interpolated feature maps of different sub-networks, respectively, F fb is the fused block output, and n is the number of feature maps to be fused. For the case shown in Fig.4, n = 3.

IV. DISCUSSION
In this section, we compare the proposed model with similar works and discuss the difference between them.

A. DIFFERENCE BETWEEN MARDN AND CNF
CNF [18] and the proposed MARDN belong to multisubnetwork designs. There are two differences between them.
First, MARDN is an end-to-end training paradigm, while CNF trains different sub-network structure separately and fine-tunes through weighted assemble. Second, MARDN has different resolution sub-networks, while CNF only reconstructs information in the same resolution space.

B. DIFFERENCE BETWEEN MARDN AND RDN
Both RDN [13] and MARDN both utilize residual and dense connection to effectively extract features and stabilize the training of deeper network. However, MARDN is different from RDN in three aspects. First, MARDN has a spatial attention block after residual dense unit to focus on most relevant features, while RDN does not. Second, MARDN has a multi-resolution sub-network architecture, while RDN only contains the same resolution data flow. Third, MARDN uses a specifically designed channel-wise sub-network attention to better fuse the feature maps from different resolution sub-networks. In fact, RDN is a special case of MARDN when the space-attended weights are set to 1 and only the original-resolution sub-network is kept.

C. DIFFERENCE BETWEEN MARDN AND RCAN
Recently, the attention mechanism is applied to superresolution task, and RCAN [8] is an exemplary work using channel attention. There are also three differences between RCAN and MARDN. First, RCAN uses residual skip connection to stabilize the training process, while MARDN contains both residual and dense skip connection. Second, RCAN only considers channel-wise correlations and rescales features with channel attention, while MARDN applies both spatial attention in the ARDBs and channelwise sub-network attention in the fusion blocks. Third, RCAN utilizes a single network to extract feature from the same resolution features, while MARDN uses several sub-networks to extract features of different resolutions, handling low-and high-frequency information in different sub-networks.

V. EXPERIMENTS
In this section, we first detail our experiment setting. Then we conduct ablation studies and present quantitative and visual comparisons between our method and some state-of-the-art methods in bicubic and blur-downscale degradation model. Finally, we describe the model analyses.

2) DEGRADATION MODEL AND EVALUATION METRICS
We obtain the LR images by down-sampling the HR images leveraging the MATLAB bicubic (BI) kernel function. The experiments are conducted with the scaling factors of ×2, ×4, and ×8 between the HR and the LR images. The commonly used evaluation metrics PSNR and SSIM are utilized for quantitative comparisons with other methods on the Y channel. The visual results for qualitative comparisons are also provided.

3) TRAINING SETTING
As shown in Fig. 2, the proposed network is a full convolutional network. The convolutional layer has the kernel size 3 × 3 in default. It is followed by a Leaky ReLU activation layer. The model is first trained with L 1 loss with learning rate initialized as 1 × 10 −4 and then decreases by half every 1 × 10 5 iterations of back-propagation. After the training process converges, we change the objective function to L 2 to fine tune the parameters. For optimization, we use Adam [47] with β 1 = 0.9, β 2 = 0.999, and = 10 −8 . Following Lim et al. [11], we randomly crop the input LR and HR images into 48 × 48 LR patches and HR patches according to the scaling factor, and then, these patches are augmented with random horizontal flipping and 90 • rotatio. Finally, we train the network in RGB channels with mini-batch size 16. Our MARDN model is implemented on the PyTorch framework [48] on 2 NVIDIA GeForce RTX 2080 GPUs.

B. ABLATION INVESTIGATIONS
To verify the effectiveness of the proposed MARDN model, we conduct a series of ablation investigations, including the effects of the components of the sub-networks, the skip connections and the fusion blocks. Table 1 shows the best PSNR(dB) values of different ablation cases on Set5 in 100 epochs with ×4 upscaling factor and Fig. 5 shows the corresponding convergence curves.

1) COMPONENTS OF SUB-NETWORKS
To demonstrate the effect of our proposed multi-resolution sub-networks structure, we conduct two ablation experiments (No.1 and No. 2 in Table 1) by removing high-and lowresolution sub-network, respectively. Although these two experiments reach the same PSNR = 32.06, the No. 1 experiment is hard to train without high-resolution sub-network. As shown in Fig.5, the curve of the No.1 experiment oscillates heavily sometimes. In comparison, our MARDN achieves the best PSNR value, which demonstrates that our multiresolution sub-networks can extract more effective features for reconstructing the super-resolved images.

2) GSC AND LSC
To verify the effectiveness of global skip connection (GSC) and local skip connection (LSC), we remove them separately in the No.3 and the No.4 experiments. It can been seen from Table 1 and Fig. 5 that their PNSR values are lower than the complete MARDN (No.6) in both cases. Furthermore, removing the LSC results in slower convergence and lower PNSR than removing the GSC, which indicates that the GSC passing shallow feature information can improve the performance. VOLUME 8, 2020

3) FUSION BLOCKS
In the No.5 experiment, we remove all the fusion blocks, and replace them by concatenation and 1 × 1 convolution. Compared to the complete MARDN, the model without adaptive channel-wise sub-networks fusion blocks reduces the PSNR by 0.12dB. It demonstrates that the fusion blocks are effective in joining different feature maps from different sub-networks. These ablation investigations validate the effectiveness of the proposed MARDN designs, including multi-resolution sub-networks, GSC, LSC and fusion blocks. All of them can bring performance improvements.

C. RESULTS WITH BICUBIC (BI) DEGRADATION MODEL
To evaluate the powerful reconstruction ability of our MARDN, we first compare it with 12 state-of-the-art CNNbased SR methods with BI distortion, including SRCNN [21], FSRCNN [39], VDSR [5], LapSRN [10], CNF [18], Mem-Net [23], EDSR [11], SRMDNF [14], D-DBPN [28], RDN [13], RCAN [8], and SAN [29]. Table 2 presents all the quantitative results on the benchmark datasets with various scaling factors. Our MARDN performs similar to RCAN and SAN and better than the other stateof-the-art CNN-based methods. Specially, the MARDN performs relatively better for scale factors ×4 and ×8, but a little worse for scale factor ×2. We consider that the inferior performance of our MARDN for scale factor ×2 may be caused by deactivating the upscaling module, which increases the burdens of the previous sub-networks. For upscaling with large-scale factors, the input image lacks effective information for image super-resolution. To exploit limited information, we use the multi-resolution sub-networks architecture to separate and reconstruct different frequency information effectively, resulting in high-quality super-resolved images.  in Urban100, recent methods RDN, RCAN and SAN suffer from artifacts with wrong textures to some extent, while our MARDN recovers more correct textures similar to the HR images and has the highest PSNR/SSIM values. In the image 'UchiNoNyansDiary_000', our MARDN keeps the original texture and achieves sharper result, while the other methods generate blurred or wrong textures of the cat ear. These observations demonstrate the effectiveness and superiority of our MARDN, which not only obtains more faithful results but also brings no or less artifacts via the multi-resolution structure.

1) QUANTITATIVE RESULTS
As commonly used in [8], [29], we compare ×3 SR results with BD distortion. As shown in Table 3, our MARDN achieves the best performance in each dataset. Compared with the very deep networks with second order attention in SAN, MARDN reconstructs high-quality images with multiresolution sub-network structure and indicating our method can reconstruct super-resolved images with high quality from more complex degraded images.

2) QUALITATIVE COMPARISON BY VISUALIZATION
Similar to BI degradation, we also provide visual comparisons in Fig. 7. For reconstructing the challenging texture in image ''253027'', RDN and RCAN recover some texture information in some degree, but still suffer from oversmoothness. In contrast, our MARDN recovers the correct pattern of zebra and reconstructs a clear super-resolved image like the original high resolution image. It clearly demonstrates the efficiency of our multi-resolution sub-network to separate the frequency information for high quality images.

E. MODEL ANALYSES 1) MODEL PERFORMANCE ANALYSIS
To find the best performance of our MARDN model, we first investigate the correlation between the performance and    Table 4, the model L5O7H3 with 5, 7 and 3 ARDBs in the low-, originaland high-resolution sub-network, respectively, performs better than those with fewer ARDBs. However, it can also be seen that the L4O7H2 and L5O4H3 have similar number of parameters, but obviously L4O7H2, which has more ARDBs in its original-resolution sub-networks, performs better than L5O4H3. On the other hand, L6O7H4 has a deeper network, but performs worse than L5O7H3. For balancing the calculation burden and performance, we use the L5O7H3 as our final setting. Fig. 8 shows the performance and model sizes of some state-of-the-art deep CNN-based SR methods. Our MARDN model has the best performance, but it has a large number of parameters. To reduce the parameters, We remove a spaceattended residual dense unit from the ARDB, and name it as MARDN lite model. It can be seen that the MARDN lite model has similar number of parameters to the SAN model, but performs better than the other state-of-the-art methods.

3) TIME COMPLEXITY ANALYSIS
Considering the execution time, we use the public official implement testing code of recently CNN-based SR methods, including D-DBPN, SAN and RCAN on the machine with 4.2GHz Inter i7 CPU (32G RAM) and Nvidia RTX 2080 platform. We conduct five times inference over the 100 images of Urban100 dataset, and the mean values of per image are showed in Table 5. It is noteworthy that our MARDN is a real-time model, whose inference time is the least but has the highest PSNR.

4) FEATURE MAPS VISUALIZATION IN MULTI-RESOLUTION SUB-NETWORKS
To verify whether our MARDN recovers the features of different frequencies separately, we visualize the feature maps before the last fusion block, which are depicted in Fig. 9.  In our model, we utilize the low-resolution sub-network to process the low-frequency information, and the highresolution one to learn the high-frequency information. It is obvious that as the resolution increases, the feature map exhibits more high-frequency texture information, which demonstrates that our multi-resolution sub-networks have powerful ability to process different frequency information, separately.

VI. CONCLUSION
This paper proposed a multi-resolution space-attended residual dense network (MARDN) for image super-resolution. Specifically, different frequency feature maps are produced through controlling the resolution and the depth of the subnetworks. To effectively fuse the feature maps from different sub-networks, a channel-wise sub-network attention fusion block was introduced to adaptively join the features. Extensive experiments on SR with bicubic (BI) and blurdownscale degradation models showed the effectiveness of our MARDN in terms of quantitative and visual results. Ablation investigations also validated the effectiveness of multi-resolution sub-networks, skip connection and fusion blocks in our MARDN. Furthermore, visualization of feature maps in multi-resolution sub-networks verified that the highresolution sub-network with few stages can extract highfrequency information, while the low-resolution sub-network with more stages can extract low-frequency information. In our future work, we aim to design a light-weight backbone network to offset the parameter increase within multi-subnetworks.
JIAYV QIN received the B.S. degree in network engineering from the Guilin University of Technology, in 2017. He is currently pursuing the M.S. degree in software engineering with the South China University of Technology, China. His current research interests include single image super-resolution, face hallucination, and generative adversarial networks.
XIANFANG SUN received the Ph.D. degree from the Institute of Automation, Chinese Academy of Sciences, in 1994. He is currently a Senior Lecturer with the School of Computer Science, Cardiff University, U.K. His main research interests include computer vision, computer graphics, pattern recognition, and artificial intelligence. XINYI PENG is currently a Full Professor with the School of Software Engineering, South China University of Technology, China. In recent years, he has presided over and participated in more than 20 projects, with the project funds of over nine million. Among them, nine projects above the provincial level have passed the acceptance inspection. His research interests include artificial intelligence and data mining. VOLUME 8, 2020