Hybrid Residual Attention Network for Single Image Super Resolution

The extraction and proper utilization of convolution neural network (CNN) features have a significant impact on the performance of image super-resolution (SR). Although CNN features contain both the spatial and channel information, current deep techniques on SR often suffer to maximize performance due to using either the spatial or channel information. Moreover, they integrate such information within a deep or wide network rather than exploiting all the available features, eventually resulting in high computational complexity. To address these issues, we present a binarized feature fusion (BFF) structure that utilizes the extracted features from residual groups (RG) in an effective way. Each residual group (RG) consists of multiple hybrid residual attention blocks (HRAB) that effectively integrates the multiscale feature extraction module and channel attention mechanism in a single block. Furthermore, we use dilated convolutions with different dilation factors to extract multiscale features. We also propose to adopt global, short and long skip connections and residual groups (RG) structure to ease the flow of information without losing important features details. In the paper, we call this overall network architecture as hybrid residual attention network (HRAN). In the experiment, we have observed the efficacy of our method against the state-of-the-art methods for both the quantitative and qualitative comparisons.


Introduction
In this paper, we address the Single Image Super Resolution (SISR) problem, where the objective is to reconstruct the accurate high-resolution (HR) image from a single lowresolution (LR) image. It is known as an ill-posed problem, since there are multiple solutions available for mapping any LR image to HR images. This problem is intensified when the up-sampling factor becomes larger. Because HR images preserve much richer information than LR images, SISR techniques are popular in many practical applications, such as surveillance [43], Face Hallucination [35], Hyperspectral imaging [14], medical imaging [26] etc.
Numerous deep learning based methods have been proposed in recent years to address the SISR problem. Among them, SRCNN [3] is considered as the first attempt to come up with a deep-learning based solution with its three convolution layers. SRCNN outperformed the existing SISR approaches that typically used either multiple images with different scaling factors and/or handcrafted features. Later, Kim et al. [4] proposed an architecture named VDSR that extended the depth of CNN up to twenty layers while adding a global residual connection within the architecture. DRCN [11] also increased the depth of network through a recursive supervision and skip connection, and improved the performance. However, due to increasing depth of the networks, vanishing gradient resisted the network to be converged [7]. In the image classification domain, to solve the aforementioned problem, He et al. [7] proposed a residual block by which a network over 1000 layers was successfully trained. Inspired by its very deep architecture with residual blocks, EDSR [17] proposed much wider and deeper networks for the SISR problem using residual blocks, called EDSR and MDSR [17], respectively.
Very recently, Zhang et al. [40] proposed RCAN that utilizes a channel attention block to exploit the interdependencies across the feature channels. Moreover, Li et al. [16] proposed MSRN that improved the reconstruction performance by exploiting the information of spatial features rather than increasing the depth of CNNs. MSRN combines the features extracted from different convolution filter sizes and concatenates the outputs of all residual blocks through a hierarchical feature fusion (HFF) technique, utilizing the information of the intermediate feature maps. By doing so, MSRN achieved comparable performance against EDSR [17] although having a 7-times smaller model size. In [42], Zhang et al. proposed DCSR in which they proposed a mixed convolution block that combines dilated convolution layers and conventional convolution layers to attain larger receptive field sizes. Nonetheless, most of these CNN-based methods focused either on increasing the number of layers [10,11,17,40] or on extending the width and height in a layer of CNN to achieve higher performance [16]. In this way, they put less focus on exploiting the by-product CNN features, e.g., spatial and channel information, simultaneously, and thus suffer to maximize the performance at times.
Moreover, the strong correlations between the input LR and output HR images [16] lead us to making an assumption that, apart from the high-level features, the both low-level and mid-level features also play vital roles for reconstructing an super-resolution (SR) image. Therefore, we argue that, they should be treated precisely in this paper.
In the previous work, dense connections were used [32], which added every feature to subsequent features with residual connections. As a variant of dense connections, hybrid feature fusion (HFF) [32,41,16] was proposed to remove the trivial residual connections and to directly concatenate all the output features from the residual blocks for the SISR problem. However, this direct feature concatenation prohibit the features from smooth feature transformation from low to high levels, resulting in resulting in improper utilization of various low-level and mid-level features. This may introduce redundancy in feature utilization, thus increasing the cost of computation complexity. In our ablation study in Section 4.1, this problem will be verified.
To solve this problem, in this paper, we propose a binarized feature fusion (BFF) scheme that combines adjacent feature maps with 1 × 1 convolutions which are repeatedly performed until remaining a single feature map. This allows all the features extracted from the CNN to be integrated smoothly, thus fully utilizing various features with different levels. Moreover, to efficiently extract the features, unlike previous work that used main residual blocks, we propose to use residual groups (RG) that constructs with the proposed hybrid residual attention block (HRAB). Our proposed HRAB extracts both the spatial and channel information with the notion that the both information is important in the reconstruction of high quality SR images and should be extracted simultaneously in a single module.
Moreover, compared to MSRN [16] that concatenates the conventional convolution layers with different kernel sizes to enlarge receptive field sizes, proposed method concatenates dilated convolution layers with different dilation factors exploiting much larger receptive fields while significantly decreasing the number of parameters, i.e., convolution weights. Furthermore, to ease the flow of information, we introduce the short, long and global skip connections. We conduct comprehensive experiments to verify the efficacy of our method, where we observe its superiority against other state-of-the-art methods.
We summarize the overall contributions of this work as, • We propose a BFF to transfer all the images features smoothly by the end of the network. This structure allows the network to smoothly transform the features with different levels and generate an effective feature map in the final reconstruction stage.
• We propose a hybrid residual attention block (HRAB) that considers both channel and spatial attention mechanisms to exploit the channel and spatial dependencies. The spatial attention mechanism extracts the fine spatial features with larger receptive field sizes whereas the channel attention guides in selecting the most important feature channels thus in the end, we have more discriminative features.
• Other than previous works, we employ BFF on residual groups (RG) rather than residual blocks (HRAB) • For extracting the multiscale spatial features, we propose to use a mixed dilated convolution block with different dilation factors. Compared to the previous work in [16] that used the large kernel sizes to secure large receptive fields, our proposed method can achieve a similar performance even with smaller kernel sizes. Moreover, we propose to use the dilated convolution in an effective manner to avoid the gridding problem of the conventional dilated convolution layers.
• To ease the transmission of information through out the network, we propose to adopt the global, short and long skip in our architecture.

Related work
There are several CNN-based SISR methods that have been proposed in the recent past. Previously, in the preprocessing step, researchers tend to use an interpolated LR image as an input that is interpolated to desired output image size which enables the network to have the same size of input and output images. In contrast, due to the additional computation complexity of interpolation, current work emphasizes to directly reconstruct HR image from LR image without interpolation.
In 2014, Dong et al. [3] proposed SRCNN, the first CNN network architecture in the SR domain. It was a shallow 3 layers CNN architecture which achieved the superior performance against the previous non-CNN methods. Later, He et al. [7] proposed a residual learning technique, and then Kim et al. [10,11] achieved remarkable performance with their proposed VDSR and DRCN methods. VDSR used the deep (20 layers) CNN and global residual connection whereas DRCN [11] used a recursive block to increase the depth that does not require new parameters for repetitive blocks. Tai et al. [28] proposed the MemNet which had memory blocks that consist of recursive and gate units. All of these methods have used the interpolated LR image as input. Due to this preprocessing, these methods add additional computation complexity along with artifacts, as also described in [25].
On the other end, the recent state-of-the-art methods directly learn the mapping from input LR image. Dong et al. [4] proposed the FSRCNN, an improved version of SR-CNN, having faster training and inference time. Ledig et al. [15] proposed the SRResNet, inspired from ResNet [7], to construct the deeper network. With the perceptual loss function in GAN, they proposed the SRGAN for photorealistic SR. Lim et al. [17] removed the trivial modules (like batch normalization) of SRResNet, and proposed the EDSR (wider) and MDSR (deeper) that made a significant improvement in SR problem. EDSR has a large number of filters (256) whereas MDSR has a small number of filters though the depth of CNN network is increased to about 165 layers. It also won the first NTIRE SR challenge [30]. It has shown that deeper networks can achieve remarkable performance. Consequently, Zhang et al. [40] proposed a very deep network for SR. To the extent of our knowledge, it has the largest depth in the SR domain. RCAN [40] has shown that only stacking the layers cannot improve the performance. It proposed to use the channel attention (CA) [8] mechanism to neglect the low-frequency information while selecting the valuable high-frequency feature maps. To increase the depth of the network, it proposed the residual in residual (RIR) structure. Nevertheless, RCAN [40] network is very deep and makes it difficult to use it in real-life applications due to higher inference time.
In contrast, multiscale feature extraction technique, which is less explored in SISR, has shown significant performance in object detection, [18] image segmentation, [24] and model compression [2] to achieve good tradeoffs between speed and accuracy. Li et al. proposed a multiscale residual network (MSRN) [16] having just 8 residual blocks. It used multipath convolution layers with different kernel sizes (3×3 and 5×5) to extract the multiscale spatial features. Furthermore, it proposed to use the hierarchical feature fusion (HFF) architecture to utilize the intermediate features. The intuition behind HFF architecture is to transfer the middle features at the end of the network since an increase in the depth of the network may cause the features to vanish in between the network. HFF shows comparable performance to EDSR, nevertheless its accuracy is limited. In addition, as the depth or width of a network increases, HFF also increases the computation complexity.
Therefore, we need an efficient multiscale superresolution CNN which could fully utilize the feature information as well as channel information. Considering it, we propose a hybrid residual attention network (HRAN) which combines the multiscale feature extraction along with the channel attention [8] mechanism. In this paper, we refer the multiscale feature extraction as spatial attention. Thus, the combination of the channel and spatial attention is called hybrid attention. We discuss the details of HRAN in the next section.

Network architecture
The proposed HRAN architecture is shown in Figure 1. The HRAN can be decomposed into two parts: feature extraction and reconstruction. The feature extraction is divided into two parts: shallow feature extraction and deep feature extraction. The deep feature extraction step further includes residual groups (RG) with binarized feature fusion (BFF) structure. Whereas, RG contains a sequence of hybrid residual attention blocks (HRAB) followed by 3×3 convolution. We represent the input and output of HRAN as I LR and I SR respectively. We aim to reconstruct the accurate HR image I HR directly from LR image I LR .
In the shallow feature extraction, we use two convolution layers to extract the features from input I LR image.
Here H SF 1 (·) represents the convolution operation. F 0 is also used for global residual learning to preserve the input features. As mentioned above, we pass the F 0 for further feature extraction where H SF 2 (·) represents the convolution operation. F 1 is the output of shallow feature extraction step and will be used as input for the deep feature extraction.
Here H DF (·) represents the deep feature extraction function and F 0 shows global residual connection like VDSR [10] at the end of deep features. The deep features are sequentially extracted through HRAB, RG and BFF. The details are mentioned in later sections.
where H REC denotes the reconstruction function. For the image reconstruction, previously researchers upsampled the input image to get the desired output dimensions. we reconstruct the I SR having similar dimensions as I HR through deep features of I LR . There are various techniques to serve as upsampling modules, such as PixelShuffle layer [25], deconvolution layer [4], nearest-neighbor upsampling convolution [5]. In this work, we use the MSRN [16] reconstruction module that enables us to upscale to any upscale factor with minor changes. We can write the proposed HRAN function as For the optimization, numerous loss functions have been discussed for SISR. The mostly used loss functions are the MSE, L1, and L2 functions whereas perceptual and adversarial losses are also preferred. To keep the network simple and avoid the trivial training tricks, we prefer to optimize with L1 loss function. Hence, we can define the objective function of HRAN as : where Θ denotes the weights and bias of our network.

Binarized Feature Fusion (BFF) structure
The shallow features lack the fine details for SISR. We use deep networks to detect such features. However, in SISR, there is a strong correlation between I LR and I SR . It is required to fully utilize the features of I LR and transmit them to the end of network, but due to the deep network, features start gradually vanishing during the transmission. The possible solution is to use a residual connection, however, it induces the redundant information [16]. MSRN [16] uses the hierarchical feature fusion structure (HFF) to transmit the information from all the feature maps towards the end of the network. The concatenation of every feature generates a lot of redundant information and also increase the memory computation.
In contrast, we propose a binarized feature fusion (BFF) structure that as shown in Figure 1. The notable difference in this architecture is the use of residual groups (RG) instead of Multiscale residual block (MSRB) [16]. It is called a residual group due to residual connections within itself i.e., RGs are connected through LSC whereas its sub-module HRABs are connected through SSC. The use of RGs does not only help to increase the depth but also reduce the memory overhead when concatenating the features map.
Another difference in this architecture is the feature extraction from adjacent RG blocks. First, we concatenate the adjacent RG blocks and then, we remove the redundant information from adjacent blocks using 1×1 convolution. We repeat this procedure for all RG blocks and the resultant blocks produced through this mechanism until all the blocks integrate into single RG block, which is convolved by 1×1 to produce the output features. In the end, we element-wise add this output to the shallow features' output (F 0 ). We refer this element-wise summation as global skip connection in Figure 1.
where H RG (·) represents the features extracted through single RG block whereas F i shows the i th extracted feature map. We explain the details of RG in the next section.
When we extract all the features through RG blocks, then we can utilize these RG blocks with HFF architecture.
Here, the output of two adjacent RG blocks are channelwise concatenated and then passed into 1x1 convolution layer to avoid the redundant information from them. Thus, the four RG blocks produce two more blocks which are then processed in a similar manner such that F i+1 = M j and F i+2 = M j+1 . Thus, in the next step, M j and M j+1 will act as two RG blocks. We repeat this procedure until we integrate all the RGs and resultant blocks into a single output which is further used in the input of reconstruction step.

Residual Groups (RG)
It is shown in [17] that the stacked residual blocks enhance the performances of SR but after some extent, cause crucial information loss during transmission and also makes the training slower, affecting the performance gain in the SISR [40]. Thus, rather than increasing the depth, we propose to use the residual groups (RG) (see shaded area of Figure 1) in our architecture to detect deep features. The RG consists of multiple HRAB that are followed by 1×1 convolution. We find that adding many HRAB does degrade the SR performance. Thus, to preserve the information, we apply element-wise summation between the input of RG and output of 1×1 convolution and refer it as long skip connection (LSC).
The RG enables the network to remember the information through LSC whereas to detect deep features, it uses SSC within its modules, in this case, HRAB. Hence, the flow of information in RG is smoothly carried out through LSC and SSC. The details of the HRAB are mentioned in the next section.
Thus, we express the single RG block as Here H i represents the 'B' hybrid residual attention blocks (HRAB), which takes input features from previous RG block (F i ) and produces the output (F i+1 ). After stacking the 'B' HRAB modules, we apply 3×3 convolutions with weights W RG . After applying LSC, the equation 10 can be rewritten as The above equation represents the first RG block because it takes the shallow features F 1 as input. Since, we have multiple RG blocks to extract the deep features, hence, the above equation can be generally written as Here i = 1, 2, · · · , R. We have 'R' RG blocks and each RG block uses the output of the previous block as its input except the first RG block that uses the shallow features F 1 as input. Thus, for the first RG block, H 0 RG = F 1 .

Hybrid Residual Attention Block (HRAB)
In this section, we propose a multiscale multipath residual attention block for the feature extraction, called hybrid residual attention block (HRAB) (see Figure 2). Our HRAB integrates both the spatial attention (SA) and channel attention (CA) mechanisms, thus, it has two separate paths for the SA and CA.
where H SA and H CA denote the functions of spatial attention (SA) and channel attention (CA) respectively. Here '·' represents the element-wise multiplication between SA and CA functions. Unlike RCAN [40], we propose to use element-wise multiplication between the outputs of SA and CA to extract the most informative spatial features. Like RCAN [40], we also add the short skip connections (SSC) to ease the flow of information through the network.

Spatial Attention (SA)
MSRN [16] proves that multiscale features improve the performance with lesser residual blocks. In MSRN [16], authors use the multiple CNN filters with increasing kernel sizes (3 × 3 and 5 × 5) to extract multiscale features. The intuition behind the larger kernel size is to take advantage of large receptive fields. But, the large kernel size causes to increase the memory computation. Thus, we propose to use the dilated convolution layers with different dilation factors which can have the same receptive fields as large kernel size and memory consumption is similar to smaller kernel size. But, only stacking the dilated convolution layers produce gridding effect [36]. To avoid this problem, as illustrated in Figure 2, we propose to use the element-wise sum operation between the dilated convolutions with different factors before the concatenation operation. Suppose F i−1 is the input of SA then the output will be F i .
where H DC1 and H DC2 denotes the convolution layers with dilation factors 1 and 2 respectively. First, we concatenate the output of two convolution layers to increase the channel size and at the end, we use 1 × 1 convolution to reduce the channels. Thus, our input and output have the same number of channels. Our SA architecture inspires from [6] which has shown that upsampling and downsampling module within the architecture improves the accuracy in SR. For the activation unit, by following ref [12,29], we prefer the LeakyReLU over ReLU activation whereas we use the linear bottleneck layer as suggested in [23].

Channel Attention (CA)
The channel attention (CA) mechanism achieves a lot of success in image classification [8]. In SISR, RCAN [40] introduces the CA layer in the network. CA plays an important role in exploiting the interchannel dependencies because some of them have trivial information while others have the most valuable information. Therefore, we decide to use channel-wise features and incorporate the CA mechanism with SA module in our HRAB. Thus, by following [8,40], we use the global pooling average to consider the channel-wise global information. We also experiment with global pooling variance as we thought global variance could extract more high-frequencies, in contrast, we get poor results as compare with global pooling average. Suppose if we have C channels in the feature map [x 1 , x 2 , · · · , x C ] then we can express each 'c' feature map as a single value.
here x c is the spatial position (i, j) of the feature maps.
To extract the channel-wise dependencies, we use the similar sigmoid gating mechanism as [8,40]. Alike SA, here, we replace the ReLU with LeakyReLU activation.
Here LR (·) and f (·) represent the LeakyReLU and sigmoid gating function respectively whereas W D and W U respectively denote the weights of downscaling and upscaling convolutions. It is noted that it is channel-wise downscaling and upscaling with reduction ratio r.

Implementation details
For training the HRAN network, we employ 4 RG blocks in our main architecture and in each RG block, there are 8 HRAB modules which are followed by 3 × 3 convolution. For the dilated convolution layers, we use the 3 × 3 convolution with dilation factor 1 and 2. We use C = 64 filters in all the layers except the final layer which has 3 filters to produce a color image though our network can work for both gray and color images. For the channel-downscaling in CA mechanism, we set a reduction factor r = 4.

Experimental Results
In this section, we explain the experimental analysis of our method. For this purpose, we use several public datasets that are considered as the benchmark in SISR. We provide the results of both the quantitative and qualitative experiments for the comparison of our method with several stateof-the-art networks. For the datasets, we follow the recent trends [17,39,41,16,30] and use DIV2K dataset as the training set, since it contains the high-resolution images. For testing, we choose widely used standard datasets: Set5 [1], Set14 [37], BDS100 [19], Urban100 [9] and Manga109 [20]. For the degradation, we use the Bicubic Interpolation (BI).
We evaluate our results with peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) [33] on luminance channel i.e., Y of transformed YCbCr space and we remove P-pixels from each border (P refers to upscaling factor). We provide the results for scaling factor ×2, ×3, ×4, and ×8.
For the training settings, we follow the settings in [16]. We extract 16 LR patches randomly in each training batch with the size of 64×64. We use ADAM optimizer with learning rate lr = 10 −4 which decreases to half after every 2 × 10 5 iterations of back-propagation. We use Py-Torch framework to implement our models with GeForce RTX 2080 Ti GPU.

Ablation studies
We conduct a series of ablation studies to show the effectiveness of our model. In the first experiment, we train our model with and without CA and compare their performance with our HRAB module. For the training, we use Urban100 dataset [9] as it consists of large dataset. The results are shown in Table 1. We observe that our SA module alone achieves 32.77 dB PSNR. We also experimented on CA module alone though results were unsatisfactory. Whereas, when we combine SA with CA, i.e., our HRAB module, it achieves the 32.95 db PSNR. This study suggests we need HRAB module containing both the spatial and channel attention for accurate SR results. We also investigate about our BFF structure using HRAB module and tested the both BFF and HFF on MSRN [16] and our proposed HRAN model to verify the effectiveness of BFF on both models. It is evident from the results that BFF structure improves the PSNR of MSRN [16] from 32.22 dB to 32.44 dB with BFF. Moreover, proposed HRAN and BFF together significantly increase the accuracy which show the Table 3. Quantitative Comparisons of state-of-the-art methods for BI degradation model. Best, 2nd best and 3rd best results are respectively shown with Magenta, Blue, and Green colors.
We show our quantitative evaluation results in Table 3 for the scale factor of ×2, ×3, ×4, and ×8. It is evident from the results that our method outperforms most of the previous methods. Our self-ensemble model achieves the highest PSNR amongst all the models. Although RDN [41] has shown slightly better performance, from Figure 4, we observe that RDN [41]  ing the depth of the network. Hence, this observation indicates that we can improve the network performance with HRAB and RG along with BFF without increasing the network depth. This also suggests that our network can further improve the accuracy with more HRAB's and RG's, though, we aim to achieve the greater accuracy by considering the memory computations. Moreover, we present the qualitative results in Figure 3. The results of other methods are derived from [40]. In Figure 3, it can be observed from 'img 004' image our HRAN method recovers the lattices in more details, meanwhile, other methods experience the blurring artifacts. Similar behavior is also observed in 'Yumeiro-Cooking' image where other methods produce blurry lines and our HRAN produces the lines similar to HR image. It shows that our model reconstructs the fine details in output SR image through extracted deep features with RGs which are then efficiently utilized by BFF.

Model Complexity Analysis
Since, we are targeting the maximum accuracy with limited memory computation, therefore our performance is best visible when we see the Table.3 along with Figure. 4. In Figure. 4, we compare our model size and its performance on Set5 [1] (×4). As we can observe that our HRAN model has fewer parameters compared to RDN [41] and EDSR [17], it still achieves the comparable performance whereas our HRAN+ outperforms the state-of-the-art methods. We have also shown analysis on much larger scale (×8) in supplementary materials. These results demonstrate the effective utilization of the features that result in performance gain in SISR.

Conclusions
In this paper, we propose a hybrid residual attention network (HRAN) to detect the most informative multiscale spatial features for the accurate SR image. Proposed hybrid residual attention block (HRAB) module fully utilize the high-frequency information from input features with a combination of the spatial attention (SA) and channel attention (CA). In addition, the binarized feature fusion (BFF) structure allows us to smoothly transmit all the features at the end of the network for reconstruction. Furthermore, we propose to adopt the global, short and long skip connection and residual groups (RG) to ease the flow of information.
Our comprehensive experiments show the efficacy of the proposed model.

Supplementary Material
In this supplementary submission, we present more qualitative results with the different scaling factors. Furthermore, we also compare our method's computation complexity with the large scaling factor.

Model Complexity Analysis
When it comes to the large scaling factor, the reconstruction of the SR image becomes more difficult and the SR problem is intensified due to very limited information in the LR image. In this section, we compare our model computation complexity and performance on the large scaling factor (8×) in terms of a number of parameters and peak signalto-noise ratio (PSNR) respectively in Figure. 5. The results in Figure. 5 shows that our HRAN and HRAN+ models outperform all the models including EDSR, and MSRN, for the scaling factor 8× with the low number of parameters.
We show the experiments' results with the different scaling factors: 3×, 4×, and 8×. In Figure. 6, we can visualize that most of the methods fail to reconstruct the fine details in 'img 062' and 'img 078' and have blurry effects. Although SRMDNF has recovered the horizontal and vertical lines but output result is more blurry. Whereas, our results have no blurry effect and have shown similar visual performance than EDSR.
For further illustrations, we also analyze our results on 8× super-resolution (SR) in Figure 7. When the scaling factor increases, we get very limited details in the LR image. From the 'img 040' image, what we observe that Bicubic interpolation does not recover the original patterns. Those methods (SRCNN, MemNet, and VDSR) which use interpolation as pre-scaling, lose the original structure and generate wrong patterns. Our HRAN results are more similar to EDSR, but unlike EDSR, HRAN does not produce blurry effects. Similarly, in 'TaiyouNiSmash' image, we observe that most of the methods could not recover the tiny lines clearly and lose the structures and the blurry effect is also evident in most of the methods.