Arbitrary Scale Super-Resolution Neural Network Based on Residual Channel-Spatial Attention

In recent years, the performance of convolutional neural networks in single-image super-resolution has improved significantly. However, most state-of-the-art models address the super-resolution problem for specific scale factors. In this paper, we propose a convolutional neural network for arbitrary scale super-resolution. Specifically, given a range of scale factors, the proposed model can generate super-resolution images with non-integer scale factors within the range. The proposed model incorporates a channel-spatial attention block in which the scale factor is also provided. This module recovers the most relevant information from the low-resolution image given the scale factor and enhances the upsampled image before generating the high-resolution target image. This channel attention block allows learning the channel and spatial dependencies. Additionally, we incorporate global residual learning so that the model recovers the details of an upsampled low-resolution image by interpolation. We evaluated the proposed method through extensive experiments on widely used benchmark datasets for single-image super-resolution. In order to assess the performance of the model, we used the peak signal-to-noise ratio and the structural similarity index measure. The proposed model achieves an average of 35.36, 31.78, 29.62 for peak signal-to-noise ratio, and 0.9410, 0.8828, 0.8334 for structural similarity index measure for the standard evaluation scale factors <inline-formula> <tex-math notation="LaTeX">$\times 2$ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$\times 3$ </tex-math></inline-formula>, <inline-formula> <tex-math notation="LaTeX">$\times 4$ </tex-math></inline-formula>, respectively. The experimental results show a better performance of the proposed model over other state-of-the-art models for arbitrary scale super-resolution, and are competitive with models trained for specific scale factors.


I. INTRODUCTION
Single-image super-resolution (SISR) is an important subject in image processing and computer vision. The aim of SISR is to reconstruct an accurate high-resolution image (I HR ) given its corresponding low-resolution image (I LR ). SISR has been widely used in several computer vision applications, including security and surveillance imaging [1], medical imaging [2], object recognition [3], satellite images [4], etc. However, getting an image with super resolution (SR) is an ill-posed inverse problem since various solutions exist for a given I LR input. The SISR problem has been addressed by The associate editor coordinating the review of this manuscript and approving it for publication was Yongjie Li. traditional techniques, including interpolation-based methods [5], reconstruction-based methods [6], and learningbased models [7], [8], [9]. In recent years, convolutional neural networks (CNN) have notably outperformed several computer vision tasks, including image super-resolution [10]. There two main streams in super-resolution algorithms based in CNN: single scale SR and multiple scale SR.

A. SINGLE SCALE SR
Dong et al. [11] proposed the first CNN model for SISR, achieving significant improvement over conventional methods. The model was named super-resolution convolutional neural network (SRCNN). This method first VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ upsamples the low-resolution image to the target size, and then the model improves the quality of the upsampled image to estimate the I HR image. Wang et. in [12] propose a densely connected network based on edge-guided care (EGADNet) for SISR. This model is composed of two branches that obtain the content and the high-frequency information, and a final branch that combines the previous information and returns the super-resolution image.
Haris et al. [13] proposed a back-projection CNN model that exploits iterative upsampling/downsampling layers, providing an error feedback mechanism for projection errors at each stage. In this way, the proposal better examines the mutual dependency of low and high-resolution images. In order to improve the network accuracy, the authors in [13] introduce a dense connection to capture the relationship between I LR and I HR . This also contributes to implementing a large-scale factor. Nonetheless, the training step of this model needs a very large dataset, which substantially increases the computation time. On the other hand, the proposed network only uses a single-scale convolution kernel, which makes it difficult to extract feature information at different scales [14].
Zhang et al. [15] propose a residual channel attention to SISR by considering interdependence among channels. The model leads to the detection of more important feature maps and also effectively captures the relationships between them. One drawback of the proposal in [15] is the high computational complexity [10]. Chen et al. [16] propose a pre-trained model for low-level computer vision tasks based on image transformers [16], [17]. Their proposal uses different input/output modules trained for specific tasks but keeps the same encoding/decoding transformers. This approach has shown a good performance not only in SISR but also in other image processing tasks. However, it requires a very large training dataset, and a large number of trainable parameters. Chen et al. [18] propose a gradient-based CNN model that takes advantage of the structural information present in I LR . This model uses a dual-branch structure, in which one branch estimates the high-resolution image, and the other uses gradient feature maps to recover the structure of the image. Since the gradient is sensitive to noise, this could affect the performance of this model. Zhang et al. [19] propose the residual dense network (RDN) model, which improves the feature extraction of I LR images through dense convolutions and local residual learning. Although the good behavior of the model, this kind of schema is not efficient for detecting hierarchical features [20]. The main limitation of the previous works is that we need to train the model for each scale, which is impractical in many situations.

B. MULTIPLE SCALE SR
Hu et al. [21] propose a meta-upscale module to handle noninteger scale factors. The proposal dynamically estimates the filters used in the upsampling for different scale factors using meta-learning. This module was initially applied to the RDN and the enhanced deep super-resolution network (EDSR) [22] models as the upsampling module. The meta-upscale module receives the scale factor as additional information to estimate the filter values. However, this additional information is not used by the rest of the model, i.e., the backbone parameters of the model are the same for all scale factors for which the model was trained. Fu et al. [23] extend the meta module by combining its original linear input vector with a quadratic and a bicubic encoded vector. This approach leads to a higher complexity of the encoding vector as the degree increases with no necessary a better performance of the proposal.
Shen et al. [24] propose an approach based on applying the Laplacian pyramid recursively to generate arbitrary scale I HR images. The scale factor used in this method is in the range (1,2], so in order to generate the desired target size the method should be applied several times. This module also presents a plug-in element for predefined CNN models to extend their capability to handle non-integer scale factors. Despite the good behavior of the method, including a low number of parameters, the pyramidal approach increases the computational complexity and the use of memory. Kim et al. [25] improve the SRCNN model by adding more convolutional layers and global residual learning so that the model learns the differences between the upsampled I LR image and the target I HR . The training step of the model is carried out with multiple scales, which improves the performance. However, the architecture considers a bicubic interpolation for the input image, and all subsequent layers work in high resolution. Therefore, the computation time and memory increase [22]. Chen et al. [26] propose the local implicit image function (LIIF) that basically combines the RDN model [19] with a multilayer perceptron which increases the computational cost.
There are also several methods for more realistic applications in which the target scale size is not an integer value. Wang et al. [27] propose a plug-in module for arbitrary scale SISR, which consists of multiple scale-aware feature adaption blocks and a scale-aware upsampling layer. This module is capable of generating images with different scale factors along the height and width of the image. Feng et al [28] propose a dynamic kernel generation model that improves the image reconstruction performance. The model consists of two stages.The first one extracts the features from low-resolution images by means of self-organizing feature mapping network, allowing the construction of regularized pattern. The regularized pattern contains information at different locations. The second stage uses a regularized pattern and a metalearning mechanism to predict the weights of the convolution kernels that correspond the regularized pattern at each location. The deep learning community also addresses the problem of parameter estimation for different scale factors. In this direction, Liu et al. [29] propose a framework based on low-resolution reconstruction, blind kernel estimation, and recursive error compensation. Their approach uses one set of parameters while achieving varying scales and obtaining higher efficiency and satisfactory accuracy.
Most existing methods are trained or designed for a pre-defined degradation process (e. g., bicubic downsampling) from a I HR to a I LR . As a result, these models can struggle with low-resolution images obtained by a degradation process different from the one considered in the training dataset [30]. In recent years, the interest in this area of research, known as blind image super-resolution, has increased. Among the methods designed for blind-image SR are those based on a predefined set of different degradation models [31], based on the estimation of the degradation model by reducing the artifacts iteratively on the reconstructed SR images [30], and based on deep learning [32], [33]. Compared with the commonly used degradation model in synthetic low-resolution images, the actual degradation in real-world scenarios is more complex and varying because it can be affected by several factors such as imaging system and imaging environment.
Another recent research on SISR is real-world single image super-resolution [34]. In this case, the pair of I LR and I HR are generated collecting images from the same scene using different camera configurations for different image resolutions. Cai et al. [35] proposed a real SR dataset, which consists of pairs of images obtained by adjusting the lens of two digital single lens reflex cameras and a progressive image registration framework to achieve pixel-wise registration for the different focal lengths. Wei et al. also proposed a real SR dataset [36] that extends the dataset proposed in [35], but using five digital single lens reflex cameras, and consists of both outdoor and indoor scenes. Zhang et al. proposed another real dataset [37] which is composed of pairs of raw images taken at different levels of optical zoom. Euclidean motion model is applied for image registration to describe the relationship between the images with different resolutions and it is optimized by minimizing the enhanced correlation coefficient. Additionally, the dataset contains both raw sensor data and RGB images for SR from raw data.
In summary, and according to our examination of the reported proposals, many of the models work with lowresolution images previously upsampled through bicubic interpolation to the target size [11], [25]. This strategy allows arbitrary handling of scale factors, but it may generate noise amplification and blurring to the input image. Besides, processing a larger input image increases the computational cost significantly. On the other hand, most CNN models work around this problem by feeding I LR into the model and upsampling it within the network using transposed convolution [38] or sup-pixel upsampling [39] near the end of the model. These upsampling methods limit the models to integer scale factors, reducing their flexibility in applications.
With the aim of addressing the previous limitations, and considering the significant improvement of the CNNs over conventional SR techniques, we propose a residual channel-spatial attention network, namely RC-SAN, for arbitrary scale SISR. The proposal is able to accomplish the super-resolution image without the limitation of integer scale factors and enough attenuation of the noise and blurring as a consequence of scaling. Furthermore, the proposal does not consider the image degradation model that generates the I LR , so in principle, it can be used for any degradation model.
We develop an attention block that combines spatial and channel information along the scale factor to preserve or shrink the information generated by residual dense blocks. The model uses bicubic interpolation to perform the upsampling operation before the last section of the model, so most of the process is on the low-resolution image. Using bicubic interpolation allows deal with arbitrary scale factors and perform global residual learning such that the model enhances the previously upsampled I LR and speed-up the learning process.
The main contributions of this paper are summarized as follows: • We propose a neural network for single image super-resolution with arbitrary scale in a given range (1,4]. • We extend the proposed model for specific scale factors (×2, ×3, ×4).
• We present an attention block that combines the feature maps and the provided scale factor. This block is designed to learn the channel and spatial dependencies of feature maps.
Additionally, we incorporate, in the last part of the model, a global residual learning using a bicubic interpolation of the low-resolution image. This strategy allows to handle arbitrary scales and speeds up the training of the model. Extensive experiments demonstrate that the proposed method is competitive with state-of-the-art approaches for SISR, even with those trained for specific scale factors. This paper is organized as follows: Section II explains the proposal and its methodology, Section III describes all the experiments we carried out and the results, Section IV discusses the results obtained, Section V presents some limitations of the proposal and finally in Section VI the conclusions are given.

II. PROPOSED METHOD
In this section, we first introduce the architecture of the proposed RCSAN model and then describe the building blocks of the model.

A. NETWORK ARCHITECTURE
The proposed RCSAN model, see Fig. 1, consists of three modules and a global residual learning (GRL): the first module is a shallow features extractor (block in green), the second module is the core of the model and represents a deep feature extractor based on a series of residual dense convolutional networks with an attention block (block in gray) and the last module is a high-resolution re-constructor block (block in yellow), which is finally combined with a GRL.
Module 1: In order to extract the shallow features from the original low resolution image, I LR , we use 3×3 convolutions, based on [19], and the PReLU activation function [40]: where H LRFE1 (·) and H LRFE2 (·) denote the first and second convolution operation respectively, see Fig. 1 where H RDG,i is the i − th RDG block, details of the RDG block are described in Sec. II-B. That is, F i+1 is obtained after applying H RDG,i to the previous feature map F i . The last element of the sequence F I is the output of this module.

Module 3:
This module is formally written as follows: where BI ↑ (·) corresponds to a bicubic interpolation, RDG HR (·) is an RDG module that performs deep feature extraction of the upsampled F BI image (see Fig. 2 ), and H HRFE1 (·) and H HRFE2 (·) are 3 × 3 convolutions with PReLU activation function. This module is shown in Fig. 1 block in yellow.
We first apply a bicubic interpolation to F I , the output of the deep feature extractor, Eq. (3), in order to upsample the image to the desired scale factor. This upsampling method allows the model to deal with arbitrary scale factors, contrary to other methods, like transposed convolution or sub-pixel upsampling, where the scale factor is fixed. Further, the final convolution H HRFE2 (·) reduces the number of feature maps to match the number of the input image channels, what allows us to introduce a global residual learning.
Global Residual Learning: Finally, and based on the previous comment, we perform the global residual learning between the output of the high-resolution re-constructor block, F HR , and the bicubic interpolation of the LR image I LR BI ↑ (I LR ): This type of learning allows the model to learn important details and features of the image that the interpolation process cannot recover.

B. RESIDUAL DENSE GROUP
The residual dense group (RDG block) is shown in Fig. 2. This module consists of the following elements: one sequence of dense convolutions with an attention block, Fig. 4, one concatenation, a factorization block through asymmetric convolutions and a residual learning. The last factorization block summarizes the feature maps generated by the dense blocks, and the residual learning is applied between the input of the RDG and the output of the factorization. The (i + 1)-th deep feature image is generated from the i-th deep feature image as follows: we first obtain the sequence of feature maps where DB i,j (·) is the j-th dense block, and then where Cat i is the concatenation of the features F i,j generated by the dense blocks DB i,j (·) (the brackets [· · · ] represents the concatenation operation between the feature maps inside them), C 3×1 (·), and C 1×3 (·) are the asymmetric convolutions with 3×1 and 1×3 kernel sizes, respectively, and C 1×1 (·) is a 1 × 1 convolution to summarize the results of the asymmetric convolutions and reduce the number of feature maps in order to perform global residual learning.

C. DENSE CONVOLUTIONAL BLOCK WITH ATTENTION BLOCK
Now we show the details of the dense attention block (DAB), see Fig. 3. The DAB consists of a sequence of K convolutional layers with 3 × 3 kernel size and PReLU activation function, where every convolution processes all the feature maps generated previously within the block. After the dense convolutions are applied, a 1 × 1 convolution with PReLU activation function, with index K + 1, summarizes all the features generated by the dense convolutions. Subsequently, the attention block is applied in order to focus on the most informative feature maps given the input image, see Fig. 4. The attention block works as follows: First, an adaptive average pooling with a 4 × 4 output size is applied to reduce the image size. This pooling preserves the information of 16 regions of the image instead of squeezing the whole image into a single pixel, as the global average pooling. Afterward, the pooled image is reshaped to be processed by two fully connected layers. The reshaped image and the target scale factor are concatenated to be processed by fully connected layers. The first fully connected layer uses a PReLU activation function, and the second one uses a Sigmoid activation function. Finally, element-wise multiplication, denoted as ⊗, is applied between the output of the attention block and the output of the dense convolution. The result is added to the input of the dense block. This local residual learning speeds up the learning process and handles the vanishing gradient problem. The generated j-th output within the i-th RDG, the dense convolutional block can be defined as follows: where C i,j,k corresponds to the 3 × 3 convolutions with PReLU activation function, where C i,j,K +1 corresponds to the 1 × 1 convolution with PReLU activation function, FC i,j,1 , FC i,j,2 are the fully connected layers, with PReLU and Sigmoid activation functions respectively, and ⊗ is the multiplication between the elements generated by DAB i,j and the feature maps of F i,j,K +1 , i.e., every feature map in F i,j,K +1 is weighted by the corresponding entry in the vector DAB i,j .  Table 1 shows the effects of using global residual learning (GRL), channel-spatial attention block, and the incorporation of the scale factor in the attention block. In this study we use the Set14 dataset, section III-B. We compare the performance of the model incorporating these components after the first 200 training epochs. To consider the scale factor information without using the attention block, the scale factor is concatenated as a feature map before the last 1 × 1 convolution of the dense convolutional block, with a negligible increase in the number of trainable parameters. This addition improves about 0.03 and 0.22 of the peak signal-to-noise ratio (PSNR) value compared with the base model. However, the improvement when this information is combined with GRL is little significant. The attention block by itself improves over the use of the scale factor as additional information for the scale factors This may be mainly because the model only has to retrieve the details of the previously scaled low-resolution image instead of estimating the high-resolution image directly.
To visualize the convergence process, Fig. 5 shows the training process for the studied combinations in Table 1. Note that the GRL increases the speed of the training process of the model without incorporating additional parameters.

A. TRAINING DETAILS
The flowchart in Fig. 6 shows the process of training, validation and testing of the RCSAN model. We use the   DIV2K dataset [41] and Flickr2K [42] as training dataset. The DIV2K and Flickr2K training datasets consist of 800 and 2650 high-quality, high-resolution color images, respectively. We use a validation dataset of the DIV2K, consisting of 100 images with similar characteristics to the training dataset.
For training the proposed model, we randomly generate input/output training patches {(x i , y i )} from the DIV2K training partition and the Flickr2K dataset. The I LR patch size is set to 48×48. The scale factor is selected randomly uniformly in the range [1.1, 4.0] with step 0.1 such that the whole training batch has the same scale factor. Additionally, we apply data augmentation consisting of random vertical and horizontal flips and random 90 circ rotations. In addition, we trained a version of the proposed model for specific scale factors, denoted as RCSAN s , to assess the model in this scenario and compare it with state-of-the-art models trained for a single scale factor. All the versions of the model were trained from scratch.
In order to optimize the parameters of the RCSAN model, we use the Charbonnier loss function [43] and the AdamW algorithm [44] with parameters β 1 = 0.9, β 2 = 0.999, = 10 −8 , and λ = 10 −5 . The initial learning rate is α 0 = 10 −4 , which is halved every 200 epochs. This optimization algorithm allows us to work with decoupled loss function and weight decay term. The RCSAN model was trained with a batch size of 16 for 1000 epochs and 1000 iterations of back propagation per epoch. Our model was implemented in Python 3.6 using the PyTorch framework. The training time was 14 days in two Nvidia RTX Titan GPU. The source code and models are available at https://github.com/JavierGurrola/RCSAN.

B. TESTING DETAILS
The proposed models were tested on five widely used benchmark datasets for SISR: Set5 [45], Set14 [46], BSD100 [47], Urban100 [48], and Manga109 [49]. To assess the results of the model, we first convert the estimated I HR image into YCrCb space color and then calculate their corresponding PSNR [50] and the structural similarity index measure (SSIM) [51] metrics on the Y channel.

TABLE 3.
Quantitative results of different SISR methods on Set5, Set14, BSD100, Urban100 and Manga109 test datasets for ×2 scale factor. The best two results of PSNR (dB) and SSIM are highlighted in red and blue respectively for the models trained for a specific scale factor, and bold and italic for arbitrary scale models respectively.

IV. RESULTS
In Table 2, we compare the performance of the proposed RCSAN model with the baseline methods bicubic interpolation and Meta-RDN, for the arbitrary scale SR task. In this experiment we consider the BSD100 dataset. We use scale factors from 1.1 to 4.0 with a step length of 0.1.   Quantitative results of different SISR methods on Set5, Set14, BSD100, Urban100 and Manga109 test datasets for ×3 scale factor. The best two results of PSNR (dB) and SSIM are highlighted in red and blue respectively for the models trained for a specific scale factor, and bold and italic for arbitrary scale models respectively.
The results show that the proposed RCSAN outperforms the Meta-RDN model by 0.135 dB on average and the bicubic interpolation by 2.832 dB on average. Fig. 7 illustrates an example of arbitrary scale SR for the image ''86000'', taken from the BSD100 dataset. Tables 3-5 shows the average PSNR and SSIM metrics of the compared methods on the test datasets for ×2, ×3, and ×4 scale factors. The proposed arbitrary scale RCSAN model is competitive with state-of-the-art designed for arbitrary scale SR and outperforms in most cases. Notice that the proposed model excels significantly for the case of the Urban100 dataset with a difference of 0.14 dB and 0.0023 in PSNR and SSIM values respectively between the RCSAN and the model with the closest value (RSAN) for the ×4 scale factor. The data sets in which the proposed model struggles the most are Set5 and Set14. Although the proposed model also stands out for BSD100 and Manga109 datasets, the gap between the performance of the proposal and the model with the closest performance is lower. Note that, the self-ensemble version outperforms all the compared models. In particular, the performance has a greater increase for the case of Manga109 with 0.25 dB and 0.003 in PSNR and SSIM values respectively. Fig. 8 shows reconstructed regions from 5 images using seven methods: Bicubic, HAN, ITP, ArbRCAN, meta-SR, RSAN and our proposal. The PSNR and SSIM values are also reported. In all cases, the scale factor applied to the images is ×4. It is worth mentioning that for all the compared methods, we used the source code and pre-trained models provided by the authors. In this case, we used the RCSAN  Quantitative results of different SISR methods on Set5, Set14, BSD100, Urban100 and Manga109 test datasets for ×4 scale factor. The best two results of PSNR (dB) and SSIM are highlighted in red and blue respectively for the models trained for a specific scale factor, and bold and italic for arbitrary scale models respectively. model trained for arbitrary scale. The first row of Fig. 8 depicts the comparison results using the image ''ìmg067'' from the Urban100 dataset. Some methods, such as ITP and ArbRCAN, present a loss in the structure and obtain blurred regions. On the contrary, the proposed method can recover more high-frequency detail and less blurry regions compared to the other methods. The second row shows the comparison using the ''butterfly'' image from the Set5 dataset. The proposed RCSAN method recovers the image with more contour and structure details. In this case, the RSAN method has the best PSNR value, but the proposed RCSAN model has the best SSIM value; nevertheless, the visual results between both methods are very similar. The third row shows the comparison of the ''Baboon'' image from the Set14 dataset. All methods present difficulties in reconstructing this image, having problems recovering details of the hairs around the nose. We can see that the reconstructed region has a high degree of blurring for all methods. For this case, the RCSAN gets the second-best PSNR and SSIM values. The fourth row shows the comparison of the ''Comic'' image from the Set14 dataset. In this case, neither method recovers the gray details and pattern present in the upper right of the zoomed region. However, the RCSAN method retrieves the image with less blur and more detail on the edge. The last row shows the comparison of the ''MariaSamaNihaNaisyo'' image from the Manga109 dataset. For this image, the proposed model is able to recover more texture and details on the borders.  case, we only compare the methods that have reported their number of parameters. It can be observed that the RCSAN outperforms most of the methods with a better trade-off between the performance and the number of parameters. The GFSR is considered a low-weight model; therefore, its performance is expected to be lower. Note that the ITP model requires a large number of parameters to achieve good performance. On the other hand, among the methods with arbitrary scaling: RCSAN, Meta-RDN, and RSAN, the proposed RCSAN has fewer parameters and better performance. Table 6 shows the execution time (in seconds) for different SISR deep learning models for a 64 × 64 with a ×4 scale factor, the number of multiplication-accumulation operations FIGURE 10. Preliminary results obtained by the RCSAN model, using a realistic dataset for super-resolution. From left to right: low resolution, high resolution and reconstructed high resolution images. The numbers below the low resolution images correspond to the obtained PSNR and SSIM measures. The images are taken from ImagePairs testing dataset available at https://www.microsoft.com/applied-sciences/projects/imagepairs. (MACs) required to generate the 256 × 256 images, and the number of parameters of the models. The execution time of the proposed RCSAN is comparable with the ArbRCAN and slightly longer than the Meta-RDN and RSAN for the case of GPU. In the case of the CPU execution time, this is more degraded for the proposed model than for the others. However, the RCSAn is still faster than the IPT and NLSA models. In contrast to the fast GPU execution time, the number of operations for the RCSAN is high in proportion to the low number of trainable parameters. This is mainly because the last section of the model processes the image in high resolution. Thus, the number of required operations increases at this stage.

V. LIMITATIONS OF THE STUDY
Considering that the purpose of the article was to propose an arbitrary scale super-resolution model, the design of the experiment considered only one type of degradation. Therefore, the main limitation of the study was the evaluation of the proposed model in the context of blind or real superresolution. It is worth mentioning that the proposed model is blind to degradation. It would require training the model using databases with different degradation models and probably adjusting the model for these tasks. Another limitation is that the number of multiplication-accumulation operations is high, while GPU time is low. This suggests improving the last module (module 3) of the model since this part works on the high-resolution image. These limitations will become the future work of this study. Fig. 10 shows preliminary results obtained by the proposed RCSAN model in the context of real world SR. In particular, we show the results for two images from the realistic dataset ImagePairs for super-resolution which is available at https://www.microsoft.com/appliedsciences/projects/imagepairs. The figure also shows the obtained PSNR and SSIM measures. Even though the RCSAN model was not trained with realistic datasets, the experimental results are quite encouraging.

VI. CONCLUSION
In this work, we presented a residual channel-spatial attention network for arbitrary scale single-image super-resolution. We also extended the previous model for specific scale factors. The proposed models incorporate an attention block that takes advantage of the spatial information, channel information, and the scale factor applied to the image to improve the results. Additionally, we incorporated global residual learning using a bicubic interpolation of the lowresolution image. The use of global residual learning allowed the model to enhance previously upsampled low-resolution images instead of directly estimating the high-resolution image. At the same, it allowed to handle arbitrary scales and sped up the model's training. Experiments on several datasets showed that the proposed model is competitive compared with state-of-the-art models for arbitrary scale superresolution, even with those trained for specific scale factors. The extended versions of the proposed model, trained for specific scale factors, also yielded competitive results.
The main limitation of this study is its application to blind or real super-resolution. This limitation can be addressed by training the model on databases with different degradation models and possibly adapting the model for these tasks. For this purpose, new components could also be proposed. The model can also be extended to solve denoising and super-resolution problems simultaneously. All this remains as future work.