Multiscale Recursive Feedback Network for Image Super-Resolution

Deep learning-based networks have achieved great success in the field of image super-resolution. However, many networks do not fully combine high-level and low-level information, and fuse local and global information. A multiscale recursive feedback network (MSRFN) for image super-resolution is proposed. First, multiscale convolution is integrated into the feedback network to propose multiscale projection units that adaptively capture image features of different scales by driving a multipath information flow. Next, recursive learning is applied to multiscale projection groups composed of up- and down-multiscale projection units to construct a feedback module that exploits high-level information to correct the low-level representation and refines the features in the early layers. Then, global residual learning and local residual feedback were combined to provide more contextual information for the final reconstruction. Experimental results demonstrate that MSRFN can predict more high-frequency details and alleviate the ringing effect and checkerboard artifacts inherently in CNN-based models. Even when the training datasets are relatively small, MSRFN is still superior to most state-of-the-art methods, especially for large scaling factors ( $\times 8$ ).


I. INTRODUCTION
Super-resolution (SR), an important image processing technology in the field of computer vision, is widely applied in medical imaging [1], security and surveillance [2], satellite remote sensing images [3], image compression [4] and small object detection [5], [6]. It aims to establish a suitable model for converting a low-resolution (LR) image to a highresolution (HR) image [7]. Because a given LR image may correspond to a series of possible HR images rather than a single unique image, SR is a challenging ill-posed inverse problem. Currently, numerous SR methods have been proposed to address this problem, which are primarily divided into three types: interpolation-based, reconstruction-based, and learning-based methods [8], [9]. The SR model based on deep learning has gained wide attention in recent years owing to its superior reconstruction performance.
The associate editor coordinating the review of this manuscript and approving it for publication was Tony Thomas. SRCNN [10], [11] is the first network that applies convolutional neural networks (CNNs) to SR, which directly learns the nonlinear mapping from interpolated LR images to HR images in an end-to-end manner. As a simple shallow linear network, its performance is superior to that of most traditional networks, which demonstrates the superiority of CNNs in solving the SR problem. Subsequently, a series of SR algorithms based on the SRCNN were proposed. Depth can provide larger fields and more contextual information as a key factor in deep neural networks. However, two problems were caused by deepening the network, including gradient disappearance/explosion and numerous parameters. To alleviate the gradient problem effectively, researchers have introduced residual learning [12] and succeeded in training deeper networks, including VDSR [13] and EDSR [14]. In addition, dense connections [15] are often employed, which enables networks not only to alleviate the gradient vanishing problem, but also encourage feature reuse, such as SR-DenseNet [16], RDN [17], and DBPN [18]. To reduce the network parameters, some networks, such as DRCN [19], DRRN [20], and VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ DRFN [21], employ recursive learning to facilitate weight sharing. Owing to these mechanisms, a growing number of algorithms tend to design more complex and deeper networks to obtain a higher peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) [22].
The following problems exist in many present networks: First, many SR networks ignore the training difficulty in achieving excellent performance of depth models, resulting in a huge training setting, more training ticks, and more time. For example, DBPN [18] employs a very large training setting, including DIV2K (1,000 2 K resolution images, including 800 training images and 200 evaluation images) [23], Flickr2K [14] (2650 2 K resolution images), and ImageNet [24] dataset (over 14 million). Second, most SR networks learn hierarchical representations of LR images in a feedforward manner, which relies on their limited features. In addition, the pre-processed feedforward networks can only accommodate a single upsampling factor, and they require a large adjustment and retraining each time they migrate to other upsampling factors, which is extremely inflexible. Owing to the lack of feedback, feedforward networks such as DRRN [20] have difficulty with large scaling factors. Although MSRN [25] and LapSRN [26] with feedforward architectures can perform the experience of ×8 enlargement, there is still an improvement in the ×8 reconstruction performance. Third, a few SR studies introduced feedback mechanisms, but they obtained image features at a single scale without taking full use of image features. Due to the inadequate utilization of features, the features gradually disappear in the process of transmission, especially for large factors SR (such as ×8 SR). Networks such as DBPN [18] and SRFBN [27], [28] fail to cope with the drawbacks of singlescale feedback networks and cannot learn feature mapping at multiple context scales.
To solve the above problems, we designed a novel multiscale recursive feedback network (MSRFN). The structure is illustrated in Fig. 1. MSRFN uses much fewer training datasets than DBPN with only 800 images from DIV2K, but it outperforms DBPN even on large scaling factors. Moreover, owing to the introduction of multiscale feedback, the MSRFN can not only learn rich hierarchical feature representations at multiple context scales, but also refine low-level information with high-level information and better represent the mutual relationships between LR-HR image pairs. In addition, the MSRFN can extend to any upscaling factors with only minor adjustments of the network, and it can also provide the flexibility to define and train networks with different depths, which benefits from a modular end-to-end structure. It is more exciting that MSRFN can effectively alleviate the ringing and jaggy effect at the edge structures and produce more competitive SR results, particularly for ×8 enlargement.
The main contributions of our study are as follows. First, a multiscale projection unit (MSPU) is proposed by incorporating a multiscale convolution kernel into the feedback connection. Different kernel sizes are introduced in each branch to drive the multipath information flow for up-or down-sampling operations. The MSPU can adaptively capture image features at different scales, which are regarded as local multiscale features. In addition, multiscale receptive fields and information sharing performed between different bypasses contribute to the full use of local features. Furthermore, the 1 × 1 convolution layer is applied to achieve dimensional reduction and cross-channel multiscale feature fusion; it also improves the generalization ability of the network by adding a nonlinear activation to the learning representation of the previous layer. This kind of local multipath learning enhances branch information communication, further increases the receptive field of the network, and improves guide reconstruction. Second, in the MSRFN, a pair of up-and down-MSPUs constitutes a multiscale projection group (MSPG) that can realize the local feedback process. MSPG not only generates HR features from the LR input, but also projects them back to the LR spaces. Only one MSPG is used for recursive learning to form a feedback scheme. This kind of top-down work allows previous layers to access useful information from the following layers to refine low-level representation and enrich high-level features. Meanwhile, such a recurrent structure with feedback flow can not only constantly correct the mutual relationship between LR and HR features, but also effectively reduce the network parameters and support a deeper structure. MSRFN has a powerful early reconstruction ability. In addition, the reconstruction module concatenates the HR multiscale feature maps generated by MSPG, which can transfer more abundant elements for the reconstruction of the HR images.
Third, in addition to combining high-level and low-level information, we also combine local and global information by the fusion of local multiscale residual features and global residual features to maximize the utilization of image features and overcome defect features that disappear in the transmission process. On the one hand, MSRFN applies the iterative up-and down-sampling framework to provide local residual feedback for the multiscale projection residuals of MSPG, acquiring finer initial features in the early layers. On the other hand, the global residual skip connection adds the residual image to the global identity mapping from the LR input and helps the network recover the residual between the LR and HR images, greatly reducing the learning difficulty and promoting faster convergence of the network. The combination of local residual feedback and global residual learning helps feature reuse and provides more contextual information for creating SR images.

II. RELATED WORK
A. IMAGE SUPER-RESOLUTION SR based on deep learning is a trainable data-driven model that can directly learn the non-linear mapping between LR and HR images in an end-to-end manner [11]. The upsampling operation is the key step because it determines how to generate the HR output from the LR input. In view of the different locations of upsampling operations in the model, SR frameworks are divided into four types [29]: pre-upsampling, post-sampling, progressive upsampling, and iterative up-and down-sampling frameworks.
SRCNN is a pioneering framework that adopts a preupsampling framework [10], [11]. It is characterized by the completion of the upsampling operation in the pre-processing step. The LR image is enlarged to the target size by the interpolation algorithm, and then the algorithm inputs the interpolation image into the network to establish the mapping relationship with the HR image. Hence, the pre-upsampling SR comes with the defect of poor scalability and difficulty in accommodating any scaling factors with minor adjustments to the network. Although the framework has a lower learning cost owing to its simple structure, it is subject to side effects, including additional noise from coarse images, noise amplification, blurring, and exponentially increasing computational complexity.
To avoid learning most mappings in high-dimensional space, researchers proposed a post-sampling framework that aims to integrate the upsampling layer at the end of the network and directly learn hierarchical feature representation from the LR input. FSRCNN [30] and ESPCN [31] are representative algorithms that improve the computational efficiency and quality of SR images compared with SRCNN. However, because of the limited learnable features in the LR images and the performance of the upsampling operation only once, it is difficult to characterize the complex mapping from the LR to HR images, which greatly increases the learning difficulty for large scaling factors of ×4 and ×8.
To overcome this drawback, LapSRN [26] employs a progressive upsampling framework that uses multiple upsampling modules to progressively reconstruct higher-resolution images. By adding a multi-stage design to the feed-forward network and upsampling the image to a higher resolution at each stage, the complex large-scale factor reconstruction can be decomposed into multiple simple small-scale reconstructions. The scheme of gradually reconstructing multiple SR images of different scales reduces the difficulty of learning and improves the SR performance on large scaling factors. However, its essence is the stacking of a single upsampling network, which is still limited by LR features and subjected to feature underutilization.
To address the above problems, Haris et al. innovatively proposed the DBPN algorithm and constructed an iterative up-down sampling framework, which better explores the mutual dependency of LR-HR images by introducing iterative back-projection [18]. The framework alternates up-and down-sampling operations to generate deeper HR features and combines HR images of different depths to produce the results. The authors also introduced a dense connection to improve the network accuracy. This scheme can capture the deep mapping relationship between LR and HR, which improves the reconstruction performance and successfully implements a large scaling factor. However, training this network requires an extremely large dataset and requires more training time and skills. In addition, the network only uses a single-scale convolution kernel, and it is difficult to extract feature information at different scales.

B. NETWORKS
Based on the above four SR frameworks, researchers have applied different network design strategies to construct various SR networks with distinctive characteristics.
DRCN [19] and DRRN [20] are typical models that apply recursive learning to the pre-sampling framework, which stacks multiple identical layers or units in a recursive manner to increase the network depth. Shared weights between recursive modules prompt the network to greatly reduce the introduced parameters and gain a larger receptive field to learn more features. However, recursive learning easily leads to the inherent degradation of deep networks, so it often needs to be combined with residual learning.
Residual learning only learns residual mappings to recover high-frequency information, which avoids direct conversion from LR to HR images. Therefore, it solves the overfitting problem of deep networks and improves the convergence speed. Unlike DRCN, DRRN replaces a recursive layer consisting of a single convolutional layer with a recursive block consisting of several residual units. ResNet [12] and VDSR [13] applied local residual learning and global residual learning to a pre-sampling framework, respectively. Inspired by this, DRRN introduces skip connections in both local residual units and the global network, which reduces the difficulty of training deep models and alleviates the vanishing or exploding gradient problem. VOLUME 10, 2022 Compared with a simple linear network, the multipath structure designed by DRRN further facilitates learning, in which the residual path can learn high-frequency features, and the identity path transmits rich early image information to the later layers and promotes gradient back propagation. Based on the residual module proposed by Kim [12], Huang introduced dense connections [15]. Unfortunately, this results in an exponential increase in computational complexity and applies a single-size convolution kernel to both the residual and dense modules. Multipath learning aims to transfer diverse feature settings through multiple branches of the model and fuse these elements to provide better performance. Under the progressive upsampling framework, LapSRN [26] introduced global multipath learning, which predicts the sub-band residuals with a feature extraction path and reconstructs different scaling HR images through multipath signal flow. Under the postsampling framework, MSRN [25] introduced local multipath learning, which achieves adaptive detection of image features at different scales using the proposed multiscale feature extraction module. Multiple branches can extract image features of different aspects and continuously exchange information with each other during propagation, further enhancing the ability to learn and extract features.
However, all of the above SR networks learn one-way mapping from LR to HR in a feed-forward manner. This feed-forward structure prevents early layers from effectively utilizing useful information from later layers. Therefore, a few SR algorithms introduce a feedback mechanism that allows the model to convert the output into input to correct the previous state. DBPN [18] proposed an iterative error feedback based on iterative up-and down-sampling layers to enable the network to implement a self-correcting procedure. SRFBN [27] uses hidden states in an RNN with constraints to construct a feedback module to drive the feedback stream and generate powerful high-level representations. However, all of these feedback networks use a single-scale kernel to learn the mapping functions.
To the best of our knowledge, there is no model that integrates local multiscale feature learning into a feedback network for SR.

III. MULTISCALE RECURSIVE FEEDBACK NETWORK
We first focus on the details of MSPU in the network in Section 3.1, which is divided into up and down MSPUs ( Fig. 2 (a) and (b)). The feedback module composed of the recursive multiscale projection group (MSPG) is described in Section 3.2. Finally, we divide into three main modules to specifically analyze the MSRFN in Section 3.3 (Fig. 1).

A. MULTISCALE PROJECTION UNIT
Inspired by the idea of GoogleNet [28], we introduce a multiscale convolution kernel in the projection unit, in which we construct two branch networks and apply different scale convolution kernels to different branches to capture image features at different scales. Such local multipath learning is introduced not only to make information sharing between different bypasses, but also to help make full use of the local features. According to the iterative up and down sampling framework, up and down MSPUs are designed for upsampling and downsampling operations, respectively.

1) UP MULTISCALE PROJECTION UNIT
As shown in Fig. 2(a), the up MSPU mainly consists of six steps to map the LR feature, L g−1 , to the HR feature, H g . The details are as follows.
Step 1: Using the previously calculated LR feature map, L g−1 , as input, and, respectively, using deconvolution layers with kernels of different sizes, D D ↑ u1 and D ↑ u2 represent Deconv1(k 1 , n) and Deconv2(k 2 , n), respectively; k 1 and k 2 represent the kernel size, and n represents the number of kernels.
Step 2: Concatenating the HR feature maps, H g u1 and H g u2 , and using convolution layers with kernels of different sizes, C ↓ u1 and C ↓ u2 , to perform downsampling operations on two branches, the concatenated HR feature map is mapped into the LR feature maps, L g u1 and L g u2 .
Step 3: Concatenating the LR feature maps, L g u1 and L g u2 , and using a 1 × 1 convolution to perform feature pooling and dimension reduction, two LR maps are merged into the LR feature map, L g u , to achieve cross-channel feature fusion.
C u represents Conv(1, n), and the number of channels in each branch becomes n from 2n. In addition, the 1 × 1 convolution adds non-linear activation to the learning representation of the previous layer to improve the expression ability of the network.
Step 4: The residual, e g u , is obtained by calculating the difference between the observed LR map, L g−1 , and the reconstructed LR map, L g u .
Step 5: Two deconvolution layers with kernels of different sizes, D D ↑ e1 and D ↑ e2 represent Deconv1(k 1 , n) and Deconv2(k 2 , n), respectively, and the number of channels in each branch is n.
Step 6: Concatenating the residual HR feature maps, H g e1 and H g e2 , and summing with HR feature maps concatenated in step 2, the HR feature map, H g , obtained by a 1 × 1 convolution is the final output of the up-MSPU.
C h represents Conv(1, n). The number of channels is 2n after summing, and then Conv(1, n) reduces the number of output channels to n, which is consistent with the input. Both the input and output of the MSPU have the same number of channels. This structure allows multiple MSPUs to be mutually connected.

2) DOWN MULTISCALE PROJECTION UNIT
As shown in Fig. 2(b), a down-multiscale projection unit was defined. Its function is to map the input HR feature, H g , to the LR feature, L g . Details are as follows.
Step 1: Taking the HR feature map, H g , from the previous up MSPU as input, and using two convolution layers, C ↓ d1 and C ↓ d2 , with kernels of different sizes to perform upsampling operations on two branches, H g is mapped into the LR feature maps, L g d1 and L g d2 .
C ↓ d1 and C ↓ d2 represent Conv1(k 1 , n) and Conv2(k 2 , n), respectively, and k 1 and k 2 represent the size of the kernels. Step represent Deconv1(k 1 , 2n) and Deconv2(k 2 , 2n), respectively. The number of channels in each branch is 2n. Step C d represents Conv(1, n), and the number of channels in each branch is changed from 2n to n.
Step 4: The residual, e g d , is obtained by calculating the difference between the observed HR map, H g , and the recon- Step 5: Two convolution layers with kernels of different sizes, C represent Conv1(k 1 , n) and Conv2(k 2 , n), respectively, and the number of channels in each branch is n.
Step 6: Concatenating the residual LR feature maps, L g e1 and L g e2 , and summing with LR feature maps concatenated in step 2, the LR feature map, L g , obtained by a 1 × 1 convolution, is the output of the down-multiscale projection unit.
C l represents Conv(1, n). The number of channels is 2n after summing, and then Conv(1, n) reduces the number of output channels to n, which is the same as the input.

B. RECURSIVE MULTISCALE PROJECTION GROUP
The feedforward structure only maps the rich representation of the input space to the output space, and this one-way mapping is limited to the LR features from the input space. An up MSPU followed by a down MSPU constitutes a multiscale projection group, which can project LR multiscale features to HR space and then back to LR space. Let the output of the previous projection group modulate the input of the next iteration to form feedback. As the feedback flow alternates between the up-and down-sampling processes, the projection residual is fed into the sampling layer, and then local residual feedback is employed to change the solution to form a self-correcting process iteratively. Multiple recurrent MSPGs are considered an efficient iterative process to optimize reconstruction errors VOLUME 10, 2022 to capture the interdependence between LR and HR images more deeply and enhance the utilization of local features. Significantly, our entire network uses only one MSPG, and recursive learning allows it to be shared among all recursive stages, which greatly increases the network depth without increasing the network capacity. In addition, our network can directly obtain the HR feature output from the MSPG at each stage and then fuse the HR features of different depths from each iteration.
To control the number of parameters and reduce the computational complexity, many network models use a 3 × 3 convolution to complete the feature mapping. This can avoid the increase in computational cost and the decrease in convergence speed caused by large-scale convolution kernels, but at the expense of a part of the reconstruction performance. However, recursive MSPG implements the iterative utilization of the MPU, which not only greatly promotes the shared weights and reduces the parameters, but also suppresses the limitation that the large-scale kernel brings slow convergence speed and may produce suboptimal results. This allows our network to design large-scale kernels with multibranched structures. Hence, each branch of our MSPU uses a large-scale kernel such as 10 × 10, which can extract more image features and improve the reconstruction result.

C. NETWORK STRUCTURE
The MSRFN is mainly divided into three components: feature extraction, feedback, and reconstruction modules, as shown in Fig. 1. Significantly, because global residual learning is applied, the entire network takes the original LR image as input and only needs to learn the residual image between the HR image and the interpolated LR image. Here, let conv(f , n) denote the convolution layer, where f is the size of the kernel and n is the number of channels. The introduction of these three modules is as follows: The original LR image, I LR , is input into the featureextraction module to produce the initial LR feature map, L 0 .
The feature extraction module is composed of two convolution layers, conv(3, n 0 ) and conv(1, n). n 0 is the number of channels in the initial LR feature extraction layer. n is the number of input channels in MSPG. It first uses conv(3, n 0 ) to generate shallow features L 0 with LR image information from the input I LR , and then uses conv(1, n) to reduce the number of channels from n 0 to n.
Subsequently, the initial LR feature map, L 0 , flows into the feedback module formed by the recursive MSPG and outputs a series of HR feature maps, H g .

For g in G,
where G represents the number of MSPGs equivalent to the total recursion time. f g FM represents the feature mapping process of the MSPG at the g-th stage in the feedback module. When g is 1, the initial LR feature map L 0 is taken as the input of the first MSPG, and when g is greater than 1, the LR feature map L g−1 generated by the previous MSPG is taken as the current input.
The reconstruction module is expressed as follows: Here, H 1 , H 2 , · · · , H g represents the deep concatenation of multiple HR feature maps. f RB represents the operation of the reconstruction module, which concatenates a series of HR feature maps generated in the feedback module and then flows across conv (3,3) to generate a residual image, I Res .
Through the global residual skip connection, the final output SR image can be expressed as

I SR
= I Res + f US I LR (22) Here, f US represents an upsampling operation with interpolation. According to the given scaling factor, bilinear interpolation is applied to enlarge the original input image I LR to the target size (other interpolation algorithms may also be used, e.g., bicubic interpolation). Then, the interpolation LR image bypassing the main body of the network is transferred to the end of the network and summed with the reconstructed residual image I Res to generate the final image I SR . As their name implies, different modules play different roles in our deep neural network, and the three major modules constitute our SR framework. Assuming that the number of MSPGs is g, the network contains a total of (10g + 3) layers. Two layers were used for the feature extraction. (5 + 5) * g layers were used for feature mapping in the feedback module, and one layer was used for the final reconstruction. We abstract these modules by defining multiple basic blocks and parameterizing the modules in the network in a concise manner. Owing to the introduction of modules in network design, we can change the depth of the network by only changing G, which makes it more convenient to train the network with different depths or different numbers of MSPGs. In addition, it is easier to migrate to any upsampling factor with only minor adjustments to the network parameters.

IV. EXPERIMENTAL RESULTS AND DISCUSSION
The performance of the MSRFN was evaluated using several benchmark datasets. We first introduce the experimental setting, evaluation metrics, and implementation details, then provide the quantitative comparison results with mainstream methods, and finally show the visualization results of different methods from the perspective of qualitative analysis. Comparative analysis of various SR models demonstrated the superiority of the MSRFN.

A. IMPLEMENTATION AND TRAINING DETAILS 1) EXPERIMENTAL PLATFORM
The operation system is win10, the CPU is Intel Core i5-7500, and the GPU is NVIDIA RTX-2080. All experiments were completed using the deep learning framework Pytorch 1.2.0, and the accelerator library was CUDA Toolkit 10.0.

2) DATASETS AND METRICS
We used 800 images in DIV2K [23] as the training set. DIV2K contains 800 2 K resolution train images collected from the Internet. Rotation and flipping are used for data augmentation to fully utilize the training data [14]. During the test, we selected PSNR and SSIM [22] as metrics to evaluate SR image quality on five standard benchmark datasets: Set5 [32], Set14 [33], BSD100 [34], Urban100 [35], and Manga109 [36]. The Set5 dataset has 5 images (''baby,'' ''bird,'' ''butterfly,'' ''head,'' ''woman''). The Set14 dataset is a dataset consisting of 14 images. The BSD100 dataset has 100 test images and it is composed of a large variety of images ranging from natural images to object-specific such as plants, people, food etc. The Urban100 dataset contains 100 images of urban scenes. Manga109 is composed of 109 manga volumes drawn by professional manga artists in Japan. They are commonly used for testing performance of SR models. The larger the metric value, the better is the reconstruction performance. To be consistent with the existing network, all evaluations were performed only on the luminance channel (Y).

3) TRAINING SETTINGS
We set the batch size to 16. To take full advantage of the memory resources and contextual information from LR images, we feed RGB image patches with different patch sizes according to the upscaling factor (Table 1), which are used for training together with the corresponding HR patches. Bicubic down sampling is used as the degradation model to produce LR images from the ground truth HR image. We apply the method proposed by He et al. [37] to initialize the weights and use the ADAM [38] optimizer to optimize the parameters. The learning rate was initialized to 0.0001 and decayed by half per 200 epochs. We adopted L1 loss to train the network.
We designed different kernel sizes and padding in each branch of the MSPU and adjusted the sizes of the kernels and strides according to the corresponding scaling factors. Table 1 lists the network parameter settings for the different SR factors. Both the input and output of the network use RGB color channels. Except for the reconstruction layer at the end of the network, PReLU [37] was used as the activation function behind all the convolution and deconvolution layers.

B. COMPARISON OF RESULTS AND DISCUSSION
For four different scale factors (×2, ×3, ×4, and ×8), we qualitatively and quantitatively compared MSRFN with other latest SR models on five test sets with different characteristics. Set5, Set14, and B100 mainly contain natural scenes; Urban100 is composed of many regular patterns in urban scenes and focuses on man-made structures with details in different frequency ranges; Manga109 is comic datasets drawn by Japanese artists. Table 2 presents the results of quantitative comparisons. It can be seen that in these five datasets, MSRFN has higher objective evaluation metrics in terms of PSNR and SSIM. This proves that the MSRFN is not only inclined to construct regular artificial patterns, but also good at reconstructing irregular natural patterns. In particular, our training sets do not contain any comic images, but excellent experimental results are shown for Manga109, which indicates that the MSRFN has excellent performance in reconstructing images with fine-structure information such as comic characters. In short, the MSRFN is superior in adapting to various scene features and possesses remarkable SR results for images with different characteristics.

1) QUANTITATIVE ANALYSIS
For small enlargement factors (×2, ×3, ×4), we compared the MSRFN with 21 advanced methods, as shown in Table 2. Because many models are not suitable for a large-scale factor SR (×8), the MSRFN is compared with 11 advanced methods on ×8, as shown in Table 3. For ×2 enlargement, MSRFN obtains the best PSNR results in five benchmark datasets, and the SSIM values of the MSRFN are only slightly lower than MSRN in BSD100, Urban100, and Manga109. However, for the ×3, ×4, and ×8 enlargements, the MSRFN is superior to all other models in terms of PSNR and SSIM. As the upscaling factor increased, the superiority became relatively more obvious. Especially for ×8 SR, it proves the effectiveness of MSRFN to enlarge the image with a large factor, which can generate HR components better than other networks.

2) QUANTITATIVE ANALYSIS
For qualitative analysis, Figs. 3 to 17 display the visual effects of multiple SR works in the above five datasets with different scaling factors.
For small SR factors (×2, ×3, and ×4), we compared the MSRFN with eight mainstream methods: bicubic, SRCNN,    Fig. 3 shows the visualization results for the ×2 SR. Owing to the low magnification, the gap between different models is subtle, but in contrast, the MSRFN still shows an obvious advantage. The text in our reconstructed image is clearer, there is no blur or adhesion between the letters, and the first letter ''M'' recovered from seven CNN-based networks (from SRCNN to LapSRN) has a crack that should not exist, but MSRFN has avoided this defect very well. Figs. 4 and 5 show  the visualization results for the ×3 SR. For the natural image ''baboon,'' the MSRFN restores sharper beard patterns than other models; for the comic image ''Belmondo,'' the edges of the patterns reconstructed by other models have obvious blur artifacts, while the MSRFN accurately predicts the edges and details of patterns. Fig. 6 shows the visualization results for the ×4 SR. For the image ''Belmondo'' with irregular characteristics in Fig. 6, the eye patterns recovered by other models all suffer from different degrees of blurring, but MSRFN can recover more high-frequency information and details so that the reconstructed pattern contains sharp and accurate edges. For the image ''img_096'' with regular characteristics from Urban100 (Fig. 7), the edge features recovered by other models are obviously affected by the ringing effect and checkerboard artifacts, but MSRFN successfully eliminates these negative effects and reconstructs clearer patterns of building and window, which are very close to original HR image in comparison.   For a large SR factor (×8), we compared the MSRFN with seven mainstream methods: bicubic, SRCNN, FSRCNN, SCN, VDSR, LapSRN, and MSRN in five benchmark datasets (Figs. 8-17). As shown in Fig. 8, the MSRFN has an excellent reconstruction effect for irregular speckle patterns, while the SR results from other models lose more edge details and have a relatively severe blurring. Fig. 12 shows that the MSRFN can reconstruct clear text even on large scaling factors, and other models have difficulty in estimating   high-frequency information because of insufficient feature utilization, which reduces the ability to recover text details. In Fig. 9, the other models predict the wrong stripe VOLUME 10, 2022   direction owing to their weak ability to recover highfrequency components, but the MSRFN recovers the highfrequency texture details to the greatest extent and the correct direction. Figs. 10, 11, and 13, show the visualization results of images on Urban100, from which it can be seen that the MSRFN surpasses other advanced models in the reconstruction performance of images containing regular modes with more mid-and high-frequency information. Figs. 14 to 17 show the reconstruction results of the comic images with more complex and fine textures. Other methods have difficulty in estimating high-frequency details such that SR images have smooth edges and blur artifacts, but the MSRFN results have finer details such as sharper edges and contours.   Owing to the loss of information during image degradation, especially the loss of high-frequency information, these CNN-based SR models still recover smooth image edges. As the scaling factor increases, the edge blurring becomes more severe. However, MSRFN can suppress the smooth component and predict more high-frequency information, which can make SR images with sharper edges and contours, and to a great extent alleviate the interference of checkerboard artifacts and ringing effects. Surprisingly, MSRFN still retains this advantage at large scaling factors, generating the SR results closest to the ground truth in comparison.
The above qualitative and quantitative comparisons and analyses show that the MSRFN has a persuasive reconstruction performance. Compared with feed-forward networks, it focuses on refining well-developed information; compared with single-scale networks, it can focus on fine details and generate finer high-level representations. It can not only capture image features on multiple context scales and mine more mutual dependencies between LR and HR images, but also create contextual information from LR input, which can save HR features better, even in the face of large scaling factors.

V. CONCLUSION
We propose a multiscale recursive feedback network for image super-resolution. Unlike single-scale networks, the proposed multiscale projection unit can adaptively capture image features with different scales by constructing a twobypass structure with different kernels, in which feature information can be shared between different bypasses to fully use the local features of images. Unlike feed-forward networks, we design recursive multiscale projection groups to form feedback modules that can effectively enhance features. We also combine local and global information by the fusion of local multiscale residual features and global residual features. The feedback flow exploits the high-level information extracted from deep layers to refine the low-level features from shallow layers, which improves the early reconstruction performance of the MSRFN. Furthermore, a combination of global residual learning and local residual feedback can encourage feature reuse and provide more contextual information for the final reconstruction. Therefore, MSRFN not only focuses on fusing local information and global information, but also pays attention to combining low-level details with high-level abstract semantics, which helps to produce more faithful results to the ground truth. The experimental results show that the MSRFN achieves encouraging performance and is superior to other advanced SR methods, especially for large-scale factors (such as ×8).
Future research improvements mainly have the following directions. If there is noise in images, the performance of SR methods might become worse. We will study SR methods for noisy images by the integration of denoising methods [54]- [58]. In training, ADAM optimizer is commonly used in many SR studies. The use of other optimization algorithms [59]- [63], such as particle swarm optimization algorithm [64]- [66], might improve the SR performance in our future study. As the MSRFN has achieved good performance for ×8 SR, we also intend to apply it to higher SR rates such as ×16 and develop a single model performing multiscale superresolution in our future study.