Multi-Scale Residual Fusion Network for Super-Resolution Reconstruction of Single Image

Inferring all missing high-frequency details by using low-resolution image is the key to super-resolution reconstruction of a single image. In order to extract the feature information in low-resolution images fully and maximize the deduction of high-frequency details, we introduce a multi-scale cross merge (MSCM) network based on residual fusion. The MSCM uses feature extraction module with different-sized convolutional kernels to extract multiple features from the low resolution input image and send them into a nonlinear mapping module after concatenate them together. The nonlinear mapping module consists of five cross-merge modules, each of them formed by cascading three residual dual-branch merged structures. This structure can promote information integration of different branches. Dense connection and residual connection are integrated into the nonlinear mapping module to improve the transmission of information and gradient. The nonlinear mapping module is responsible for extracting high-frequency features and sending them to reconstruction module, which combines an improved sub-pixel up-sampling layer with external residual and global residual to generate a high resolution image. Simulation experiments demonstrate that our MSCM network has the ability of achieving single-image super-resolution reconstruction, and offers objective and subjective quality improvement compared to mainstream methods and other state-of-the-art reconstruction methods.


I. INTRODUCTION
Inferring a high-resolution (HR) image from a corresponding low-resolution (LR) image is called single image superresolution (SISR). It is an important issue in digital image processing, which is widely used in satellite remote sensing, medical image, video surveillance, unmanned driving, hyper-spectral imaging and other fields. In the process of reconstruction, the traditional SISR methods often have some problems, such as loss of high-frequency details, blurred edges and so on. With the rapid progress of computer hardware equipment, deep network model based SISR methods have received widespread attention.
SRCNN [1] is the first convolutional neural network (CNN) model for SISR, which contains three The associate editor coordinating the review of this manuscript and approving it for publication was Tao Zhou . convolutional layers. To improve the gradient dispersion problem, VDSR [2] is proposed based on residual network ResNet [3]. The multi-level fusion network MFFRnet [4] used a recursive residual network to extract features for image reconstruction. FDSR [5] is a symmetric residual convolution structure, which can speed up the training and improve the reconstruction effect without changing the network complexity. DRCN [6] and DRRN [7] introduced recurrent neural network, which greatly reduces network parameters and accelerates the convergence speed. The achievements of Densenet [8] in image classification inspire scholars to use it for super-resolution reconstruction. MemNet [9] used dense connections to stack persistent memory and multiple memory blocks for image restoration. RDN [10] realized the interconnection and fusion of multiple residual dense blocks through the residual dense connections. All these literatures show that residual and dense connections can VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ promote information flow, alleviate gradient disappearance and improve reconstruction performance. In order to enhance the reconstruction effect, scholars try to construct independent sub-networks in the HR reconstruction model. Tang et al. [11] proposed a joint residual network consisting of three parallel sub-networks, two of them learn low-frequency information and the other one learns high-frequency information. CMSC [12] utilized multiple cross sub-networks to enhance the extraction of information. CSN [13] reduced learning burden of deep network by distributing feature information with sub-networks. HDRN [14] reduced memory consumption and promoted feature fusion by constructing multiple hierarchical dense residual blocks. MGEP-SRCNN [15] proposed to construct an independent pre-classified network, which selects sample sub-database that highly related to the target image before feature extraction, and then performs nonlinear mapping and image reconstruction. The above-mentioned structures of independent sub-network promoted a new idea for the construction of reconstruction models.
The mainstream SISR methods mainly obtain more contextual information by deepening the network layer. Zagoruyko and Komodakis [16] and Zhao et al. [17] found that increasing network width can promote the fully extraction of image features. CNF [18] integrated multiple convolutional layers into network model, and improved network performance by fusing the contextual information of the network. Lap-SRN [19] combined the traditional Laplace pyramid with deep learning to reconstruct images. MCNN [20] utilized the competition among multi-scale convolution filters to enhance the inference ability of the network. Multi-scale convolution is explored by EEDS [21] to construct deep and shallow networks to extract different features, respectively. HCNN [22] implemented edge extraction, edge enhancement and HR reconstruction by constructing a three-branch network structure. These good reconstruction effects inspired us to extract more information by increasing the network width.
Investigation shows that both the construction of sub-networks and the increase of the network width can promote extraction of image features fully, and the residual dense connections can effectively simplify the difficulty of network training. Therefore, a multi-scale cross merge network (MSCM) for SISR is proposed. The main contributions include three aspects: 1) We design a feature extraction module (FEM), which uses different-sized convolutional kernels to fully extract image features of the input LR image and concatenating them together to form multi-range contextual information. FEM can achieve complementarity of different types of features.
2) We propose a nonlinear mapping module (NMM) by cascading five cross-merge modules. Dense connection and local residual connection are integrated into the nonlinear mapping module to achieve the fusion of multi-level and multi-scale features. NMM can obtain the necessary high-frequency details for reconstructing the textural details of the HR image.
3) We establish an image reconstruction process which combines external residual, global residual and sub-pixel convolution. The global residual connection is used to merge the low-level features extracted from the shallow layer and the high-level features extracted from the deep layer. The low-frequency information in the LR image is combined with the high-frequency information inferred from the network by using the external residual connection. Sub-pixel convolution is used in the last layer of the network to realize up-sampling.
This article is organized as follows. Section 2 introduces the structure of CNN for SISR. Section 3 describes the proposed MSCM hierarchical. Section 4 analyzes the structure of MSCM, presents experimental results and compares them with the leading methods on four public datasets. Section 5 concludes the paper.

A. CONVOLUTION KERNEL
The convolution kernel is an important part of the CNN architecture. Each convolution kernel can extract a feature map. The reconstruction effect of HR image is closely related to the size of convolution kernel when the network depth is determined. Every feature map has its own suitable convolution scale. On this scale, the image feature is the most obvious and can be distinguished accurately from the surrounding environment. The large-scale convolution kernel has the ability to learn complex features, but it will lose the detailed information, while the small-scale convolution kernel is easy to learn and can bring more detailed information, but it has poor ability to learn complex features.
SRCNN [1] is the first CNN model for SR, which uses a single 9 × 9 convolution kernel to extract image features. FSRCNN [23], ESPCN [24], VDSR [2], DRCN [6] etc. also extract features through a single-scale convolution layer. GoogLeNet [25], developed by Google, solves the problem of insufficient information extraction caused by a single-scale convolution kernel by increasing the network width, that is, extracting information with multi-scale convolution kernels to enhance the performance of the network. This idea can be introduced into the feature extraction of SR methods.

B. MULTI-BRANCH STRUCTURE
GoogLeNet is a multi-branch structure essentially, which increases the width of the network and promotes information fusion through paralleling convolution kernels of different scales. Zhao et al. [17] merged two residual branches together and then carried out mapping transfer (MR) to realize the rapid transmission of feature information from the front-end module to the back-end module, and realize the rapid backward propagation of gradient from the back-end module to the front-end module. Hu et al. [12] changed the convolutional layer in the MR module to multi-scale convolution, and Zhao et al. [13] introduced residuals to the branch network further, they achieved ideal reconstruction effect. The above methods confirm the feasibility of constructing multi-branch  [23] enlarged the LR image to the target size firstly by using interpolation, and then used the convolution network to reconstruct the HR image. Enlarging the size of a LR image cannot bring more details information, but will destroy the contextual information of the image. In other words, increasing the image size does not contribute to the inference of high-frequency information. Shi et al. [24] proposed a sub-pixel up-sampling method, which directly extracts features from a LR image and obtains the up-sampled image by periodically arranging multiple feature maps extracted by convolution. Networks such as RDN [10] and MFFRnet [4] also use this method to achieve image up-sampling. The sub-pixel up-sampling avoids the loss of LR image features caused by interpolation and amplification. It can preserve the contextual information and bring abundant image details, thereby improving the effect of super-resolution reconstruction.

III. MULTI-SCALE CROSS MERGE NETWORK A. THE MSCR MODEL
The purpose of single image super-resolution is to infer all the missing high-frequency details from an input LR image X , thereby to obtain a HR image Y . Given a training dataset E = X (k) , Y (k) , k = 1, 2, · · · , D, where X (k) and Y (k) represent the LR image and the corresponding HR image. The SISR reconstruction model is to achieve end-to-end mapping from a LR image to a HR image. The goal of our MSCM model is to learn a deduction model to obtain inferring valueŶ (k) of the HR image from the input real LR image X (k) , where = [ω, b] are the network parameters of MSCM, ω is the weight matrix, b is the bias. The model parameters are determined by minimizing the loss between the reconstructed HR image and the real HR image. We define the loss function as, The process of training MSCM with training set E is to minimize the loss and find the optimal parameters for the model. The structure of the MSCM model is shown in Figure 1. It consists of feature extraction module (FEM), nonlinear mapping module (NMM) and reconstruction module (RM). The FEM is responsible for extracting the shallow features of the input LR image and transmitting to NMM after concatenating them together. The NMM is responsible for extracting the high-frequency features and sending them to RM. The RM reconstructs the HR image by the improved sub-pixel up-sampling layer after global feature fusion.
Inspired by the propagation mechanisms of ResNet [3], DenseNet [8] and RDN [10]. We introduce MR [17] and residual connections to promote the integration of information from different branches to the MSCM model, which supervise the prediction of the sub-network and the output of MSCM model using the cascaded supervision strategy. The external residual connection (ERC) is used to avoid the destruction of the correlation between pixels caused by the periodic arrangement of multiple feature maps when directly up-sampling a LR image. The residual learning relationship between input and output is established by using the global residual connection (GRC). The local residual connection (LRC) is used to improve the information transmission of the nonlinear mapping module.

B. FEATURE EXTRACTION MODULE
Using only single-scale convolution kernel to extract lowlevel features will cause many feature details missing. In this article, the 3_5_9 convolution model composed of 3×3, 5×5 and 9 × 9 is used as the scale of the convolution kernels in the FEM. The multi-scale convolution model can extract more feature details from the LR image, which is conducive to the reconstruction of HR image details. This structure can make use of the complementary advantages fully between different features. The feature extraction formula is defined as, where X is the input LR image and H represents the convolution operation operator, the subscript denotes the size of the convolutional kernel used in the layer, F 1 is the extracted feature map corresponding to the 3 × 3 convolutional kernel. Similarly, the LR image is convoluted with 5 × 5 and 9 × 9 convolution kernels respectively to obtain the feature maps F 2 and F 3 . Then, concatenating the three feature maps to obtain F, which is the result of feature extraction by the 3_5_9 convolution model. The formula is defined as, where [·] represents the concatenation operation. Feature dimensionality reduction of F is performed through 1 × 1 convolution kernel to avoid too many training parameters, which is conducive to improving the robustness of the network. Then, the 3 × 3 convolution kernel is used to extract features further to obtain the final feature X 0 of the FEM. The formula is as follow, NMM consists of n cascaded cross-merge modules (CM).
We can see the structure of CM in Figure 2. It is cascaded by m residual dual-branch merged structures (RDM) where integrate dense connections. ''Add'' the output of the dual-branch of the last RDM, and adjust the dimensions through a 1 × 1 convolution kernel. After that, LRC is used to improve the flow of information and gradient. The convolutional layer is a kind of linear filter with crosscorrelation properties. So, ReLU in CM is the key to realize nonlinear mapping, it helps MSCM network to learn complex features of the input image. Act as an activation function of the convolutional layer, ReLU has nonlinear characteristics, it can convert multiple input signals of a node into one output signal, realize nonlinear mapping between the input and output feature image.

1) RESIDUAL DUAL-BRANCH MERGED STRUCTURE
The RDM structure is shown in Figure 3, which merges two parallel branches through residual branches. RDM inputs data into two parallel residual branches. The upper-branch contains a 3 × 3 convolution layer and connects a ReLU activation layer, and the lower-branch contains a 5 × 5 convolution layer and connects a ReLU activation layer. After the two branches are connected and fused together by local residuals, concatenating it with the output of the upper-branch and lower-branch, respectively, to realize the fusion and complementarity of multi-scale contextual information. ''Add'' in the middle of RDM indicates the weighted average of two-branch feature maps without changing the number of channels, while the subsequent ''concat'' means concatenating feature maps together, which increases the number of channels.
Taking the j-th mapping stage in the i-th CM as an example, the mathematical model of RDM is defined as, where i = 1, 2, . . . , n, j = 1, 2, . . . , m. C b1 i,j and C b2 i,j represent the outputs activated by ReLU activation layer after convolving the input X b1 i,j−1 and X b2 i,j−1 of the upper-branch and lower-branch with 3 × 3 and 5 × 5 convolution kernels, respectively. X b1 i,j and X b2 i,j are the outputs of the upper-branch and lower-branch of the j-th RDM, satisfying the relationship of X b1 i,1 = X b2 i,1 = X i−1 , and I represents identity matrix.

2) LOCAL RESIDUAL CONNECTION
After cascading m RDMs, the feature mapping results of the upper-branch and lower-branch are merged by ''Add'', thus, we obtain M 1 . The M 1 passes through a convolutional layer with 1×1 convolution kernel and obtains M 2 , which is ''Add'' with the input of the i-th CM transferring to the upper-layer by LRC, we can get the output of the i-th CM as follow, where D c (·, ·) represents the mapping function of ''Add'', which merges the upper and lower branches. For the sake of convenience in representation, the mapping relationship between the input X i−1 and the output X i in Eq. (8) is represented by T i c , the results X n of the n-th CM can be obtained as, where X 0 is the input of the first CM and the output of FEM.

D. RECONSTRUCTION MODULE
The image reconstruction module includes two parts, global feature fusion and image restoration.

1) GLOBAL FEATURE FUSION
By concatenating the nonlinear mapping results with dense connections and ''concat'', the outputs of n CMs are connected into a tensor X M , its mathematical model is, where X M is the global feature obtained by fusing the local features in all channel blocks of NMM. X M goes through a 1 × 1 convolution and a 3 × 3 convolution in turn, we can obtain the first-level featureF 1 . ''Add'' it with the feature F 1 of FEM transferring by GRC, we can get the second-level featureF 2 , which is convolved in turn by three convolution layers with convolution size of 5 × 5, 3 × 3 and 3 × 3, respectively, to obtain the third-level featureF 3 . The mathematical models are as follows, The 1 × 1 convolution kernel in Eq. (11) can reduce the number of channels to 64 and reduce the model parameters. Convolutions with larger sizes (e.g., 5 × 5 or 9 × 9) have more connection parameters, which increases the computational complexity. A good deep learning model often needs to achieve the tradeoff between the best performance and the running efficiency. Therefore, only F 1 is selected as the object of GRC.

2) IMAGE RESTORATION
The sub-pixel up-sampling convolution (SUC) refers the process of rearranging r 2 feature maps of size h×w×c extracted from LR space into one rh×rw×c HR image. The effect of the SUC is illustrated in Figure 4.
SUC is easy to destroy the correlation between pixels in the process of periodic arrangement, but LR image and the corresponding HR image have similar topological structure, which can enhance the correlation between pixels. In view of this, we introduce LR image into the last part of HR reconstruction by ERC, then realize up-sampling using sub-pixel convolution. Finally, the reconstruction of HR image is completed by passing the up-sampling result through a convolution layer with convolution size of 3 × 3. The mathematical models are as follows, whereX is the 1 × 1 convolution result of the input LR image X , T is the feature map to be reconstructed,Ŷ is the up-sampling result, SUC(·) is the pixel recombination operation for the LR feature image, r is the scale factor of up-sampling amplification, and c is the number of image channels (the corresponding values of color image and gray image are 3 and 1, respectively). mod (x, r) and mod (y, r) represent the activation operation, that is, according to the different sub-pixel positions in the r 2 LR maps, the pixel regions in the same position are activated in the process of pixel recombination, and extracted to form a region in the HR image Y . In Eqs. (5), (6), (8), (11), (13), (14) and (16), the operator H and the subscript have the same meaning as Eq. (3). The operator D c in Eqs. (12) and (14) has the same meaning as Eq. (8).

A. DATASETS AND EXPERIMENTAL SETTING
The experiments are carried out in the environment Ten-sorFlow. Each convolutional layer of the MSCM network has 64 convolution kernels, and the network is trained using stochastic gradient descent. We trained 30 epochs and used Adam optimizer with variable learning efficiency. Its initial value is set to 10 −4 , but it decreases by half every 10 epochs. For each epoch, the batch size is 64.
The training set is constructed using 800 color images (c = 3) from the DIV2K Dataset [26]. We extract output HR training samples of size 32 × 32 (h × w) from the 800 original images with a stride of 16. The corresponding input LR training samples are generated by using Bicubic down-sampling function, the size is set to 32/r×32/r. If 32 is not an integer multiple of r, we perform an integer operation on 32/r. The MSCM network is trained to obtain the corresponding reconstruction model for the upscale factor ×2, ×3 and ×4, respectively. In order to evaluate the advantages and disadvantages of the MSCM model accurately, we use the current  mainstream datasets, Set5 [27], Set14 [28], BSD100 [29] and Urban100 [30] as the test sets for analysis and research.
Compared with RGB color space, the luminance (Y) channel in YCbCr color space contains more structure information that is more suitable for human eyes. Hence, we convert RGB color space to YCbCr color space, only train and test Y-channel for SISR. Two objective indicators, peak signal-to-noise ratio (PSNR) and structural similarity   (SSIM) [31] are used to evaluate the image reconstruction effect.

B. ARCHITECTURE ANALYSIS OF MSCM
In the structure analysis of the MSCM model, we select Set5 as the test set, and set the scale factor r = 3.

1) DETERMINE THE STRUCTURE OF FEM
In order to determine the number and size of parallel convolution kernels in FEM, an experimental analysis of the 8 FEM structures of different scales in Figure 5 is carried out. The numbers 3, 5, 7, 9 and 11 represent the size of convolution kernel, and each convolution layer contains 64 filters. In the test of 8 FEM structures, the NMM and RM are exactly the same. The experimental results are shown in Figure 6, where the abscissa is the type of FEM structure (list in Figure 5) and the ordinate is the corresponding evaluation results. It can be seen from Figure 6(a), FE 3_9 yields the highest PSNR, and it goes beyond all other FEM structures. FE 3_5_9 is the second-best structure. Figure 6(b) shows the SSIM of 8 FEM structures, we can see that FE 3_5_9 obtained the highest SSIM, followed by FE 3_9. In order to maximize the complementarity of different features, FE 3_5_9 is selected as the structure of our FEM, which is used in subsequent experiments.

2) DETERMINE THE STRUCTURE OF RDM
RDM is an important part of CM. In order to determine the optimal mapping structure, the experimental analysis of 6 RDM models in Figure 7 is carried out. The numbers 3, 5 and 7 represent the size of convolution kernel, and each convolution layer contains 64 filters. The objective indicators of 6 compared RDM models are illustrated in Table 1, where boldface indicates the best indicator. It can be seen that the CM 3_5 model achieved the highest PSNR and SSIM. So, the CM 3_5 model was selected as the RDM structure.

3) DETERMINE THE NUMBER OF CM AND RDM
NMM is formed by cascading n CMs, and each CM is cascaded by m RDMs. We should find the best values of n and m, so, two experiments are carried out. Taking m = 3, set n as 2, 3, 4, 5, 6, respectively, we compare the PSNR and reconstruction time, the results are shown in Figure 8 (a). Taking n = 5, m is set as 2, 3, 4, 5, respectively, the PSNR and reconstruction time are shown in Figure 8 (b). The abscissas of Figure 8 (a) and Figure 8 (b) are the number n of CM and the number m of RDM, respectively. Both ordinates are all PSNR, and the numbers on the polyline are reconstruction time. As shown in Figure 8, with the number of CM increasing, the PSNR and reconstruction time will increase. As the number of RDM increases, the PSNR and reconstruction time will also increase. In order to compromise the operation speed and reconstruction effect, the final determination is n = 5, m = 3. VOLUME 8, 2020

C. EXPERIMENTAL RESULTS
In order to illustrate the effectiveness of MSCM model, two reconstruction experiments of ×2, ×3 and ×4 scale factors are carried out on datasets Set5, Set14, BSD100 and Urban100. Experiment 1: Comparison with mainstream deep learning method. Experiment 2: Comparison with the latest deep learning methods.

1) COMPARISON WITH MAINSTREAM METHODS
We obtain the source codes provided by the authors of methods SRCNN [1], VDSR [3], DRCN [6], DRRN [7], Mem-Net [9] and LapSRN [19], which are retrained on a PC with an Intel(R) Core (TM) i7-8750 CPU and a GPU model of NVIDIA GeForce GTX 1060 to obtain the results and compared with MSCM. The comparison includes three aspects, namely objective indicators (Table 2), subjective indicators ( Figure 9 and Figure 10) and reconstruction effect of real LR image ( Figure 11).
The average PSNR, average SSIM of MSCM and five compared methods on datasets Set5, Set14, BSD100 and Urban100 are shown in Table 2, in which, the maximal values are boldface, and the second ones are blue. We can find that, for the four datasets with ×2, ×3 and ×4 scale factors, the proposed method achieves the highest PSNR and SSIM. On all scale factors (×2, ×3 and ×4), the MemNet obtains the second best results in terms of PSNR and SSIM for Set5 and Set14, and obtains the second best results in terms of PSNR for BSD100 and Urban100. For Urban100 with scale factors ×2, ×3 and ×4, when compared to MemNet, our MSCM shows a notable PSNR increase of 0.74dB, 0.63dB and 0.59dB, respectively. Figure 8 and Figure 9 show the subjective comparison results of the reconstructed images with scale factors ×3 and ×4, respectively, for the images from the dataset Urban100. Observe the ×3 reconstruction detail image of the high-rise building in Figure 8, and make a local comparison of the wall texture. It can be seen that the wall texture reconstructed by our algorithm is the closest to the HR image, and the contour is obvious, while the image texture reconstructed by other algorithms is disordered and the overall image is blurred. In Figure 9, a detailed comparison of the station building with a magnification of ×4 is performed. Comparing the local details of the window, we can find that MSCM yields the most obvious window outline and restores edge details well, while other methods cannot effectively restore the contours.
The ultimate goal of theoretical research is to achieve practical application, and the background of actual image is more complex than any dataset used in theoretical research, actual image can better measure the advantages and disadvantages of reconstruction methods. Satellite remote sensing images are more disturbed by external factors, making reconstruction more difficult. We use an image obtained by Beijing-2 satellite of China with ×4 scale factor to verify the proposed MSCM and compare its result with 6 mainstream methods. Figure 11 shows the reconstruction results, the large image on the left is the actual HR image, (a) ∼ (h) are the partial enlargement results of the 6 compared methods and MSCM. It can be seen that MSCM achieves the best reconstruction result, which has high sharpness and clear edge contour.
From above experimental results, we can see that, our method has better performance when the reconstruction scale factor is larger. It can recover the contour features well and get a better reconstructed image.

2) COMPARISON WITH THE LATEST METHODS
The latest methods of MFFRnet [4], FDSR [5], MGEP-SRCNN [15], EEDS [21] and HCNN [22] achieved good reconstruction results, but we didn't find the source codes provided by the authors. For objective and fair comparison, we directly compare the results provided in the paper with our MSCM.
The average PSNR and SSIM values of five compared algorithms and MSCM on four datasets are listed in Table 4 [3], DRCN [6] and MSCM. We evaluate the three methods by reconstruction images with the scale factor ×4 for datasets Set5, Set14, BSD100 and Urban100. Compared with VDSR [3] and DRCN [6], it is because MSCM has basically the same number of 3 × 3 convolutional layers with them, in other words, they have similar model depths, both VDSR [3] and DRCN [6] have 20 layers, and MSCM has 21 layers. From TABLE 3, we can see that, although MSCM is one level deeper than VDSR and DRCN, the model parameters are 205Kb and 1314Kb less than them respectively. And the MSCM is the fastest, that mainly because our method uses the shallow layer features fully, and does not need to preprocess the input LR image before training. Compared with the algorithms of VDSR and DRCN, our algorithm has more obvious advantages in large-scale datasets, and it is more suitable for large scale factor image reconstruction.

V. CONCLUSION
We propose an image reconstruction method for single LR image, which integrates the ideas of multi-scale, residual, cascade and fusion. The use of multi-scale convolution kernels and residual connections can facilitate the comprehensive extraction and fusion of multiple information, which is beneficial to characterize the feature information of a LR image fully, and lay the foundation for the inference of high-frequency information. The cascade structure can realize the information extraction supervision, prediction, and establish an effective mapping relationship from LR to HR image for super-resolution reconstruction. The multi-scale information fusion structure can promote the information exchange between different modules, and infer the missing high-frequency components by combining multi-contextual information and the mapping relationship of super-resolution reconstruction. For the four mainstream datasets Set5, Set14, BSD100 and Urban100 with ×2, ×3 and ×4 scale factors, our MSCM achieves the highest PSNR and SSIM, especially in Urban100. For example, for Urban100 with the ×2 scale factor, the PSNR value of MSCM outperforms MFFRnet, FDSR, and MGEP-SRCNN by [1.16, 1.03, 0.89] dB, respectively. This is mainly because the image size of Urban100 is larger, the image edge contour is clear, which is conducive to the implementation of our algorithm. That means that our algorithm can fully extract the shallow features of the LR image. Meanwhile, MSCM has outstanding performance in large scale factor image reconstruction, it can recover the contour features well and get better reconstructed image.