Image Super-Resolution Algorithm Based on RRDB Model

Aiming at the problems of texture distortion and fuzzy details in the existing image super-resolution reconstruction methods, a super-resolution reconstruction network based on multi-channel attention mechanism is proposed. The texture extraction module designs an extremely lightweight multi-channel attention module in the network structure. Combined with one-dimensional convolution, it realizes cross-channel information interaction, focusing on important feature information; The texture restoration module introduces dense remaining blocks to restore some high-frequency texture details, improve the model performance, and generate high-quality reconstructed images. The proposed network can not only effectively improve the visual effect of the image, but the results on the benchmark data set CUFED5 are similar to the classic super-resolution (SRCNN) reconstruction method based on convolutional neural network. The peak signal-to-noise ratio (PSNR) and structure are similar. Degrees (SSIM) increased by 1.76dB and 0.062 respectively. Experimental results show that the designed network can improve the accuracy of texture migration and can effectively improve the quality of the generated image.


I. INTRODUCTION
Super-resolution reconstruction of a single image is a technique for recovering high-resolution images from lowresolution images. At present, high-resolution images are widely used in the fields of remote sensing mapping, medical images, video surveillance, and image generation [1]- [3]. Due to current technological development limitations and cost considerations, the use of software processing methods to obtain higher resolution images has become a research hotspot in the field of image processing.
For the traditional interpolation [4] and reconstruction [5] methods, there are usually problems such as poor reconstruction effect and blurred edges. With the development of technology, people began to focus on deep learning technology. Yang et al. [6] introduced deep learning into the field of image reconstruction for the first time, and proposed a convolutional neural network method (super-resultion''convolutional''neuru-neural''neet) to achieve reconstruction by constructing 3 layers of SROCKN, volume. Jun [7] proposed The associate editor coordinating the review of this manuscript and approving it for publication was Zeev Zalevsky . a sub-pixel convolution method (efficient, sub-pixel, convolutional, neural network, ESPCN), which does not require preprocessing of low-resolution images.
It is directly used as the input of the network for feature extraction, and the feature map is arranged in the last layer to realize the up-sampling operation, which reduces the destruction of low-resolution image context information and makes the feature information as much as possible. For convolutional networks, the deeper the network, the better its processing capabilities. In image processing, the deep network can also more fully extract the feature information in the image, so that the processing effect is improved, but in the application, it is found that, Increasing the number of network layers will cause gradient dispersion problems. Songsheng et al. [8] combined the residual network [9] and proposed a deep convolutional network method (very deepnetwork for subper-resolution, VDSR) to solve the problem of the cumulative problem. Yonggong et al. [10] proposed a method of residual dense connection network (residual dense network, RDN), through the mutual connection and fusion of multiple residual dense blocks, it can extract feature information more effectively and improve the quality of reconstruction. Xin et al. [11] constructed a method of cascaded channel segmentation network (channel, splitting, network, CSN), and distributed the feature information in the sub-network to reduce the learning burden of the deep network and improve the training effect. Currently, in the field of image reconstruction, learning-based methods have become the focus of research.
Unlike traditional SISR, RefSR extracts the texture of the Ref image to compensate for the missing details in the LR image, so that the generated HR image has more detailed and realistic texture. For example, in 2018 Rui et al. [12] proposed a neural texture transfer model (Super-Resolution by Neural Texture Transfer, SRNTT). SRNTT performs local texture matching in the feature space, and then transfers the matched texture to the final output through the depth model. In 2020, Wei et al. [13] proposed a Texture Transformer Network for Image Super-Resolution (TTSR). TTSR encourages joint feature learning through LR images and Ref images, and discovers deep feature correspondences through the attention mechanism to deliver accurate texture features [14]. However, in the process of restoring the texture of these models, there will be problems such as face distortion and unrealistic texture restoration.
In order to solve the above problems, inspired by the idea of ECA (Efficient Channel Attention) attention mechanism in the literature [15], this paper proposes an image Super-Resolution network by multi-Channel Attention based on the multi-channel attention mechanism, SRCA). Compared with most current RefSR methods, SRCA can recover image details better [15].
Combine the multi-channel attention mechanism with the texture search module, realize local cross-channel information interaction through one-dimensional convolution, and assign different weights to each feature channel of the input image to focus on extracting more important feature information to facilitate Feature reuse. The texture recovery module introduces dense residual blocks to improve the structure of the model, removes the batch normalization layer in the dense residual blocks, and uses residual scaling to restore some high-frequency details to produce high-quality reconstructed images.

A. NETWORK ARCHITECTURE
The network structure of this paper is shown in Figure 1. The algorithm is mainly realized by cascading the nonlinear feature mapping module and the up-sampling reconst ruction module [16]. I LR , I SR Represent the input and final output of the model respectively, and use a convolutional layer with a convolution kernel size of 3 × 3 to extract the shallow features F 0 of the input LR image: The purpose of using 3 × 3 convolution is to use fewer parameters to build a lightweight network. At the same time, in super-resolution tasks, especially in the first layer, it is not appropriate to use a core with a larger receptive field. Since each pixel in the down-sampled image corresponds to a small area in the original image, the large receptive field may introduce irrelevant information during the training process. The shallow feature F0 completes the information extraction through the nonlinear feature mapping module, which is composed of 20 adaptive residual attention information extraction modules (ARB). After passing through the nth adaptive residual attention information extraction module, the feature X n is extracted from the output feature X n − 1 of the previous block, and the formula is shown in equation (2).
In formula (2) g n ABC Represents the nth adaptive residual attention information extraction module. After obtaining the features from the last residual attention information extraction module X L After going through a multi-scale upsampling module, as shown in equation (3), F tail Represents a multiscale upsampling module, used to upsample X L to the target size X tail • This module integrates the multi-frequency information generated by the nonlinear mapping module, and uses subpixel convolution to up-sample the image. Sub-pixel convolution (sub-pixel) was proposed by ESPCN [16] as a way of upsampling. The purpose is to use sub-pixel convolution for a low-resolution image to rearrange the pixels between the channels of the feature map. Corresponding high-resolution images, in order to make mapping learning easier, the original low-frequency information is introduced to maintain low-frequency accuracy, and the F tail Bilinear interpolation module with the same components F tail , Get global information X skip : Finally, the target image I SR by X skip and X tail Add up to get:

B. TEXTURE EXTRACTION MODULE
The quality of the features extracted by the texture extraction module has a crucial impact on the generalization ability of the model. This model adds ECA attention mechanism to the VGG19 network [17] to improve the efficiency and quality of feature extraction. Adding a multi-channel attention mechanism before pre-training the VGG19 network for feature extraction can assign different weights to each feature channel to improve feature extraction and increase expressiveness. The structure of the multi-channel attention mechanism is shown in Figure 2. Given an input x ∈ R W ×H ×C , First, use GAP (Global Average Pooling) to perform dimensionality reduction operations, then perform fast one-dimensional convolution with k = 5 to generate channel weights, and finally multiply the output result with the input x. Among them, k is the convolution kernel of 1D convolution, which represents the local cross-channel interaction coverage, and is the Sigmoid function used to generate channel weights. The multichannel attention mechanism can use the correlation between channels to obtain the weight value of the feature map to adaptively adjust the channel features, and process the high and low frequency information in the feature map to better extract the effective texture.
In the VGG19 network, relu1_1, relu2_1 and relu3_1 are used as multi-scale texture encoders. In order to speed up the matching process, this paper only performs matching on the relu3_1 layer and projects the corresponding relationship to relu2_1 and relu1_1, which can reduce the amount of calculation while ensuring the accuracy of texture migration. Q (query), K (key) and V (value) represent the three basic elements of the attention mechanism inside the texture migration network. K and Q represent only the relu3_1 layer features of Ref ↑↓ and LR ↑ images are extracted, and V represents the extraction of Ref images Features on the three layers of relu1_1, relu2_1, and relu3_1.

C. TEXTURE SEARCH MODULE
The texture search module determines the texture correlation between the input image and the reference image by comparing the relu3_1 layer features between K and Q. First, as shown in equation (6), take the output of K and Q as input, and calculate their similarity by normalizing the inner product.
Among them Q i , K j Indicates that Q and K are expanded into blocks, the number of blocks are respectively H LR ×W LR , H Re f × W Re f , Take the i-th and j-th blocks among them, S i,j , Means Q i versus K j The similarity between. As shown in formula (7), through S i,j , To calculate Q i versus K j Index of the most relevant position h i And value r i , In this way, the transferred HR texture features are obtained from the reference image.

D. TEXTURE MIGRATION MODULE
The texture migration module converts the HR texture feature of the reference image into the feature of the LR image to improve the accuracy of the texture generation process. This module uses a cross-scale integration method to further stack and merge textures, and merge the texture features of the three zoom ratios (1×, 2×, 4×) corresponding to the three layers of relu1_1, relu2_1, and relu3_1 to perform cross-scale feature fusion to improve Problems such as texture distortion. As shown in formula (8), using h i For indexing, HR texture features are extracted and transferred for V. r i Indicates the confidence level of the transferred texture feature at each position. Finally, the HR texture feature and LR feature of the LR image are synthesized, and these features are further multiplied by element. r i , To obtain the output of the texture migration network.
Among them, P represents the output image of the texture migration network. Conv and Concat stand for convolutional layer and concatenation operation, respectively. The operator ⊗ means that each feature map is multiplied element by element. RRDB)

E. TEXTURE RECONVERY MODULE
The texture recovery module reduces the loss of information while restoring part of the texture details while enlarging the LR image in a 4× scale, and strengthens the texture transfer. As shown in Figure 3, the texture recovery module adopts the residual block architecture of SRResNet (Super-Resolution Residual Network). The SRResNet network is formed by a combination of 3 × 3 convolution blocks, dense residual blocks, and up-sampling blocks. In experiments, the texture recovery module changed the Residual-in-Residual Dense Block (the number of modules to 15 in order to repair higher frequency details. This module combines multi-level residual networks and dense connections to effectively reduce texture loss. To better restore image details. RRDB uses a deeper and more complex structure than the original residual block of SRGAN (Super-Resolution Generative Adversarial Network). The RRDB structure is shown in Figure 4, and the residual scaling parameter is 0.2. The texture recovery module adaptively adjusts the fusion texture information by adjusting the residual scaling parameters, so that the model can be effectively improved in terms of texture detail transfer and high frequency detail generation.
The addition of the output image of the texture restoration module and the output image of the texture migration module is the final output image of the model. As shown in formula (9): Among them: SRout represents the output image of this model, and R represents the output image of the texture restoration module.

F. LOSS FUNCTION
The loss function can play a role in measuring the performance of the model. In order to preserve the spatial structure of the LR image, improve the visual quality of the deep image and make full use of the rich texture of the Ref image, this paper uses three loss functions: reconstruction loss, counter loss and perceptual loss. Reconstruction loss is used in most SR methods. Combating loss and perceptual loss can improve the visual quality of the generated image.
The reconstruction loss is usually measured by Mean Square Error (MSE) to improve PSNR. In this paper, L1 norm is used. Compared with L2 norm, L1 norm can make the weights sparse, convenient for feature extraction, more sensitive performance, and fast convergence. The reconstruction loss can be expressed by equation (10). Where (C, H, W) is the size of the HR image, I HR Represents the HR image, I SR Represents the generated SR image.
Combating loss can significantly improve the clarity and visual quality of the generated image. This article uses WGAN-GP (Wasserstein Generative Adversarial Network with Gradient Penalty). WGAN-GP proposes gradient penalty to solve the problem of training gradient vanishing gradient explosion and has a faster convergence speed than WGAN (Wasserstein Generative Adversarial Network) and generates higher quality samples [18]. Confrontation loss can be expressed by equation (11) and equation (12): Perceptual loss has been proven to significantly improve visual quality. It calculates the loss by comparing the convolution output of the original image with the convolution output of the generated image. The perceptual loss in this paper can be expressed by equation (13): Among them: the first part φ vgg i (•) Represents the feature map of the i-th layer in the VGG19 network; the second part represents the feature map of the j-th layer of the texture extraction module; V hj It represents the HR texture feature of the j-th layer extracted and transferred for V. Using this perception function can make the network more effectively transfer the texture features of the Ref image. Table 1 shows the convolution kernel parameter settings of the network structure of this algorithm, where ''ARB'' represents the adaptive residual attention information extraction module, the number is 20, divided into two branches, one of which is the ''RB'' residual branch, The other is the ''CoordA'' coordinate attention branch; ''kernel_size'' represents the size of the convolution kernel; ''input_channel'' and ''output_channel'' represent the number of input channels and output channels of each convolution kernel, respectively. In addition, in this algorithm, up-sampling uses the ''pixelshuffle'' sub-pixel convolution method to enlarge the input image, and the loss function uses the L1 regularization method to constrain the algorithm.

G. COORDINATE ATTENTION
Coordinate attention is pooled by one-dimensional global average pooling in the horizontal and vertical directions of the feature map, generating feature maps with spatial orientation in both directions, and then performing different convolution operations to generate features that focus on location information Figure. The existing super-resolution network models based on the attention mechanism usually use channel attention and spatial attention networks. Although these two networks can significantly improve the performance of the model, the position information is not taken into consideration. It is very important for generating spatial selective attention feature maps. Inspired by the literature [19], the coordinate attention network is improved: after the image passes through the batch normalization layer, its color distribution information is normalized, destroying the original contrast information, so the network removes the batch normalization information. One layer.
As shown in Figure 5, X represents the input feature whose shape size is (nf, 64, 64) after 1 × 1 convolution, ''nf'' represents the number of channels, and 64 represents the height and width of the image, respectively. Binary adaptive average pooling in the horizontal and vertical directions(AdaptiveAvgPool2d), By adaptively extracting the average value of the features from the feature space to smooth and eliminate the noise in the LR image, reduce the dimensionality, and retain important information, using two spatial pooling kernels (H, 1) and (1, W) Encode each channel along the abscissa and ordinate respectively, the height of the input X is h, and the output Y of the c-th channel is expressed as: Similarly, the width of the input X is w, and the output Y of the c-th channel is expressed as: After the adaptive pooling layer, a pair of feature maps that are sensitive to location information are obtained Y h , Y w , Both have the same number of channels as the input X, and the shapes and sizes are (nf, 64, 1) and (nf, 1, 64), and then Y h , Y w Perform a merge operation and obtain f through a 1 × 1 convolution operation: In equation (16), cat() represents the merging operation, and δ represents the H_sigmoid nonlinear activation function. Then input f divided into two independent tensors along the spatial dimension f h , f w , After two 1×1 convolutional layers, f h and f w Substituting a tensor with the same number of channels as the input X, we get: The σ in equations (17) and (18) represents the sigmoid function. The output g h c and g w c are used as attention weights, and finally Y output from the coordinate attention module is expressed as: This paper uses the heat maps of the attention maps of the ''baboon'' images in the Set14 data set under different network layers to show the effect of coordinate attention on the high-frequency details of the image, as shown in Figure 3, ''block1'', ''block9'', '' Block15'' and ''block20'' respectively represent the output images of the 1, 9, 15, and 20 ARB modules. When outputting features, the sigmoid function is used to constrain the feature value to [0, 1]. The brighter part of the image indicates attention Where the force is more concerned. It can be seen from the figure that in the low-level network, attention pays more attention to the overall outline of the image, and in the deep layer of the network, attention is more inclined to the high-frequency details of the image, which shows that the coordinate attention is more sensitive to the location information of the high-frequency details. Have stronger attention.

H. ADAPTION RESIDUAL ATTENTION INFORMATION EXTRACTION MODULE
The self-correcting convolution proposed in [15] splits the feature channel and uses two branches to process the input features with different functions. The self-correcting branch obtains the attention feature map through the attention module, and the convolution branch performs conventional feature extraction. Integrating the features of the two as the final output, the self-correction branch can only focus on the position information of the attention map, reducing additional learning parameters. Since low-resolution images contain a large amount of low-frequency information and a small amount of high-frequency information, low-frequency information is often easier to learn. At the same time, in order to better enable the model to learn high-frequency information, the skip link in the residual block is used to output the low-frequency information. The addition of features allows the model to directly learn more complex high-frequency information, and at the same time relieves the training pressure. Therefore, this paper is inspired by self-correcting convolution [20] and proposes an adaptive residual attention information extraction module, which is the core of the nonlinear mapping module. Different from the design of a single information flow, the adaptive residual attention information extraction module The module mainly adopts the parallel mode of the residual module and the coordinate attention module. The residual module is responsible for extracting the feature information of the LR image [21]- [23]. The coordinate attention branch can adaptively generate a feature map that is sensitive to position information, and then the feature information is fused to obtain more good result. As shown in Figure 4, ''RB'' in the figure represents the residual block, and ''CoordAttention'' represents the coordinate attention  module. The input feature X n−1 will pass through a 1 × 1 convolutional layer f 1 before passing through two branches., f 2 , its main function is to separate channels, achieve dimensionality reduction and reduce the amount of calculation, the expression is: H 1 and H 2 in equations (20) and (21) respectively represent the output of the two branches after 1 × 1 convolution. After H 1 and H 2 go through the residual module and coordinate attention respectively, the output features are fused, and finally go through a 1 × 1 convolutional layer f 3. At the same time, the method of skip link is adopted to alleviate the problem of gradient disappearance, and the final X n, as shown in formula (22): RB() in equation (22) represents the output of the residual module, CA() represents the output of coordinate attention, and [] represents the joining of two output features.

I. LOSS FUNCTION
This paper uses the L1 loss function to optimize the network. Compared with the MSE loss, the L1 loss penalizes the relative error less, and the effect on the image texture detail reconstruction will be better. , I H LR Represents low-resolution images and high-resolution images respectively, M represents the size of the batch training set, and the parameter set is θ. The loss function expression of the algorithm in this paper is shown in equation (23), and the L1 loss function is used for constraints: In formula (23), f ARASR represents the network model of this paper

A. DATA SAT
In order to test the feasibility of this model, this paper trains and tests the model on the recently proposed RefSR data set CUFED5 [24]- [26]. The training set contains 11,842 pairs of pictures, and each pair is composed of an input image and a reference image. The test set contains 126 groups of pictures, each group is composed of an HR image and four reference images. In order to fully train the network, this article uses three methods to enhance the training data: 1) Rotate the image by 90 • , 180 • and 270 • ; 2) Flip the image horizontally and vertically; 3) Process the LR image to 40 × 40 pixels, the Ref image is processed into 160 × 160 pixels.
In order to evaluate the generalization ability of SRCA on the CUFED5 data set, this paper conducted model tests on the CUFED5 [27], Sun80 [28], Urban100 [29], and Manga109 [18] data sets. The Sun80 data set contains 80 natural images, and each image contains multiple reference images to pair with it; the Urban100 data set contains 100 architectural images without reference. Due to the high similarity of the architectural images, the LR image is set As a reference image for texture search and transfer; Manga109 contains 109 comic images without reference images, so HR images are randomly selected as reference images in this dataset.
This paper uses 800 high-quality RGB training images in the public data set DIV2K [30] as the training set, Set5 [17], Set14 [18], BSD100 [19], Urban100 [20] four data sets as the standard test set, Among them, Set5, Set14 and BSD100 contain natural scene images, and Urban100 contains challenging urban scene images with details distributed in different frequency bands. The training set is rotated by 90 • , 180 • , 270 • and horizontal flip for data enhancement. In order to evaluate the performance of the model, this paper calculates the peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) on the Y channel (ie, brightness) of the YCbCr channel.
In the training phase, this article uses the Adam algorithm to optimize the convergence speed of the model. The parameters are set to β 1 = 0.9, β 2 = 0.999, ε = 10 −8 , the initial learning rate is set to 1 × 10 −4 , and the learning rate is reduced every 200 epochs. Half, the model is trained for a total of 600 epochs, the batchsize is set to 32, the magnification factor is ×2, ×3, and ×4. The HR subgraphs with sizes of 128 × 128, 192 × 192, 256 × 256 are used as labels, and the size is 64 × 64. LR subgraph as input.

B. EXPERIMENT RESULTS AND ANALYSIS
In order to evaluate the effectiveness of this model, the SRCA model in this paper is compared with other latest SISR and RefSR methods. SISR methods include SRCNN [4], MDSR (Multi-scale Deep Super-Resolution system) [19], RDN (Residual Dense Network) [20], RCAN ( Residual Channel Attention Networks) [21], SRGAN [22], ENet (Efficient neural Network) [23], ESRGAN (Enhanced SRGAN) [24], RSRGAN (RankSRGAN) [25]. The RefSR method uses three most advanced methods: CrossNet [26], SRNTT [10], TTSR [11], and its performance is much better than the previous RefSR method. In all experiments, the LR and HR images were quantitatively evaluated with a 4-fold magnification factor. In order to compare the performance of each model fairly, all methods are trained according to the settings in TTSR. During the test, adversarial training can obtain better visual quality in the SR method, but relatively speaking, it will reduce the peak signal-to-noise ratio (PSNR) and structural similarity (Structural SIMilarity, SSIM). ). In response to this problem, another model version SRCA_rec, which is optimized only for reconstruction loss, is trained in this article to compare PSNR and SSIM more fairly.
In this section, quantitative and qualitative evaluations of SRCA are carried out. The quantitative evaluation is shown in Table 2. Through comparison, it can be concluded that SRCA has the best performance on the Urban100 and Manga109 datasets. On CUFED5 and Sun80, SRCA can achieve performance comparable to the latest models.
Next, use SRCA, bicubic, RDN, RCAN, SRNTT, and TTSR to super-resolution reconstruction of the actual picture, and as shown in Figure 8 and Figure 9, most reconstruction methods are prone to artifacts, Figure 8. (c), Figure 8(d) and Figure 8(e) reconstructed face images are very blurry, the reconstructed face images of Figure 8(f) and Figure 8(g) have serious artifacts, and they are positioned during texture transfer Inaccurate, the image details and textures reconstructed by the model in Figure 8(h) are more real and the face contours are clearer. The reconstructed images in Fig. 9(d), Fig. 9(e) and Fig. 9(f) are very blurry. The reconstructed image in Fig. 9(g) is better than the first three, but the reconstructed number 3 has an unclear edge contour. The details of the restoration are not true. However, the reconstruction of the VOLUME 9, 2021   model in Figure 9(h) in this paper has a high degree of realism in the line details, and the edge contours are clearer and natural.
The pictures restored by SRCA also have higher visual quality, and can be more accurate in Ref image texture search and transfer. When the image similarity between Ref image and LR image is not high, the SRCA model can still search for the corresponding texture and Transfer to the SR image to make the restored image more vivid, as shown in Figure 10 and Figure 11. In Fig. 10, the reconstructed window lines in Fig. 10(d), Fig. 10(f) and Fig. 10(g) are blurred, and the reconstructed image in Fig. 10(e) is visually improved but the enlarged window glass edge Not clear. The model reconstructed in this paper is shown in Figure 10(h). The edges of the window glass are clear and enjoyable. In terms of face restoration, as shown in Figure 11, the pictures restored by SRCA also have higher visual quality, and can be more accurate in Ref image texture search and transfer. When the image similarity between Ref image and LR image is not high The SRCA model can still search for the corresponding texture and transfer it to the SR image, so that the restored image is more vivid, as shown in Figure 10 and Figure 11. In Fig. 10, the reconstructed window lines in Fig. 10(d), Fig. 10(f) and Fig. 10(g) are blurred, and the reconstructed image in Fig. 10(e) is visually improved but the enlarged window glass edge Not clear. The model reconstructed in this paper is shown in Figure 10(h). The edges of the window glass are clear and enjoyable. In terms of face recovery, as shown in Figure 11.
In this paper, the model training performance is compared with TTSR, and the experimental results are shown in Figure 12. Figure 8 shows the comparison of the PSNR  and SSIM curves on the ×4 CUFED5 verification set during the 200 rounds of training for the two networks. It can be seen that both networks show a growth trend, but the overall growth of SRCA is above TTSR, and that of SRCA The average PNSR and average SSMI are approximately 0.12dB and 0.0035 higher than TTSR. This proves that SRCA has better performance under the same number of training.

C. OBJECTIVE EVALUATION RESULTS AND ANALYSIS 1) PERFORMANCE EVALUATION OF IMPROVED MODULES IN THE ALGORITHM
In order to verify the effectiveness of the adaptive residual attention information extraction module in the algorithm of  this paper and the feasibility of improving the coordinate attention network to compare other attentions, in order to ensure the fairness of the experiment, the EDSR [12] algorithm is used as the benchmark model. The number of modules is 16. ''Baseline'' means the EDSR benchmark model, which is composed of 16 residual modules; ''Baseline+CA'' means the residual module is cascaded with the channel attention module; ''Baseline+CoordA'' means the residual module is cascaded with the coordinate attention module, ''ARASR'' means that the residual module and the coordinate attention module are connected in parallel, which is also the network structure used in this article. In order to train fairness, the experiment has trained 5 × 105 iteration rounds. Table 3 shows the PSNR values of different module combinations on the Set5 data set at 4 times magnification. From the table, it can be found that the adaptive residual attention information extraction module ''ARASR'' proposed in this paper has obvious parameters compared with other algorithms. Reduced, the PSNR value also has a certain increase, which proves the effectiveness of the adaptive residual attention information extraction module.

2) THE INFLUENCE OF THE NUMBER OF ARB MODULES IN THE ALGORITHM ON THE PERFORMANCE OF THE MODEL
In order to explore the impact of model network depth on model performance, 16, 18, 20, and 22 ARB modules were selected for the experiment. Figure 13 shows the PSNR values of models with different ARB modules on the Set14 data set. It can be seen that When the network is not deep enough, the PSNR value is low. The reason may be that the network does not extract sufficiently deep features, which makes the model unable to learn the high-frequency details of the image. After the network is deepened to a certain level, continuing to deepen will not increase the PSNR value. When the number of ARB modules is 20, the PSNR value is the highest, and the model's ability to extract features tends to be saturated. Therefore, the number of ARB modules is selected as 20 in this paper.

3) ALGORITHM OVERALL PERFORMANCE EVALUATION
In order to evaluate the performance of the algorithm in this paper, the results obtained by the algorithm in this paper are compared with SRCNN [5], FSRCNN [6], VDSR [8], DRCN [9], LapSRN [21], DRRN [10], CARN [22], Mem-Net [11], SRMDNF [23] and EDSR-baseline [19] and other algorithms are compared with the experimental results, and the PSNR, SSIM values and parameter sizes under three magnifications of ×2, ×3, and ×4 are compared. Table 4 shows the model parameters and objective quality evaluation index values under different magnifications. It can be seen that the PSNR and SSIM values of the algorithm ARASR in this paper have a certain increase compared with other algorithms under the three magnifications, especially when the magnification is 2, the PSNR values of the four data sets of the benchmark model EDSR are increased respectively. After reading 0.03dB, 0.14dB, 0.03dB, 0.26dB, we can see that the model in this paper performs better in the Urban100 dataset of complex scenes. At the same time, the parameters are reduced by 818K and 596K respectively compared with CARN and the benchmark model EDSR. In order to explore the influence of the width of the network on the performance of the model, this paper sets up two sets of experiments. The number of modules in the ARB network of the ARASR-s algorithm is 20, and the number of input channels is 40; the number of modules in the ARB network of the algorithm in this paper (ARASR) is 20, The number of input channels is 64. It can be seen from Table 3 that the width of the network has a huge impact on the amount of model parameters. When the magnification is 4, ARASR-s reduces the amount of parameters by 60% compared with ARASR, and the model performance also declines. Figure 14 shows the corresponding relationship between the model parameters of each algorithm on the Set14 data set and the PSNR value at 2 times magnification.
The text model has a good trade-off between performance and the amount of parameters. Figure 15 shows the comparison of the model generation effect diagram when the magnification is 4, you can see that the zebra stripe image obtained by the bicubic interpolation is blurred as a whole, and the reconstruction effect diagram of SRCNN and other algorithms can clearly see the phenomenon of local blur, and this article The algorithm restores high-frequency details to the greatest extent for this irregular zebra stripe image and greatly reduces the ringing effect.     Figure 17 show the reconstruction renderings of the two images ''img_012'' and ''img_046'' in the Urban100 dataset at a magnification of 4. From ''img_046'' it can be seen that the algorithm in this paper can better restore the edges and textures of the image The details are closer to the real image than other algorithms. From ''img_012'', it can be seen that other algorithms will produce incorrect reconstruction of the texture direction in the image to a certain extent, and at the same time will cause the checkerboard effect. The algorithm in this paper can reconstruct the correct texture details more closely to the real image.

IV. IN CONCLUSION
In this paper, an adaptive residual attention network is proposed, which can achieve lightweight and accurate single image super-resolution. The experimental results show that the proposed method can reduce the number of parameters and achieve the best objective evaluation index compared with other algorithms; In terms of visual quality, the proposed method can better deal with fuzzy artifacts and reconstruct texture details. At the same time, it can be seen that the network width has a great impact on the number of parameters and performance of the model.