DSMA: Reference-Based Image Super-Resolution Method Based on Dual-View Supervised Learning and Multi-Attention Mechanism

Reference image based super-resolution methods (RefSR) have made rapid and remarkable progress in the field of image super-resolution (SR) in recent years by introducing additional high-resolution (HR) images to enhance the recovery of low-resolution (LR) images. The existing RefSR methods can rely on implicit correspondence matching to transfer the HR texture from the reference image (Ref) to compensate for the information loss in the input image. However, the differences between low-resolution input images and high-resolution reference images still affects the effective utilization of Ref images, so it is an important challenge to make full use of the information in Ref images to improve the SR performance. In this paper, we propose an image super-resolution method based on dual-view supervised learning and multi-attention mechanism (DSMA). It enhances the learning of important detail features of Ref images and weakens the interference of noisy information by introducing the multi-attention mechanism, while employing dual-view supervision to motivate the network to learn more accurate feature representations. Quantitative and qualitative experiments on these benchmarks, i.e., CUFED5, Urban100 and Manga109, show that DSMA outperforms the state-of-the-art baselines with significant improvements.


I. INTRODUCTION
Image Super-Resolution (SR) is a fundamental computer vision task that aims to recover natural high-frequency details from a given low-resolution image [1]. It is widely used in the real world in some fields that require high image quality, such as medical imaging, satellite surveying, surveillance, security, etc. [2]- [4], and it helps to improve other computer vision tasks. In general, the problem is very challenging and inherently ill-posed due to the lack of information inherent in LR images and HR images.
The study of image super-resolution is usually divided into two types: single image super-resolution (SISR) and reference-based image super-resolution (RefSR). Traditional SISR methods are mainly based on interpolation, filtering and dictionary learning [5]- [7], which use manually designed mapping or learning strategies to recover images. Since these approaches are homogeneous and very dependent on human experience, it is difficult to recover satisfactory images when faced with unknown degraded LR images. SRCNN [8] is the first deep learning method applied to the SISR problem, which uses a three-layer convolutional neural network (CNN) to learn the mapping from LR images to HR images to recover images, which provides new insights for exploring SR research. To better learn this mapping relationship from LR images to HR images, Dong et al. [9] improved SRCNN by putting the upsampling operation at the end and using smaller convolutional kernels to extract features, which greatly improved the speed of SR. Kim et al. [10] proposed a 20-layer network to extract more information using a larger perceptual field to generate higher quality images. Ledig et al. [11] introduced residual connectivity to overcome the problem of gradient propagation, which avoids gradient disappearance and gradient explosion during the training of deeper networks by jump connectivity and identity mapping. Kim et al. [12] proposed DRCN, which uses a recursive network structure to deepen the number of layers and expand the perceptual field of the network without increasing the network parameters. Ledig et al. [11] proposed SRGAN, which introduces generative adversarial networks to enhance the realism of the recovered images using perceptual loss and adversarial loss. ESRGAN [13] enhances SRGAN by introducing RRDB with relativistic adversarial loss. Although these deep learning models have achieved good results in the application of image super-resolution, these classical SISR methods usually cause blurring effects or visual artifacts due to the inherent lack of information between LR and HR images.
Recently, reference-based image super-resolution (RefSR) has achieved success in the field of SR, which additionally introduces HR images as Ref and provides finer details to the LR image by transferring the texture features of Ref to achieve good reconstruction performance. Traditional RefSR methods such as CrossNet [14] [16] and MASA [17] further introduce the idea of the transformer [18] to discover the deep feature correspondence between LR and Ref images through the attention mechanism thus transfers more accurate textures. However, some noise information that remains in such texture features will have negatively affect the subsequent fusion with low-resolution image features, network convergence and final results.
To address these problems, we propose a RefSR called DSMA, which enhances the effective use of Ref information, improves the quality of feature propagation, and enables the network to learn more accurate feature representations. The design of DSMA has several advantages. First, this paper is inspired by mixed attention [19] and introduces multi-stage attention, which first achieves high-quality correspondence matching by the first stage of attention and extract similar HR texture features from Ref; then, the obtained HR texture features are reinforced by the second stage of attention to learn important features and suppress the propagation of noisy information. Second, this paper is inspired by knowledge distillation [20] and introduces a dual-view supervised learning strategy to quantify the similarity between the deep features of HR images and the fused features of LR and Ref at different scales, which is applied to the evaluation and loss functions to motivate the network to learn a more accurate feature representation.
The main contributions of this paper are as follows.
(1) Based on the idea of mixed attention, we combine the transformer attention and channel attention structures to avoid the interference of noisy information during texture transfer while enhancing the learning of useful features.
(2) We propose a novel supervised strategy for dual-view learning, which enables our method to combine supervised signals at both intermediate feature layer and image layer levels to motivate the network to learn more powerful feature representations.

II. RELATED WORK A. SINGLE IMAGE SUPER-RESOLUTION
As the mainstream method of image super-resolution, SISR has been a hot topic of research in the field of image super-resolution.The SRCNN [8] proposed by Dong et al. first introduced deep learning into the field of image superresolution; they proposed a three-layer network architecture to achieve image super-resolution by learning the mapping function of LR-HR through CNN, and achieved better results than traditional methods (e.g., interpolation, dictionary learning, etc.) [5]- [7]. Afterwards, more network structures were explored to improve the performance of SR networks. Kim et al. proposed a deeper network to generate higher-quality images.The SRResNet [11] network applied the residual block structure to the super-resolution problem and avoided the problem of gradient disappearance and gradient explosion caused by networks that were too deep. Tong et al. [21] introduced the dense block structure combining low-level features and high-level features to improve the performance of superresolution. Zhang et al. [22] further combined residual and dense blocks to form an RDB module to enhance the flow of information and gradients. Zhang et al. [23] proposed RCAN, which introduced a new idea of attention to enhance the performance of super-resolution reconstruction by assigning different attention weights to the feature information learned by the network. These methods use mean square error (MSE) or mean absolute error (MAE) as their objective function which ignores human perceptions. SRGAN [11] introduced generative adversarial networks, which use perceptual loss and adversarial loss to improve the realism of images. Recently, RSRGAN [24] was proposed to introduce Ranker mechanism to optimize the perceptual loss to obtain more advanced visual effects.

B. REFERENCE-BASED SUPER-RESOLUTION
Different from SISR, RefSR does not rely only on the prior knowledge obtained by the model to recover the image, it also enhances image recovery by additional high-resolution information provided by the introduced reference image. It is because of this property that RefSR overcomes the inherent information deficiency between LR images and HR images and makes the images that are recovered by the model more realistic, so RefSR has become a hot research topic in recent years.
Although reference-based image super-resolution can provide a large amount of high-frequency information, there are scale and content differences between the input lowresolution image and the high-resolution reference image due to parallax and resolution, which makes it difficult to directly utilize high-frequency details. Therefore, how to establish the correspondence between LR and Ref images becomes the key to the success of RefSR. One branch of RefSR performs spatial alignment between LR and Ref images. Zheng et al. [14] proposed a cross-scale warping layer using optical flow alignment to perform reverse warping operation on Ref features to obtain spatially aligned features with LR features, and subsequently send LR and transformed Ref features to the decoder for fusion, however, the flow is obtained by a pretrained network, which leads to computationally intensive and inaccurate estimation. Shim et al. [25] further proposed to overcome the shortcomings of optical flow estimation by using deformable convolution to align and extract Ref features, but the performance of this method heavily depends on the quality of alignment between LR and Ref images and is limited in finding long-distance counterparts.
Another branch of RefSR follows the idea of patch matching. Zhang et al. [15] proposed SRTNN, which is an adaptive patch matching approach to swap similar texture features using VGG features of LR and Ref images for patch matching, which ensures stronger robustness even when providing uncorrelated Ref images. However, it ignores the correlation between LR features and swapped features and provides all swapped features equally to the main network. Recent works TTSR [16] and MASA [17] introduce the idea of transformer [18] to more accurately mine the deep correspondence in the LR-Ref feature space and transfer texture features in Ref. MASA also designed a coarse-to-fine matching scheme to reduce the computational effort in the matching process. However, these methods ignore the fact that the transferred features bring a large amount of noisy information while providing more useful information, which can also affect the performance of SR if they take up a large amount of computational resources. Therefore, further screening of these feature information is needed to achieve a more reasonable utilization.

C. ATTENTIONAL MECHANISMS
In recent years, attention mechanisms have been shown to play a superior role in several computer vision tasks such as image classification [26], [27], semantic segmentation [28], target detection [29] and image super-resolution [16], [23].
Hu et al. [27] proposed a compression and excitation block, which significantly improves the performance of image classification by explicitly modeling the interdependence between channels and adaptively recalibrating the channel feature responses. Liu et al. [30] proposed a non-local operations to compute the interaction between any two positions in space to capture the long-range dependencies. Woo et al. proposed a lightweight attention module CBAM [19] that could perform attention in both channel and space dimensions with almost no increase in parameters. Yang et al. [16] proposed a texture transformation network for image super-resolution, where the spatial attention between LR and Ref images is calculated to transfer the texture information.
Although current advanced RefSR algorithms use spatial attention to enhance the ability to transfer features from Ref to compensate for LR images, the differences between lowresolution input images and high-resolution reference images still pose a great challenge to the effective utilization of Ref images. To address this problem, we propose a method called DSMA, which introduces a multi-attention mechanism to enhance the learning of important features from Ref images while using a dual-view supervised excitation network to learn a more accurate feature representation.

III. OUR APPROACH
In this section, we introduce the proposed image superresolution method based on two-view supervised learning and a multi-attention mechanism (DSMA). Then, we introduce the overall structure of the network in Subsection A, our proposed multi-attention mechanism in Subsection B, dual-view supervised learning strategy in Subsection C, and experimental details in Subsection D.

A. NETWORK STRUCTURE
As shown in Fig.1

B. MULTI-ATTENTION MECHANISM
Inspired by CBAM [19], we combine the transformer attention [18] used to achieve corresponding matching on spatial locations and the channel attention structure [23] to form a multi-attention module. Our multi-attention module is divided into feature selection attention module (FSAM) and feature adaptive attention module (FAAM), which we will introduce separately in the following.

1) Feature selection attention module
In order to obtain useful information from the Ref to compensate the recovery of LR image, we introduce the FSAM to achieve the corresponding matching of LR and Ref images. As shown in Fig.2 The relevance is further used to obtain the hard-attention map and soft-attention map.
where the value of h i can be considered as an index, which represents the most relevant position in the Ref to the i-th position in the LR image. Then we apply an index selection operation to the unfolded patches of F Ref using the obtained hard-attention map as the index: where t i denotes the value of T in the i-th position, which is selected from the h i -th position of F Ref .
where F denotes the features obtained from Ref for transfer. The denotes element-wise multiplication between feature maps.

2) Feature adaptation attention layer
The process of feature selection from Ref images is described above, and this section describes the implementation of further analysis and processing of the selected features. Inspired by RCAN [23] and SEnet [27], we propose feature adaptive attention modules as shown in Fig.3, which uses residual blocks to superimpose a channel attention module to perform channel dimensional feature filtering on the features obtained in the previous step, and scales the features of each channel by adaptive learning to enhance important features and weaken noise. In the entire process, generating different attentions for each channel is a key step. For a feature mapping x of size C × H × W , the spatial information of the aggregated feature mapping is first aggregated using average pooling and max pooling to generate two different channel descriptors: z avg and z max , where the c-th element is computed from all elements of that channel: where x c (i, j) is the value of the c-th channel feature at (i, j), H AP (·) is the global average pooling function, H M P (·) is the global maximum pooling function, and then The two channel descriptors adaptively learns the weights through a shared network: where σ(·) and δ(·) denote the Sigmoid function and the ReLU function, respectively. W D is the weight of the first layer of this shared network and W U is the weight of the second layer of this shared network. First, the two channel descriptors reduces the number of channels by reduction ratio r. After being activated by ReLU, it is then increased with ratio r by a channel-upscaling layer. Finally, the channel weights of each layer are mapped between 0,1 by the Sigmoid function. Using this method to downscale and subsequently upscale the channel dimensions, the number of parameters of the network is reduced, while the nonlinear relationship between channel descriptors and channel weights can be better learned by the two activation functions over ReLU and Sigmoid. The obtained weights of each channel further act on the input feature mapping x to achieve a different focus on each channel: where s c and x c are the weight and feature map in the c-th channel. Finally, we use the residual block to apply the channel attention to the feature F to obtain the final transferred featuresF Ref : where W 1 and W 2 are the weights of the two convolutional layers in the residual block, CAM denotes the channel attention operation.

C. DUAL-VIEW SUPERVISED LEARNING STRATEGY
Since the RefSR model has a more complex network structure than the general SISR model, using only the supervised signal of the final image layer for optimal learning leads to difficult gradient propagation, which poses some difficulties for the training of the network. To address the above challenges, we propose a dual-view supervised learning strategy to obtain more accurate feature representations. Specifically, in the intermediate feature layer, we employ an additional triple supervision loss to minimize the features of HR images and the fused features of LR and Ref images as follows: where I HR denotes the real image, Φ encoder n (·) denotes the n-th layer of the encoder, F f usion denotes the fusion feature of LR features and transferred features.
In the final image layer, we use the following three losses for supervision: Reconstruction loss. The first loss is the reconstruction loss, which is used to measure the difference in pixel space between the HR image and the SR image: where I HR and I SR denote the ground truth image and the network output. Perceptual loss. The second loss is the perceptual loss, which has been shown to help improve the visual quality and has been applied in [11], [31]. Its core idea is to enhance the VOLUME 4, 2016 similarity in feature space between the predicted and target images: where φ vgg i is the feature map of the i-th layer of the VGG network, and Here we use the conv5 − 4 layer of the network, which is the last convolutional layer in the VGG network. Adversarial loss. The adversarial loss [32] L adv is effective in generating visually pleasing images with natural details. It's core idea is to use the mutual game learning of generative model G and discriminative model D to produce better outputs: Finally, we combine supervised signals at both intermediate feature layer and image layer levels to learn more powerful feature representations: The model training algorithm for image super-resolution based on dual-view supervised learning and multi-attention mechanism is shown in Algorithm 1.

D. IMPLEMENTATION DETAILS
The encoder network contains three building blocks, each consisting of one convolutional layer and four ResBlocks, to output texture features in three different scales. In the feature selection attention module, to reduce the consumption of both time and GPU memory, the relevance embedding is only applied to the smallest scale and further propagated to other scales. The feature adaptation attention module uses three networks with the same network structure and different parameters to perform adaptive scaling of the information at each scale in parallel. The decoder network contains multiple convolutional layers, deconvolutional layers, ResBlocks [22], and feature fusion modules [17], where the fusion is first applied to the smallest scale and propagated upward to other scales one by one.

A. DATASETS AND SETTINGS
In this paper, we mainly use CUFED5 [15] as the training set in our experiments, which is the benchmark dataset proposed in SRTNN and widely used in subsequent studies of RefSR. CUFED5 consists of 11871 training image pairs, each of which contains an original HR image and a reference image with 160×160 resolution. To compare with other SR models, we choose a bicubic downsampling operation with a scaling factor of 4× on the original HR image to obtain the LR image following the previous works, and we augment the training Compensate F f usion for I LR to generate the SR image I SR , calculate the gradient ∇ g f (I HR , I SR ) 8: Update θ w by ∇ g f (I HR , I SR ) and ∇ g f (F HR , F f usion ) with Adam algorithm 9: end for 10: end for 11: return Parameters of the network images by randomly horizontally and vertically flipping followed by randomly rotating 90 • , 180 • and 270 • to enhance the robustness of our model during training. To investigate the generalization capacity of our model, we conduct experiments on three widely used benchmarks: CUFED5 testing set, Urban100 [33] and Manga109 [34]. There are 126 test images in the CUFED5 test set, and each image corresponds to 4 reference images with different similarities. Urban100 contains 100 building images without references, and each image takes its LR image as a reference so that the network can explore the self-similarity of the input images, and for Manga109, which also lacks reference images, the HR images in the randomly sampled dataset are used as reference images because its dataset is all composed of simple lines and flat colored areas. All results of PSNR and SSIM are evaluated on the Y channel of YCbCr color space.
The discriminator uses the same structure as [17]. In the training process, we choose Adam [35] as the algorithm for training optimization, and the parameters β 1 and β 2 are set to 0.9 and 0.999, respectively, and set the initial learning rate to 10 −4 , and use the cosine annealing strategy to vary the learning rate, which decreases to 10 −6 every 500 epoch. the weight coefficients λ sup , λ rec , λ per and λ adv are set to 0. 5, 1, 1, and 0.005, respectively. The DSMA proposed in this paper is implemented based on the Pytorch framework and trained on an NVIDIA 2080Ti GPU.

B. EVALUATION
To evaluate the effectiveness of DSMA, we compare our model with other state-of-the-art SISR and RefSR methods. The SISR methods include SRCNN [8], MDSR [36], RDN [22], RCAN [23], SRGAN [11], ENet [37], ESR-GAN [13], RSRGAN [24], The RefSR methods include SRNTT [15], TTSR [16], MASA [17]. All the models are trained on the CUFED5 training set, and tested on the CUFED5 testing set of Urban100 [33] and Manga109 [34]. All experiments are performed with a scaling factor of 4× between LR and HR images. Quantitative evaluations. For a fair comparison with other MSE minimization-based methods on PSNR and SSIM, we train only another version of DSMA by minimizing the reconstruction loss, denoted as DSMA -rec. Table 1 shows a quantitative comparison of PSNR and SSIM, where the best and suboptimal results are indicated in red and blue, respectively. As shown in Table 1, our model outperforms the state-of-the-art method on these three benchmark datasets. In addition we further show the comparison between our method and other methods in terms of network parameters, as shown in Fig.4, evaluated on the CUFED5 dataset with a scaling factor of 4×, where the red dots represent the network proposed in this paper. The best SR results can be obtained for DSMA-rec in networks with fewer than 5000K parameters. This shows that our method can well balance the number of parameters and the reconstruction performance. Qualitative evaluations. In order to get a more intuitive view of the visual reconstruction effect, we also compare DSMA with other mainstream models ESRGAN, RSRGAN, TTSR and MASA. Fig.5 shows the comparison of several methods in terms of the reconstruction peformance of human faces, buildings, and numbers, letters, and graphics. As shown in the top left example, DSMA successfully recovers the exact word "n busine one site" while other methods fail. Besides, as shown in the middle example, our approach can restore a more realistic facial expression, while other Ref methods produce more severe artifacts or blurred effects.

V. ABLATION STUDY
In this section, we perform several ablation experiments to analyze our proposed method. We validate the effectiveness of our proposed attention module and the dual-view supervised strategy, and also analyze the effect of different similarity levels of reference images. Attention module. The role of the feature adaptation attention module (FAAM) is to weaken the effect of noisy information in the Ref features and enhance the effect of important information. As shown in Fig.6, we compare the training of the model after adding attention and our model achieves faster convergence and higher PSNR level after convergence.
The effect of the FAAM is further illustrated in Fig.7. As shown in the left side of Fig.7, the baseline pays more attention to texture details on channel 10 and less details on channel 11, while our DSMA shows more attention to channel 11 and less attention to channel 10. This shows that our model can adaptively scale the channels according to the importance of the channel information. The right side of the figure shows the final recovery effect of the image. The DSMA with channel attention has an advantage in PSNR/SSIM values, and our DSMA can recover more realistic leaves in terms of visual perception, while the baseline shows a more blurred effect.
We also compared FAAM with other attention methods (including RCAN [23] and SENet [27]), as shown in Table  2, FAAM improves PSNR and SSIM by 0.09 dB, 0.11 dB, and 0.37 dB on the three public datasets, respectively, compared to the baseline (without introducing channel attention), only introducing 0.22M more parameters. All three methods obtained performance improvements on the CUFED5 dataset compared to the baseline, which suggests that the introduction of channel attention may have enhanced the model fitting ability, while SENet performed poorly on the other two datasets, even less than baseline, and RCAN and FAAM achieved some improvement on all three datasets, which suggests that the addition of residual structure may have enhanced the generalization ability of the model. Supervision loss. The role of supervision loss is to restrict VOLUME 4, 2016    the fusion features and HR features to be similar. As shown in Fig.8 Table 4, DSMA achieves the best results regardless of the relevance level of the reference image, with L1 as the reference image having the best performance. User Study. We conducted a user study to qualitatively demonstrate the superiority of our method. In total, 10 users were asked to compare the visual quality of our method with the state-of-the-art techniques on the CUFED5 dataset, including ESRGAN, RSRGAN, TTSR, and MASA. We showed them two images, one of which was the result of our method, and asked the users to choose the one with better visual quality. As shown in Fig.9, more than 80% of the users think that our method provided better results than the existing methods.

VI. CONCLUSION
In this paper, we propose an image super-resolution method based on dual-view supervised learning and multi-attention mechanism (DSMA), which uses the attention mechanism to search for HR features on the Ref image to compensate for the LR recovery, weaken the noisy information in the features that affect recovery, and enhance the important information that helps recovery. In addition, we also introduce a dualview supervised approach to integrate the supervised signals at both feature and image layers to optimize the network to learn a more robust feature representation. In the future, we will further investigate new attention structures based on reducing GPU memory usage and explore the use of easier LR-HR matching to guide LR-Ref matching to achieve better performance.
YE WANG was born in 1984. He received the M.S. degree in Electrical Engineering and its Automation from the State Grid Institute of Electric Power Science, in 2009. He is mainly engaged in technical research, project management and standards preparation for information systems and network security, and has participated in the preparation of many corporate standards of the State Grid Corporation, such as "Network Security Risk Monitoring and Warning Platform Security Monitoring Data Specification". VOLUME 4, 2016