Guided Dual Networks for Single Image Super-Resolution

,


I. INTRODUCTION
Aiming to recover a high-resolution (HR) image from a single given low-resolution (LR) image, single image superresolution (SISR) has received critical attention in computer vision researches. SISR can be used in various fields, such as security and surveillance [1], [2], medical imaging [3], [4], remote sensing image [5], and object recognition [6]- [9]. However, because there exist multiple solutions for the same low-resolution image, SISR is an ill-posed inverse problem. Most SR methods learn the mapping between LR and HR images for generating high-quality super-resolved images.
In recent years, deep convolutional neural network (CNN) based methods [10], [11] have consistently achieved significant improvements over traditional methods in reconstructing high-quality SR images. Various network The associate editor coordinating the review of this manuscript and approving it for publication was Yonghong Peng .
architectures are proposed to improve the SR performance, commonly taking the Peak Signal-to-Noise Ratio (PSNR) and/or the Structural Similarity Index (SSIM) [12] as measurements and evaluation indexes. These methods assume that higher PSNR value implies better quality and less distortion. They usually adopt the optimization function of minimizing the mean squared error (MSE) between the recovered SR image and the ground truth to maximize PSNR. However, they lack the ability to capture high-frequency features. The high PSNR estimates are typically over-smoothed and conflict with human visual perceptual observation. As shown in Fig. 1, the PSNR-oriented method RCAN [13] has high reconstruction accuracy but generates over-smoothed edges.
To improve the visual quality of SR images, several perceptual-driven methods have been proposed to produce visually satisfying results. Generative adversarial networks (GANs) [14] have been applied in SR because of its capability to generate realistic images. Despite their great success, • We employ the top-down global guidance to deliver the high-level global features from the low-frequency branch to the high-frequency branch for generating detail information.
• Extensive experiments show that our approach achieves state-of-the-art performance on several benchmarks, demonstrating the effectiveness of our network.

II. RELATED WORK A. SINGLE IMAGE SUPER-RESOLUTION
Since deep learning algorithms have shown superior performance, we mainly focus on deep learning algorithms for the SISR problem. SRCNN [19] is the first successful attempt towards super-resolution using only three convolutional layers. This effort can be considered as the pioneering work in the CNN-based SR field and inspires numerous later works. Replacing the bicubic upsampling operation with an efficient sub-pixel convolution, FSRCNN [20] and ESPCN [21] improve the speed and image quality, which achieve real-time performance. Various advanced upsampling structures have been proposed in recent years, such as deconvolutional layer [22], [23] and EUSR [24]. VDSR [25] and DRCN [26] increase the network depth and achieve better performance, supporting the argument that deeper networks can provide better contextualization. EDSR [27] introduces a very deep and wide network by modifying the ResNet [28] architecture. LapSRN [29] employs a pyramidal framework to progressively predict the residual images up to a factor of 8×. ZSSR [30] uses an unsupervised method to learn the mapping between LR and HR images. SRMDNF [31] tackles multiple degradation problems in a single network by treating degradation maps of images as inputs. Densely connected networks have been proposed to improve SR performance. RDN [32] combines residual skip connections with dense connections, showing good resilience against the degradation process and recovering enhanced SR images. RCAN [13] is a recently proposed deep ResNet network with the channel attention mechanism and achieves state-of-the-art PSNR performance.
The aforementioned methods mainly use MSE loss as the optimization function to obtain high PSNR and SSIM values. However, these PSNR-oriented methods usually generate heavily over-smoothed edges. The generated images lose various high-frequency details and have bad perceptual quality. To improve the visual quality of SR results, perceptual-driven approaches have been proposed. SRGAN [33] firstly introduces the GAN framework into the SR problem and produces visually pleasing results. SRGAN combines a perceptual loss and an adversarial loss to improve the reality of the generated images. But visually implausible artifacts can still be found in some generated images. EnhanceNet [34] combines a pixel-wise loss in the image space, a perceptual loss in the feature space, an adversarial loss, and a texture matching loss [35] to produce more realistic and better perceptual-quality outputs. Built upon SRGAN, ESRGAN [15] removes batch normalization layers and introduces a basic and effective block: Residual-in-Residual Dense Block (RRDB). Moreover, ESRGAN also employs an enhanced discriminator called Relativistic average GAN (RaGAN) [36]. Noteworthily, ESRGAN won the first place in the 2018 PIRM Challenge on Perceptual Image Super-Resolution [37], which evaluates the image perceptual quality using the perceptual index (PI). The SR models trained with the MSE loss tend to produce over-smoothed results while that trained in an adversarial manner generate realistic details but bring some unpleasant noise. By simply utilizing linear network interpolation of the results generated from PNSR-oriented and GAN-based models, DNI [38] balances the MSE and GAN effects of SR results. But the interpolation parameter α is selected manually. It is too costly to generate continuous interpolation results from interpolating the PSNR-oriented model and the GAN-based method with parameter α in [0, 1]. EEGAN [39] proposes a GAN-based edge-enhancement network that has two subnetworks: a GAN-based ultradense subnetwork and a CNN-based edge-enhancement subnetwork. However, EEGAN is specifically designed for satellite image SR reconstruction, where the CNN-based edge-enhancement subnetwork is for extracting the special features of the edges from satellite images. RankSRGAN [40] introduces a ranker to optimize the perceptual metric directly, which only pursues lower PI value and brings blurring artifacts to the hallucinated details.
Some SISR methods use two branches to capture more information to achieve better performance. DualCNN [16] is a PSNR-oriented network, which uses the different numbers of convolution layers to extract the structure information and the details. SRDPN [41] replaces the residual blocks of EDSR with DPN [42] blocks to achieve improved performance. DSRN [43] introduces a dual-state recurrent network to incorporate information from both the LR and the HR spaces. Dual-way SR [17] exploits a complex network EDSR as its complex branch and the bicubic interpolation as its plain branch to capture the global and the detail information. The above dual-path methods design different network structures for different branches to capture more information, but all the branches of these methods are trained with the same loss function. They are still PSNR-oriented methods and cannot solve the problem of generating over-smoothed results. We use dual branches to learn different information based on different loss functions, rather than using different network structures. Our network leverages the advantages of both the PSNR-oriented and the GAN-based methods to supervise the network and capture the high-frequency and the low-frequency features. The proposed training strategies facilitate reconstructing accurate and realistic super-resolution images.

B. DUAL SKIPPING NETWORK
The study on hemispheric specialization shows that visual analysis takes place in a predominately and default coarseto-fine sequence. Instead of processing spatial frequency information equally, the recent biological experiments reveal that the left hemisphere (LH) and the right hemisphere (RH) are predominantly involved in the high and low spatial frequency processing, respectively. Inspired by the research of the primate visual cortex, the dual skipping network [18] shows promising results on coarse-to-fine object categorization. The dual skipping network is a left-right asymmetric layer skippable network which has two branches referring to LH and RH, respectively. One branch is used for fine-grained level classification which simulates the LH mechanism of processing spatial high-frequency information. The other branch is used for coarse-level classification which simulates the RH mechanism of processing spatial low-frequency information. So the dual skipping network can simultaneously work on coarse and fine-grained classification tasks. Moreover, motivated by a similar mechanism in the brain, the dual skipping network introduces a ''Guide'' referring to top-down facilitation of recognition. The guide feeds the high-level information from the coarse branch to relatively lower-level visual processing modules of the fine-grained branch. In our network, inspired by the study on hemispheric specialization and dual skipping network, we design the high-frequency branch to simulate the LH processing mechanism, and the asymmetric low-frequency branch to process the RH mechanism. Moreover, we also design a mask network to simulate the function of the cerebellum, which is involved in balance and motor control.

III. PROPOSED METHOD
This section introduces the proposed method in detail. Our GDSR network aims to improve the perceptual quality and reconstruction accuracy of SR images via a left-right asymmetric network architecture. As shown in Fig. 2, the GDSR consists of three key components: 1) a left-right asymmetric SR network, 2) a global guidance mechanism, and 3) a mask network. The left-right asymmetric network architecture mainly consists of two branches: the high-frequency branch (HFB) and the low-frequency branch (LFB). The guidance delivers the global feature maps from the LFB to the low-level module of the HFB, helping the HFB generate more high-frequency details. The mask network adaptively reconstructs the final output from the LFB and the HFB to FIGURE 2. The overall architecture of our proposed method. It contains three subnets: the low-frequency branch (LFB), the mask network and the high-frequency branch (HFB), respectively. They share the same shallow feature extraction module, namely the shared module (SM). Each subnet is stacked with the basic blocks: RRDBs. The top-down guide delivers the global information from a high abstraction level of the LFB to a lower abstraction level of the HFB. The mask network generates an attention mask to combine the feature maps from the LFB and the HFB right before the last convolution layer of reconstruction.
improve the perceptual quality and reconstruction accuracy of the SR image. We first describe our left-right asymmetric SR network architecture, then detail our guidance mechanism and mask network in the later subsections.

A. LEFT-RIGHT ASYMMETRIC NETWORK ARCHITECTURE
Our left-right asymmetric SR network aims to achieve a better trade-off between reconstruction accuracy and perceptual quality, which contains two complementary branches. The HFB is used to recover detail information, while the LFB is for reconstructing global information. ESRGAN [15] introduces a novel effective basic block: Residual-in-Residual Dense Block (RRDB), as depicted in Fig. 3, to generate high-quality images. The excellent experiment results prove the strong ability of RRDB to extract multi-level feature information, and we also utilize the RRDB as our basic block.

1) SHARED MODULE
Generally, the shallow parts of the three subnets always extract the shallow features, such as edges and corners. Hence, we design a shared module (SM) for these three subnets at the head of our framework. We first use a convolution layer to process the same LR input image of the subnets, attaining the feature map F 0 of the input: where H LR represents convolution operation. Then we use S RRDBs in the shared module to obtain shallow features from the input feature map F 0 , so we can have: where H SM denotes our shared module. The SM is shared by the low-frequency branch, the high-frequency branch and the mask network, which can extract feature maps efficiently and reduce the parameters.

2) LOW-FREQUENCY BRANCH
To reconstruct relatively high-accuracy SR images, we use a low-frequency branch (LFB) to extract the global features. The LFB is trained by the MSE loss, including deep level feature extraction module, upsampling module and reconstruction module with a feed-forward pipeline. We adopt L RRDBs in the deep feature extraction module to VOLUME 8, 2020 obtain more global feature maps, while the upsampling module upscales the feature maps and the reconstruction module outputs the super-resolved feature maps by a convolution layer.

3) HIGH-FREQUENCY BRANCH
To produce more photo-realistic SR images, we utilize the HFB to generate more high-level feature maps. The HFB is trained by the GAN framework having a generator and a discriminator. The network structure of the generator is similar to the LFB, where H RRDBs are stacked, followed by a convolution layer, an upsampling layer and another convolution layer for reconstruction. The discriminator is a classification network to distinguish the real HR image and the artificially super-resolved image. Similar to ESRGAN [15], we apply the Relativistic average Discriminator (RaD) [36] as our HFB discriminator. The probability output of RaD being closer to 1 means the real image x r being more realistic than the fake one x f . The loss function of the discriminator is defined as follows: represents the operation of taking average for all fake data in the mini-batch, and σ (·) is the sigmoid function. Following [15], our generator consists of three losses: the L 1 , the perceptual loss L percep , and the adversarial loss.
Following [27], [29], [32], [44], we use L 1 loss function to constrain the content of a generated SR image to be close to the HR image. The L 1 loss is defined in Eq. 4: where F G θ (·) represents the function of the generator, θ is the parameters of the generator and I i means the i-th image. This function treats every position in the image equally.
The perceptual loss [45] aims to measure the perceptual similarity between the SR image and the corresponding HR image, which minimizes the distance between two high-level features extracted from a pre-trained network before the activation layers. Both the SR and HR images are taken as the input to the pre-trained VGG19 and the VGG19-54 layer features are extracted. The perceptual loss is defined as: where F VGG θ (·) is the features from the 4-th convolution layer before the 5-th maxpooling layer in the pre-trained VGG19 network and I i is the i-th image, G(·) is the function of the generator.
The adversarial loss for the generator is in a symmetrical form against the discriminator: Inspired by the LSF-based top-down facilitation of recognition in the visual cortex [18], we deem that the LFB can guide the HFB to recover more detail information with the global context features of the input. As shown in Fig. 2, the output feature maps of the l-th RRDB in the LFB is used to guide the subsequent feature extraction of the h-th RRDB in the HFB. Specifically, the output feature maps are concatenated into the input feature maps of the h-th RRDB in the HFB. The injection of feedback information from the global level can be beneficial for the fine-grained reconstruction. We have demonstrated the effectiveness of the guidance in our experiments.

C. MASK NETWORK
To make the final reconstructed SR image focus on highfrequency details, we have embedded an attention mechanism in our framework. As presented in Fig. 2, we design a mask network to produce an attention mask for adaptively reconstructing the final output image, achieving a better trade-off between reconstruction accuracy and perceptual quality. Similar to the LFB, the mask network is stacked by M RRDBs after the shared module. Then we use the upsampling layer to upscale the attention feature map. The feature map is then processed by the sigmoid function to a probability matrix which enables the dual SR framework to yield superior results. Unlike other works that fuse the SR output of each branch to the final output [17], [46], we merge the feature maps in the process of SR image reconstruction which is before the final reconstruction convolution layer. The feature maps extracted from the mask network module is defined as: where H mask denotes mask network. F SF is the output of shared module. The mask A can be formulated as: We use the attention mask A to fuse the feature maps of the low-frequency branch and the high-frequency branch as follow: where f high (F SF ) represents the reconstructed feature map from the HFB, f low (F SF ) represents the reconstructed feature map from the LFB, and A denotes the attention mask indicating to what extent each pixel of the f high (F SF ) contributes to the final output image. In this way, the mask network can learn the weight of each pixel of the feature map, leading to reconstruct the SR image with higher visual quality and less deformed textures. Furthermore, the mask network adaptively reconstructs the final output merged from the LFB and the HFB with the last convolution layer.

IV. EXPERIMENTS
In this section, we first describe our network training settings, then present the quantitative and visual results of the proposed network compared with state-of-the-art methods on benchmark datasets. To study the effects of the guidance and the dual branches in the proposed network, we conduct some ablation study experiments by removing these components and compare their differences, respectively.
According to the work of Blau and Michaeli [53], the perceptual quality of super-resolved images is not always improved with the increase of PSNR/SSIM values. We adopt the perceptual index (PI) [37] and root means square error (RMSE) as our quantitative measurements, where PI measures the perceptual quality of the super-resolved image and RMSE measures the reconstruction loss between the HR image and the SR image. Lower values of both PI and RMSE represent better quality. The training loss functions of our network are as shown in Table 1. The training process is divided into two stages. First, we use MSE loss to pre-train a PSNR-oriented model with all branches. We then employ the trained PSNR-oriented model as an initialization for the network of the HFB. Second, we train the HFB in an adversarial manner. Meanwhile, we continue to use the L 1 loss to update the LFB and the mask network branch until the model converges.
At the training stage, the inputs are augmented by rotating and flipping. Our network is optimized with ADAM optimizer [54], whose hyper-parameters β 1 and β 2 are set to β 1 = 0.9 and β 2 = 0.999. We randomly crop HR images to 128×128 patches. Following [21], [27], [29], [32], [55], [56], the initial learning rate is set to 1 × 10 −4 and then decreases to half every 2 × 10 5 iterations. The number of RRDBs for different modules in GDSR is set as shown in Table 1. We set the block number in the global guidance mechanism as l = 10 and h = 5. We implement our model with the PyTorch framework [57] on two NVIDIA GeForce RTX 2080Ti GPUs.

B. QUANTITATIVE COMPARISONS
As shown in Fig. 4, the methods on the top-left region are MSE-based, which have lower RMSE loss but higher PI value. They have high reconstruction accuracy but poor visual quality with over-smoothed edges. On the contrary, the methods on the bottom-right region are GAN-based, including SRGAN [33], EnhanceNet [34], ESRGAN [15], RankSRGAN [40] and our method, which have lower PI value. Although the previous GAN-based methods obtain more photo-realistic images than the MSE-based methods, they have higher RMSE loss, resulting in more deformed textures in the SR images. Our GDSR attains the lowest RMSE loss and comparatively lower PI value among all the GAN-based methods, and it can produce SR images of better perceptual quality and relatively higher reconstruction accuracy.
To further show the performance of our method more clearly, we compare the results of DNI [38] with our GDSR on the measurement index LPIPS [58]. LPIPS calculates the perceptual similarity of the images, which is recently a common measurement index to evaluate image quality in the Super-Resolution field. The evaluation result of LPIPS is more close to human perception, which provides a good trade-off between perception and distortion. We choose the MSE-based method SRResNet [33] and the VOLUME 8, 2020  GAN-based method RankSRGAN [40] to interpolate with different interpolation parameter α, which is set to 0.2, 0.4, 0.6, 0.8. As shown in Table. 3, the LPIPS of our method GDSR in the two benchmark datasets Urban100 and Manga109 are lowest, which means that the image quality of our method is much better than that of simple interpolation between the PSNR-oriented model and thef GAN-based model. It can be seen that the PSNR of our method is higher than that of RankSRGAN and other DNI models in the benchmark dataset Manga109, which means that our method has less distortion. In fact, our mask network can choose optimal mask parameters automatically to achieve good perception-distortion trade-off.
Furthermore, more quantitative comparison results of the performance of the proposed method with the perceptual SR methods ESRGAN [15] are listed in Table 2. The evaluation metrics include PSNR, SSIM, and PI [37]. Table 2 shows their performance on five test datasets: Set5, Set14, BSD100, Urban100, Manga109. Note that lower PI value indicates better visual quality, while higher PSNR/SSIM values mean higher reconstruction accuracy. When comparing our method with ESRGAN, we find that GDSR achieves the best PI performance on most datasets except Set5. Furthermore, the improvement of perceptual scores comes at the price of PSNR. Note that in all test sets, GDSR obtains the highest PSNR/SSIM values comparing with ESRGAN. Our proposed method achieves a lower PI value and higher PSNR/SSIM values, which means it has the better visual quality and higher reconstruction accuracy.
We show some visual examples, where we observe that our method could generate more realistic textures without introducing additional artifacts. As shown in Fig. 5, our proposed network outperforms other methods by a large margin in visual quality. Our network can generate images with more fine-grained textures and high-frequency details without deformation. For example, for the image 'img_002' of Urban100, the edges from MSE-based methods are oversmoothed. EnhanceNet and ESRGAN generate the streaks with noise. The margins of cropped parts of SRGAN and RankSRGAN are blurry. Our GDSR can produce clear and natural stripes of the window frame without noise. The cropped parts of the image 'img_044' in Urban100 are full of stripes. All the compared MSE-based methods suffer from blurry artifacts, failing to recover the structure and the gap of the stripes. The result of EnhanceNet is full of noisy and ESRGAN generates noise and wrong textures. The streaks of SRGAN, RankSRGAN are indistinct and blurred, while our GDSR can recover them correctly, producing more pleasing results and being faithful to the HR image. These representative comparisons demonstrate the strong ability of our GDSR for producing more photo-realistic and high-quality SR images.
We also found that our method can generate more detail lines without artifacts in bright-color background. As shown in Fig. 5, we can find that the Window frames in 'img_013' of Urban100 of SRGAN and ESRGAN are difficult to distinguish. EnhanceNet and RankSRGAN generate 'img_013' of Urban100 with too many noise. Our method GDSR can recover the lines of window frames without artifacts, which overcomes the above disadvantages of GAN-based methods.
Furthermore, we show the visual example of the DNI [38] and GDSR model in Fig. 7. We can see that the picture of DNI becomes more and more photo-realistic with the increase of α value, but those photo-realistic images are accompanied by unsatisfying artifacts and noise. The image generated by our model is more photo-realistic with less noise.

D. RESULTS WITH REAL-WORLD DATASET
We further compare our model with some others on real-world images to test the robustness of our model. We use the Nikon dataset from RealSR [60] to test, which is a testing dataset commonly used in real-world Super-Resolution filed. Several evaluation metrics exist for real-world images, such as SSEQ [61], LPIPS [58], and DIBR-Synthesized Image Quality Metric [62]. We choose SSEQ and LPIPS as our evaluation metrics where the open-source codes found online are utilized. SSEQ calculates the spatial-spectral entropy of image blocks to obtain the relationship between the image 93614 VOLUME 8, 2020

E. ABLATION STUDY
To study the effects of the structure in the proposed method, we conduct ablation experiments by removing the components and testing the differences. We remove the global guidance to verify its influences. Then we train a single branch in an adversarial manner without global guidance and mask network to verify the effect of our proposed dual branches. We also compare the differences of the final reconstruction outputs of the HFB and the dual branches in the same network. The visual comparisons are illustrated in Fig. 8 and Fig. 9. Detailed discussions are provided below.

1) REMOVING THE GUIDANCE
We first remove the top-down guidance in our network. An obvious performance decrease can be observed in Fig. 8. For image 'KarappoHighschool' in Manga109, the model without guidance introduces some unnatural noise and blurry edges, while GDSR can generate clear SR image. The HFB without the global guidance from the LFB introduces VOLUME 8, 2020  blurring artifacts. The characters in the cropped image generated by GDSR are clearer and more recognizable due to the benefit of top-down guidance. We also perform the experiment that replaces concatenation by addition of the output feature maps from the LFB to the HFB but we observe no differences.

2) COMPARED WITH SINGLE BRANCH
Firstly, we train the single branch in the adversarial manner, which can be treated as ESRGAN with 17 RRDBs. Then, We generate the SR images from the HFB in the original network by adding the final reconstruction convolution layer. The experimental visual comparison results are shown in Fig. 9. We can see that GDSR outperforms the single branch ESRGAN_17 by a large margin. The single branch ESRGAN_17 tends to introduce unpleasant and unnatural artifacts. The HFB with the global guidance from the LFB can alleviate the artifacts but the lines of eaves in img_025 have additional textures. By employing the mask network to adaptively reconstruct the final output from the LFB and the HFB, our GDSR can alleviate heavy artifacts and noise to generate more correct and clearer stripes. The visual analysis indicates that our dual branches structure plays an important role in our GDSR to achieve a better trade-off between perceptual quality and reconstruction accuracy in SR images.
We also give the quantitative results of ESRGAN_17, HFB, and GDSR. As shown in Table 5, comparing with ESRGAN_17, the HFB gains higher PSNR and SSIM values than the ESRGAN_17. Our GDSR has achieved the highest PSNR and SSIM, demonstrating that our GDSR benefits from the two-branch design and reconstructs more accurate SR images.

V. LIMITATIONS AND FUTURE WORK
As SISR is a serious ill-posed problem, it is unavoidable that our method has some limitations. One interesting failure on  an image in the Set14 dataset is shown in Fig. 10, where the model blurs the complicated stripes visible in the HR image as smooth areas. The reason for the results is that the model does not have enough features to learn, which usually occurs between pairs of LR and HR images. The complicated stripes gradually disappear in the LR image as the size of the LR image reduces.
The model is already competitive in terms of visual results. Future work will focus on reducing the depth of the network and applying shrinking methods to speed up the model.
We also want to add a temporal consistency term to use the model for video super-resolution.

VI. CONCLUSION
In this paper, we proposed a novel left-right asymmetric network for image SR to achieve a better trade-off between reconstruction accuracy and perceptual quality. We used two different training strategies to train the low-frequency branch (LFB) and the high-frequency branch (HFB), aligning with the goal to make complementary branches. The LFB is trained with MSE loss to pursue accuracy and the HFB is trained with the GAN adversarial loss to extract high-frequency features. Furthermore, we proposed a top-down guidance mechanism to guide the high-frequency feature extraction in the HFB. The high-level feature from the LFB helps the HFB to extract more high-frequency texture information. To take full advantage of both high-frequency and low-frequency features, we used a mask network to adaptively reconstruct the final output image. Our GDSR can reconstruct accurate and realistic super-resolution images, benefiting from the complementary branches to extract the high-frequency features and the low-frequency features, the guidance mechanism to guide the high-frequency feature extraction, and the mask network to fuse the features from two branches. Extensive benchmark evaluations demonstrated the effectiveness of our proposed network, which achieved superiority over state-of-the-art methods.  XINYI PENG is currently a Full Professor with the School of Software Engineering, South China University of Technology, China. He has presided over and participated in more than 20 projects, with the project funds of over nine million. Among them, nine projects above the provincial level have passed the acceptance inspection. His research interests include artificial intelligence and data mining.