Bridging Component Learning with Degradation Modelling for Blind Image Super-Resolution

Convolutional Neural Network (CNN)-based image super-resolution (SR) has exhibited impressive success on known degraded low-resolution (LR) images. However, this type of approach is hard to hold its performance in practical scenarios when the degradation process is unknown. Despite existing blind SR methods proposed to solve this problem using blur kernel estimation, the perceptual quality and reconstruction accuracy are still unsatisfactory. In this paper, we analyze the degradation of a high-resolution (HR) image from image intrinsic components according to a degradation-based formulation model. We propose a components decomposition and co-optimization network (CDCN) for blind SR. Firstly, CDCN decomposes the input LR image into structure and detail components in feature space. Then, the mutual collaboration block (MCB) is presented to exploit the relationship between both two components. In this way, the detail component can provide informative features to enrich the structural context and the structure component can carry structural context for better detail revealing via a mutual complementary manner. After that, we present a degradation-driven learning strategy to jointly supervise the HR image detail and structure restoration process. Finally, a multi-scale fusion module followed by an upsampling layer is designed to fuse the structure and detail features and perform SR reconstruction. Empowered by such degradation-based components decomposition, collaboration, and mutual optimization, we can bridge the correlation between component learning and degradation modelling for blind SR, thereby producing SR results with more accurate textures. Extensive experiments on both synthetic SR datasets and real-world images show that the proposed method achieves the state-of-the-art performance compared to existing methods.


I. INTRODUCTION
Single image super-resolution (SR) aims at restoring the high-resolution (HR) image from a low-resolution (LR) image, which is a classical but active task in computer vision.In the past years, various convolutional neural network (CNN) methods have significantly promoted the development of image SR [1,2,3,4,5,6,7,8,9,10] and achieved remarkable progress.Most CNN-based methods assume that an observed LR image is generated by degrading an HR image via a bicubic downsampling kernel.However, when the real degradation differs from such bicubic assumption, these methods can suffer from obvious performance drops.Recent researches [11,12,13] assume the degradation process of an LR image is known and define the degradation kernel (e.g.Gaussian blurs and noises) as prior information to recover the HR image.Though this type of non-blind approaches exhibits better performance than previous bicubic-based methods, in real scenarios, it faces the limitation of versatility when the degradation information is complex or even unknown.To fulfill the need for practical applications, it still inevitably adopts kernel estimation to provide degradation information.
To overcome this issue, blind SR methods [15,16,17,18,19,20,21,22,23,24,25] propose to recover the HR image from an unknown degraded LR image.Some of them [19,20,21,22,24,25] divide this problem into degradation estimation and kernel-based SR reconstruction.Bell-Kligler et al. [19] introduce an image-specific internal-GAN to estimate the SR kernel according to the distribution of patches across scales of the LR image.Zhou and Süsstrunk [20] build a large blur kernel pool from real photographs and then employ them to create a paired LR-HR dataset for training.In [21], a mutual affine network (MANet) is presented to estimate spatially variant kernels from the LR observation for every HR image pixel.These methods highly rely on explicit kernel estimation.Actually, obtaining accurate kernels from arbitrary images is difficult, where possible estimation errors can cause visually unpleasant artifacts in SR results.Some methods [15,16] perform iterative kernel estimation and conditional SR reconstruction in an end-to-end framework, which take advantage of intermediate SR results to alleviate the kernel mismatch problem.Yet, the time consumption caused by multiple iterations cannot be ignored by these two methods.Wang et al. [17] propose an unsupervised learning scheme to learn degradation representations for blind SR, which realizes fast reconstruction speed but weak accuracy.
The above methods directly model the degradation process arXiv:2212.01628v1[cs.CV] 3 Dec 2022 Fig. 1.SISR results on "img 011" in Urban100 [14] with scale factor 2. The HR image is first blurred by an isotropic Gaussian kernel with kernel width set to 1.8 and then downsampled by a bicubic kernel.Equipped with the component decomposition and degradation-driven learning strategy, the proposed CDCN succeed to recover sharp and realistic details, outperforming other SOTA blind SR methods including IKC [15], DAN [16], DASR [17] and DSSR [18].
from LR images while ignoring ambiguity from downsampling operation (e.g.bicubic downsampling), which increases the difficulty of accurate kernel estimation and further reduces the possibility of vivid reconstruction.To avoid the adverse effects of degradation estimation, Li et al. [18] propose to alternatively optimize the HR image detail and latent structural context using a recurrent framework for blind SR (DSSR), which produces acceptable results without degradation prior incorporation.Nevertheless, there are main two drawbacks to this method: 1) DSSR simply supervises the residue between the predicted HR image and bicubic-upsampled image to optimize the HR detail, where the structure component is implicitly optimized in latent space by the SR supervision.This manner ignores the influence of degradation for SR reconstruction, which is sub-optimal to generate satisfactory results.2) DSSR also adopts an iterative learning scheme to pursue robust performance.But, its corresponding training process or testing process is converged in several iterations, which seems to reach saturation of model performance.In fact, whether the local details or perceptual quality of the SR results can be further improved (see Fig. 1).All these mentioned observations motivated us to consider two issues: 1) How to avoid the adverse effects of kernel estimation while modelling degradation information.2) Is there any more effective image detail and structure optimization scheme for robust blind SR performance.To this end, in this paper, we first introduce a degradation-based formulation model that denotes the degradation kernel of an HR image as a combination of the blur kernel, bicubic kernel, and standard downsampler.According to this formulation model, we analyze the degradation of an HR image from its intrinsic components (see Sec. III-A).Therefore, we propose a components decomposition and co-optimization network (CDCN), which bridges component learning and degradation modelling for blind SR.CDCN starts from simply LR image decomposition in feature space to obtain initial LR detail and structure components.Then, considering that the two components represent different image characteristics but are interdependent, we present the mutual collaboration block (MCB) to exploit the relationship between both components and learn more comprehensive representations conditioned on each other, achieving a collaborative optimization process.
Based on our degradation-based formulation, we present a degradation-driven learning strategy to jointly recover the HR image detail and structure.Moreover, we leverage a multiscale fusion module that combines the learned detail and structure features for final HR image restoration.Experimental results demonstrate that CDCN outperforms recent state-ofthe-art blind SR methods on both synthetic datasets and realworld images.
We summarize the contributions of this work as follows: • We analyze the degradation of an HR image from image intrinsic components and propose a components decomposition and co-optimization network (CDCN), which bridges component learning and degradation modelling for blind SR.

A. Single Degradation Image Super-Resolution
CNN-based image SR with single degradation aims at superresolving LR images that are degraded from HR images with predefined single degradation, i.e. bicubic downsampling.This type of approaches constructs paired training data based on a bicubic kernel and investigates various CNN architectures to learn the LR-to-HR mapping.Early methods [1,2,3,26,27] conduct pre-interpolation on LR images using bicubic upsampling and learn to refine the upscaled HR images.Due to much calculation on large-resolution features in CNN layers, these methods face the challenges of computational complexity and time cost.To address this problem, some methods [4,5,6,28,29,30,31,32] propose to directly learn the feature representations from the LR input and amplify learned features onto HR space at the end of networks using transposed convolution [4,28,29,33] or sub-pixel layer [6,30,31].Ledig et al. [30] propose to train a generative adversarial network (GAN) with perceptual loss and adversarial loss for photo-realistic image SR reconstruction.Wang et al. [8] propose enhanced SR generative adversarial networks (ESRGAN) by introducing local residual dense connection, relativistic discriminator, and deeper architecture in [30], which significantly boosts the SR performance.In [34], Haris et al. exploit iterative feature upsampling and downsampling operations to provide an error feedback mechanism, constituting a deep back-projection network (DBPN) for image SR.In [35], a multipath recursive residual learning framework is introduced for several classical image restoration tasks including SISR.Liu et al. [36] perform iterative SR reconstruction and downsampling to gradually decrease the distance on the HR space by adjusting the distance on the LR space.Recently, the attention mechanism [7,10,37,38,39] is widely used in various image SR networks to exploit more informative features for performance improvement.

B. Non-Blind Super-Resolution with Multiple Degradations
Except above single degradation SR methods for nonbind bicubic downsampling setting, many other non-blind methods [11,12,13,40,41] aim at super-resolving the LR image under multiple degradations using a single model.These methods assume a fixed known but more complex degradation process and unify the degradation prior with LR input to train a single SR model with more practicalities.In [11], Zhang et al. propose a dimensionality stretching strategy to produce the degradation map that is concatenated with an LR image for SR reconstruction, which can solve blur, noise, and bicubic downsampling degradation in a single model.The authors later present a deep plug-and-play SR (DPSR) framework [40], which combines deblurring methods and a half quadratic splitting optimization algorithm to solve arbitrary uniform blur kernels.In [12], an unfolding SR network (USRNet) is proposed to alternatively optimize a data sub-problem and a prior sub-problem for multiple degradations.Xu et al. [13] adopt the pre-defined degradation map as an additional input and employ dynamic convolution to enhance the effectiveness of networks for variational degradations.Hussein et al. [42] introduce a closed-form correction filter to convert an LR image to match the bicubic degraded one, thus the existing model for bicubic degradation can super-resolve the converted image.Shocher et al. [43] present a zero-shot SR (ZSSR) method which exploits the internal recurrence of information inside a single image across different scales to perform imagespecific SR.Soh et al. [41] first exploit external data to learn a generic initial parameter of the SR model and then employ meta-learning to make the model adapt to given images, which significantly improves the inference speed for ZSSR.

C. Blind Super-Resolution
For blind SR, the degradation from an HR image to an LR one is unknown.Under such blind setting, existing methods mainly pursue estimating accurate degradation kernels from LR observations and conditionally reconstructing HR images based on estimated kernels.Cornillere et al. [44] propose a degradation-aware SR network that generates both the HR image and the blur kernel.A kernel discriminator is also designed in [44] to refine their results.Gu et al. [15] introduce an iterative kernel correction (IKC) method to iteratively correct inaccurate blur kernels by observing the intermediate SR results.Bell-Kligler et al. [19] use an internal generative adversarial network (GAN) to estimate degradation kernel from LR images at test time.Luo et al. [16] adopt an alternating optimization algorithm, which can estimate blur kernels and restore SR images iteratively.Wang et al. [17] extract abstract representations to distinguish different degradation with contrastive learning and superresolve LR images using the discriminative representations.Liang et al. [22] propose a flow-based kernel prior (FKP) method, which models the kernel distribution by learning an invertible mapping between the kernel space and a tractable latent space.In [21], a mutual affine network (MANet) with a moderate receptive field is proposed to maintain the locality of degradation for spatially variant kernel estimation.Kim et al. [23] use a downsampling network to estimate spatially variant kernels.The predicted kernels are then leveraged as local filtering operations for SR features modulation.Jo et al. [45] propose to use an adaptive target rather than the GT target, which can alleviate the ill-posedness of blind SR problem and encourage sharp SR reconstruction.He et al. [46] introduce a degradation reconstruction loss to guide the SR network to capture the degradation-wise differences between HR images and SR results.Li et al. [18] introduce a recurrent alternative optimization SR network to tackle the blind SR problem from structure and detail perspectives without explicit degradation incorporation.MoESR [25] utilizes a mixture of experts to board the degradation space and trains an expert for a specific kernel.

D. Component Learning-based Super-Resolution
In image SR, the key idea of component learning is to optimize the intrinsic image components of LR images to facilitate SR reconstruction.Shi et al. [47] impose the base image content, global boundary context, and residual context as complementary information and jointly learn the three contexts with a multitask framework to address the SR problem.Yang et al. [48] take the LR image and its edge map as input and jointly infer their HR versions by a recurrent residual SR network.Xie et al. [49] introduce a convolutional sparse coding (CSC)-based decomposition method to separate the LR input into residual and smooth components.The residual component is represented by a shallow CNN and then recombined with the interpolated smooth component to predict the HR image.Nazeri et al. [50] use a two-stage network to sequentially recover HR edges and textures.Ma et al. [51] propose a structure-preserving SR network (SPSR) which exploits gradient maps as auxiliary context to guide the recovery of HR images.In [52], Cai et al. propose to decompose the textures and details from ground-truth HR images to guide the network to learn textures and details in observed LR images for better perceptual quality.

III. PROPOSED METHOD A. Degradation-based Problem Formulation
In the typical SISR framework, the degradation model between an HR image I HR and its corresponding LR image I LR can be formulated as follows: where k d denotes the degradation kernel, ⊗ denotes the convolution operation, ↓ s represents a downsampling operation with a scale factor s. In the blind SR problem, most methods [15,16,17] adopt the combination of isotropic Gaussian kernels or anisotropic Gaussian kernels together with additive white Gaussian noise as the degradation kernel k and bicubic downsampler as the downsampling operation, respectively.
Existing methods [15,16] usually only take the Gaussian blur kernel k g into consideration and treat the blind SR problem as an optimization problem: where E (•) is the function that estimates the Gaussian blur kernel k g from I LR , SR (•) is the function that restores I HR from I LR , θ sr is the parameter of SR (•).However, estimating a single blur kernel from the input LR image for the whole image restoration may cause two intrinsic problems: 1) The input LR image is blurred by the combination of the Gaussian blur kernel k g and the downsampling operation, which increases the difficulty of accurate Gaussian blur kernel estimation. 2) The degradation process is ill-posed, which means a degraded patch may correspond to several blur kernels.In fact, when applying blur kernels to an HR image, not only blur kernels themselves but also the HR image plays a decisive role in generating the LR image.
To tackle these problems, since the bicubic degradation can be approximated by setting a proper blur kernel in Eq.( 1) [12,19], in this paper, we view the bicubic downsampler as a combination of bicubic kernel and standard downsampler.For the noise-free with Gaussian kernels degradation model, we re-formulate Eq.( 1) as where the input HR image I HR is first blurred by the combination of the Gaussian blur kernel k g and the bicubic kernel k b .Then it is downsampled by the standard s-fold downsampler s d , which only keeps the upper-left pixel for each distinct s×s patch.Subsequently, we propose to model the degradation process from image intrinsic components.According to Eq. ( 3), we define an HR image I HR degraded by a Gaussian kernel k g as the structure component I s , and the lost information as the detail component I d .In this condition, we can obtain an LR image I LR by downsampling I s with bicubic kernel k b .The decomposition process can be expressed as Consequently, the optimization problem in Eq. ( 2) can be decomposed into two separated optimization problems: where S (•) is the function that restores the structures of I HR from I LR .θ s is the learned parameter by S (•).It is worth noting that since the I LR is directly bicubic downsampled from I S , the S (•) is equivalent to SR (•) in the traditional SISR tasks.D (•) is the function that estimates the details of I HR from I LR .θ d is the learned parameter by D (•).
Considering that the detail component can reveal informative regions to help structure inferring and the structure component can help to understand image context, thus it is sub-optimal to separately optimize the detail or structure components for accurate SR reconstruction.To better exploit the collaboration between structure and detail components, we propose to continuously enhance the structure and detail feature representations in a mutual complementary way, resulting in a collaborative optimization processing for blind SR.To this end, the objective of this work can be further rewritten as follows: where Îs and Îd are the predicted structure and detail components using I LR .In contrast to Eq. ( 5), the S (•) and D (•) not only concern the information from I LR but also concern the information from each other.Thus the difficulty of detail estimation can be eased by the structures and the quality of structure restoration can be enhanced by the detail information.
With optimized structure and detail components, we finally combine them to synthesize the SR image I SR , which is completed by the fusion function F (•).The SR image I SR can be generated by the following process: where F SR (•) is the whole SR function of the proposed scheme and θ f is the learned parameter by F (•).We minimize the distance between I HR and generated I SR to facilitate our SR reconstruction.Such a process can be formulated as follows: B. Network Structure Overall framework.The proposed components decomposition and co-optimization network (CDCN) aims at the Given an LR image I LR of size C × H × W , where C denotes the number of channels, H and W denote the height and width of I LR .We first feed I LR into a 3 × 3 convolutional layer to extract the shallow feature F in , which serves as the input of our CDM for component decomposition.In CDM, we obtain the initial detail and structure components by using two individual feature extraction blocks to directly learn the corresponding features from F in .This process can be formulated as follows: where C (•) represents the component decomposition function to produce the structure component F 0 s and detail component After that, to exploit the relationship between both components, we design the mutual collaboration block (MCB), which performs detail and structure information interaction and learns more comprehensive representations conditioned on each other.By forming multiple MCBs in a stacked manner, we can construct the residual groups (RG) to learn powerful feature representations.Supposing there are N RGs in CDCN and each RG contains M MCBs, for the m-th MCB in the nth RG, the process of structure and detail component learning can be expressed as where ) are the input and output of the m-th MCB in the n-th RG, respectively.With local residual connections, the output of the n-th RG can be obtained as follows where ) are the input and output of the n-th RG respectively.W n−1 s and W n−1 d are the weight sets of the two convolutional layers at the tail of (n − 1)-th RG.Such a process can also be abbreviated as where G (•) represents the function of RG.The very shallow LR feature F in contains abundant structural contexts, which are helpful for better structure modelling.Therefore, after stacking N RGs, we design an asymmetrical global residual connection by where W s and W d are the weight sets learned by the two convolutional layers after N RGs.Then we use MSFM to where F (•) is the function of MSFM.Finally, we use three individual upsample modules to project the LR features F s , F d , and F out onto HR space and generate corresponding predicted HR structure image Îs , predicted HR detail image Îd , and final SR result I SR .The architecture of upsample modules is illustrated in Fig. 2 (bottom middle).

C. Mutual Collaboration Block
The detail and structure components represent different image characteristics but are interdependent.On the one hand, the structure component can help understand contextual information and find the location of the detail component.On the other hand, the detail component can reveal the informative region in the image, thus assisting the reconstruction of the structure component.Consequently, we propose the mutual collaboration block (MCB) to better learn the structure and detail features of images.As shown in Fig. 3 where X m,n Next, to exploit the relationship between structure and detail features, we concatenate X m,n s and X m,n d to make a fused feature representation X m,n and conduct parallel channel-wise attention to capture the inter-channel dependencies.Thus, we can produce two attention maps A m,n s and where CA s (•) and CA d (•) denote the functions of channel attention.We further use A m,n s and A m,n d to re-weight the structure and detail features respectively, which enforces the network focus on more informative features.By integrating such a mechanism in a residual block, the output of MCB can be formulated as where F m,n s and F m,n d denote the resulting structure and detail features.Equipped with MCB, we can effectively exploit the interaction between image structure and detail components, learning more comprehensive representations based on each other, and achieving a collaborative optimization process.The effectiveness of the proposed MCB is discussed in Section IV-B.

D. Multi-scale Fusion Module
The goal of our multi-scale fusion module (MSFM) is to ensure that the two component features F s and F d can be well combined, thus improving the final reconstruction quality.In MSFM, we first take F s , F d , and their summation Fout as the input for multi-scale feature extraction.As shown in Fig. 4, we leverage three parallel convolutions with kernel sizes of 3 × 3, 5×5, and 7×7, as the multi-scale feature extractor.Therefore, for each input, we can produce three types of features.Taking F s as example, after such multi-scale feature extraction, we can generate three outputs: As for feature fusion, we conduct cross-branch concatenation to aggregate multi-scale information.We use another three parallel convolutional filters with feature concatenation to output the fused feature.The whole multi-scale feature fusion of MSFM can be simply expressed as where F usion(•) denotes the fusion function.Moreover, in MSFM, we append six densely connected 3 × 3 convolutional layers with residual connection to enhance the fused feature where E(•) denotes the function for feature enhancement.F out is the final output of MSFM.

E. Degradation-driven Learning Strategy
Considering the importance of the degradation process in the blind SR problem, we present a degradation-driven learning strategy to exploit the spatial relation of the degradation kernel and SR reconstruction.Specifically, according to Eq. ( 4), the kernel-degraded HR image I s is defined as the structure label.We regard the input LR image I LR as a bicubic-blurred and s-fold downsampled one on I s , i.e.I LR = I s ⊗ k b ↓ s .Therefore, the estimated structure image Îs can be represented by S (I s ⊗ k b ↓ s ; θ s ; θ d ) (equal to Eq. ( 6)).Thus, the loss function for structure component optimization can be formulated as Besides, the residue between structure label I s and ground truth I HR is considered as the detail label whose supervision concentrates on lost information caused by blur kernel k g : The supervision of both detail and structure components drives the network to catch the degradation effect of both blur kernel k g and bicubic kernel k b , leading to vivid image restoration in the blind SR problem.
In addition, we introduce an SR reconstruction loss to make sure the SR reconstruction result I SR under the supervision of the label I HR : The overall loss function to train CDCN can be formulated as follows: The degradation-driven learning strategy enables our network to perform HR detail and structure recovery and SR reconstruction while maintaining consistency with the degradation process for blind SR.

A. Datasets and Implementation Details
Datasets.We synthesize the training HR-LR image pairs according to Eq.( 3) with a bicubic downsampler.For the isotropic Gaussian blur kernels, following [15,16,17,18], the kernel size is set at 21 × 21, and the kernel width ranges are set to [0.2, 2.0], [0.2, 3.0] and [0.2, 4.0] for SR scale factors 2, 3 and 4 respectively.For more general degradation with anisotropic Gaussian blur kernels, following [19], the kernel size is set at 11 × 11 with random lengths λ 1 , λ 2 ∼ U (0.6, 5) independently distributed for each axis, rotated by a random angle θ ∼ U (−π, π).Then we further apply uniform multiplicative noise (up to 25% of each pixel value of the kernel) and normalize it to sum to one to make it deviate from a regular Gaussian kernel.
Parameters.In the proposed network, the number of the RGs and the number of MCBs per RG are 5 and 10 respectively.The number of filters in the intermediate layer is set to 64.We use LeakyRelu [53] as the activation function.
Training settings.During the training, the patch size is 64 × 64 for all scale factors.The batch size is 16.Training dataset augmentation is performed by random horizontal flips and 90-degree rotations.We use Adam [60] as optimizer with β 1 = 0.9, β 2 = 0.99 and initial learning rate 2 × 10 −4 .All the models are trained for 5 × 10 5 iterations while the learning rate decays by half at every 1 × 10 5 iterations.We implemented our network on Pytorch framework with NVIDIA RTX A4000 GPUs.We also implement our network by using the MindSpore Lite tool1 .

B. Ablation Studies
Here, we observe the best PSNR values in 1 × 10 5 iterations for a quick comparison with isotropic Gaussian blur degradation.
Investigation of the numbers of RGs and MCBs.Firstly, we investigate the basic parameter of the proposed CDCN: the numbers of RGs (denoted as G for short), and the numbers of MCBs in each RG (denoted as B for short).Specifically, we develop 5 variants with different values of G and B to study how G and B will affect the SR performance, which are denoted as G5B5, G4B8, G5B10, G8B8, and G10B10.We observe the PSNR performance on Set5 dataset with Gaussian8 kernels for ×4 SR.As shown in Table I, we can see the model "G10B10" achieves the best PSNR but also involves the largest number of parameters which is about twice as "G5B10"."G5B10" shows very close PSNR to "G8B8" but fewer parameters (11.7M v.s.14.1M, about 20% drop).To pursue a good balance of model complexity and SR performance, we choose the variant G5B10 as our baseline model in other experiments.
Study of Components Decomposition.As mentioned in Section III, we decompose the input LR images into structure components and detail components.The network is then trained with the degradation-driven learning strategy, which makes supervision using the corresponding labels.In this subsection, we investigate the effect of our components decomposition and the constraints of both components.
Here, we first introduce a network variant (Model 1) by removing the component decomposition module in Fig. 2. Specifically, the proposed dual-path network is simplified to a single-path network.Then, we develop another variant by removing both structure and detail constraints (Model 2).As shown in Table II, the network without components decomposition shows the worst PSNR score among all variants.Obviously, our CDCN benefits from both components decomposition and constraints from structure and detail components to produce better results.
To further investigate which component plays a more important role in the proposed network, we develop two variants by removing structure constraint (Model 3) or detail constraint (Model 4).From Table II, it can be found that only structure constraint or only detail constraint is not enough to recover satisfactory HR images.Models 3 and 4 suffer from the absence of component constraint, thus producing worse results compared to the full model.Such a phenomenon demonstrates the importance of component learning to a certain extent.It is worth noting that, they still produce better results than Model 1 and 2, which proves that the absence of component constraint will not destroy the whole network recovering ability.
In Fig. 5, we visualize the recovered structure and detail maps and their corresponding feature maps to illustrate how well our structure and detail components have been learned by the proposed CDCN.Noted that these feature maps are produced by the last convolutional layers before the final reconstruction layers.We can get two intuitive clues from this visualization: (1) Our framework can recover promising structure and detail maps that are similar to their corresponding ground-truth versions (first column).These comparisons demonstrate that the proposed CDCN can effectively exploit the interaction between the image structure and detail components for better HR reconstruction.(2) The average structure feature map before reconstruction tends to contain more lowfrequency structural information, while the average detail feature map has more discriminative high-frequency detail information (e.g.edges and textures).Such a phenomenon shows a strong effect of structure and detail constraints, which further leads to a more accurate recovered structure or detail map.With the well-recovered components, our framework can achieve better SR reconstruction.
In the next subsection, we will investigate how the collaborative optimization of structure and detail components affects the final SR performance.
Influence of Collaborative Optimization With the proposed MCB, we achieve the collaborative optimization of structure and detail components.Here, we investigate the effect of our proposed MCB by comparing our model with another 4  variants.First, we remove feature concatenation and channel attention operation, which simplifies our model to a simple dual-path residual block (w/o all).Then, we keep the channel attention operation and introduce a variant without collaborative optimization (w/o CO), which makes our residual group constructed by dual-path residual channel-attention blocks.In that situation, the whole residual group learns the two components separately and only fuses them in the final MSFM.To further investigate the components fusion and interaction ability of the proposed MCB, We remove the channel attention operation and directly use the feature concatenation operation with 1 × 1 convolution to achieve component fusion (w/ FC).hen, we replace feature concatenation operation with elementwise addition where the features from another component are fed into 1 × 1 convolution and then added to the original component (w/ EA).
As we can see from Table III, the channel attention operation improves the performance to some extent, but it fails in certain kernel widths.The variants which utilize collaboration between the structure and detail components "w/ FC" and "w/ EA" achieve better performance than those without information interaction "w/o all" and "w/o CO".Compared with other variants, our model constructed with the proposed MCB can effectively exploit the relationship between detail and structure components for better SR performance.
To further demonstrate the effectiveness of MCB, we also visualize the detail components from the variants (see Fig. 6) and calculate PSNR values (see Fig. 7) between the detail components and their corresponding ground-truths.It can be observed in Fig. 6 that the proposed MCB recovers brighter and cleaner detail components than other variants, which directly leads to better SR reconstruction in the texts "Everything".Fig. 7 demonstrates the effectiveness of MCB for detail estimation from a quantitative perspective.Effect of Multi-scale Fusion Module.In the previous subsection, we prove the necessity of component decomposition and collaborative learning.Then we investigate the effect of the proposed multi-scale fusion module.We first introduce a variant by removing the MSFM in Fig. 2, where we simply use element-wise addition to put structure and detail features together and get the fused features for SR reconstruction.Then we develop another two variants by removing the multi-scale feature fusion stage or fused feature enhancement stage in Fig. 4. From Table IV, it can be observed that the PSNR values decrease if two components are not well integrated (e.g.28.29 vs 28.04 for dataset Set14).Besides, both multi-scale feature fusion and fused feature enhancement play an important role in higher PSNR scores.

C. Comparison with State-of-the-arts
Isotropic Gaussian Kernels.For the isotropic Gaussian blur kernels, we first evaluate our method on synthesized test images by Gaussian8 kernel setting.We compare our method with state-of-the-art blind SR approaches including ZSSR [43] (with bicubic kernel), IKC [15], DAN [16], DASR [16], AdaTarget [45], SRDRL [46], DSSR [18].We also conduct a comparison with bicubic-assumed methods RCAN [7], CARN [6].Following [16], the deblurring method [61] is For most methods, we use their public code and pre-trained models.Since the models of IKC for some scale factors are unavailable, we retrain IKC using its official implementation to evaluate the SR performance.
The quantitative comparison is shown in Table V.Although RCAN [7] and CARN [6] achieve promising results on bicubic degradation, they suffer severe performance drop when the degradation deviates from the bicubic one since they are trained under the bicubic setting.When refined by the deblurring method ("CARN+ [61]"), the performance of CARN significantly improves and exceeds the large capacity network RCAN.However, when we perform the deblurring method before SR reconstruction (" [61]+CARN"), the result is even worse than bicubic interpolation owing to the larger domain gap between deblurred images and degraded LR images.Benefiting from correcting inaccurate blur kernels iteratively, IKC [15] improves their results with higher PSNR values.By unifying degradation estimation and SR reconstruction into an end-to-end framework, DAN [16] and DASR [17] perform better than IKC, but they are still inferior to our method.Compared with DSSR [18], which utilizes the same detail-structure optimization as us, our method outperforms the proposed degradation-driven learning strategy.
In Table VI, we even make a comparison to several nonblind SR methods, including SRMDNF [11], SFTMD [15], MZSR [41], UDVD [13], to further show our performance on other specific kernel width.The result of SOTA blind SR approach DAN [16] is also reported.As one can see, when the ground-truth blur kernels are available, non-blind SR methods perform better than blind SR methods, such as UDVD v.s.DAN.Compared to these methods, even without blur kernel prior, the proposed CDCN not only produces better SR results than the blind SR method DAN, but also achieves comparable performance with existing non-blind SR methods.The comparison in Table VI demonstrates the robustness of our method.The qualitative results are shown in Fig. 8.It can be observed that RCAN and ZSSR can hardly recover any details in the images, thus their results are still blurred.The recent SOTA blind SR methods like IKC, DAN, DASR, and DSSR produce cleaner SR images in general but fail in certain areas in the images (e.g., the lower right part of the building in "img 033" from Urban100 dataset in Fig. 8).In contrast to them, the proposed CDCN generates visually pleasant results with better color fidelity and higher perceptual quality.
As shown in Table VII, the proposed CDCN maintains preferable performance across all scale factors compared with other methods.Specifically, since the methods in Class 1 are trained on bicubic downsampled images, they show the worst results on anisotropic Gaussian kernels as well as isotropic Gaussian kernels.The winners (Class 2) of NTIRE blind SR challenge achieve fine results in the challenge, but still fail to recover promising images in the face of anisotropic Gaussian kernels.When blur kernels are unavailable, due to the distance between estimated kernels and ground-truth kernels, simply combining kernel estimation methods and non-blind SR methods (Class 3) can not always produce satisfactory results.The end-to-end blind SR methods (Class 4) achieve superior results than other classes, but their performance is still limited by the accuracy of kernel estimation.Particularly, MoESR [25] uses a mixture of experts for specific degradation kernels, which achieves remarkable performance for scale factor 2 but worse 4× results.Some other methods conduct blind SR without kernel estimation (Class 5), which obtains close PSNR and SSIM to Class 4 but is still lower than the proposed CDCN.By comparison, our method solves the blind SR problem using detail and structure collaborative optimization with a degradation-driven learning strategy, which can achieve the almost best performance on all scale factors.Fig. 9 shows the visual results in DIV2KRK [19].We can observe that the bicubic-based methods EDSR still fails to recover sharp lines.The SR images generated by KernelGAN and MoESR suffer from severe unpleasant artifacts.The SR images produced by DAN are cleaner, but visible artifacts and blurs yet exist in their result.Meanwhile, the proposed CDCN reduces the artifacts and recovers photo-realistic details like the window in "img 845" from DIV2KRK dataset in Fig. 9.
Real-World degradations.In addition to experiments on the synthetic data set based on isotropic and anisotropic Gaussian kernels, we further compare with other methods on real-world images to demonstrate the effectiveness of the proposed CDCN.The visual results are shown in Fig. 10.As one can see, the results of other compared methods are better than bicubic interpolation.No matter the existing non-blind SR or blind SR methods, they all super-resolve the images with obvious artifacts and blurry edges.Compared with these methods, our method produces SR images with sharper edges, clearer contents, and fewer artifacts, which demonstrates the generalization ability of the proposed CDCN.
Efficiency.To demonstrate the efficiency of the proposed CDCN, we conduct a comparison with other SOTA blind SR methods in terms of parameters, average inference time, and PSNR performance on Set5 dataset with Gaussian8 kernels  for ×2 SR.For fair comparison, all the methods are tested using their public source codes and pre-train models on the same platform TITAN RTX GPU.As shown in Table VIII, although having the fewest parameters, the zero-shot methods "Bicubic + ZSSR" and " [19] +ZSSR" suffer from thousands of iterations during testing and show the worst inference time.
Benefiting from iterative kernel estimation, IKC [15] and DAN [16] achieve better performance than ZSSR.The performance of DSSR [18] is further improved by utilizing the detailstructure optimization, but still lower than ours.Although the number of parameters of these iterative methods is fewer than the proposed CDCN, the inference time of these methods is much higher than that of ours due to their iteration strategy.Compared to the fastest method DASR [17], our method keeps a good balance between the inference time and the reconstruction performance.Specifically, our method exceeds DASR by 0.61 dB with slightly a higher execution time.

V. CONCLUSION
In this paper, we start from a degradation-based formulation model and analyze the degradation process from the perspective of image components.On the basis of degradation modelling, we propose a components decomposition and cooptimization network (CDCN), which performs simultaneously structure and detail components decomposition and restoration.In particular, CDCN uses the proposed mutual collaboration block (MCB) to extract features from both components and exploit the mutual relationship between them.Furthermore, a degradation-driven learning strategy is proposed to explicitly supervise the detail and structure restoration and exploit the interaction between the degradation kernel and SR reconstruction.Finally, a multi-scale fusion module is designed for enhancing the feature representations of both components and further producing a comprehensive feature for SR reconstruction.Extensive experiments on both synthetic datasets and real-world images demonstrate the effectiveness of the proposed CDCN.In the future, we will continue to mine the internal information of images and combine it with the degradation process.By utilizing internal image information analysis and external dataset training, we believe there will be further improvement in the blind SR task.

3 UpsampleFig. 2 .
Fig. 2. The architecture of the proposed components decomposition and co-optimization network (CDCN).Blue arrows represent feature connections of the structure component.Red arrows represent feature connections of the detail component.
, For the m-th MCB in the n-th RG, there are structure component F m−1,n s and detail component F m−1,n d as the input.We first use two individual stacked convolutional layers to perform feature extraction on F m−1,n s and

s
and X m,n d are the extracted features.[W m,n s1 , W m,n s2 ] and [W m,n d1 , W m,n d2 ] are the corresponding learned weight sets for X m,n s and X m,n d in MCB, respectively.δ(•) represents LeakyReLU [53] activation function.

Fig. 5 .
Fig. 5. Visualized recovered detail and structure components of the proposed CDCN with isotropic kernel for ×4 SR.The kernel width is set to 3.6.

Fig. 6 .
Fig.6.Visual comparisons with different variants with isotropic Gaussian kernels on Set14[57] for ×4 SR.The kernel width is set to 3.6.

TABLE I PSNR
(DB) PERFORMANCE AND THE MODEL COMPLEXITY WITH DIFFERENT G AND B. THE RESULTS ARE EVALUATED ON THE SET5DATASET WITH Gaussian8 KERNELS FOR ×4 SR.
FOR 4× SR .THE KERNEL WIDTH IS SET TO 1.2.

TABLE III EFFECT
OF THE VARIANTS (PSNR(DB)/SSIM) IN OUR MCB MEASURED ON THE SET14 ×4 DATASET.

TABLE IV EFFECT
OF THE VARIANTS (PSNR(DB)) IN OUR MULTI-SCALE FUSION MODULE MEASURED ON THE SET5 [56], SET14 [57] FOR 4× SR .THE KERNEL WIDTH IS SET TO 2.4.

TABLE VI QUANTITATIVE
COMPARISONS (PSNR(DB)) WITH SOTA NON-BLIND SR METHODS WITH ISOTROPIC GAUSSIAN KERNELS ON SET5 [56], SET14 [57] AND BSD100 [58].THE KERNEL WIDTH IS SET TO 1.3.THE BEST TWO RESULTS ARE HIGHLIGHTED IN RED AND BLUE COLORS.

TABLE VII QUANTITATIVE
COMPARISONS (PSNR(DB)/SSIM) WITH SOTA SR METHODS WITH ANISOTROPIC GAUSSIAN KERNELS ON DIV2KRK [19].THE BEST TWO RESULTS ARE HIGHLIGHTED IN RED AND BLUE COLORS.

TABLE VIII THE
EFFICIENCY COMPARISON WITH SEVERAL SOTA BLIND SR METHODS FOR ×2 SR ON SET5 DATASET WITH Gaussian8 KERNELS.