Single Image Super-Resolution by Residual Recovery Based on an Independent Deep Convolutional Network

In this paper, we propose an independent neural network for single image super-resolution by residual recovery. The network is inspired by the observation that there still exists image residuals between the low-resolution image and the downsampled high-resolution output obtained by a previously proposed super-resolution network. Based on this observation, we design a simple but effective deep convolutional neural network to train the mapping between the image residuals and the corresponding ground-truth residuals. Furthermore, we combine the high-resolution output generated by the previous super-resolution network and the high-resolution residual output by the proposed neural network to yield the final high-resolution image. Extensive experiments on simulated natural images and real time-of-flight (ToF) images demonstrate the effectiveness of the proposed method from the aspects of visual and quantitative performance.


I. INTRODUCTION
The main goal of single image super-resolution (SR) is to recover a high-resolution (HR) image from one lowresolution (LR) image while keeping clear image details. In general, the LR image only contains fewer image details than that of the HR image, which promotes us to develop mathematical strategies or approaches to improve the LR image's details. Therefore, how to propose an accurate and fast SR approach to increase image resolution is quite crucial, which is also the main task and challenge in this work.
From the perspective of methodology, existing single image SR approaches can be divided into three categories: 1) interpolation-based method; 2) statistics-based method; 3) learning-based method. In particular, the learning-based method could be roughly divided into two parts. One is the dictionary-based learning method, and the other is the deep learning-based method. The proposed method in the paper belongs to the category of the deep learning-based method.
The associate editor coordinating the review of this manuscript and approving it for publication was Gustavo Callico .
The interpolation-based method is a kind of classical single image SR approach. It has been studied for several decades. This kind of method is mainly to fill in pixels at unknown locations by some relations in terms of its neighbor points. The most classical interpolation methods for single image SR are nearest-neighbor interpolation and bicubic interpolation. Both methods could yield SR outcomes fastly; however, the nearest-neighbor interpolation generally will lead to a jaggy effect, and bicubic interpolation may result in blur effect. Besides them, recently some state-of-the-art interpolation methods are also proposed, readers are recommended to check the related references, see, e.g., [2]- [5].
The statistics-based method also becomes an active field of the image SR. In general, it mainly contains two important directions, i.e., Maximum a Posterior (MAP) based method and Maximum Likelihood estimator (MLE) based method (see more related references [6]- [8]). In [7], Capel et al. proposed two estimators for the resolution enhancement of text images. One was proposing a MAP estimator that was based on a Huber prior, and the other was proposing an estimator using the total variation (TV) regularization. The given method was not only for enhancing image resolution FIGURE 1. The super-resolution results for a real ToF image with a scale factor of 3. The LR image is with low-resolution and additional noise. It is clear that the proposed method holds the better ability of outlier removal than the state-of-the-art PnP method [1] when increasing image resolution (please see the close-up), which indicates the better performance of our method. Moreover, the absolute residual map (shown in the last image) between the PnP and the proposed method demonstrates that our method could pick up image details from the result of the PnP to get a better visual outcome.
but also could work for denoising tasks simultaneously. Based on the MAP, some regularization models are proposed for single image applications, e.g., image super-resolution [9], [10]. In [9], Deng et al. proposed a sparse regularization model by reproducing kernel Hilbert space (RKHS) function for single image SR. To pick up more image details, they also designed an iterative scheme for the solution by alternating direction method of multipliers (ADMM). After that, Deng et al. [11] presented a 1 sparse model based on two Heaviside function terms that one is to depict the primary image information and the other is to describe the sparse sharp edges. Experimental results demonstrate that the regularization models could obtain promising performance. Wang and Gong in [10] proposed an RKHS-based regularization model which can realize image SR and denoising simultaneously.
Dictionary-based learning approaches play a crucial role in the field of image SR, as well as show significant improvements than classical methods. Readers are recommended to find more references of this direction, e.g., [12]- [17]. One representative dictionary-based learning method for image SR was proposed by Yang et al. [16]. The authors formulated a dictionary-based learning framework for single image SR, which is to utilize a 0 sparse training model with LR patches and HR patches as input. After getting the relation between the LR patches and HR patches, it could obtain the output HR image by inputting an LR image to the learned relation.
Recently, with the tremendous improvements in hardware devices, deep learning has shown the superpower for image processing, e.g., [18], [19]. For the application of image SR, Dong et al. [20], [21] first utilized three layers of convolutional neural network (CNN) to address single image SR, called SRCNN. This network is based on a 2 loss function and to calculate the parameters on each layer, finally to predict the HR image by the trained nonlinear mapping with any LR image as input. After this work, many literatures based on CNN have been proposed for image SR, e.g., [22]- [27]. Kim et al. [24] proposed a deep recursive CNN for single image SR, which mainly has a very deep recursive layer. This recursive CNN will not introduce new parameters; thus, it has a quite fast speed for training and testing. Additionally, Lai et al. [26], [27] presented a fast and accurate image SR with a designed deep Laplacian pyramid network. The proposed network could reconstruct the sub-band residuals at multiple pyramid levels. Besides, due to the feature extraction on LR grids, thus the proposed approach has quite low computation. In [1], Zhang et al. proposed a deep plug-and-play SR method with arbitrary blur kernels. Especially, the framework of deep plug-and-play is mainly based on a new single image SR degradation model, which could take advantage of existing blind deblurring approaches. Experimental results on several simulated and real examples show that this method obtained the state-of-the-art single image SR performance. Although there are many deep CNN methods for the application of image SR, it still has space for improvements due to the multiscale property of SR. Especially, here we utilize this property of SR to design a deep neural network architecture for single image SR.
In this paper, we observe that there exist image residuals between the LR image and the downsampled HR output yielded by a previously proposed SR network. To utilize the image residuals on LR grids, we independently design a simple deep CNN that is based on ResNet [28] to pick up more image details for the final HR image. In particular, the ground-truth residuals of the independent deep CNN are obtained by the subtraction of the high-resolution output obtained by the previously proposed SR network and the ground-truth HR image. Furthermore, we use a 2 norm as the loss function. Experimental results on simulated natural images and real ToF images demonstrate the effectiveness of the proposed method. Additionally, Fig. 2 shows the flowchart of proposed deep CNN for single image SR.
In summary, this paper mainly has the following contributions: 1) Unlike the previous deep SR CNN that enforces the error between the network output and the ground-truth as small as possible, the paper is to formulate an independent deep CNN for the residual recovery to pick up more image details of HR images. 2) The proposed deep CNN yields the best performance, especially on the quantitative aspect, compared with modern state-of-the-art SR methods. 3) Our approach could work for real ToF images and get competitive visual performance. The flowchart of the proposed method. ''Architecture 1'' could be any existed network used for single image SR. After ''Architecture 1'', we downsample the ''output'' then calculate the residual between the LR and the downsampled output to get the residual input ''Lr_res'' for ''Architecture 2''. Besides, we also compute the ground-truth residual ''Gt_res'' by the subtraction of the HR output and the ground-truth. The designed network ''Architecture 1'' involves four ResNet blocks and 2 loss function. Especially, before entering into ResNet blocks of ''Architecture 2'', there is an operation of transposed convolution, which could increase the size of ''Lr_res'' to match the size of ''Gt_res''. The final SR image is the summation of the ''Output'' in ''Architecture 1'' and the ''Output_res'' in ''Architecture 2''. More parameter setting for our architecture can be found from the Section III-C.
The paper is outlined as follows. In Section II, we briefly introduce the related works. Section III detailedly presents the proposed deep CNN for single image SR. In Section IV, visual and quantitative results are reported to show the superiority of the proposed method. Also, we apply our method to real ToF images. In particular, we also explain why choosing ToF images as real test data in this Section. Finally, some conclusions are drawn in Section V.

II. PROBLEM FORMULATION AND RELATED WORKS
The proposed method in the work is actually based on the formulation of plug-and-play (PnP) [1], and the outcome of PnP is crucial to the final SR result of our method; thus we mainly review the brief introduction of SR and the formulation of PnP in this section.
Image SR is a critical problem in image processing, which is mainly to increase the spatial resolution of an image such that the processed image can better serve for subsequent applications, e.g., recognition, segmentation, object detection, etc. Especially, the image SR can be mathematically formulated as follows where y stands for the LR image, ⊗ is the convolution between the blur kernel k and the clean HR image x, n represents the additive Gaussian white noise. Additionally, ↓ s is the downsampling operator with a scaling factor s. This degraded SR model (1) is an ill-posed problem, we may take many strategies to solve it, e.g., regularizationbased approaches which have been used in many applications [29]- [34]. If following these regularization methods, some issues will appear. For example, how to estimate the blur kernel accurately. Even though some recent works arise to calculate the blur kernel, it is also difficult to accurately compute it.
Recently, Zhang et al. [1] novelly view the formulation (1) as the following SR degraded model, where 1 2σ 2 k ⊗ x ↓ s −y 2 2 is the fidelity term and (x) represents the regularization term, σ and λ are the noise level and the regularization parameter, respectively (see more details in [1]). VOLUME 9, 2021 For the solution of (3), Zhang et al. [1] give a strategy that will solve the unknow variables alternatingly to obtain excellent SR outcomes. More details of the solving process can be found in [1] and the corresponding code of this method is also available (see the result section).
Especially, the PnP method in [1] could obtain state-of-theart single image SR results, also shows the enormous capacity for a variety of images. However, just like the mentioned before, there still exist visible image residuals between the LR image and the downsampled HR output yielded by a previous SR method, e.g., PnP (see also ''Lr_res'' in Fig. 2). Motivated by the image residuals, we intend to design a deep CNN to recovery the lost HR residuals to finally generate better visual results. In what follows, we will present the whole flowchart of our approach detailedly.

III. PROPOSED METHOD
With the considerable development of image SR techniques, especially deep learning techniques, one can obtain very desired SR results even for a different type of images. However, there is no end for the improvement of image SR. We still have room to make SR results better by some new investigations or observations.
In this work, we propose the method based on an observation that there still exist visible image residuals between the LR image and the downsampled HR output generated by a previous SR method that even could be a state-of-the-art approach. In Fig. 2, it is evident that ''Lr_res'' that is from the subtraction of the LR image and the downsampled HR output still has significant image residuals; thus we attempt to pick up more HR image details from the LR residuals, just like the iterative SR method in [9], [35]. Different from [9], here we do not use a similar strategy to recover HR image details iteratively, but utilize the deep CNN that has been proven as a very efficient and effective technique for image SR in many pieces of literature. Especially if we intend to use the deep CNN for image SR, we have to simulate the training data which mainly includes two kinds of data, i.e., the LR data and the corresponding ground-truth (GT) data. Fortunately, it is not difficult to yield the LR-GT residualpairs for training in the work. After obtaining LR residual ''Lr_res'', the corresponding GT residual can be naturally generated by the subtraction between GT and the output of the network.

A. NETWORK ARCHITECTURE FOR THE RESIDUALS
The main goal of image SR is to recover spatial information from the LR image that generally only contains less spatial image details. Also, the spatial image details usually exist in the difference between the LR image and the downsampled estimated HR image. Besides, the deep CNN method, without depending on the pre-defined image priors that are sometimes not so accurate, has shown its significant superiority in image SR. Motivated by the just mentioned, we intend to propose a simple and effective network architecture by considering the spatial details on LR grids and deep CNN. The ''Architecture 2'' in Fig. 2 is our design for the residual recovery of image SR. From this architecture, it is easy to know that the calculated residual of LR ''Lr_res'' is taken into the network and will establish a nonlinear mapping f to the GT residual ''Gt_res''. Therefore, the output of the deep network can be viewed as the following: where contains the network parameters that mainly include the convolutional filters and bias on each layer. Especially, the input LR residual has high-frequency image details such as edge information, and it is better to select a deep network architecture for the feature extraction. ResNet [28] is a very promising and excellent architecture in a deep convolutional neural network. It can achieve deep layers, which means the network has a more flexible ability to extract and represent image features. Thus we choose ResNet as the main part of our architecture. Specifically, the ResNet can be viewed as the combination of some ResNet blocks. Each ResNet block generally consists of two layers 1 with a nonlinear function ReLU or not, see Fig. 3 for one ResNet block. Especially, we only take four ResNet blocks in this work, since the input of the network is actually similar to the output of the network, the ResNet with few blocks is suitable to learn a transformation like this case. From ''Architecture 2'' in Fig. 2, it is easy to find that ''Output_res'', the output of our network, indeed contains some visible image residuals which can be viewed as the lost image details in ''Architecture 1''.

B. LOSS FUNCTION
After obtaining the output of network, i.e., ''Output_res'' with the paramter , it is necessary to define the loss function between the ''Output_res'' and the ''Gt_res'' so that we may calculate the paramters on each layers by backpropagation. Especially, one conventional loss function for highfrequency image details is 1 loss function which indicates However, considering the performance in  the experiments, we take another conventional loss function with 2 norm, where · 2 F norm for matrice (or tensors) is equivalent to 2 norm for vectors.
The parameters on each layer can be obtained by where we use the backpropagation to compute them. After defining the loss function, in what follows, we will present how to simulate the training data.

C. TRAINING DETAILS
The proposed network ''Architecture 2'' is based on a previous network ''Architecture 1'' in which we employ PnP in this paper. Thus we do not need to re-simulate the training images, i.e., LR-GT image pairs, since the training image pairs have been generated in the previous network ''Architecture 1''. We only need to simulated ''Lr_res'' images and ''Gt_res'' images for our ''Architecture 2''. In particular, we could generate the ''Lr_res'' images simply by the subtraction of the original LR images and the downsampled ''Output'' images that are implemented directly by bicubic downsampling. Also, we could generate the ''Gt_res'' images by the subtraction of the original GT images and the ''Output'' images (see Fig. 2 for more details).   Especially, the partial training images for ''Architecture 2'' in the work come from the test dataset, i.e., BSD68 [1], [36], [37] that contains 68 natural images. We simulate the LR images by the following steps: 1) blurring each clean image by Gaussian kernels with eight standard deviations (stds); 2) downsampling the blurred images directly by bicubic interpolation. Thus we may get 544 LR-GT image pairs in this simulation. Particularly, the 544 LR-GT image pairs are divided into 80% (for training) and 20% (for testing), respectively, which indicates we have about 435 LR-GT image pairs for training and 109 LR-GT image pairs for testing. In other words, even though we do not take too many LR-GT image pairs into our network for training, it still obtains competitive results.
Moreover, the more details about the network ''Architecture 2'' are outlined as follows. Adam optimizer with a learning rate of 1 × 10 −4 is employed for computing the network parameters. 2 The kernel size of each ResNet block is 3 × 3 with 32 filters. The batch size is set as 30, and the total iterations are 10000. Besides, all data are normalized into the range of [0, 1] for use. Moreover, we train the models on Python 3.5.2 with Tensorflow 1.0.1 on an NVIDIA GeForce GTX 1080 GPU with 8GB RAM.

IV. RESULTS
In this section, we compare the proposed method, called IDCNN, with six competitive image SR methods, including:  Fig. 6, including the average PSNR and SSIM with the corresponding standard deviation (std). (Bold: the best). 1) A classical interpolation method called as ''bicubic''; 3 2) A competitive variational-based method, called ''RKHS'' [9]; 4 3) A benchmark method for single image SR, called SRCNN which is also the first approach for image SR using CNN [20]; 5 4) The acclerated SRCNN for single image SR, called FSRCNN [38] 6 ; 5) A novel CNN method with sparse priors, called SCN [22]; 7 6) A recent state-of-the-art image SR method using a plug-and-play strategy, called PnP [1]. 8 Especially, we keep all default parameters along with the source codes for fair comparsions.
For the display of visual results and the quantitative evaluations, we implement them on Matlab R2017 on a desktop computer. Furthermore, we employ two popular metrics, i.e, peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) index [39], 9 to evaluate the quantitative performance of compared approaches.
For fair comparisons, we trained the networks, i.e., SRCNN, FSRCNN, and SCN, on the DIV2K dataset 10 which is also the training dataset of PnP method. Since our IDCNN method is actually based on the PnP method, thus it is also trained on the DIV2K dataset. Note that the DIV2K dataset has 800 HD images for training and 100 HD images for validation, which could provide abundant image features for training.
In what follows, we will exhibit the performance of different compared methods from two aspects: 1) The visual and quantitative results on simulated natural images to evaluate the effectiveness of compared methods; 2) The visual results on real ToF images to validate the practical ability of image SR. Besides, we also make some discussions in this Section to adequately demonstrate the effectiveness and validation of the proposed method.

A. SIMULATED DATA
In this section, we first blur the HR noise-free images (i.e., GT images) by different Gaussian kernels, 11 then downsample the blurred images to generate the simulated LR images that will be tested in the experiments (accordingly Eq. (1)). Fig. 4 exhibits the simulated LR images with different blur kernels. In Fig. 4, the first row is with a scale factor of 4, and the second row is with a factor of 3. Especially, the GT images for the corresponding simulated LR images are displayed in the last column of Fig. 5 and Fig. 6, respectively. From Fig. 5, it is easy to know that the bicubic interpolation shows significant blur effects since the methodology of interpolation usually overlooks the image spatial details preservation. Similarly, the RKHS method also ignores the spatial details for the obtained SR images, as the given algorithm for solving the RKHS based model does not consider the blur of Gaussian kernel, it only considers the simple bicubic interpolation as a replacement. In particular, SRCNN, FSRCNN, and SCN methods could yield better visual results than the bicubic interpolation and the RKHS method since they are CNN based methods that can capture more image features on each layer, which naturally obtains better visual results. However, the three approaches fail to outperform the PnP method, as the PnP method not only considers the CNN based architecture but also can estimate the blur kernels with some existing kernel estimation approaches due to the novel formulation, i.e., Eq. (2). Especially, the proposed method that is an improvement of PnP could generate better visual performance than the PnP method, as well as enhances the image resolution significantly. Correspondingly, the quantitative metrics in Tab. 1, including PSNR and SSIM, also validate the superiority of the proposed method. From the table, it is clear that our method performs best, which demonstrates the effectiveness of our improvement to PnP. The results by our method have a larger margin than that by SRCNN, FSR-CNN, and SCN, since our method also involves the kernel estimation for image SR, while the three methods are a direct CNN way for the image SR. Fig. 6 and Tab. 2 respectively present the visual and quantitative results with the scale factor of 3. We have the similar conclusions as that for the scale factor of 4, which is just described in the last paragraph. Here, we do not repeat more.

B. REAL ToF DATA
In this section, we choose a particular real data, i.e., ToF images, to validate the effectiveness of the proposed method. ToF image is a kind of image that contains the distance information of the detected object captured by the ToF sensor. Especially, the ToF sensor is a class of scanner-less LIDAR, in which the entire scene is captured with each laser or light pulse, as opposed to point-by-point with a laser beam such as in scanning LIDAR systems. In this work, we choose the ToF images as the real test data since we have already captured ToF images via our designed and made manufactural ToF instruments. However, the captured images often hold low image resolution and additional outliers, which motivates us to increase image resolution by a new SR method. As there are no reference images in the real ToF data, we do not show the quantitative metrics and only present the visual results in this section.
In Fig. 7, we exhibit the visual results of three real ToF images, in which the first example is with the scale factor of 3, and the last two examples are all with the scale factor of 4. Note that the LR images captured by our designed ToF instruments are often with low image resolution and corrupted outliers, thus it is quite essential to propose an efficient SR method to enhance the image resolution and suppress the outliers simultaneously. From Fig. 7, the bicubic interpolation and the RKHS method show blur effects, while the SRCNN, FSRCNN and SCN methods show clearer image structures than the bicubic and the RKHS. However, the other five approaches all fail to suppress the outliers (see Fig. 7) since they were not built for noise removal. They are just for image SR. Constrastly, the PnP method and the proposed method could not only increase the image resolution significantly but also remove the involved real outliers effectively. Notably, our method holds the better ability of outlier removal than PnP (please see the close-up in the second example), which indicates the better performance of our method.

C. MORE DISCUSSIONS 1) THE CONVERGENCE OF OUR NEURAL NETWORK ARCHITECTURE
In this work, we propose an independent deep CNN to pick up image details to merge into the final HR image. The proposed network, i.e., ''Architecture 2'', is actually simple but effective, and is trained on the given training dataset (see Section III-C for more details). Therefore, it is necessary to investigate the convergence property of the proposed network. Fig. 8 shows the convergence curve (calculated with the mean square error (MSE)) of our neural network architecture both for the training dataset and the validation dataset. From this figure, it is clear that the given network is converged, as well as there is not overfitting or underfitting happened in the training phase.

2) RESIDUAL RECOVERY BY THE PROPOSED METHOD
Our approach that is actually based on the previous state-ofthe-art method, i.e., PnP [1], it could recover more image FIGURE 9. The absolute residual maps between the PnP and the proposed method, i.e., |Proposed -PnP|. The maps in (a), (b) and (c) are the absolute residual maps of the first, the second and the third example in Fig. 7, respectively. details based on the PnP. Therefore, it is also necessary to investigate what image details are recovered by the proposed method. Fig. 9 exhibits the absolute residual maps of the three examples in Fig. 7. From Fig. 9, it is easy to know that our method could pick up some image details to improve the quality of the final HR images, which also verifies the motivation of our method.

V. CONCLUSION
In the paper, we proposed an independent deep CNN to recover more image details from the obtained SR image. The work was motivated on an observation that there existed image residuals between the LR image and the downsampled HR output yielded by a previously proposed SR network. Extensive experiments on the simulated and the real ToF data verified the motivation, as well as the proposed method also held competitive outlier removal ability when increasing image resolution significantly. Moreover, the experimental results also validate the two mentioned contributions in the introduction.
In the future, we intend to collect more real ToF images by our designed instruments to construct a benchmark dataset for real ToF image restoration. Based on the dataset, we may design novel and useful deep CNNs for various applications of image restoration.

ACKNOWLEDGMENT
Thanks for the insightful and valuable comments of reviewers.