Modified Dual Path Network With Transform Domain Data for Image Super-Resolution

Recently, studies on single image super-resolution using Deep Convolutional Neural Networks (DCNN) have been demonstrated to have made outstanding progress over conventional signal-processing based methods. However, existing architectures have grown wider and deeper, resulting in a large amount of computation and memory cost, but only a small improvement in performance. To address this issue, in this paper, we present a Wavelet- and Saak-transform Dual Path Network (WSDPN), which considers not only low-resolution images but also transform-domain information. The proposed network exploits the rich information extracted from the transform domain to reconstruct more accurate high-resolution images. In addition, to reap the benefits from both residual network (ResNet) and densely convolutional network (DenseNet) topologies, we use dual-path blocks as the basic building blocks which allow feature re-use while ensuring the ability to continue extracting new features. Thanks to extensive research on the attention mechanism, we further introduce spatial and self-attention blocks to refine features based on feature correlations at different layers. The experimental results show that our proposed approach achieves better performance on extensive benchmark evaluation than other state-of-the-art methods.


I. INTRODUCTION
Single Image Super Resolution (SISR) aims to leverage one Low-Resolution (LR) image to predict useful information to reconstruct a High-Resolution (HR) image while improving the image quality. SISR has been widely applied in many image processing fields, such as satellite image, surveillance, medical imaging, and remote sensing. Since a determined LR input can be obtained by adopting the same degradation process on many possible HR images, which is the reason why SISR has been an ill-posed problem despite decades of extensive research.
Interpolation-based methods, including nearest neighbor, bilinear, and bicubic interpolations, weight the adjacent pixels of an LR image to generate the HR image, which often accompanied by blurred artifacts. Reconstruction-based methods exploit complex prior knowledge to limit the desirable HR space to generate clear details. However, as the The associate editor coordinating the review of this manuscript and approving it for publication was Jinjia Zhou . upscaling factor increases, a huge number of training samples are required to recover HR images at a visually satisfactory level which is very time-consuming. On the other hand, the learning-based or example-based methods use supervised learning to build a mathematical analysis between LR and corresponding HR patches from the sample data which is learned by either extracting internal similarities from the LR patch itself or the correspondence between external exemplar pairs. The neighbor embedding [1] method conducts manifold learning on multiple nearest neighbors in the training dataset to reconstruct HR patches. Sparse coding [2], [3] methods consider image patches as a sparse linear combination of elements from a compact dictionary. Nonetheless, due to over-reliance on the well-trained mappings and their associated weak representation capabilities, they are usually inefficient and thus show limited visual quality. Recently, Convolutional Neural Networks (CNNs) have been shown to exhibit superior performance as compared to prior models by their remarkable learning capabilities. The Super-Resolution Convolutional Neural Network (SRCNN) [4], known as the VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ first CNN model in super-resolution tasks, learn end-to-end mappings from LR to HR images through a fully convolutional network. It outperforms the classical non-deep learning method. However, most deep-learning SR methods are based on spatial domain input to reconstruct the output of the network. In this work, as an alternative, we investigate the advantages of the data from the transform domain. More specifically, we attempt to capture image information in both the spatial and spectral domains to enhance SR quality. In addition, motivated by the promising performance of the residual network (ResNet) [5] and the densely convolutional network (DenseNet) [6] in classification tasks, we propose taking advantage of both networks and combining them with wavelet and Saak transforms [7]. Our network is trained with nine input channels, which comprise four sub-bands of the low-resolution wavelet coefficients, four sub-bands of the low-resolution Saak coefficients, and one LR image. The experiments show that using the transformed signals only increases the parameters by a small amount, but it greatly improves the quality of the reconstructed image. To further improve performance, we propose the Dual-Path Block (DPB), which inherits the advantages of the residual network (ResNet) and densely convolutional network (DenseNet), to facilitate the feature reuse within the network and to meanwhile obtain more compact and accurate representations that lead to more realistic visual effects. We also apply self and spatial attention blocks to consider the correlations between features at different levels and to recalibrate the feature maps with context information.
The rest of this paper is organized as follows: The related background is reviewed in Section II. Section III details the proposed network. Model comparisons and experimental results are presented in Section IV. Finally, conclusions are offered in Section V.

II. RELATED WORK A. DEEP LEARNING BASED IMAGE SUPER-RESOLUTION
The strong feature extraction and data representation abilities in deep learning have led to a surge of research on convolutional neural networks for SISR. As the seminal SR method based on the convolutional structure, Dong et al. [4] proposed the SRCNN that learns the nonlinear mapping between LR and HR patches. However, the high computational cost still hampers it from being applied in real-time applications since the network takes upsampled images as input. To further improve accuracy, speed and memory efficiency, FSRCNN [8] and ESPCN [9] conduct feature extraction directly from LR images and adopt a post-upscaling scheme which employs a deconvolution layer or a sub-pixel convolution module at the tail of the network to upsample the spatial size. By so doing, as compared to SRCNN, they significantly reduce computations while still failing to construct a deeper model due to the difficulties of training. Also, the quality of the reconstructed images may be degraded if there are no sufficient layers in the network. In order to relieve the training burden of deep networks, Kim et al. [10] proposed VDSR, which introduces global residual learning so that the network only needs to learn the residuals between HR and LR patches, and the residuals are then added to the original images to recover the SR results.
To avoid the checkboard artifacts produced by the deconvolutional operation, we choose the sub-pixel convolution module, which is regarded as a common convolution in LR space followed by a periodic shuffling, for the last stage of the network to upscale the spatial size. Furthermore, we apply the residual learning mechanism to help the convergence of training deep networks.

B. DEEP LEARNING IN TRANSFORM DOMAIN
A discrete wavelet transform analyzes an image by decomposing it into sub-bands that can capture textural and contextual information in both the frequency and location domains. A super-resolution algorithm with the wavelet transform is implemented to estimate the missing coefficient, where the LR image is considered to be the low-frequency subband of the HR image. The difficulty lies in predicting the unknown coefficients of the lost high-frequency subbands. DASR [11] combines both interpolated LR images and high-frequency subband images acquired by the discrete wavelet transform to fulfill reconstruction with high-quality in the spatial domain. Nguyen and Milanfar [12] established an interlaced sampling structure in training data for the purpose of efficiently calculating the wavelet coefficients. In addition to the wavelet-based constraint, Jiji et al. [13] used a smooth prior to determine the appropriate wavelet interpolation. Sparse coding was integrated to design different interpolation methods in [14]- [16]. Kinebuchi et al. [17] exploited hidden Markov trees to interpolate wavelet coefficients. DWSR [18] combined CNN and a wavelet transform, which benefits from the sparsity of wavelet residuals, to recover missing details and achieves competitive performance. However, due to the lack of training and the use of simple interpolation approaches, the above methods failed to prove their superiority over recent deep learning based models.
Kuo [19], [20] proposed the RECOS (REctified-COrrelations on a Sphere) transform to explain CNN in a mathematical model. This is a multi-layer transform whose forward process maps three-dimensional data into one-dimensional rectified spectral vectors. In order to reduce the defects in the inverse process, Kuo et al. proposed a Subspace approximation with an augmented kernel (Saak) transform [7], which adopted the Karhunen-Loève (KL) basis as the basic kernel. By using its negative vector to augment the transform kernel and performing the sign-toposition format conversion which is equivalent to the ReLU activation, the Saak transform can solve the sign confusion problem when multi-level transforms are cascaded. There is no need to train the transform kernels through back propagation, and it is possible in the meantime to minimize the transform losses. The Saak coefficients represent the spectral components in the corresponding spatial area and thus offer a joint spatial-spectral representation. In addition, the Saak transform can apply Principal Component Analysis (PCA) technique to reach energy compaction. The small perturbation in the test data would not affect the leading coefficients, and thus makes the Saak transform a robust process. Saak transform is a data-driven approach, so it can easily adapt to any task as a feature extraction technique or an unsupervised dimension reduction procedure.

C. DUAL PATH NETWORK
Taking advantage of cutting-edge neural network architectures to design a novel one is the most intuitive and effective way to enhance the model learning ability in a variety of tasks. Recent studies show that scaling up networks [21] has been widely adopted to improve performance of neural networks. However, deep neural networks will encounter degradation problems in which the network accuracy begins to saturate. To solve this issue, recent works have focused on helping the flow of information and gradients in the network to avoid problems such as vanishing gradients and the curse of dimensionality by modifying the network structure.
He et al. [5] proposed a residual network that introduced identity mapping and shortcut connections to ease optimization issues. In addition, due to the sparsity of input and output signals, the networks are more robust and can be trained more easily to further construct deep neural networks with hundreds of layers. Recently, densely convolutional networks [6] were proposed, where skip connections are introduced to concatenate from the input of convolutional layers to the output to facilitate training by strengthening feature propagation and encouraging feature reuse. Nevertheless, the method used to fuse features is not an adding process but rather is a concatenation, which results in the width of the densely connected path to increase linearly as the depth rises. This hampers building a deeper and wider network due to a large number of parameters and a large amount of GPU memory cost. Chen et al. [22] proposed the Dual Path Network (DPN), which is a compound network design intended to follow the core idea of both residual and densely connected networks. The DPN took the residual network as its backbone and attached a thin, densely connected path to construct the dual path network. However, in order to avoid high redundancy and relieve the computational burden, DPN used the grouped convolution to reduce the number of parameters caused by the densely connected paths, which resulted in additional hyperparameters. Also, in the DPN, shortcut connections both in the residual path and densely connected path are only applied between different blocks, making it difficult to share information and improve gradient flow across layers. In addition, in the two path topologies, skip connections do not take the importance of different features into account, where they are simply fused through feature adding and concatenation. Different from the DPN, we use the gating mechanism to limit the growth of the number of the feature maps and simultaneously merge the information of the two paths where the importance of the features in both paths will be adaptively considered.

III. PROPOSED NETWORK
In this section, we introduce the proposed model in detail, including the transformed inputs, the design of the dual-path block, and then the overall network architecture.

A. TRANSFORM DOMAIN DATA
Image super-resolution technology can be divided into two categories: frequency domain methods and spatial domain methods. In this work, we propose the use of transform domain signals to enhance SR quality. Here, the data are first transformed to the frequency domain and they are then combined together to capture both spatial and spectral information. After processing by CNNs, the signals are inverse transformed into the spatial domain to reconstruct a super-resolved image.
Over the past few years, the common Fourier transform was gradually replaced with the wavelet transform in image and signal processing. In the study of Fourier theory, intricated but periodic signals are represented as the sum, theoretically infinite, of sine and cosine waves. Though it can decompose the analyzed signal into the frequency information, it does not provide any time or location details that may benefit SR applications. To address this issue, wavelet transform applies different versions of basis function to analyze a signal in the time domain which offers both the frequency and location information. [23] Besides, wavelets allow rapid and efficient transform algorithms that need to be considered when the training of deep networks becomes a burden. Similarily, the Saak coefficients collect the spectral component in the corresponding spatial area and thus can provide a joint spatial-spectral representation. As stated above, we chose wavelet and Saak algorithms [7] to transform the input image to offer additional information.

1) WAVELET-TRANSFORM INPUT
In order to obtain the wavelet subbands, LR training images X are upsampled using bicubic interpolation. Then we generate four LR wavelet sub-bands by conducting a Haar wavelet on bicubic interpolated images X bic , which can be denoted as: {LL, LH , HL, HH } = 2dDWT {X bic } where the LL, LH, HL, and HH are four subbands of the bicubic-interpolated image, respectively. Note that 2dDWT{·} denotes the 2D discrete wavelet transform.

2) SAAK-TRANSFORM INPUT
As in the above procedure, LR training images X are upscaled first using a bicubic interpolation. Then, we reshape enlarged LR images X bic into a one-dimensional vector f by scanning the grid points in a fixed order, after which we can calculate the correlation matrix of f and take the eigenvectors of the correlation matrix as the Karhunen-Loève (KL) basis b k , for k = 1, · · · , K . VOLUME 8, 2020 In summary, the anchor vectors can be denoted as A = {a 0 , a 1 , · · · , a k , · · · , a K }, where K = L k − 1 and L k denotes the number of the spectral dimensions. We then separate the anchor vectors into two types, DC and AC vectors. The DC anchor vector is The AC anchor vectors are the remaining anchor vectors a 1 · · · a K . A basic way to obtain AC anchor vectors is to train a convolutional neural network by backpropagation. Instead, the Saak transform takes the KLT's kernel vectors as the AC anchor vectors and first augments the k-th KLT kernel vector as Then, it projects the input vector f onto the set of augmented kernels to obtain p k = a T k f. Finally, the projection p is reshaped back into a 2D Saak feature map.

B. DUAL PATH BLOCK
As shown in Fig. 1, there are two paths, a residual path and a densely connected path, in our dual-path block. We modify the basic structure of ResNet [5] by passing the information for the preceding layer to each layer in the residual path. In order to make the network to adaptively consider the importance of features at different levels while avoiding the instability that may occur when training deep networks, we set learnable weights to adjust the path of each skip connection. Let B i−1 and B i be the input and output of the i-th DPB, respectively. The residual path output R o,i , which has two convolution layers, can be formulated as where F res i,1 = σ (W res i,1 B i−1 ), W res i,c is the weights of the c-th convolution layer, and σ denotes the ReLU activation function. Note that the bias term is omitted for simplicity. Similar to the residual path, the densely connected path connects all layers within the blocks in a feed-forward design. Dense connections can improve information and gradients flow throughout the network. Besides, concatenating feature maps attained by other layers provides more information in the input of following layers and improves performance. The dense connected path output D o,i , which consists of two convolution layers, can be formulated as is the weights of the c-th convolution layer, σ denotes the ReLU activation function, and [· · · , · · · ] refers to the concatenation operation. The Local Fusion Module (LFM) is then applied to adaptively fuse the features from both paths. Since we want to fuse the information extracted from both paths, and the feature maps of the densely connected path are directly preserved in a concatenative manner, we first concatenate the feature maps of the two paths and then adopt a 1×1 convolutional layer to fuse the information and to adaptively control the width increment of the dual-path block as well as the memory cost. We can formulate the operation of Local Fusion Module (LFM) as where LFM i (·) denotes the 1 × 1 convolutional layer. To further enhance the information flow, improve the network representation ability and increase performance, local residual learning is applied. Thus, the final output can be reached by

C. FEATURE FUSION MODULE (FFM)
We constructed the Feature Fusion Module (FFM) in our feature mapping sub-network to make full use of the information obtained from each DPB and preserve persistent memory. As shown in Fig. 2, the FFM is composed of a dual path blockchain and a Mid-range Fusion Module (MFM). A series of continuous DPBs are stacked into a chain structure to form the dual-path blockchain for the purpose of performing further feature extraction at multiple levels. The Mid-range Fusion Module (MFM) is attached at the end of each FFM to merge the information from the preceding FFM and from the current blockchain to keep information. Similar to the Local Fusion Module (LFM), MFM first concatenates the features obtained by the previous FFM and by the current blockchain and then passes through a convolutional layer that serves as a gating mechanism to screen out the output information.
Inspired by [24] and [25], we also adopted the spatial attention block attached at each DPB to learn the correlations between hierarchical features as shown in Fig. 3. The spatial attention first divides the input data into three parts by three 1 × 1 convolution layers and then performs the dot-product operation in pairs to compute the similarity between two feature maps at different levels. Note that there is a shortcut   connection between input and output. Therefore, the attention model only needs to learn the residual mapping to fine-tune the feature maps. Since the spatial attention takes the outputs of both the previous and the current DPB as input, it can comprehensively take the contextual information into consideration.
In summary, in the dual path blockchain, the stacked DPBs expand the receptive field of the network to extract deep feature representations, and the use of spatial attention considers hierarchical features to obtain more precise information. In MFM, multiple skip connections facilitate feature reuse and improve information flow across blocks.

D. NETWORK ARCHITECTURE
The proposed network for SISR, which is demonstrated in Fig. 2, contains a feature extraction sub-network (FENet), a feature mapping sub-network (FMNet), and an upscaling sub-network (UpNet). The FENet extracts the feature maps from the transform domain data. The FMNet is then applied to learn finer features by multiple stacked feature fusion modules. The learned features are used to generate the final SR result in the UpNet. Specifically, in FENet, we adopt a convolutional layer to extract the initial features from the concatenation of LR input images and the transformed inputs. The self-attention block proposed in [25] is located at the end of FENet to recalibrate the features. Note that the self-attention takes only the current input for computation, as shown in Fig. 3. For the FMNet, we stack multiple FFMs, which are designed to refine the features yielded from the feature extraction sub-network. The Global Fusion Module (GFM) is utilized to integrate the global features and avoid long-term information loss by fusing hierarchical features from all the FFMs. After integrating the highly informative features, we adopt another self-attention block to further adjust features for subsequent global residual learning. Finally, we utilize the sub-pixel convolution layer [9] to upsample the spatial resolution for the purpose of reconstructing HR images in UpNet.

E. LOSS FUNCTION
Given N training sample pairs {X n , Y n } N n=1 from the dataset, the proposed network is optimized to minimize the L 1 loss VOLUME 8, 2020 where and D (·) denotes the parameter set and the output, respectively, of the network.

IV. EXPERIMENTS AND DISCUSSION
In this section, we first present the training data setting and provide implementation details, including the model hyperparameters. Then, we analyze the influence of different composing units in the proposed model using ablation studies. Finally, comparisons with other state-of-the-art methods on several publicly available benchmark datasets are made to prove the superiority of our proposed network.

A. DATA AND SIMILARITY MEASURES
For the evaluation, the proposed method was compared on four standard benchmark datasets, Set5 [26], Set14 [27], BSD100 [28], and Urban100 [29]. The Set5, Set14, and BSD100 consist of human images and natural scenes, while the Urban100 contains the urban view. For training, we chose the DIV2K dataset [30], which consists of 800 high-quality (2K resolution) images for image restoration tasks. We performed random horizontal flipping and 90 degree rotation to augment the training data. We used the peak signal-tonoise ratio (PSNR) and the structural similarity (SSIM) [31] index as the measurement metrics. For a fair comparison, we only take the luminance channel in YCbCr space into consideration to calculate PSNR (dB) and SSIM index and all images were center-cropped and a 4-pixel wide stripe was removed from each border, which is a common practice in SISR.

B. IMPLEMENTATION DETAILS
Based on existing studies [32], [33], we used the MATLAB [34] bicubic kernel to downsample HR images into LR images. For training, we randomly cropped the 32 × 32 LR patches from the LR images as the inputs and set the mini-batch size to 16. For optimization, we used an Adam optimizer [35] with β 1 = 0.9 and β 2 = 0.999. The learning rate was initialized to 10 −4 , which was decayed by a factor of 2 at every 2 × 10 5 iterations. For model training and testing, we used PyTorch [36] on an NVIDIA GTX 1080Ti GPU, and it took about four days to train our proposed WSDPN model. In the proposed network, we set 64 filters for all convolutional layers and the kernel sizes were all 3 × 3 with the exception of the 1 × 1 convolutional layers. In order to improve the learning capabilities of the network and control the computational costs, we adopted 24 DPBs in the FMNet. Meanwhile, we conducted zero-padding at the boundaries of each feature-map to keep the spatial size after the convolutional operation.

C. MODEL DISCUSSION
In this subsection, we discuss the influence of the different components making up our model through ablation experiments.
Considering different combinations of the types of input data, we examined several settings for the proposed network in Table 1. For quick validation, we used the original residual block [37] as the building block and removed the feature fusion module. Among the different combinations of input data types, it could be observed that residual blocks with both wavelet and Saak coefficients as input outperformed those with only LR images based on the PSNR gains of 0.05dB. Besides, using multiple inputs only increased the number of parameters by 0.3%. This demonstrates that leveraging transform domain input, which yields more information, indeed benefits super-resolution.  We also explored the effects of different path topologies on the dual-path block. Fig. 4 shows the three path topologies used for the comparison: (a) the original residual path in SRResNet [37] and EDSR [38], which removes all the   batch normalization layers in ResNet [5] for reducing memory consumption, (b) the densely convolutional path, which concatenates the feature maps of each layer to every other layer within the blocks, and (c) the modified residual path that adds the information of the preceding layer to each layer of the residual path. Table. 2 provides all possible combinations of the DPB topology. For simple, quick validation, we removed the attention modules and used only 16 DPBs as the baseline model. It was observed that applying the densely convolutional path and the modified residual path in the DPBs led to the best performance. This was because it can reserve the shallow features and continue discovering the finer ones. If we take three path topologies at the same time, the extracted features will interfere with each other, and the PSNR performance will degrade slightly. To show the tradeoff between performance and model size from the proposed network and existing SR networks, we made a comparison shown in Fig. 5. The results are evaluated on the Set14 dataset for a scaling factor of 4×. It can be observed that our model outperforms most stateof-the-art methods. It should be noted that our model shows higher PSNR values but with fewer parameters than EDSR and RDN. This evidence indicates that our network has a better trade-off between performance and the number of parameters. Fig. 6 demonstrates the trade-offs between the reconstruction accuracy and the execution time. In terms of running time, WSDPN runs faster than other SR methods and also obtains better PSNR results. It is obvious that our model strikes a good balance between the reconstruction accuracy and the running time.
To study the relationship between the number of DPBs in FENet and the reconstruction performance of the proposed model, we provide different number of DPBs and the corresponding PSNR results in Fig. 7. To save training time and perform simple and quick validation, we removed the attention mechanism. It can be noticed that though the depth increases as we use more DPBs, causing the number of parameters to grow linearly, the increase in PSNR tends to saturate. Thus, we chose 24 DPBs as our final model for subsequent comparisons. The conclusion can be drawn that designing a delicate architecture will be more helpful for reconstruction accuracy than blindly increasing the depth of the network.

D. COMPARISON WITH STATE-OF-THE-ART METHODS
To show the effectiveness of our model, the average PSNR and SSIM results of several state-of-the-art SISR methods, including SRCNN [4], FSRCNN [8], VDSR [10], LapSRN [33], DRCN [39], DRRN [40], SRResNet [37], D-DBPN [41], EDSR [38], and RDN [42], are reported in the VOLUME 8, 2020 form of a quantitative evaluation. In this work, the geometric self-ensemble [38] technique is also performed to obtain higher performance. Specifically, the test images are flipped and rotated to augment seven images from the original. We input these images into the network and perform an inverse transform on the output high resolution images. All of these images are then added together and averaged to obtain the final high-resolution output.This self-ensemble strategy has an edge over other ensemble methods in that it can be easily applied to various models without further training. Although the self-ensemble method does not require additional parameters, it can be noticed that this technique indeed increases the PSNR metric at approximately 0.1dB. The model with the self-ensemble method is denoted by adding ''+'' postfix to the model name. The quantitative evaluations for scales ×2, ×3 and ×4 in the benchmark datasets are listed in Table 3. It can be seen that, compared to other methods, our model yields the best performance. In addition, the visual results of various methods for a scale factor ×4 are presented in Fig. 8∼ 12. It can be observed that the results of prior methods often include some distortions and artifacts, such as the stripes or the fur on animals, the word contour, and the lines of the buildings. By contrast, our method prevents such distortions, avoids the artifacts, and produces more realistic results. The proposed model sufficiently recovers the HR images with fine textures and thus demonstrates its superiority.

V. CONCLUSION
In this paper, we propose image super-resolution algorithms benefiting from the Saak and wavelet transforms. Our proposed multiple transform domain inputs extract rich information from the original LR image and thus can make our network learn the finer mapping between LR and HR pairs. Thanks to the robustness and efficiency of the residual network and the densely convolutional network, we apply the dual-path blocks as the basic architecture by which to construct our network. To further improve performance, we connect each layer within the dual-path block to increase information and gradient flows. In addition, we adopt self and spatial attention mechanisms, which aim to progressively recalibrate the learned feature maps, to improve the representational ability of the network. Compared with most stateof-the-art methods, WSDPN can achieve competitive or even better results under the premise that the number of parameters is economical.