Multi-wavelet residual dense convolutional neural network for image denoising

Networks with large receptive field (RF) have shown advanced fitting ability in recent years. In this work, we utilize the short-term residual learning method to improve the performance and robustness of networks for image denoising tasks. Here, we choose a multi-wavelet convolutional neural network (MWCNN), one of the state-of-art networks with large RF, as the backbone, and insert residual dense blocks (RDBs) in its each layer. We call this scheme multi-wavelet residual dense convolutional neural network (MWRDCNN). Compared with other RDB-based networks, it can extract more features of the object from adjacent layers, preserve the large RF, and boost the computing efficiency. Meanwhile, this approach also provides a possibility of absorbing advantages of multiple architectures in a single network without conflicts. The performance of the proposed method has been demonstrated in extensive experiments with a comparison with existing techniques.


INTRODUCTION
I MAGE denoising is one of well-known ill-posed problems.
The noisy images are typically caused by external noise like electromagnet wave interruption [1] or internal noise from the detectors themselves [2]. To meet the needs of extensive applications, many image denoising algorithms have been proposed for decades [3][4][5][6][7]. Recently, benefiting from the increasingly computing power brought by advanced graphic processing units (GPU), various neural networks, especially convolutional neural network (CNN), have been raised up to resolve image denoising tasks. Deserved to be mentioned, in CNN, once the network is trained, the image denoising procedure can be accomplished with a simple forward propagation, which will significantly reduce the computing time, while the degraded image can be treated as a non-linear map from the clean image [8]. Due to the employment of the nonlinear activation function, the CNN are gifted in dealing with noisy image denoising tasks.
During this decade, the CNN has shown a more effective performance in comparison to traditional methods. Although there are a variety of networks, most of them can be divided into three categories in terms of their structure features, including simple, multi-residual and U-net structures. The networks with simple structure do not have skip connection, which means that the feature maps propagate layer-by-layer. Due to the low computing load and convenient designing process, CNNs with simple structure are widely used in the early attempts [9][10][11]. However, the connection between each convolution layers is neglected in these networks, and dead neurons will be incurred due to the gradients vanishing problem in very deep networks [12]. In 2015, He et al. realized residual learning by inserting shortcut connections, which perform element-wise addition between inputs and outputs [12]. This multi-residual structure is helpful for maintaining suitable gradients in back propagation and providing the possibility of obtaining a better performance in deep networks. Since then, several networks with more complicated connections were presented to take full advantage of the relationship between each layer [13][14][15]. The U-net structure was first proposed by Ronneberger et al. to apply CNN for biological segmentation [16]. U-net gets its name from its characteristic architecture whose backbone is composed of pooling and up-convolution layers. Due to the use of the pooling layer, U-net significantly reduces the calculation cost. Besides, Unet contains element-wise addition or concatenation connections to establish links between the layers in the same level. Recently, Liu et al. [8] proposed to adopt discrete wavelet transform (DWT) and inverse discrete wavelet transform (iDWT) in U-net to eliminate the information loss caused by the pooling operation . Their multi-level wavelet convolutional neural network (MWCNN) [8] further enhanced the precision of the deep learning technique in image denoising tasks.
From a certain perspective, multi-residual and U-net structures are the optimized variants of simple structure. Here, we chose six representative networks as references, including simple structures: learning deep CNN denoiser prior for image restoration (IRCNN) [17], denoising convolutional neural network (DnCNN) [18]; multi-residual structures: residual encoder-decoder network with 30 layers (RED30) [15], residual dense network (RDN) with 32 feature map channels before entering 6 RDBs each of which contains 10 convolutional layers (G32C6D10) [19], memory network(MemNet) [20]; and U-net structure: MWCNN [8]. As shown in Fig. 1, both multi-residual and U-net structures lead to a higher peak signal-to-noise ratio (PSNR) than the simple structure. Nevertheless, both multi-residual and U-net structures have disadvantages: the former takes enormous computation time, the latter has fewer internal connections and the residual learning is less efficient. Therefore, to make complementary advantages of these two kinds of networks, we propose a more robust CNN by combining both the MWCNN structure and residual dense blocks (RDBs) together, i.e., it can be categorized into U-net & multi-residual structure. We refer to this novel network as multi-wavelet residual dense convolutional network (MWRDCNN). We noticed that the networks reported in different papers were generally trained with different training sets and different parameters, which will definitely affect the final results [21]. Thus, to make a fair comparison, we trained these representative networks in the same condition with our network.
The main contributions of this work include: • adopting the RDBs in our MWCNN architecture to obtain a better tradeoff between the computation speed and denoising performance; • validating the effectiveness of RDBs in image denoising tasks; • demonstrating the performance of our MWCNN architecture compared with six state-of-the-art networks under the same training conditions.

Traditional Denoising Algorithms and Early CNN Structures for Image Denoising
It is worth noting that before the CNN was applied in image denoising tasks, the conventional methods were relatively mature [22][23][24][25][26]. In 2007, Dabov et al. proposed blockmatching and 3D filtering (BM3D) method [7], which was once thought to be the most efficient traditional denoising algorithm. So in the early years when the CNN was adopted to solve the ill-posed problems of image denoising, the BM3D method became the reference standard of the performance evaluation of newly proposed methods. Jain and Seung proposed a four-layer convolutional network [27], which was proved to have a superior performance compared with the conventional wavelet and Markov random field (MRF) methods [28]. Xie et al. presented a deep network scheme utilizing stacked sparse denoising autoencoders [29], which achieved a compatible performance against K-singular value decomposition (K-SVD) algorithm [30]. Although those CNN-based methods transcended the most conventional algorithms, they failed to achieve better results compared with the BM3D method. The monopoly of BM3D in single image super resolution (SISR) was broken in 2014 by a three-layer fully convolutional network (FCN) proposed by Dong et al. [31]. Then, Zhang et al. enlarged the depth of the network to 17 layers [18], significantly boosting its performance. Meanwhile, it was the first time that the CNN was used for image denoising. Since then, many kinds of convolution networks with very deep architecture have been developed [32][33][34].

Residual Learning
A typical method to conduct residual learning is to adopt addition operation at the end of the network, which is promised to solve the gradient vanishing problem to a certain extent. Nevertheless, as the depth of the network grows, it is harder for very deep networks to keep a longterm memory. Some intuitive solutions are to establish more complicated connections among the convolutional layers. For instance, Mao et al. [15] considered to establish multiple connections between encoders and decoders by elementwise addition; Tai et al. [20] adopted the dense connected memory blocks to take into account both short-term memory and long-term memory; Zhang et al. further proposed the residual dense block to achieve full use of all convolutional layers [19].

U-net Architecture and DWT in CNNs
To avoid overfitting and to reduce the amount of calculation, pooling layers are commonly employed in CNNs. This means that the sizes of feature maps reduce along the forward propagation. Thus, early CNNs are often used to solve classification problems by giving a single label. However, the objectives of image processing like image segmentation and denoising are not confined to classification problems, and they require the outputs to be the images that have almost the same size as the inputs. In 2015, Ronneberger et al. adopted the up-convolutional layers, which enlarged the sizes of feature maps by convolution and expanded the application of CNN to biomedical segmentation [16].
Since this network owns a novel architecture, they called this network U-net. In U-net, the calculations are accelerated, and the pooling layer is a convenient tool to enlarge the receptive field (RF). Therefore, the CNNs with U-net structure seem to be more suitable to deal with image denoising tasks compared with the traditional networks. But, the pooling operation can also cause inevitable information loss. Inspired by conventional wavelet algorithm, Bae et al. put forward a wavelet residual network (WavResNet) by adopting DWT and iDWT layer instead of pooling layer [35]. Later, Liu et al. optimized this network on the basis of U-net and proposed the MWCNN [8], which improved the image denoising performance. The aforementioned works have succeeded in comprehensive learning, calculation simplifying and RF enlarging. To absorb the advantages of those networks, here we propose a multi-level wavelet residual dense convolutional neural network. The architecture of the network will be detailed in the next section.

PROPOSED METHOD
In this section, we will first briefly review the procedures of DWT and iDWT to introduce the foundation of our proposed method. Then, we will formally show our MWRD-CNN based on DWT, iDWT and RDBs. The details of RDBs will be also introduced. After that, the implementation details and the difference from the previous works will be presented.

DWT and iDWT
Before introducing the DWT and iDWT, we first draw the schematic diagram of convolution and inverse convolution processes in CNNs, as shown in Fig. 2. Taking the four pixels in the top left corner of the input image X as an example, when they are convolved with a 2 × 2 filter F, we will have the first pixel value in the hidden layer O via In the direction of forward propagation, the pixel values of unknown Y will be obtained by By performing the same operation for the rest pixels with a stride of 2, we can get the output image Y via inverse convolution between O and a 2 × 2 inverse filter D: where Iconv represents the inverse convolution operation. Actually, from the hidden layer's point of view, o 11 can also be symmetrically viewed as the convolution result of the four pixels in the top left corner of the output image Y and the dual filter D ′ of D, as illustrated in Fig. 2.

Convolution
Inverse Convolution f 22 Now, let us review the concepts of the DWT and iDWT, which are the common tools in conventional image processing. A given image X can be decomposed into four sub-images by DWT, i.e. low-pass image X A (average), and three high-pass images including X H (horizontal), X V (vertical), X D (diagonal). From a certain perspective, the process of the DWT can seen as a convolution between x and four 2 × 2 filters described as below, in a stride of 2: Thus, through the DWT process, we will obtain four decomposed sub-images via where Conv refers to the convolution operation, X can be treated as the feature maps, and f i denote the filters. As a result, the pixel-sizes of the four sub-images are half of the input image size. To some extent, the DWT has the similar downsampling effect as the pooling operation in U-net [16]. Moreover, due to the orthogonality of the four filters, there will be no information loss during the downsampling process, which means that the target image can be completely reconstructed by iDWT. Previously, Liu et al. [8] adopted four equations to describe the iDWT process.
Here, we find iDWT can also be expressed as an inverse convolution operatioñ where Sum denotes the element-wise addition. Due to the information lossless property, the DWT and iDWT are gradually utilized in CNNs [36]. Among them, the MWCNN is one of the most representative networks [8]. Thus, here we adopt the MWCNN as the main framework of our work. The channel number of feature maps is set to c. Assuming the sizes of the feature maps before the DWT are all h × w, then the sizes of the decomposed images are all h 2 × w 2 . Several pairs of DWT and iDWT operations are sequentially used to establish our hierarchical structure, which will be introduced later.

Network Architecture
As shown in Fig. 3, we retain the backbone of the MWCNN and optimize each level with the RDB to build the our network. Let us consider a three-level MWRDCNN. We denote the input image as W 0 . The first DWT decompose W 0 into four subband feature maps, all marked as W 1 . For the reason that these feature maps locate in the downsampling procedure, we further refer to them as W 1,d . The subband images in the upsampling procedure corresponding to W 1,d will all be denoted by W 1,u . For the first DWT, we have After that, a single convolution block (i.e., the Conv block in Fig. 3) is deployed to reduce the channels of the feature maps for improving inter-band independency of feature maps [  the block contains a batch norm layer depends on the number of the current layer. Then, after the subband images pass through the RDB, we will obtain W * 1,d via where CB denotes the function of the convolution block. After accomplishing local feature fusion by RDB, the second DWT is adopted to produce a hierarchical architecture in the network, where the subband images in the second level are expressed as According to this procedure, the feature maps in the ith level during the downsampling procedure can be deduced from the equations below Suppose that the network contains n levels, the upsampling process begins after the feature maps passing the nth RDB. To guarantee the symmetric of this architecture, another convolution block will be used before the first iDWT operation. The iDWT in the highest level can be formally described as where W n−1,u refers to the (n − 1)th upsampled feature maps. Then, the element-wise addition is adopted to establish long-term residual learning between the feature maps W n−1,u and W * n−1,u in the same level. In the downsampling procedure, each layer is preceded by the convolution block, then followed by the RDB; while during the upsampling process, the order of the convolution block and RDB in each layer needs to be reversed, i.e., set the RDB first, then deploy the convolution block. As mentioned earlier, the role of the convolution blocks in the downsampling process is to reduce the number of channels, the ones in the upsampling procedure have precisely the opposite effect, i.e., to increase the number of channels and make it equal to the channel number of corresponding feature maps in the downsampling procedure of the same level, for the next element-wise addition. Subsequently, we will obtain the output feature maps W * n−1,u by W * n−1,u = CB(RDB(Sum(W n−1,u , W * n−1,u ))).
Similar to Eq. (10), the general term formula in the upsampling can be expressed as It is worth noting that there are no more learnable parameters after the last convolution layer, so adding one more rectified linear unit (ReLU) excitation layer will definitely degrade the quality of the final output image. Therefore, we set the last convolution block be a single convolution layer.
In this work, we use the PSNR as a unitless performance measure for evaluating the quality of the final output images from the networks: From above definition we can see that the PSNR value depends on the mean square error (MSE), which is defined where p and q are the pixel size of the input image and output image. The MSE can describe the squared distance between the denoised imageŨ (i, j) and the original image U o (i, j). Naturally, the larger is the PSNR value, the better is the image quality of the output result.
Since the quadratic term will come out a 2 during the derivative of backpropagation when training the network, to cancel out this 2, we define the loss function of our network as: where b i and x i are the image with additive noise (the input image of the network) and the original image (the target, also called the ground truth), respectively, both of them comprise a training set , N is the batch size, Ψ (b i , P ) refers to the output image of the network, and P denotes the learnable parameters in the network.  The RDB is a common structure to build tight connections among layers and has been extensively used to deal with the SISR tasks [19]. In this work, we innovatively apply the RDBs to image denoising. The detailed architecture of RDBs is given in Fig. 4. Assume that the inputs and outputs of the RDB in ith level are denoted as R i,o and R * i,o , respectively, where o = d (downsampling) or u (upsampling), we can simply write the relationship as below:

3×3 Conv Layer
For a RDB with j convolution blocks, the output of the final concatenation result R i,j,o can be derived from where Concat denotes the concatenation operation. Assuming the number of channels in R i,o is c, the R i,j,o will contain j × c channels. Thus, another convolution layer is needed to fuse the feature maps for the local residual learning. Then,

Implementation Details
Our MWRDCNN consists of three hierarchical levels. Each RDB contains three convolution blocks. The sizes of the convolution kernels in the convolution layers are all set to 3 × 3, and the weights of each convolution kernel need to be adjusted in backpropagation; the sizes of the convolution kernels in the DWT and iDWT are all 2 × 2, and instead their weights are not adjusted in backpropagation. It is worth mentioning that there are no convolution layers in the DWT and iDWT. The numbers of channels in each layer are marked in Fig. 3.

Discussion
By combining with the preeminent structures of the RDN and MWCNN, we will discuss our efforts to address the trade-off between performance and time consumption by using our MWRDCNN.

Difference between the MWRDCNN and RDN
Deserved to be mentioned, the RDN does not contain a Ushape structure, so only two convolution layers are taken into account in the RDN to extract the shallow features. In our network, the subband images are the new feature maps decomposed from the input images, which means that their features are needed to be extracted. Therefore, we add one more convolution block in each level of the network.

Difference between the MWRDCNN and MWCNN
Here, we summarize three main differences between our network and the MWCNN. First, in the MWCNN, only long-term residual learning is applied by element-wise adding the downsampling and upsampling feature maps in the same level. Since we adopt the RDBs in each layer of our MWRDCNN, both the short-term and long-term connections are established, resulting in a more efficient learning mechanism. Second, in our MWDDCNN, we rewrite the DWT and iDWT functions into convolution and inverse convolution operations, instead of matrix operations used in MWCNN. Third, each RDB in our MWRDCNN contains 3 convolution blocks, so the downsampling or upsampling process of the first two layers separately contains 4 convolution blocks, which is consistent with the MWCNN, but the numbers of convolution blocks in the highest layer of two networks are different. In the MWCNN, there are eight convolution blocks in the highest layer, while in our MWRDCNN there are five convolution blocks in the highest layer (the last one is used to ensure the symmetry of the network). Therefore, the number of convolution blocks in the highest level of the MWDDCNN is three less than that of the MWCNN, which can accelerate the computation to a certain extent.

EXPERIMENTS
In this section, we first show the training sets and training procedure for each network, then we make a fair comparison to demonstrate the performance of our network.

Training and Test Sets
Here, the DIV2K (one of the most popular training sets since 2017 [8,13], benefiting from its aesthetically high quality) is used for our training procedure [21]. It contains 800 images for training, 100 images for validation, and 100 images for testing.
In this work, we chose six other different networks as our main competitors. The networks and corresponding training patches and batch sizes were shown in Table 1.  [17] 152 × 152 32 DnCNN [18] 152 × 152 32 RED30 [15] 152 × 152 21 MWCNN [8] 152 × 152 32 RDN [19] 76 × 76 28 MemNet [20] 76 × 76 18 MWRDCNN 152 × 152 32 Due to the limited GPU memory size, we set the batch size to 21 for the RED30. To adjust the batch size into a suitable value, we cropped every 152 × 152 patch to four 76 × 76 patches for training both the RDN and MemNet. By this means, we ensured that each network learned the same features from the training set.
The main task of trained networks is to denoise the images with additive Gaussian noise of different standard deviations. In this paper, we took three standard deviations of additive Gaussian noise into account, σ = 15, 25 and 50. The performance of these networks were assessed by using five commonly used test sets: Set5 [37], Set12 [18], Set14 [38], BSD68 [39] and Urban100 [40].

Network Training
To compare the proposed method with the previous networks as fair as possible, we adopted the same parameters when training different networks. Adaptive moment (Adam) estimation algorithm was chosen as the optimizer, which was set to α = 0.01, β 1 = 0.9, β 2 = 0.999 and ǫ = 10 −8 . The learning rates were divided into 3 stages of 45 epochs in total. In the first 15 epochs, the learning rates were fixed to 10 −3 . In the next 20 epochs, the learning rates decayed exponentially from 10 −3.8 to 10 −4 . In the last 10 epochs, the learning rates decayed exponentially from 10 −4.5 to 10 −5 . The patches rotation and flip were used for data augmentation. We accomplished the training and test procedure on a NVIDIA RTX 2080Ti GPU with a package named MatConvNet [41].

Comparisons with State-of-art Networks
The performance of the denoised images and running time were evaluated in this section. The PSNR and structural similarity index measure (SSIM) [42] were utilized in our quantitative assessment. In this work, the images used in both training set and test sets were of gray-scale.

Quantitative Evaluation
The results of the proposed method and six previous methods on different Gaussian noise level were shown in Table 2. The highest score in each row is highlighted in bold. It can be seen that except the MemNet slightly exceeds ours in Urban100 test set, our MWRDCNN performs better than the other methods. It is worth noting that our MWRDCNN surpasses the MWCNN (the basis of our network) about 0.1 dB in the first four test sets, and about 0.3 dB in Urban100 test set in terms of PSNR. These results indicate that the proposed network does benefit from the residual learning from the RDBs. Particularly, all the networks with RDBs (the RDN, MemNet and MWRDCNN) perform much better than the others in Urban100 test set. Since many images from this test set contains various grids, the RDB may be gifted to deal with this kind of complicated patterns. On the other hand, the MWCNN gets a higher score than the RDN and MemNet in Set5, which is composed by many creature images. It can be seen from this result that the DWT and iDWT may help the network enlarge the RF and improve the fitting ability. Therefore, our MWRDCNN inherits this property and further improves the performance. In a nutshell, Our MWRDCNN combines the advantages of both the MWCNN and RDB together, and can competent in denoising the images of various types.

Qualitative Comparison
We selected "09" in Set12 and "test051" in BSD68 for qualitatively comparing our method with the others. In each image, we enlarged two cropped patches for comparison. The PSNR values of the whole images obtained by different methods are listed under the corresponding patches. As shown in Fig. 5, the first three methods cannot correctly restore the tablecloth texture in the red rectangles. The MWCNN, RDN and MemNet can roughly recover the image, but there still exists some tiny flaws on the cloth edge. The result obtained by the proposed method has the highest similarity with the ground truth (original image without noise). In the green rectangles, our MWRDCNN obtained the clearest details, such as the right eye of the woman and the kerchief. The results of images with various texture directions are given in Fig. 6. Although all methods will reconstruct some redundant stripes beyond the zebra's legs, this negative phenomenon is not very significant in our method.
Overall, the MWRDCNN is promising in concisely recovering images which are badly interrupted by heavy noise and presents a better image denoising performance.

Running Time
As the network architectures become more and more complicated nowadays, the computational efficiency must be taken into account. In this work, we timed the mentioned methods in the same test environment. We adopted the CuDNN-v7.5 deep learning library with CUDA 10.1 to build the environment under Windows 10 operating system. After finishing the computational process with a NVIDIA RTX 2080ti GPU, we obtained the average time cost, as shown in Table 3.
Benefit from the DWT, which can reduces the size of the feature maps, our MWRDCNN runs faster than other two RDB-structured networks, i.e., the RDN and MemNet. Although slower than the IRCNN, DnCNN and MWCNN, the high-quality reconstruction is sufficient to compensate the time cost. Thus, MWRDCNN achieve a good balance between the performance and efficiency.

CONCLUSION
In this work, we chose the MWCNN as the backbone of our network, and inserted the RDBs in each downsampling   and upsampling procedure to build our MWRDCNN. The proposed network inherits the large RF and lossless information operations in the MWCNN by using the DWT and iDWT. In addition, the RDBs can help the network establish a dense short-term residual learning. The experiment results show that by combining these advantages, the proposed network achieves higher qualities in image denoising tasks. Moreover, our MWRDCNN takes shorter computation time compared with the RDB-adopted networks like the RDN and MemNet. Thus, MWRDCNN is a successful attempt to utilize the superiority of multiple networks. We hope our MWRDCNN can be further extended to other image denoising tasks, such as the SISR and image artifacts removal. ⊲ For more information on this or any other computing topic, please visit our Digital Library.