Image Denoising With Deep Convolutional Neural and Multi-Directional Long Short-Term Memory Networks Under Poisson Noise Environments

Removal Poisson noise poses a very challenging technical issue because it is difﬁcult to capture noise characteristics. This induces from the fact that Poisson noises from different sources affect each image pixel proportional to the pixel level. This paper addresses a new image denoising method for removing Poisson noise based on the Deep Convolutional Neural and Multi-directional Long-Short Term Memory Networks. The architecture of the proposed network contains some Convolutional Neural Network (CNN) layers and multi-directional Long-Short Term Memory (LSTM) layers. CNN layers are responsible to extract image features and to estimate some noise bases existed in images. The multi-directional LSTM layers are used to effectively capture and learn the statistics of residual noise components, which possess long-range correlations and appear sparse in the spatial domain. Moreover, designing deep learning models for image denoising involves several hyperparameters such as a number of layers. To select proper hyperparameters, it is beneﬁcial to investigate what is the best image denoising performance we can achieve under different model complexities. Moreover knowing and realizing how far the employing image denoising algorithm can do to the optimal result makes us possible to design the efﬁcient image denoising algorithm. We utilize the Blahut-Arimoto algorithm to derive numerically distortion-mutual information function of image denoising algorithm. The derived function serves as the distortion lower bound given the mutual information between the original image and the denoised image. Based on the knowledge of distortion-mutual information function, we can decide how deep the CNN layers should be deployed in our image denoising algorithm before applying the multi-directional LSTM layers. From our experiments, the proposed image denoising algorithm can outperform other algorithms in both subjective and objective qualities.


I. INTRODUCTION
Image denoising is one of the most classical problems in the field of computer vision and image processing whose objective is to remove noises while preserving the original image structures. Accurately modeling and capturing noise characteristics in image denoising algorithms lead to high quality restored images [1]- [3]. In general, there are two main classes of noise:1.) Signal-independent noise; and The associate editor coordinating the review of this manuscript and approving it for publication was Shuihua Wang .
2.) Signal-dependent noise. The Additive White Gaussian Noise (AWGN) is a widely used signal-independent noise model. The AWGN is generally is used to model noises induced by thermal vibrations of atoms, shot noise, and black body radiation from warm objects. Unfortunately, the AWGN can not effectively represent noise characteristics under domination of photon noise [4], [5], which is signal-dependent. The photon noise is caused by the random arrival of the photon onto an image sensor. Poisson distribution is deployed to model this photon noise [6].
Removal of the AWGN can be done efficiently by several existed techniques such as the sparse 3D transform-domain collaborative filtering (BM3D) [7]. However, when we apply the BM3D technique to denoise Poisson noise especially in natural images, it can not provide good results as in the AWGN environments. To denoise Poisson noise, various methods have been proposed. Azzari and Foi [8] proposed the modified BM3D technique called Variance Stabilization for Noisy+Estimate Combination in Iterative Poisson Denoising (I+VST+BM3D). The I + VST + BM 3D method is relied on an iterative algorithm that progressively improves the effectiveness of the Variance Stabilizing Transformations (VST). The I + VST + BM 3D gives better denoising results than those of BM 3D. However, it can not perform well on Poisson noise with low peak values. Sparsity-Based Poisson Denoising With Dictionary Learning (SPDA) [9] was proposed to denoise Poisson noise with a low peak value. The SPDA [9] can perform well in a very low peak value but cannot outperform the I + VST + BM 3D in Poisson noise environments with higher peak values. Feng et al. [10] proposed a method called Fast and Accurate Poisson Denoising With Trainable Nonlinear Diffusion (TRDPD). The TRDPD is an improved version of the Trainable Nonlinear Reaction Diffusion (TNRD) [11], which can perform well on denoising Gaussian noise. Unlike the TNRD, the TRDPD replaces the reaction term of the diffusion equation of the TNRD by a new function derived from the Poisson noise distribution. The TRDPD provides better denoising results in Poisson noise environments for all ranges of peak noise values but it leaves some artifacts on the denoised image.
With the recent advances of deep neural networks [12]- [16], the classical image denoising techniques have been outperformed by the deep learning-based techniques [17]- [21]. Zhang et al. [17]  and learn the statistics of residual noise components, which possess long-range correlations and appear sparse in the spatial domain. Moreover, designing deep learning models for image denoising involves several hyperparameters such as a number of layers. To select proper hyperparameters, it is beneficial to investigate what is the best image denoising performance we can achieve under different model complexities. Moreover knowing and realizing how far the employing image denoising algorithm can do to the optimal result makes us possible to design the efficient image denoising algorithm. We utilize the Blahut-Arimoto algorithm to derive numerically distortion-mutual information function of image denoising algorithm. The derived function serves as the distortion lower bound given the mutual information between the original image and the denoised image. Based on the knowledge of distortion-mutual information function, we can decide how deep the CNN layers should be deployed in our image denoising algorithm before applying the multi-directional LSTM layers. The contributions of this paper can be summarized as 1) We propose the method to compute numerically distortion-mutual information of the image denoising problem. This function can serve as a guideline on determining the hyperparatemeters of deep learning networks for image denoising; 2) We propose the multi-directional LSTM networks to extract and learn sparse noise characteristics to reduce complexities from applying the LSTM network directly to two-dimensional signals; 3) We combine the DCNN and the multi-directional LSTM to denoise images corrupted Poisson noise and obtain better results in both subjective and objective image qualities compared to the existed methods. This paper can be organized as follows. Section II formulates the framework of distortion-mutual information of image denoising algorithm. The algorithm to compute the distortion-mutual information function is also presented in this section. Section III discusses the utilization of DCNN on denoising Poisson noise. Its limitations are also discussed. Section IV describes the multi-directional LSTM networks in capturing and learning sparse noise characteristics. The combination between the DCNN and the multi-directional LSTM to form our proposed image denoising architecture is in Section V. Experimental results are in Section VI. Finally, concluding remarks are in Section VII.

II. DISTORTION-MUTUAL INFORMATION FUNCTION OF IMAGE DENOISING ALGORITHM
In this section, we try to derive numerically the lower bound of distortion from the considering image denoising algorithm. In other words, we want to know what is the best denoised image quality given the DCNN structure. Let us define P and P N as the original image and the noisy image corrupted by the Poisson noise, respectively. Each pixel in P N is an identically independent random variable with Poisson distribution. The value of noisy pixel is location-dependent. The conditional probability of the pixel value at position (x p , y p ) can be VOLUME 8, 2020 expressed as where P N (u, x p , y p ) is the random pixel value at position (x p , y p ) and P(x p , y p ) is the pixel value at position (x p , y p ) of P N and P, respectively. δ x,0 is a Kronecker delta function defined as Let the denoised image of P N beP, which can be obtained fromP where f (·) is an image denoising function and w is a set of denoising parameters. The objective of the image denoising problem is to find the optimal image denoising function with parameter w that minimizes distortion between the original image and the denoised image under the constraints on the effectiveness of function f (·). This can be translated to the complexity of function f (·) and a number of parameters in w. Therefore, the image denoising problem can be formulated as subject to I (P; P N , P) ≤ I f (P; P N , P), where D(·) is the distortion function, I (P; P N , P) is the mutual information ofP and (P N , P), and I f (P; P N , P) is the best achievable mutual information ofP and (P N , P) obtained from the denoising function f (·). We may not be able to obtain the closed form solution of D(P,P). In practical, the Blahut-Arimoto algorithm [24] can be utilized to compute numerically distortion-mutual information function of image denoising algorithm. First we need to compute joint probabilities among pixel values ofP and (P N , P). Let n(P(x p , y p ) = x, P N (x p , y p ) = y,P(x p , y p ) =ŷ) be the total number of pixel, where its value in the original image is equal to x, the corresponding corrupted pixel value from the Poisson noise is equal to y, and the corresponding denoised pixel value is equal toŷ for all positions (x p , y p ). Define N p as the total number of pixels. In general, to obtain sufficient number of n(P(x p , y p ) = x, P N (x p , y p ) = y,P(x p , y p ) =ŷ) and N p , we need to consider several images. The joint probability among pixels ofP and (P N , P) is With the same consideration, the probabilities P{P(x p , y p ) =ŷ} and P{P(x p , y p ) = x, P N (x p , y p ) = y} can be computed from and where n(P(x p , y p ) =ŷ) is the total number of denoised pixels having pixel values equal toŷ and n(P(x p , y p ) = x, P N (x p , y p ) = y) is the total number of pixels where their original pixel values are equal to x and its corresponding corrupted pixel values are equal to y. With these computed parameters, the distortion-mutual information function can be numerically calculated as follows.

Blahut-Arimoto Algorithm for Computing Distortion-Mutual Information of Image Denoising
• Step 1: Let y be the pixel value of the original image P at position (x p , y p ) with probability P{P(x p , y p ) = y} = p(y). Moreover let x be the pixel value of the noisy image at position (x p , y p ). The conditional probability of pixel value x given the original pixel value y is equal to P{P(x p , y p ) = x|P(x p , y p ) = y} = p(x|y). Let y be the restored image pixel of denoised imageP at position (x p , y p ). The conditional probability of pixel valueŷ given a pair (x, y) is defined as P{P(x p , y p ) = y|(P(x p , y p ) = y, P N (x p , y p ) = x)} = p(ŷ|y, x). • Step 2: Compute the expected distortion function d(ŷ, y) over the joint probability ofŷ and y of iteration t +1 from iteration t repeatedly until convergence via The Blahut-Arimoto algorithm numerically outputs I (P; P N , P) corresponding to the distortion betweenP and P. The mutual information from the Blahut-Arimoto algorithm corresponds to the lowest distortion we can achieve. Note that the underlying mutual information is directly computed from the statistics obtained from the Blahut-Arimoto algorithm. From the original image and the denoised image obtained from the DCNN, we can compute mutual information directly as ). (11) The difference between distortion of the original image and denoised image given the same mutual information can be used to measure the efficiency of image denoising algorithm.
Given the same mutual information, if the image denoising algorithm gives very close distortion to that from the Blahut-Arimoto algorithm, that image denoising algorithm has very high efficiency in noise removal.

III. NOISE FEATURE ESTIMATION WITH CONVOLUTIONAL NEURAL NETWORKS AND ITS LIMITATIONS
It is widely known that Deep Convolutional Neural Networks (DCNN) can learn to extract non-linear features far better than the human hand-crafted features. In image denoising problem, the DCNN is utilized to learn and extract noises from corrupted images. Then, it subtracts noises from the corrupted images to obtain the denoised images. The structure of DCNN for image denoising can be shown in Fig.8. Notice that each layer of the DCNN extracts noise features from the input by convolving the trained weights with the features extracted from the previous layer [25]. The output feature at layer i can be written as (12) where F k,i (x p , y p ) be the feature value k at position (x p , y p ) of the i th layer of the DCNN. || · || F is the Frobenius norm and • is the Hadamard operation. M i−1 is a number of feature maps at the i th layer. W m,i−1 is the trainable weight matrix of feature map m at the (i − 1) th layer. Z m,i−1 (x p , y p ) is an N × N -patch of feature map m in the (i − 1) th layer with the center at position (x p , y p ). The output features at layer i are then weighted summed to obtain the noise component. Fig.8 shows the extracted Poisson noise components from Plane image. Notice that the noise components obtained from Layer 1 and Layer 2 of the DCNN are very noisy. In contrast, the noise components from the deeper layers become more sparse (Layer 18 and Layer 19). This implies that the deeper the layer, the weaker the noise power. Note that even though in this paper, we utilize Poisson noise for our description, the concept of image denoising with DCNN can be applied to different kinds of noise such as Gaussian noise.
A number of layers and a number of neurons per layer in the DCNN involve directly with the neural network complexity. The larger the number of parameters, the longer the training period. Knowing and recognizing the limits of the using image denoising algorithm help us to optimize and select the proper DCNN structure(e.g., a number of layer and a number of parameters) for image denoising. Even though, the proper DCNN can remove noise quite efficiently, from the layers that noise components become sparse, increasing the number of CNN layers does not significantly improve performance of noise estimation. In other words, the image denoising performance does not improve much given higher complexities we give to the DCNN. This is because CNN cannot group noisy pixels to be a local patch for the convolution operation to learn the noise features. Therefore, the noise statistic cannot be effectively calculated. To capture the statistics of sparse noise, we need other measures to capture noise characteristics.

IV. SPARSE NOISE FEATURE EXTRACTION WITH MULTI-DIRECTIONAL LONG SHORT-TERM MEMORY NETWORKS (LSTM)
The LSTM network is widely known to be utilized to capture long-range dependencies of one dimensional data [23], [26], [27]. It is also possible to employ the LSTM network to capture some correlations in multi-dimensional signals with higher computational complexity [28]. This can be done by transforming multi-dimensional data to be long one-dimensional data. In this section, we deploy the LSTM network to extract sparse noise features. To obtain the optimal result, we need to transform the two-dimensional input feature map to one dimensional signal. This can be achieved by scanning feature maps using some scanning formats such as a raster scanning. However, training LSTM network to deal with very long one-dimensional input data may face several technical issues such as vanishing gradient and high computational complexity [22].
To solve these challenges, we propose the multi-directional LSTM network. The multi-directional LSTM network applies the LSTM network to inputs in four directions: 1.) from the left to the right (direction 1); 2.) from the right to the left (direction 2); 3.) from the top to the bottom (direction 3); and 4.) from the bottom to the top (direction 4). The proposed algorithm may not provide the optimal result on capturing noise features since we assume that we can obtain the sufficient sparse noise characteristics from applying the LSTM network only in four directions. Fig.2 illustrates the proposed multi-directional LSTM network as a combination of the operations of feature maps in four directions. The input feature maps will be first convoluted by 1 × 1 × C convolutional neural network before passing them to each direction of the directional LSTM module. The number of filters is equal to 32. To simplify the process, the left to the right, the bottom to the top, and the right to the left will be represented by 90 • , 180 • , and 270 • rotations from the top to the bottom directional LSTM network, respectively. After rotations, we can apply the LSTM only from the top to the bottom. This will reduce our implementation difficulties greatly. Then, the output from all four directions will be concatenated and are fed to the 1×1×64 convolutional neural network to get the output feature maps of the multi-directional LSTM network.
The procedure of the directional LSTM module from the top to the bottom can be described as follows. The feature maps with the size of I × J × K are processed directly without transforming into one dimensional data. This possibly omits some correlated data that can be gathered from a long one-dimensional data of the LSTM, but it can largely mitigate the complexity of our proposed framework. In general, the complexity of the conventional LSTM is in the order O(n 2 ). However, the proposed multi-directional LSTM has the complexity in the order of O(n). The LSTM cells are connected sequentially as a straight line from the previous cell to the next LSTM cell in the processing direction. In the top to the bottom direction, the LSTM cells will be connected from  the previous row to the next row. At the same time, the LSTM cells in each row are independent among others. Fig.2b depicts the connection of the LSTM cells in the top to the bottom direction. To calculate the output of the LSTM cell at position (i, j), we feed the feature map value at position (i, j) and the outputs of the LSTM networks from the previous row. The linear transformation is applied to the feature map value at position (i, j) before passing through the activation function. The weights of linear transformation are learned during the training period. At the same time, the outputs of the LSTM cell from the previous row are convoluted with the training weights before passing them to the activation function. The summation of these two values are used as the input of the gates in the current LSTM cell. Fig.3 shows the example of the process from the input feature maps to the output feature maps in each row in the case of one feature map. The input of the gates at position (i, j) at channel k can be calculated as (13) where ReLU (. . .) is a rectifier linear unit function. f (i,j) is the one-dimensional input features with a size of K at position (i, j). h (i−1,j) is the output feature patch with the size of 3 × K from the previous row output features centered at (i − 1, j). W f (k) is the weight matrix of the linear transformation applied to input f (i,j) at channel k. W h (k) is the weight to convolute with h (i−1,j) at channel k. Noted that the weights W f (k) and W h (k) of different gates in the LSTM cell are also independent from one another.
The proposed directional LSTM module contains cell state, forget gate, input gate, content gate, and output gate. The procedure of the multi-directional LSTM network is illustrated in Fig.2c. The cell state is the key of the LSTM cell. It connects the previous cell and has some minor process in the current cell to become an output cell state. The first operation in the cell state is at the forget gate. The forget gate controls how much of each component should be able to pass through by multiplying the value of the forget gate with the incoming cell state. A value of zero means nothing will be able to pass through and a value of one means everything is able to pass. The forget gate at position (i, j, k) can be calculated as where G f (i,j,k) is the forget gate at position (i, j) at channel k and σ (. . .) is a sigmoid function. Next, we need to process input features and decide how much of them will be stored in the cell state. There two parts of this. First, the input gate is used to decide how much of the input will be added into the cell state. Second, the content gate processes the input features before they pass through the input gate. The input gate and content gate at position (i, j, k) can be calculated by where G i (i,j,k) is the input gate at position (i, j) at channel k, and G c (i,j,k) = tanh(z (i,j,k) ), (16) where G c (i,j,k) is the content gate at position (i, j) at channel k and tanh(. . .) is a hyperbolic tangent function, respectively.
The old cell state obtained from the previous LSTM cell in the previous row will be used to compute the new cell state by combining all above computed values. The old cell state will be multiplied by the forget gate to filter some information it decides to omit earlier and then sum with the multiplication of content gate and input gate. This is the new candidate values scaled by how much we decide to update each state value. The current cell state at the position (i, j, k) after updating with the old cell state can be reckoned as where C (i,j,k) is the current cell state at position (i, j) of channel k and C (i−1,j,k) is the old cell state at position (i, j) of channel k. Finally, the output of the LSTM cell is based on the cell state and the value of the output gate. The output gate controls how much of the cell state will become an output of the LSTM cell. The cell state will be fed to the hyperbolic tangent function in order to control the output value to be between −1 and 1 and multiply it by the output gate. The output gate and the output features at the position (i, j) at channel k can be calculated by The output h (i,j,k) of the LSTM belongs to only one direction. Based on Fig.2(a), we need the outputs from four directions before concatenating them. C convolutional filters with the size of 1 × 1 × 64 is applied to the concatenated output to obtain the estimated sparse noise features with the size of H × W × C.

A. IMAGE DENOISING PERFORMANCE 1) TRAINING METHODOLOGY
An image data set from the Microsoft COCO (2017) Dataset [29] is used to evaluate the proposed image denoising technique. We randomly select several image batches from the image data set. Poisson noises are applied to the selected image batches. The noisy images are used as the training inputs. There are two training stages. The first training stage is performed over each batch with the size of 256 image patches. Each patch is with the size of 32 × 32 pixels. In our experiment, we iterate over our data set for 35000 epochs.
The learning rate α is set to 0.001. The training weights from the first stage will be used as the initial weights in the second training stage. In the second stage, each batch contains 32 image patches with the size of 128 × 128 pixels.
In this stage, we iterate over the data set for 5000 epochs. The objective of the second training stage is to let our image  denoising technique to learn the spatial information from natural images.
All image denoising techniques are trained with the image data set. However, a number of training epochs may be different to obtain the best denoised images. We employ both Peak Signal-to-Noise Ratio (PSNR) [32] and the Structural Similarity Index (SSIM) [33] to be our objective metrics. Table 1 and Table 2 compare the image objective qualities 87004 VOLUME 8, 2020  among different image denoising techniques. Our proposed image denoising technique outperforms other networks in terms of the objective quality metrics. Our proposed method provides up to 0.5 dB PSNR improvement by average. However, in SSIM, there is not much improvement gain from other works. This may imply that the SSIM may not be sensitive enough to measure the quality improvement in this comparison case. Fig.5, Fig.6, and Fig.7 show the subjective quality comparison under a peak value of Poisson noise equaling 0.1. Our method shows significantly improvement in subjective quality especially less color artifacts. Notice that our method has high impacts on image regions with low details as shown in Fig.7.
We also compare feature maps between our method and the DenoiseNet. Fig.8 shows features maps from both networks in the intermediate layers of Plane image in the LIVE1 data set [31]. Notice that feature maps of the first layer from both networks are very similar, which are very noisy. However, in deeper layers, feature maps of the DenoiseNet contain less structural information and some blurring artifacts. In contrast, our method provides feature maps with more edge information and less artifacts.
The reason that our algorithm can not outperform other image denoising algorithms objectively in high texture images because of the non-stationary pixel values of high textures. Since the Poisson noise applied to each pixel position is relied on its corresponding pixel value, in the multi-directional LSTM, noise characteristics in the operating patch are quite dynamic. Therefore, from past information, the multi-directional can not provide good predictions of the noise characteristics. However, as mentioned above, our method still gives superior subjective image qualities. To prove this claim, we explore our proposed image denoising on images with low details. To obtain a set of low detail images, we deploy the metric the two-dimensional High Frequency Component (HFC). The two-dimensional HFC of each image can be calculated via where X (i, j) is the N −point two-dimensional discrete Fourier transform of image data at frequency (i, j). Notice that the HFC is the weighted summation of all frequency components. The higher the frequency, the higher the weight. In general, low detail images tend to have low HFC values.
To be more specific, we declare that the specific image has low details if its HFC is less than 10 8 . Fig.9a, Fig.9b, and Fig.9c illustrate HFC histograms of the image data set in [30], the LIVE1 image data set [31], and the VOC2012 image data set [34], respectively. The HFC histograms imply that most images in the LIVE1 image data set contain high texture levels. In the the image data set in [30], there are some images having high texture levels, whereas some have less texture levels. We found that there are many low detail images in VOLUME 8, 2020  [30]. the VOC2012 image data set. We can extract totally 528 low detail images for our evaluation. Table 3 compares the objective qualities among different image denoising methods under Poisson noise environments. We use both PSNR and SSIM as the objective metrics. From the experimental results, our method can outperform other techniques under strong noise environments. To be specific, our method achieves 1.1 dB and 0.7 dB PSNR improvements under Poisson noises with peaks 0.1 and 1, respectively. The PSNR improvements of our method are not so significant when we have to deal with weak noise environments (lower peal noise value). This is because under weak noise environments, image textures are less affected by noise and our method can not give significant gains in such environments. However, in strong noise environments, the CNN module tends to smooth out the texture images causing texture loss. With the multi-directional LSTM modules in our method, image texture can be preserved and restored to obtain better denoised image qualities. In low detail images, the SSIM improvements of our method is also superior to other methods tally with those obtained from the PSNR improvements. The major improvements on our image denoising algorithm on low detail images are due to less dynamic on the noise characteristics in the operating patch. In low detail images, pixel values are not so dynamic. Hence, the Poisson noise characteristics are at different pixel positions are quite static. There are high correlations on the noise statistics on the operating patch. Therefore, the multi-directional LSTM can learn and predict noise characteristics better than those with highly varying textures.

B. NUMERICAL DISTORTION MUTUAL INFORMATION OF IMAGE DENOISING ALGORITHMS
To evaluate the numerical distortion-mutual information function of the image denoising algorithm, we utilize the VOC2012 image data set [34] since it contains a large number of images for training the DCNN. We randomly select several image batches from the image data set. The Poisson noise is applied to the selected image batches, where the peak value of the Poisson noise is equal to one. The noisy images are used as the training inputs. The training stage is performed over each batch with the size of 256 image patches. Each patch is with the size of 32 × 32 pixels. The learning rate α is initially set to 0.001 and Adam optimizer is utilized. Fig.10 shows the average mutual information between the  original and the denoised images from several trained DCNN models with different training epochs. The average mutual information of each DCNN model is compute from averaging mutual information between the original images and denoised images obtained from the considering DCNN model. As we can see, the DCNN models obtained from small training   epochs possess less average mutual information than those with higher training epochs. Notice that the average mutual information of different DCNN models is not much different after 15000 iterations. We employ the Blahut-Arimoto algorithm to compute the distortion mutual information function of image denoising algorithm numerically. The results from the Blahut-Arimoto algorithm serve as the best distortion we can achieve given the mutual information. We compare the image denosing performance of different DCNN models with those obtain from the Blahut-Arimoto algorithm. We vary the number of CNN layers in the DCNN with one, two, three, four, five, ten, and 15 layers. All DCNN models are trained with 35000 epochs. The results are shown in Fig.11. As we can see, the image denoising algorithm based on the DCNN gives very close performance to the best performance we can obtain given the mutual information between the original image and the denoised image. Fig. 12 shows the comparison results in the PSNR domain between the denoised image obtained from the DCNN and those from the Blahut-Arimoto algorithm.

VII. CONCLUSION
We propose a new architecture of deep convolutional and multi-directional LSTM networks to eliminate Poisson noise. Poisson noise is challenging to remove since the noise level is relied on its corresponding pixel intensity. The proposed network is designed to have two stages. The deep convolutional networks for extracting the noise bases with different variances are contained in the first stages. The deeper the layer is, the lower the noise variance is and the more sparse the noise is. Then, the multi-directional LSTM networks are in the second stage of the proposed network. The sparse noise components are grouped by the second stage of the network so that the remaining noise information still can be effectively removed. The proposed network is trained with several natural images before is applied on the test sets of images. The experimental results show that our proposed network provides better qualities of denoised images and fewer artifacts in both subjective and objective quality measures than those of the existing algorithms. We also derive the numerical distortion-mutual information function of image denoising algorithm. It provides the bound on the image denoising performance given the mutual information between the original image and the denoised image. The denoising results under the Poisson noise environment from the DCNN give near optimal qualities under different hyperparameter settings such as a number of CNN layers. This agrees with the fact that most noises are removed during the first stage. Only sparse noises still remain. However, sparse noises still affects the overall subjective qualities of denoised images. The insights given this framework can lead to the proper selection of a number of CNN layers and the design of image denoising algorithm.