Low-Dose CT Image Reconstruction With a Deep Learning Prior

In low-dose computed tomography (LDCT), a penalized weighted least squares (PWLS) approach that incorporates the Poisson statistics of X-ray photons can significantly reduce excessive quantum noise. To improve the quality of LDCT images, prior information such as the total variation, Markov random field, and nonlocal mean, can be imposed onto the target image. However, this information may be limited to reflect the characteristics of the target images, thereby resulting in unexpected side effects (e.g. blurry images). In this paper, we propose a PWLS method combined with a deep learning prior, which is learned from standard dose CT (SDCT) images. The proposed model learns a noise reduction function that maps an LDCT image to its corresponding SDCT images and estimates the prior distribution using a Pearson $\chi ^{2}$ divergence. The model can be converted to the least squares generative adversarial network with an added PWLS objective, where the optimal generator acts as the noise reduction function. We reformulate the proposed model as a constrained optimization problem and solve it using the alternating optimization (AO) algorithm. Clinical SDCT and simulated LDCT scans of ten patients were used to show the validity of the proposed method. Results show that the proposed method outperforms other PWLS methods, by imposing priors such as the total variation and the nonlocal mean.


I. INTRODUCTION
X-ray computed tomography (X-ray CT) is a diagnostic imaging modality widely used in medical radiology. It produces cross-sectional images of the human body with high resolution. However, with the growing concerns regarding the risk of radiation induced cancer, there is an emergent need for radiation dose reduction [7], [16]. In clinical practice, decreasing X-ray current is the most common method of reducing the X-ray dose [15]. The quality of low-dose CT (LDCT) is partly limited by excessive quantum noise, which results in the degradation of diagnostic performance.
Over the last few decades, numerous iterative reconstruction (IR) methods for noise reduction have been proposed. Some of these methods have incorporated statistical methods such as Maximum a Posteriori (MAP) approach for data fitting in a sinogram space. By assuming that X-ray photons The associate editor coordinating the review of this manuscript and approving it for publication was Ge Wang . follow a Poisson distribution on reaching an X-ray detector, the MAP model can be reduced to a penalized weighted least squares (PWLS) optimization, including a term relating to the prior distribution of the target images [17], [20]. Owing to the difficulty in calculating the prior distribution, alternative priors reflecting the spatial correlation among neighboring pixels, such as total variation (TV) [6], [23], fractional-order TV [25], nonlocal TV [18], Markov random fields theory [17], [20], [21], or nonlocal means (NLM) [3], [10], [12], [26], have been imposed onto the target image. Although the IR method improves the overall image quality, some issues remain to be solved. First, designing a prior that can effectively reflect the characteristics of the target image continues to be a challenge. For example, TV, NLM, and their variants are sensitive to noise, which can lead to over smoothing or cartoon-like textures. In addition, the manual selection of internal parameters for TV and NLM is difficult but a crucial factor influencing the denoising performance. Second, the IR method is computationally expensive.
Recently, Park et al [19] have proposed an image-based deep learning model that can learn a noise reduction function that maps LDCT images to their standard dose CT (SDCT) image counterparts by imposing prior onto the the target images. Using several SDCT samples, they estimated the SDCT image distribution (i.e., prior distribution) in the sense of Kullback-Leibler (KL) divergence. Their method allowed the problem to be expressed in a generative adversarial network (GAN) framework. The results showed that their approach can infer the desired SDCT-like image priors based on the SDCT images used for training. In addition, no loss function between the LDCT images and their corresponding SDCT images was required. Hence, training using unpaired LDCT and SDCT datasets was proven to be feasible. However, the approach assumed the presence of Gaussian noise in the CT images, which may not be valid in clinical practice.
In this paper, we propose a deep PWLS method by combining it with a deep learning prior. Unlike the previous work of [19], we estimate the prior distribution in the sense of Pearson χ 2 divergence, using several SDCT samples. We then convert the proposed model into a least squares-GAN (LS-GAN) [13] that adopts the least squares loss function for the discriminator, including a PWLS objective. Compared to the Jensen-Shannon or KL divergences, Pearson χ 2 divergence is known to provide a more stable learning process as it has no zero forcing properties [13], [14]. Our proposed model is then reformulated as a constrained optimization problem and solved using the alternating optimization (AO) algorithm. While training for the noise reduction function is performed in an iterative way, the corrected image is obtained directly from the trained noise reduction function. Numerical simulations are performed to confirm the validity of the proposed method.

II. METHOD
Let y ∈ R M be a measured X-ray photon counts reaching to detector, and let x ∈ R N be an unknown attenuation coefficient to be determined. CT system tries to recover x from measured y. Define A = [a ij ] ∈ R M ×N by forward operator whose element a ij denotes the effective intersection length of projection line i with voxel j. Considering the monochromatic X-ray, the expected number of measured X-ray photons at detector j, denoted byỹ i , can be simply modeled as where I 0 is the number of X-ray photons generated at source. We start from maximum a-posteriori (MAP) approach. Given y, findx that maximizes the conditional probability p x|y (x|y): x = arg max x log p x|y (x|y) = arg max x log p x (x)+L(y|x), (2) where L(y|x) = log p y|x (y|x). Here, the distribution p x (x) is generally unknown. In the following section, we estimate p x (x) using many noise-free images {x i } K i=1 sampled from the p x (x).

A. ESTIMATION OF PRIOR USING DEEP LEARNING
Let z and x be a LDCT and SDCT image reconstructed from low-dose scan y l and standard-dose scan y h , respectively. Regarding x as a noise-free image, let G : z → x be a noise reduction function. For notational convenience, we write y = y l . Together with many samples , we hope to find G maximizing the expectation of log p x|G(z) (x|G(z)), which can be expressed as: where H (x, y) is the cross-entropy of x and y. However, due to the difficulty in handling H (p G(z) , p x ) only from the samples . This quantity is zero when p G(z) = p x . Now, the problem (3) can be reduced to It can be rewritten with adversarial loss [13], [14], which is given by where D is the optimum of the following problem.
The proof of (5) is the same as that of Theorem 1 in the paper [14]. The only difference lies in the use of the term E z∼p z L(y|G(z; θ g )) . For more details, please refer to Appendix IV. The problem in (5) can be viewed as the Least-squares GAN with an additional constraint term, and hence the function G can be learned with the training samples The least squares loss for the discriminator in (6) allows the proposed method to have a stable performance during training [14], [27].

B. OPTIMIZATION METHOD
We optimize the problem (5) and (6) in alternating steps. For fixed D, we first optimize the problem (5). To decouple the denoising function G from the term L, we convert the problem (5) to the following constraint format: We now plug this problem into the quadratic penalty formulation [5], [22].
arg miñ with a sufficiently large penalty parameter ρ. This problem can be solved iteratively using AO algorithm in three steps: for each sample z ∼ p z (z), where C is some constant. Assuming that y i 's are mutually independent and follow the Poisson distribution, L(y|x) in (9) can be expressed as Second order Taylor's expansion of L(y|x) provides that [20] where p ∈ R M is the projection data whose element p i = − log(y i /I 0 ), W is a diagonal matrix whose elements w i = y i .
Here, x 2 W is the weighted norm defined as x 2 W = x T W x. Substituting (13) into (9), we havẽ Due to the difficulty in handling the network term D(x) 2 , we use the gradient-descent method to solve the problem (14).
where ηx is the step size. In our study, we empirically choose it so that algorithm converges. The problem (10) is the training problem of G with , and its update is performed by using first-order optimization method (i.e., Adam optimization [11]). Now, for fixed G, we solve the problem (6). The update of θ d is performed by using the Adam optimization, and is performed alternatively with the update of θ g to improve the stability of learning algorithm [8]. The overall algorithm of the proposed method is described in Algorithm 1.

C. IMPLEMENTATION DETAILS
Solving the problem of (14) is challenging due to the non-linear network term. We use the gradient-descent method as in (15), where the derivative term ∇ x D(x) 2 is implemented as follows. Let F be a loss function given by The last equality follows from the chain rule. For b (0) = 0 and η b = −1, we have This means that the derivative term ∇ x D(x) 2 can be easily implemented by doing the standard backpropagation of F(b) once with the choice of b (0) = 0 and η b = −1. Fig.1 shows the result of the gradient-descent method for the problem of minimizing D(x) 2 , where the derivative of D(x) 2 is implemented as in (17). As shown in Fig. 1, the D(x) 2 converges to local minimum as the number of iterations increases.
In our implementation, the initial weights θ (0) g and θ and where parameter α = 10 as in [19], and training was run 200 epochs using Adam optimization. Unlike the proposed method, the IM-DL model contains the least squares term G(z; θ g ) − z 2 . We solve the PWLS problem using the VOLUME 8, 2020

Algorithm 1 Algorithm of the Proposed Method
Initialization: gradient-descent method for 2000 iterations with a step size of 1e-6, and used it as initial x (0) . In our implementation, the number of outer iterations (OutIter), inner iterations (InIter), and epochs (Epoch) were set to 20, 100 and 10, respectively. The parameter C in (11) was set to C = 2 8 OutIter , so that C OutIter = 2 8 .

D. NETWORKS ARCHITECTURES & TRAINING
For generator (or noise reduction function), we adopt the deep convolutional framelets [24]. It is a multi-scale convolutional network network (CNN) that is composed of encoder and decoder paths with skipped and concatenation connections between them. The encoder and decoder paths are respectively composed of three stages. Each stage consists of two repeated convolutions (conv) with a 3 × 3 window. Each convolution is followed by a batch normalization (bnorm) and a leaky rectified linear unit (LReLU). 2-D Haar wavelet decomposition (wave-dec) and recomposition (wave-rec) are applied for downsampling and upsampling of the features, respectively. High pass filters after wavelet decomposition skip directly to the decoder path at same stage, while low pass filters (LF) are concatenated with the features in the decoder path. At the end of decoder path, an 1 × 1 convolution layer is added to generate a output with single channel. For discriminator we adopted the standard CNN that consists of three convolution layers with a 4 × 4 window and strides of two in each direction of the domain. Each layer is followed by a batch normalization and a leaky ReLU with a slope of 0.2. At the end of the architecture, a 1 × 1 convolution layer is added to generate 1-dimensional output data. The overall architecture of the proposed method is shown in Fig. 2. In Fig. 2, the parameter n g and n d denote the the initial number of filters of generator and discriminator, respectively. Here, we set to n g = 64 and n d = 64.
In our simulations, the generator/discriminator training was performed using the Adam solver [11] with a learning rate of 0.0002, and mini-batch size of 1. Training was imple-mented using Tensorflow [1] on a CPU (Intel(R) Core(TM) i9-9980XE, 3.0GHz) and a GPU (NVIDIA, Titan RTX 24GB) systems. The network weights were initialized following a Gaussian distribution with a mean of 0 and standard deviation of 0.01. No data normalization and augmentation were performed during training.

III. RESULTS
We used the datasets provided by ''the 2016 NIH-AAPM-Mayo Clinic Low Dose CT Grand Challenge''. Clinical CT scans of ten patients were included, and these scans were acquired using commercial CT scanners (Somatom Definition AS+, or Somatom Definition Flash operated in single-source mode, Siemens Healthcare, Germany) with a reference tube voltage of 120 kVp and a current of 200 mAs. Quarter-dose CT (i.e., LDCT) scans were generated by adding Poisson quantum noise to the clinical SDCT scans. All the projection data or sinograms obtained from the helical CT scanners using curve-detectors were rebinned to the fan beam projection data with flat-panel detectors. Standard filtered-backprojection [2] with Ram-Lak filter was applied to reconstruct the LDCT and SDCT images of size 512 × 512. Of the total ten scans, we used eight for training and the remaining two for testing. From the training dataset, we randomly extracted 30 pairs of LDCT and SDCT images, and they were used as validation dataset for the hyperparameter tuning of the proposed model.
The proposed method was compared with the PWLS method with total variation regularization (PWLS-TV) [23], nonlocal mean (PWLS-NLM) [10], IM-DL [19], and IM-DL with the cycle consistency loss (IM-Cycle) [9]. The model of PWLS-TV and PWLS-NLM is given by arg min where R(x) is the regularization terms for TV [23] and NLM [10], respectively. The hyper parameters of α = 50 and 25 for Here, for fair comparison, we used the χ 2 Pearson divergence, instead of KL divergence.
The IM-Cycle model adopts the Cycle-GAN [27] for image denoising. In our framework, the optimal generator G can be viewed as the solution of the following problem: arg min θ g ,θ r where T : x → z is the inverse function that maps z into x.
For fair comparison, we constructed the same architectures (i.e., both G and D) as the proposed method. Note that the difference between IM-Cycle and IM-DL is that it use the cycle-consistency loss (the last two terms in (22)). As suggested in [9], [19], the parameters in IM-DL and IM-Cycle were set to α = 10 (α, γ ) = (10, 5), respectively. Fig.3 shows the graph of the objective function of (7) over the number of outer iterations for AAPM datasets. As can be seen, the objective values converge stably to a local minimum during training.  demonstrate the denoising effect, the magnified regions-ofinterests (ROIs) are marked as rectangular regions in the LDCT images. As shown in Figs. 4 and 5, the LDCT images suffer from the noise streaking artifacts due to the low radiation dose. The metastatic legion (marked by arrow in Fig. 5) is barely seen in the LDCT image. All the denoising methods significantly reduced the noise streaking artifacts in the LDCT images, while improving lesion visibility. However, PWLS-TV and PWLS-NLM methods resulted in quite blurry images owing to the inherent nature of TV and NLM, while the proposed method clearly recovered the morphological structures of tissues in the body, as shown in Fig. 4. The denoised images reconstructed by the IM-DL, IM-Cycle and the proposed method are visually more similar to the SDCT images than those reconstructed by the PWLS-TV and PWLS-NLM methods. This is because the presented deep  learning methods exploit the characteristics of SDCT images as a prior. Fig. 6 shows comparison among the deep learning methods for the LDCT image of the pelvis. Insufficient number of photons reaching to detectors results in several noise streaking artifacts along the direction of greatest attenuation. The proposed method more clearly reduces noise streaking artifacts than IM-DL and IM-Cycle. For an objective evaluation of noise level, we calculated the standard deviation (SD) values on a marked ROI (see Fig. 6), and the proposed method achieved the best SD value. The IM-DL and IM-Cycle only used the pair of LDCT and SDCT images during the training procedure, whereas the proposed method used the images corrected from the sinogram, as well as the pair of LDCT and SDCT images. Hence, the proposed method showed the potential to reduce severe noise streaking artifacts even more than the SDCT images.
For performance comparison, we computed the average peak signal to noise ratio (PSNR), structure similarity (SSIM), and mean square error (MSE) between the corrected images and their SDCT images for the test AAPM datasets. The averaged quantities are summarized in Table 1. The results showed that the proposed method achieves the best PSNR, SSIM, and MSE values, whereas the PWLS-NLM method shows the worst performance in terms of PSNR  and MSE. IM-Cycle method shows better performance than IM-DL method. The added cycle consistency loss leads to performance improvement in denoising task, whereas it increases training time by more than two times. As suggested in [9], [19], training of the both IM-DL, IM-Cycle methods can be performed in a patch-by-patch manner, which can also lead to an improvement in the performance of the IM-DL and IM-Cycle methods.

IV. DISCUSSION AND CONCLUSION
In this paper, we proposed a PWLS method by combining it with a deep learning prior to improve the image quality in LDCT. We estimated the prior distribution of the target images in the sense of Pearson χ 2 divergence using many SDCT samples. The proposed model can be converted into a LS-GAN with a PWLS objective, where the optimal generator is the denoising function that maps the LDCT images to their corresponding SDCT images. We reformulate the model as constraint problem and use the AO method to decouple the denoising function and the penalized weighted objective. By doing so, the standard backpropagation algorithm provided by Tensorflow can be used to update the weight of the denoising function.
Numerical experiments show that the proposed method clearly reduces the quantum noise in LDCT images, while preserving the morphological structures of the human body. The proposed method outperforms the PWLS-TV, PWLS-NLM, IM-Cycle and IM-DL methods, in terms of quantitative metrics such as MSE, SSIM, and PSNR (Table 1). TV and NLM are parametric regularizers, and the reconstructed image quality depends on the experimental setup. For a fair comparison, TV and NLM hyperparameters with the minimum MSEs were manually selected. The PWLS-TV and PWLS-NLM methods produce blurry images but seem to provide better lesion detectability than the proposed method (or even than the SDCT images). The proposed method recovers the images with a deep image prior learned from the SDCT images used for training. Depending on the quality of the SDCT images, the ability to detect lesion can be improved. Unlike the IM-DL and IM-Cycle, the proposed approach incorporates Poisson statistics; hence, the proposed method can reduce the noise streaking artifacts more effectively due to the quantum noise, as shown in Fig. 6. For training, IM-DL and IM-Cycle methods only require CT images, whereas the proposed method requires both sinograms and CT images. The training procedure of the proposed method is more complicated because it is performed with multiple nested loops.
For the AAPM training datasets of size 512 × 512, it took approximately 7 days to train the proposed network. Nevertheless, the correction results can be obtained directly through the trained denoising function (or generator), and hence the execution time of the proposed method is much shorter than those of the model-based approaches such as PWLS-TV and PWLS-NLM. The proposed method and the PWLS-TV method require approximately 50 milliseconds and 16 seconds to obtain a single corrected image of size 512 × 512 on a GPU system, respectively.
The performance of the proposed method needs further improvement. There is no convergence theory of the AO method used in the proposed method owing to the non-convexity of the two neural networks: generator and discriminator. We empirically showed that the objective function in (7) converges with the increase in the number of iterations (Fig. 3). In addition, we empirically chose the hyperparameters and iteration numbers of the proposed method based on the experiments. For example, Table 2 illustrates the performance evaluation of the proposed method on the validation dataset for different numbers of filters and different sets of VOLUME 8, 2020 iteration parameters. In this study, we manually chose parameters with the minimum MSE. However, the chosen values may be not optimal. Further investigation will be needed to find more efficient parameters for practical use. Some studies [4], [19] have investigated the robustness of learning-based denoising methods for noise levels different from the one used in training, showing that the denoising performance of the learning-based methods can be degraded for datasets with different noise levels. The performance can be improved if training is performed together with the LDCT images acquired at different noise levels [4]. The validity of the proposed method should be demonstrated for actual clinical use, which is a part of our future work. Furthermore, we will focus on broadening the scope of the proposed method to provide a solution to other tomography reconstruction problems, such as positron emission tomography and magnetic resonance imaging.

APPENDIX
In this Appendix, we prove the equation (5). Let J d (D) and J g (G) be functions given by and J g (G) = E x∼p x D(x) 2 + E z∼p z D(G(z)) 2 −E z∼p z [L(y|G(z))] . (24) Given a fixed G, we first minimize J d (D). Writing both G(z) and x to variable ω, optimal D opt can be computed as By plugging D opt into J g (G), we have = χ 2 Pearson (p G(z) + p x ||2p x ) − E z∼p z [L(y|G(z))] , where χ 2 Pearson (p, q) := (q(x)−p(x)) 2

p(x)
dx. Since the first term in (24) is independent of G, minimizing J g (G) is equivalent to minimizing the equation (5), and this completes the proof.