Dynamic PET Image Denoising Using Deep Image Prior Combined With Regularization by Denoising

The quantitative accuracy of positron emission tomography (PET) is affected by several factors, including the intrinsic resolution of the imaging system and inherently noisy data, which result in a low signal-to-noise ratio (SNR) of PET image. To address this problem, in this paper, we proposed a novel deep learning denoising framework aiming to enhance the quantitative accuracy of dynamic PET images via introduction of deep image prior (DIP) combined with Regularization by Denoising (RED), as such the method is labeled as DeepRED denoising. The network structure is based on encoder-decoder architecture and uses skip connections to combine hierarchical features to generate the estimated image. The network input can be random noise or other prior images (such as the patient’s own static PET image), avoiding the need of high quality noiseless images, which is limited in PET clinical practice due to high radiation dose. Based on simulated data and real patient data, the quantitative performance of the proposed method was compared with conventional Gaussian filtering (GF), non-local mean (NLM), block-matching and 3D filtering (BM3D), DIP and stochastic gradient Langevin dynamics (SGLD) method. Overall, the proposed method can outperform other conventional methods in substantial visual as well as quantitative accuracy improvements (in terms of noise versus bias performance) with and without prior images.


I. INTRODUCTION
Positron emission tomography (PET) is a nuclear medical imaging modality enabling quantitative measurements of biomolecular mechanisms by radioactive tracers in vivo. However, the quantitative accuracy of PET is affected by several factors, including the intrinsic resolution of the imaging system and inherently noisy data, which result in low signal-to-noise ratio (SNR) of PET image [1]. Therefore, it is essential to improve the quantitative accuracy of PET images.
Various pre and post-reconstruction algorithms have been developed by exploiting local statistics, spatiotemporal correlation, or prior anatomical information for PET image The associate editor coordinating the review of this manuscript and approving it for publication was Gerardo Di Martino . enhancement [2]- [4]. The pre-processing method is carried out in PET sinogram to improve the estimation of physical parameters by using the spatiotemporal correlation in dynamic PET scans [5], [6]. For post-processing algorithms, conventional Gaussian filtering (GF) has been applied for PET image enhancement in clinic. Other post-processing algorithms, such as non-local mean (NLM) [7], wavelet [8], HYPR processing [9], guided image filtering [10], bilateral filtering [11] and kinetics-induced block matching and 5D filtering (KIBM5D) [12] have been proposed, and outperform conventional Gaussian filtering in terms of reducing image noise as well as preserving more image structure details.
The image restoration process is ill-conditioned due to the limited information obtained by PET image itself. Therefore, some techniques seek to incorporate anatomical priors into image restoration [13], [14]. One strategy is to rely on the explicit boundary or regional information derived from anatomical images by segmentation, which is sensitive to the accuracy of segmentation [15]- [17]. Another strategy not requiring segmentation has also been developed, which is called segmentation-free. These methods based on segmentation-free have the advantage that they are insensitive to potential errors in the segmentation of the anatomical images [18]- [24].
In recent years, deep neural networks have shown great potential in image science, such as image restoration [25], segmentation [26], object detection [27], and image analysis [28], which show better performance than conventional state-of-the-art methods when large amounts of data sets are available. Recently, it has been gradually applied to medical imaging using convolutional neural networks (CNNs) [29]- [33] or generative adversarial networks (GANs) [34]- [36] to improve SNR and reduce artifacts. For medical imaging tasks, it is usually essential to acquire high quality noiseless training image datasets, which are derived from high dose or long-scanned duration as the training labels. However, in the context of positron emission tomography and computed tomography (CT) imaging, high dose CT and PET are limited in routine clinical practice. In the context of MR imaging, the long-scanned duration will improve the SNR of MR images at the cost of increasing movement artifact, especially for cardiac and abdomen imaging. Due to the limitations of data acquisition, available data sets may not be significant or extensive for network training, whichwill make the network easy to overfit. What's more, collecting data sets and pre-training networks need adequate equipment support and cost much time. Therefore, it is significant to find some effective deep learning methods in case of no training pairs or pre-training.
Recently, the deep image prior (DIP) [37] shows that the structure of a generator network is sufficient to capture a great deal of low-level image statistics prior to any learning without pre-training. A randomly-initialized neural network with random noise input can be used as a handcrafted prior to generate great denoised images, which is a solution in the absence of large data sets. In fact, Gong et al. [38] first applied DIP to static PET image reconstruction. Then, Yokota et al. [39] proposed a method that reconstructs dynamic PET images from given sinograms by using non-negative matrix factorization incorporated with DIP for appropriately constraining the spatial patterns of resultant images. Furthermore, Cui et al. [40] utilized the deep image prior for unsupervised PET image denoising, Hashimoto et al. [41] applied DIP in CNN for dynamic brain PET image denoising, and proposed a 4D DIP framework for dynamic PET image denoising [42]. However, these methods are limited to the drawback that the DIP method is easy to overfit (e.g., the update of network needs early stopping to avoid rapid deterioration of image quality). Though the effectiveness of DIP has been proved, its results weaker a little than state-of-the-art alternatives. Fortunately, Cheng et al. [43] proposed a novel Bayesian view of the DIP method, and conducted posterior inference using stochastic gradient Langevin dynamics (SGLD) [44] to avoid the need for early stopping and improve results for denoising task. Mataev et al. [45] proposed to bring-in the concept of Regularization by Denoising (RED) as an explicit prior for DIP, which enriches the overall regularization for better performance.
In this paper, we develop a novel deep learning denoising framework aiming to enhance the quantitative accuracy of dynamic PET images via introduction of deep image prior combined with RED. Inspired by [38], [40]- [42], the network input is not only random noise but other prior images. In this work, the static PET image is used as the network input. The use of static image as prior information has been involved in some works. Chen et al. [46] proposed a prior image constrained compressed sensing (PICCS) method, which utilized a prior image reconstructed from the union of interleaved dynamical data sets to constrain the CS image reconstruction for the individual time frames. Hashimoto et al. [47] utilized a normalized static PET sinogram as the guidance image for PET image noise reduction, which is referred to sinogrambased dynamic image guided filtering (SDIGF) algorithm.
In our denoising framework, the original image (e.g., noisy image) is set as a target image, and mean squared error (MSE) is used to measure the difference between the label and the network output. The network structure is based on encoderdecoder architecture and uses skip connections to combine hierarchical features for generating the estimated image. The solution is formulated and sought by the alternating direction method of multipliers (ADMM) algorithm [48] and split into three sub-optimization steps. In order to stabilize network training, the intensity of the network input is normalized to [0, 1], and the learning rate decays with iterations, which has not been implemented in [38], [40], [41]. Both the computer simulation and real patient data are implemented. The main contributions of this study are as follows: (i) applying DeepRED method to PET image denoising for better quantitative accuracy of dynamic PET image; (ii) providing a solution to overcome the defect that DIP overfits easily in PET imaging; and (iii) comparing the performances of DIP-based method with and without prior images. Overall, The DeepRED method can outperform conventional baseline methods in visual as well as quantitative improvements.
This paper is organized as follows. Section II describes the related works. Section III introduces the theory, optimization process and implementation details. Section IV describes the computer simulation and real patient data used in this work, following specific experiment steps. Section V shows experimental results, while section VI discusses several issues. Finally, conclusions are drawn in Section VII.

A. DEEP IMAGE PRIOR
In [37], Dmitry Ulyanov et al. proposed that a randomly initialized neural network can be used as a prior with excellent results in standard inverse problems. Given a corrupted target VOLUME 9, 2021 image, a randomly initialized network with random noise input can approximately restore the uncorrupted image corresponding to the corrupted target image before fitting the target image completely, just using the network structure itself as a prior. Assuming x 0 is the corrupted image, DIP suggests to solve where T denotes the neural network, θ * denotes the network's parameters to be learned,x is the corrected uncorrupted image and z is random noise used as the network input. The training is going without conventional training data pairs.

B. A BAYESIAN PERSPECTIVE ON THE DEEP IMAGE PRIOR
Cheng et al. [43] proposed that the DIP is asymptotically equivalent to a stationary Gaussian process prior in the limit as the number of channels in each layer of the network goes to infinity, and derive the corresponding kernel. Conducting posterior inference using stochastic gradient Langevin dynamics can avoid the need of early stopping and improve results for denoising and inpainting tasks. According to (1), the inference procedure can be interpreted as a maximum likelihood estimate (MLE) under a Gaussian noise model: By adding a suitable prior ρ(z, θ) over the parameters, the desired image can be derived by x * = ρ(z, θ|x)T (z, θ)dzdθ. The integral is replaced by a sample average of a Markov chain that converges to the true posterior. SGLD provides a general framework to derive an MCMC [49] sampler from stochastic gradient descent (SGD) by injecting Gaussian noise to the gradient updates. Let ω = (z, θ), the SGLD update follows: where ε is the step size and should satisfy the property as follows: The gradient step sizes and the variances of the injected noise are balanced so that the variance of the samples matches that of the posterior. In experiments, the authors added Gaussian noise to the gradients at each step to estimate the posterior sample averages after a 'burn in' phase.

C. DEEPRED: DEEP IMAGE PRIOR POWERED BY RED
In [45], the authors brought in the concept of RED to DIP for better performance in inverse problems. RED [50] uses a denoising engine to define the regularization term, which can incorporate any image denoising algorithm and guarantee to converge to globally optimal result. RED regularization term is defined as: where x is the candidate image, f (·) is a denoising engine of choice to plug in the regularization term, like NLM, etc. For such denoisers, Romano et al. [50] claim that (4) obeys the gradient rule ∇ρ(x) = x − f (x).

A. BACKGROUND
In the field of inverse problems, the measurement y ∈ R M is given by y = Ax + n, where A ∈ R M ×N is a known linear degradation matrix, x ∈ R N is the original image, and n ∈ R M is a noise factor. Denoising is obtained for A = I, where I is an identity matrix.

B. PROPOSED PET IMAGE DENOISING FRAMEWORK
In this proposed denoising framework, the unknown image is represented as: where T denotes the neural network, θ denotes the unknown neural network's parameters, and z is the neural network input. Thus, the original data model can be rewritten as: The estimated dynamic PET images x are obtained as follows: Based on [45], merging DIP and RED, the objective function becomes min x,θ To define the optimization framework, we use the augmented Lagrangian [51] format for the constrained optimization problem in (8), which finally turns into a penalty problem as follows: where u denotes the Lagrange multipliers vector for the set of equality constraints, λ and µ are two hyperparameters to be chosen. The objective function in (9) can be solved by the ADMM algorithm iteratively in three steps: In this work, sub-problem (10) is very close to the optimization in DIP modified by a proximity regularization that forces T (θ|z) approach to y and µ(x − u). Thus, it will be solved by steepest gradient descent and back-propagation. Sub-problem (11) can be solved in two ways: (i) fixed-point strategy zeroing the derivative of the above with regard to x. Because (4) obeys the gradient rule ∇ρ(x) = x − f (x), we can get that Assigning indices to the above equation, (13) follows Thus, the update of x follows (ii) steepest-descent. The gradient of the regularization term can be surrogated by denoising residual x − f (x). The update equation would be where c is a hyperparameter to guarantee a descent. The overall algorithm flowchart is presented in Algorithm 1.

Algorithm 1 Algorithm for the Proposed DeepRED Method
Input: Maximum iteration number Max_iter, the RED regularization strength λ, the ADMM free parameter µ, steepest-descent parameters for updating θ, step-size parameter c and image initialization y, prior image z 1: Initialize n = 0, u 0 = 0, x 0 = y, set θ 0 randomly 2: for n = 0 to Max_iter do 3: Update θ n+1 solve arg min using steepest descent and back-propagation 4: Update x n+1 : choose the fixed point or the steepestdescent method 5: The neural network structure employed in this work is based on encoder-decoder architecture. The overall network architecture is summarized in Figure 1. The encoding patch consists of a 3 × 3 convolutional layer with a stride two for downsampling instead of using max pooling, followed by a batch normalization (BN) as well as a leaky rectified linear unit (LReLU), and a 3 × 3 convolutional layer, followed by the BN and the LReLU. The number of feature channels is fixed to 128 after each downsampling step and upsampling step. The decoding patch consists of bilinear upsampling, a 3 × 3 convolutional layer followed by the BN and the LReLU, as well as a 3 × 3 convolutional layer followed by the BN and the LReLU. Due to the checkboard artifact [52], the bilinear interpolation was used instead of the deconvolution upsampling. Besides, we use skip connections to link the encoder path and decoder path in a concatenation way to reduce the number of training parameters and contain structures of different characteristic scales.

D. IMPLEMENTATION DETAILS
The network was run in a GPU (NVIDIA GTX 1080 Ti with 11 GB of memory). We used the Adam optimizer implemented in PyTorch 1.1 on the Pycharm platform, and the MSE between the label and the network output was used as the loss function. Considering the fact that DIP method is easy to overfit, the quality of network output degrades rapidly after a certain number of iterations. Inspired by the supervised training method, the learning rate used in this work decayed with the number of iterations in order to alleviate overfitting. Typically, we set the initial learning rate lr = 0.005 decayed lr ← 0.9 * lr every 1000 iterations. The network output can avoid degrading suddenly due to the excessive learning rate by decaying learning rate. Furthermore, the network output was averaged with an exponential window to acquire stable results. The decay coefficient of exponential moving average was set to 99% for all DIP-based methods. In other words, the results of the DIP-based methods used 1% of the network's output value plus 99% of the average image. The average image of DeepRED method was initialized by the target noisy PET image, and the average images of DIP and SGLD methods were initialized by the network output in the first iteration.
Compromising the convergence speed and performance of the proposed method, Sub-problem (11) was solved by fixedpoint strategy, and the hyperparameters λ = 0.2 and µ = 0.2 were set in this paper. The details of hyperparameter settings will be described in Section VI.

IV. EXPERIMENTAL SETUP A. SIMULATION STUDY
To validate the proposed method, we employed a simulated 18 F-FDG dynamic PET scanning based on the BrainWeb dataset [53], in which a 3D MR brain digital phantom (voxel dimensions, 1.25 × 1.25 × 1.25 mm 3 ) was used. After linear transformation, image cropping, and partial slices extraction, the MR image array size was converted to 256 × 256 × 75. The PET scanning schedule was consistent with those used in [54], including 24 time frames for 60 min: 4 × 20 s, 4 × 40 s, 4 × 60 s, 4 × 180 s and 8 × 300 s. An attenuation image was generated using the discrete MR image by assigning a value of 0 cm −1 in air, 0.146 cm −1 in bone and 0.096 cm −1 in other tissues. The regional time activity curves shown in Figure 2 were assigned to different brain regions  to generate noise-free dynamic PET images. A hot sphere of diameter 20 mm was inserted into the PET image as a tumor region. Frame 24 in dynamic PET images was used as ground truth in this work.

1) TOMOGRAPHY IMAGING
The computer simulation modeled the GE Discovery ST PET/CT scanner, of which the system matrix was shaped by the parallel strip line integration method. Dynamic PET images were first forward projected using a system matrix to generate noise-free sinogram. After that, the sinogram was convolved with Gaussian filter (full width at half maximum, FWHM = 3.5 voxels) to simulate the partial volume effect. The expected total events were 100M over 60 min. Uniform random events equal to 20% of total counts were added to simulate background events. Then, noisy sinogram data were generated by introducing Poisson distribution. To generate a low-dose noisy image for training, the noisy sinogram data was downsampled to 1/10th of total counts randomly. Finally, the downsampling dynamic frames were reconstructed by Maximum Likelihood Expectation Maximization (MLEM) method with 120 iterations, and frame 24 in downsampling dynamic frames was used as the target noise image to be processed. For the consistency with ground truth, the voxel size of the reconstructed PET image is 1.25 × 1.25 × 1.25 mm 3 .
We compared the DeepRED method with conventional Gaussian filtering, non-local mean, block-matching and 3D filtering (BM3D) [55], DIP and SGLD method. Here, the FWHM of filter window for GF was set as a chosen parameter from 1mm to 5mm. The radio of search window of NLM was set as a chosen parameter from 3 to 7. The default standard deviation of the additive white Gaussian Noise (AWGN) from 5 to 25 was set as the chosen parameter of BM3D. For the DeepRED method, the NLM denoiser was applied once every 10 iterations. The size of patches of NLM denoiser used in DeepRED was 5 × 5 and the radio of search window was set to 6. A view of simulated brain digital phantom images. These are corresponding to MR brain digital phantom, frame 24 noise-free PET (ground truth), target noisy PET (frame 24) and static PET image from left to right.
As is described in section I, a simulated static PET image is applied as the input of the network to speed up network processing and promote denoising performance. The static PET image was acquired and reconstructed by summing the entire downsampling dynamic PET sinogram data into one static frame. Figure 3 shows the MR brain digital phantom, frame 24 noise-free PET (ground truth), target noise image to be processed (frame 24) and static PET image from left to right.

2) EVALUATION METRIC
For quantitative evaluation, we assessed the quality of denoised PET images by Peak Signal to Noise Ratio (PSNR) and Normalized Root Mean Square Error (NRMSE). Where PSNR is defined as and NRMSE is defined as where I and K are the ground truth and the denoised image, respectively. N R is the number of total pixels in the non-zero ground truth area, max I and min I are the maximum and minimum possible pixel value of the ground truth, respectively. Besides, Structural Similarity (SSIM) index was also used. SSIM is defined by where µ is the average value of image, σ 2 is the variance and σ IK is the covariance of I and K . c 1 = 4 × 10 −4 and c 2 = 3.6 × 10 −3 were used in this work. In order to ensure the consistency of experimental conditions, the maximum iteration number of all DIP-based methods was preset to 6000, and appropriate denoised PET images were selected based on the highest PSNR during iterations. Due to the overfitting, when the PSNR of images were the same in different iteration numbers, the image with a smaller iteration number was selected as the baseline for comparison. The initial learning rate lr = 0.005 was set and decayed lr ← 0.9 * lr every 1000 iterations. Profile analyses were also implemented for structural comparison with different methods.
On the other hand, the contrast recovery coefficient (CRC) vs. background standard deviation (SD) curves were plotted based on denoised images with different methods for quantitative comparisons. We computed the CRC between the tumor and white matter, and the CRC between the gray matter and white matter. The target region of interests (ROIs) were drawn in both matched gray matter regions and tumor regions. The background ROIs were drawn in white matter regions. For better quantitative evaluation, the CRC was normalized and calculated by where R represents the total number of realizations and is set to 10.
The network structure for simulation study sets was the same as in Figure 1.

B. REAL DYNAMIC PET BRAIN DATA SETS
Five patient data sets from Alzheimer's Disease Neuroimaging Initiative (ADNI) [56] were employed in this study. The patients were injected with 50 ± 0.5 mCi (185 MBq) of 18  For lack of static PET images in the original patient data sets, the voxels of all frames of dynamic PET image were added and averaged in image domain to obtain approximate clinical static PET image for each patient. Then the intensity of the approximate clinical static PET image was scaled to the range [0, 1] for network training. In order to adapt the network structure, the 34th slice of frame 22 PET image was chosen to employ for each patient, and the corresponding VOLUME 9, 2021 34th slice static PET image was applied as the network input. In this paper, only the results of static PET image input for network training were shown in view of the better performance compared with the result of random noise input in real data set.
We also compared the DeepRED method with conventional Gaussian filtering, non-local mean, block-matching and 3D filtering, DIP and SGLD method. For patient data sets, the FWHM of filter window for GF was set as a chosen parameter from 1mm to 5mm. The radio of search window of NLM was set as a chosen parameter from 2 to 6. The default standard deviation of the additive white Gaussian Noise from 5 to 25 was set as the chosen parameter of BM3D. When using the DeepRED method, we applied the NLM denoiser once every 10 iterations. The size of patches of NLM denoiser used in DeepRED was 5 × 5 and the radio of search window was set to 2. To ensure the consistency of experimental conditions, the maximum iteration number of all DIP-based methods was preset to 6000. The initial learning rate lr = 0.005 was set and decayed lr ← 0.9 * lr every 1000 iterations.
For lack of ground truth, five ROIs in the noisy PET brain image were selected for quantitative comparison. The five ROIs were shown in Figure 8. The uptakes of ROI 1, ROI 3 and ROI 5 increased with the increase of frames, which were the high activity regions. The uptakes of ROI 2 and ROI 4 decreased with the increase of frames, which were the low activity regions and used as background ROIs. The contrast recovery (CR) between the high activity regions with the low activity regions was calculated by (22) wherex l = 1/N s N s k=1 x l,k denotes the mean activity of selected ROIs, x l,m denotes the mean activity of the reference ROIs, and N is the number of realizations, which was set to 10 here. In this work, the target ROIs corresponding to ROI1, ROI3, and ROI5, the reference ROIs corresponding to ROI2 and ROI4. The network structure for the real data sets was the same as in Figure 1.

V. RESULTS
We evaluated the proposed DeepRED algorithm in comparison with conventional denoising methods, including GF, NLM, BM3D, DIP and SGLD algorithms. The parameters in all algorithms were first optimized separately, followed by vertical profile analysis and quantitative comparison in terms of regional contrast recovery coefficient vs. the background standard deviation performance.

1) THE NETWORK INPUT
In order to provide a more direct visual impression of the denoised images, Figure 4 shows ground truth, noisy image (low dose), static image and denoised images using different algorithms: GF, NLM, BM3D, DIP r , SGLD r , DeepRED r , DIP s , SGLD s and DeepRED s methods for one simulated realization, below each image is the zoomed-in ROI of the area indicated by the blue box in the upper corner reference image. Among these methods, DIP r , SGLD r and DeepRED r used random noise as the network input, while DIP s, SGLD s and DeepRED s used static image as the network input. For GF, the FWHM was set to 3mm. For NLM, the search window was set to 5 × 5. For BM3D, the standard deviation of the AWGN was set to 20.
Compared with the GF and NLM post-processing methods, all DIP-based methods and BM3D methods have lower noise and reveal more cortex structures. Compared with BM3D method, all the DIP-based methods can preserve more image details features, especially at gray matter regions, and the DeepRED method performs best. Furthermore, compared with DIP r , SGLD r and DeepRED r methods, DIP s , SGLD s and DeepRED s can retain more image details features. Actually, using static image as the network input can greatly accelerate the speed of network processing, which will be further explained in the following discussion in section VI.
In this work, we took into account the use of decaying learning rate and recorded the mean value and standard deviation of the quantitative results achieved by 10 realizations for different methods. The corresponding quantitative evaluation results are shown in Table 1, and the best value of each metric is marked in bold black. In Table 1, type ND means not decaying learning rate, type D means decaying learning rate and type NA means not applicable. As is shown in Table 1, all DIP-based methods perform better than traditional GF and NLM methods in terms of all indicators. DIP r and SGLD r have lower SSIM than BM3D method, while DeepRED r performs better than BM3D in terms of SSIM. In this work, SGLD r performs a little weaker than DIP r , which is somewhat different from the experimental results in [43]. This phenomenon may be due to the fact that the noise in the PET image domain is not a standard Gaussian distribution. Compared with DIP r , SGLD r and DeepRED r methods, DIP s , SGLD s and DeepRED s can acquire great improvements in terms of all indicators, which confirms the effectiveness of using prior image as the network input. For DeepRED r and DeepRED s , decaying learning rate can improve the network's denoising performance, while this result is not obvious for the DIP and SGLD algorithms. However, the use of decaying learning rate will alleviate overfitting and avoid network output degrading suddenly. Overall, the DeepREDs method outperforms other methods in terms of all indicators.

2) VERTICAL PROFILE ANALYSIS
To provide an intuitionistic evaluation performance of preserving the edge details, Figure 5 shows the vertical profile analysis of the images shown in Figure 4. For a better demonstration, we divide the results into Figure 5 (a) and Figure 5 (b) based on different network inputs. Figure 5 (a) shows the profiles of the results from ground truth, GF, NLM, BM3D, DIP r , SGLD r and DeepRED r . Figure 5 (b) shows the profiles  -processed PET images using the Gaussian filter with FWHM = 3mm, NLM with 5 × 5 search window size and BM3D with 20 AWGN standard deviation, respectively; (g)-(i) the post-processed PET images using DIP r , SGLD r and DeepRED r methods respectively (Random noise was used as the network input.); (j)-(l) the post-processed PET images using DIP s , SGLD s and DeepRED s methods respectively (Static image was used as the network input). Below each image is the zoomed-in ROI of the area indicated by the blue box in the upper corner reference image.
of the results from ground truth, GF, NLM, BM3D, DIP s , SGLD s and DeepRED s . It can be observed that the profiles of the results from DeepRED r and DeepRED s are more closed to ground truth than those from other methods. Besides, DeepRED r and DeepRED s are smoother in the regions with large intensity variation than other methods. These results VOLUME 9, 2021  further verify that the proposed DeepRED algorithm has a better noise reduction capability while preserving edge structures.

3) REGIONAL CRC VS. BACKGROUND SD
To provide a regional contrast recovery evaluation performance of the denoising algorithms, Figure 6 depicts plots of regional CRC vs. background SD curves of estimated images (generated with varying parameters) for gray matter and tumor region. For GF, the full width at half maximum increased from 1mm to 5mm. For NLM, the search window varied from 3 × 3 to 7 × 7. For BM3D, the standard deviation of the AWGN ranged from 5 to 25. The results from DIP-based methods were generated by varying the iteration numbers.
As is shown in Figure 6, all DIP-based methods outperform traditional methods (including GF, NLM, and BM3D), and the DeepRED method performs better than DIP and 52386 VOLUME 9, 2021 FIGURE 7. A view of approximate static PET images, original noisy images and estimated images using different methods from five patients. Each row represents the results from a patient. The first column shows approximate static PET images. The second column represents the original noisy images from different patients. Estimated images with different denoising algorithms (including GF, NLM, BM3D, DIP s , SGLD s and DeepRED s ) are shown from the 3rd to the 8th column. Except for the first column, the images of each row are displayed in the same grayscale range.
SGLD methods for both results from random noise network input and static image network input. Furthermore, using static image as the network input can improve the CRC of DeepRED method for gray matter region. In fact, these observations can be derived from Figure 4 by direct visual evaluation.

B. APPLICATION TO PATIENT STUDY
Following extensive validations using simulations, we applied the proposed DeepRED method to patient study (as elaborated in section IV.B). In order to provide a visual comparison of the denoised images, Figure 7 shows the denoised PET images with different methods (including GF, NLM, BM3D, DIP s , SGLD s , and DeepRED s ) and the corresponding approximate static PET images from five patients. Each row represents the results from a patient. The first column shows approximate static PET images. The second column shows the original images from different patients. Denoised images with different denoising algorithms (including GF, NLM, BM3D, DIP s , SGLD s and DeepRED s ) are shown from the 3rd to the 8th column. Since the PET scanning mechanisms of the five patients are similar, the filter parameters of the displayed images from the five patients are the same.
For GF, the FWHM was set to 3mm. For NLM, the search window was set to 2 × 2. For BM3D, the standard deviation of the AWGN was set to 20. Except for the first column, the images of each row are displayed in the same grayscale range. Compared with traditional methods (GF, NLM and BM3D), DIP-based methods can reduce noise effectively while recovering more details, and the DeepRED s method performs better than DIP s and SGLD s methods. Figure 9 shows the regional CR vs. background SD curves of estimated images (generated with varying parameters). For GF, the full width at half maximum increased from 1mm to 5mm. For NLM, the search window varied from 2 × 2 to 6 × 6. For BM3D, the standard deviation of the AWGN increased from 5 to 25. Moreover, the results from DIP-based methods were generated by varying iterations numbers. It can be seen that at the same background SD, the CR of the DeepRED method is higher than other methods, which confirms that the DeepRED method outperforms other methods.

VI. DISCUSSION
As described above, we develop a novel deep learning denoising framework to enhance the quantitative accuracy of dynamic PET image. Compared to the deep learning VOLUME 9, 2021   methods based on high and low dose training pairs, the proposed method eliminates the need for tedious pre-training and potential risks of overfitting, as well as shows great performance in different data sets.

A. THE STRUCTURE OF NETWORK
We applied the skip connections to combine hierarchical features instead of concatenating for faster processing speed. Five downsampling and upsampling steps were used to learn deeper information of images. Different from [38], [41], the number of feature channels was fixed to 128 after each downsampling and upsampling step in this work. We tend to use more channels to catch more image features. Moreover, it can be found that the highest average PSNR of DIP r with no decaying was 30.2281±1.81E-02 dB by using fixed 128 feature channel, while it was only 30.0954±2.63E-02 dB by using channels consistent with [38] and [41] through 10 tests, respectively.

B. LEARNING RATE AND OPTIMIZER
When the network reached certain iteration numbers, the output of the network may suddenly 'collapse' if the learning rate was too high, and it was difficult to recover the image FIGURE 9. Plots of regional CR vs. background SD curves of estimated images (generated with varying parameters) using different denoising algorithms for the ROIs in Figure 8. quality again. This phenomenon has not been described when applied to natural images. Therefore, we have tried to reduce the learning rate varying with iteration numbers. We set the initial learning rate lr = 0.005 decayed lr ← 0.9 * lr every 1000 iterations in this work. In addition, the limited memory BFGS (L-BFGS) algorithm [57] was also implemented in our experiments. The network was optimized 100 iterations by Adam optimizer firstly and the L-BFGS optimizer was applied to the subsequent iterations. The quantitative performances of estimated images based on L-BFGS algorithm were alternating oscillation, which was difficult to choose the iteration numbers. There, the Adam optimizer was applied to all iteration.

C. OVERFITTING
The DeepRED method uses a denoising filter to regularize the DIP, which can avoid getting into local solutions and make the network learning stable. This is similar to the deep meanshift prior (DMSP), which is used in imaging reconstruction recently [58]. Figure 10 shows the PSNR and SSIM curves varying with iteration numbers for DIP r , SGLD r , DeepRED r , DIP s , SGLD s and DeepRED s methods. As can be seen from the PSNR and SSIM curves, the DIP method is easy to overfit. This is the major drawback of the DIP method, which may be avoided by selecting the appropriate iteration number experimentally. The SGLD and DeepRED methods are not easy to overfit compared with DIP. Therefore, the desired image can be obtained within a wider iteration number range without worrying about the image quality degradation caused by overfitting, which provides a solution to overcome the shortcomings of easy overfitting for DIP. [59] provided another solution idea to select the most appropriate network architecture through the neural architecture search (NAS) technique. The most appropriate iteration number may be determined by the neural search technique rather than by experience.
A point should be noted that the SGLD method is slightly easier to overfit than the DeepRED method in this work. As is described in section II.B, the inference procedure of the SGLD method can be interpreted as a maximum likelihood estimate under a Gaussian noise model. However, the noise distribution in the PET image domain is not a standard Gaussian distribution, which is different from the inference premise of the SGLD method. Besides, SGLD inference includes adding noise to all parameters, including the network input z and network parameters θ. The use of static images omits the step of adding noise to the network input, which may affect the performance of SGLD method to prevent overfitting.

D. HYPERPARAMETERS
There are two hyperparameters to be selected in the objective function (9), which have important impacts on the performance of the proposed method. In order to select the optimal parameters, the experiments based on DeepRED s with decaying learning rates have been implemented to explore the impact of two parameters. Table 2 and Table 3 show the quantitative results of network performance when one hyperparameter is fixed and another hyperparameter is changed, and the best value of each metric is marked in black bold. For a better visual comparison, Figure 11 and Figure 12 show the results of Table 2 and Table 3 in the form of error bars, respectively. As is shown in Table 2 and Figure 12, the best quantitative results of DeepRED s are obtained when λ = 0.2 and µ = 0.2. Same results can be seen in Table 3 and Figure 12. In general, the hyperparameters λ = 0.2 and µ = 0.2 were set for best performance in this paper.

VII. CONCLUSION
In this work, we develop a novel deep learning denoising framework aiming to enhance quantitative accuracy of dynamic PET images via introduction of deep image prior combined with Regularization by Denoising. The proposed method avoids the need of high quality noiseless images, and random noise or prior images can be used as the network input. Furthermore, the method combines the pre-training networks and image denoising process by constructing a constrained optimization problem and alleviates the necessity of stopping early for DIP method. Both simulation and patient data experiments show that the proposed method outperforms the conventional GF, NLM, BM3D, DIP and SGLD method in visual as well as quantitative improvements. Future work will focus on better network architecture and increasing the speed of processing in order to make the method more practical and effective in clinical data sets.