Combining Deep Image Prior and Second-Order Generalized Total Variance for Image Denoising

Deep image prior is a classical unsupervised deep learning method that does not require plenty of training samples, because in some practical applications, like medical imaging, collecting tons of training samples is not always viable. At the same time, it is hard and time-consuming to construct a favorable training set, since it requires that the selected data must be sufficient and typical. In addition, due to the overfitting problem of the deep image prior method, it can lead to the loss of image texture details and the destruction of edge information. And the second-order total generalized variation can effectively protect the image edge information. Therefore, in this work, an image denoising method that combines a deep image prior with the second-order total generalized variation is proposed to remove noise, preserve the image edge and texture structuration a better way. To solve the new model efficiently, first, we transform the constrained problem into an unconstrained problem using the augmented Lagrangian method, and then solve the augmented Lagrangian function using the alternating direction method of multiplication. Experimental results show that the proposed method can removes noise more effectively, retain more image details, and obtain higher performance in terms of peak signal-to-noise ratio and structural similarity, and outperform other existing methods.


I. INTRODUCTION
In the field of image processing and computer vision, image denoising is a very important and challenging problem. Images are susceptible to many factors during acquisition and transmission, which can lead to image quality degradation. Noise in an image can reduce the clarity of the image and affect human vision, as well as obscure much of the important detail information in the image, ultimately reducing the value of the image application. The presence of noise also affects other image processing tasks, so the image denoising task is of great importance. The task of image denoising is to recover a good image, that is clean and sharp, from a noisy image.
The associate editor coordinating the review of this manuscript and approving it for publication was Dominik Strzalka .
Mathematically, we can express the inverse problem of an input image with noise f ∈ R n in the following form: where u is the original image without noise, η ∈ R n is the additive noise. As can be seen from (1), noise is directly superimposed on the original image. This noise could be pepper noise or Gaussian noise, and theoretically, if the source of the noise could be known, it could be subtracted from the input image to get the original image. But in reality, it is hard to find the noise exactly because we do not know how it is created.
In order to find the desired approximate solution, different methods are proposed [1], [2], [3], [4], [5]. Some of the methods which are famous and potential, can be divided into two main types: regularized reconstruction-based methods and learning-based methods.
A regularization-based reformulation transforms the problem into an optimization problem as follows: the first term is the data fidelity term and the second is the regularization term. Under the assumption of zero-mean Gaussian noise, the fidelity term is usually defined as the L 2 norm. The regularization term encodes a prior information about the solution, such as the total variation (TV) prior or wavelet-L 1 prior [6]. Although TV regular can effectively remove noise while preserving sharp edges. However, it is prone to produce staircase effects in the smooth regions of images.
In response to the drawback that the TV regularization model is prone to produce staircase effect in the image denoising process, research scholars found that the higher order derivatives can better identify the detailed information such as noise, edge and texture, so the higher order derivatives were introduced as the regular term of the variational denoising model, but the higher order derivatives can cause blurring of the image edges to a certain extent [7]. Later, Bredies et al. [8] proposed a new total generalized variational (TGV) image denoising model, which can adaptively adjust the first and second derivative terms. It has been shown that the use of TGV as a regular term mitigates the staircase effect and produces better results than the conventional TV. TGV was used as a penalty term in [9] on MRI problems for image denoising, and produced better results than conventional TV. Ferstl et al. [10] formulated a convex optimization problem using higher-order regularization for depth image upsampling, producing better results. As low-rank models tend to cause excessive smoothing, Liu et al. [11] proposed a weighted TGV regularized kernel norm minimization denoising model to better maintain the local structure.Hue et al. [12] proposed an image denoising method that combined overlap group sparse and second-order total variational regularization, and the denoising effect was better than other similar noise reduction methods.Aiming at the shortcoming of TV regularization term with staircase effect, Thanh et al. [13] proposed the first and second-order total variational regularization and image restoration method based on inverse gradient adaptive parameters. This method can protect image structure and avoid staircase artifacts.
In the past few decades, deep neural network has brought about widespread attention. Deep learning methods can learn complex images and improve the performance of image inverse problems. More advanced performance was first achieved in image segmentation based on deep learning methods [14], [15], which train deep neural networks from a set of training samples, however, their performance depends largely on a large number of high-quality training samples. Zhang et al. [4] investigated the construction of feedforward denoising convolutional neural networks to explore deeper architectures and used residual learning and batch normalization to speed up the training process and improve denoising performance. Later, since discriminative learning methods mainly learn a specific model for each noise level and require different models to process images with different noises, lacking flexibility, to solve these problems, Zhang et al. [16] proposed a fast and flexible denoising convolutional neural network, with adjustable noise level maps as input. Lefkimmiatis et al. [17] designed a novel network architecture to further learn discriminative image models for effective solution of grayscale and color image denoising problems. Since supervised learning requires a large number of training samples to train the network, and it is difficult to collect a large number of samples in reality, for example, it is impractical to collect a large number of samples of similar diseases in medical imaging [18]. Unlike traditional image deep learning, Ulyanov et al. [19] proposed a typical unsupervised deep learning approach called deep image prior (DIP), where DIP proposes to use the deep network itself as a regular term for the inverse problem instead of the supervised route as in most earlier approaches. It requires neither a large number of samples nor real images, and uses only noisy images for image denoising, allowing the adaptive learning capability of convolutional neural network (CNN) to be used to generate images with some good results. Later, researchers mainly engaged in the theoretical analysis of DIP [20], [21], [22] and improved its performance, and the most commonly used method is to adjust the goal in the minimization problem so that the solution satisfies a given prior condition. Particularly, Mataev et al. [23] raised DIP by adding explicit priors, introduced the concept of regularization by denoising (RED), combining DIP and RED into a very effective unsupervised recovery mode. Liu et al. [24] added an explicit anisotropic TV term to the minimization problem. In addition, the idea of combining DIP with TV-based terms has shown good performance in processing X-ray images [2] and computed tomography reconstructions [25]. Liu et al. [24] introduced a traditional TV term based on DIP in the literature, proposing the idea that the CNN model itself can act as a prior for the image and improve sparsity by imposing a parametric penalty on the image gradient facilitating the prior. Cascarano et al. [26] combined DIP with automatic estimation of the spatial variable TV regularization and local regularization parameters and solved this problem using a flexible ADMM, and their method provided a good improvement in the image denoising task. Zheng et al. [27] proposed an unsupervised image segmentation method that integrates a CV model with a deep neural network, thereby significantly improving the segmentation accuracy of the original CV model. Cascarano et al. [28] proposed a constrained and unconstrained DIP optimization model that automatically estimates the strength of regularization. The proposed model is robust to image content, noise level and hyperparameters in terms of image denoising and deblurring.
So far, researchers have continuously investigated DIP-based methods to process images [29], [30], [31]. However, because the DIP has no constraint terms, it is difficult to find the optimal solution and suffers from overfitting problems, and using it for denoising can cause the image to lose some texture details and the edge information are corrupted. In contrast, TGV regularization can adaptively coordinate the first and second-order derivatives in the image denoising task to better portray the image edges and texture details. In particular, deep neural networks are usually designed as ''black boxes'', making it difficult for us to analyze their internal structure, in contrast to traditional model-based approaches that have a rigorous mathematical foundation. So, it is important to combine traditional methods with deep learning methods and thoroughly explore the merits of both for unsupervised image denoising. Therefore, in this paper, we introduce the TGV regular term in DIP to build a new model in expectation of achieving better denoising effects. For the assumed new model, we first transform the constrained problem into an unconstrained problem using the augmented Lagrangian method, and then solve the augmented Lagrangian function using the alternating direction multiplier method. The experimental results show that the new model can effectively remove the noise, protect the structural edges of the image, and reduce the staircase artifacts. And it achieves higher performance in both peak signal-tonoise ratio and structural similarity, which is better than other existing methods.
The remaining of the paper is organized as shown below. In Section II, the related work is introduced, In Section III, the proposed new model and algorithm are introduced and discussed, and it is shown how to solve the ADMM sub-step efficiently. In Section IV, numerical experimental results of the new method are given and compared with other models. At last, Section V summarizes the whole paper.

II. PEVIOUS WORK
Aiming at Gaussian noise, Rudin et al. [32] proposed the following TV model: where, f is the noisy image observed, u is the image after noise removal, the first term is the fidelity term, the second term is the regular term, λ > 0 is the adjustment parameter. Although many studies have shown that TV-based methods do a good job of preserving the edges of images when restoring them, but it also produces staircase effects. In order to improve this shortcoming, Bredies et al. proposed a TGV image denoising model, which is: where TGV 2 BD( ) denotes the space of bounded twisted vector fields and ε(v) = 1 / 2 (∇v + ∇v T ) denotes the symmetric derivative, which is a matrix-valued Radon metric, α 1 and α 2 are two positive parameters. For a N-channel image u: Such a definition provides a way to balance between the first and second-order derivatives of a function [8] (controlled by the ratio of weights α 1 and α 2 ), in contrast to TV, which considers only the first order derivatives. Therefore, TGV has been applied in image restoration by more and more scholars [33], [34], [35].
Since image denoising lacks real data, supervised learning is difficult to achieve. Therefore, Ulyanov et al. [19] introduced the DIP method for image restoration that does not require a large number of samples, which is expressed as the following minimization problem: where T θ (z) is a fixed CNN generator with weights θ, z is usually a random input vectors sampled from a uniform distribution [19]. Given a target image trained with the network, the values of the θ * parameters that reproduce that image can be found. The deep CNN architecture is able to represent natural images more easily than random noise and does not require a fixed set of training samples. Later, Cascarano et al. [26] proposed to improve the performance of DIP by adding an explicit prior to (5), i.e., combining DIP with automatic estimation of the spatial variable TV regularization with locally regularized parameters (called DIP_TV), it is expressed as the following problem: and the authors use a flexible ADMM algorithm to solve the optimization problem, which provides a good improvement for the image denoising task. In addition, Romano et al. [36] proposed one regularization by denoising (RED)method, replacing the TV regularization in Eq. (3). The model is as follows: where Because RED can flexibly select denoising engine to insert regularization term, and denoising engine is applied to image. Mataev et al. [23] introduced RED into Eq. (5) and proposed a new denoising model (DeepRED). It avoids the need to distinguish the selected denoisers and relies on the existing denoising algorithms to define the regularization terms, and the results is superior to many other regularization schemes. Hong et al. [37] proposed an acceleration technique based on vector extrapolation (VE) to accelerate the existing RED solver.

III. THE PROPOSED MODEL AND ALGORITHM
A. THE PROPOSED MODEL From Eq.(3) and Eq. (6),we can see that the DIP_TV is equivalent to replacing the restored image in the ROF model with a CNN with a weight function. From Eq.(3) and Eq. (7), the model (7) actually replaces the TV regular term in ROF with RED. However, the TV regularization term in DIP_TV is prone to produce the staircase effect,and makes the small details of the image lost. While TGV regularization has first derivative and second derivative terms, it can adjust the first and second derivative adaptive, i.e, it can use the first derivative to protect the details of the image edge and texture area, and use the higher derivative to remove the noise in the flat area of the image [8]. Combining the advantages of the above model, this paper introduces TGV regularization term to replace TV regularization term, so as to reduce the staircase effects easily generated by TV as regularization term. Thus a new image denoising model with constraints is proposed using TGV and DIP as follows: where λ is used to adjust the regularization and data fidelity terms. T θ is the CNN generator, initialized with a set of random weights, and then the approximation of the target solution is computed as T θ * (z), where θ * is the early stop solution obtained by applying the early stop process to the iterative optimization scheme involved in solving (8).
Compared with DIP, the TGV regular term introduced in the assumed model can be used in the image iterative reconstruction process to better portray the image edge and texture details in the image denoising task. Compared with the traditional model, the new method introduces DIP in the assumed model, it is a typical unsupervised deep learning method which does not require a large number of training samples. Compared with DIP_TV, the model being assumed contains a TGV regular term, which can adaptively coordinate the first-order derivatives and second-order derivatives and it can mitigate the staircase effect. Numerical experimental results show that the algorithm proposed in this paper can better protect the edge and texture details while denoising, and the denoising effect is better than the DIP method and DIP_TV method.

B. ALGORITHM FOR THE PROPOSED MODEL
In this paper, ADMM algorithm is used to solve the minimization problem. The modular structure of the ADMM framework is flexible, with the method of modifying the sub-steps associated with the regularization, we can embed any priori information. In numerical experiments, we make comparison between the denoising results of the new model and other models.
The augmented Lagrange function of the model (8) being assumed is as follows: where λ , β are positive parameters, b is a constant. The k + 1 th iteration of the ADMM iterative algorithm after proper initialization of the variables involved according to the ADMM framework is shown below: Eq. (10) is solved inexactly using the Adam iteration scheme [38].

IV. NUMERICAL EXPERIMENTS
In this part, we use some color and grayscale maps as test images and corrupt them with white Gaussian noise with standard deviations equals of 15, 20, and 35, respectively. In our algorithm, the maximum number of iterations is 100, and we performed 50 ADAM iterations to solve the θ k+1 subproblem (10) in the original variables Due to space limitation, we give some numerical experimental results and calculate the values of PSNR and SSIM, aiming to evaluate the effectiveness of the proposed TGV as a regular term in the image denoising problem. To illustrate the performance of the new model, the proposed model is compared with some state-of-the-art image denoising methods (DIP [19], DIP_TV [26] and DeepRED [23]) and traditional denoising methods (traditional TGV [8], ROF [32], NL_ mean [39]) in this paper. Among them, our method combines a TGV regular term with DIP and does not require pre-training of the network. Each model is manually tuned with parameters to achieve a good denoising effect. In contrast, our method yields better denoising results.

A. PARAMETER SELECTION
We manually adjust parameters to achieve the optimal effect.
For λ , β, we find that when 0.1 < λ < 10 and 0 < β < 100, they need manual adjustment according to different test images to get better results. The experimental results show that the smaller the parameter λ is, the better the image edge details can be recovered. while the parameter β is the opposite. For τ and t, we find that when they are better selected in the range [0, 0.1], the proposed method is relatively stable.

B. EXPERIMENTAL RESULTS
In Fig.1, this paper tests the butterfly image, whose original clean image is damaged by Gaussian noise with a standard deviation of 20. By comparing DIP, DIP_TV and DeepRED, it can be seen that the denoising effect in Fig.1(c) is not very good, the yellow part of the butterfly's upper right wing is still noisy. In Fig.1(d), part of the white dots and some black edges at the top of the butterfly's wings are removed and some parts are too smooth and lose some detail information. In Fig.1 (e), the texture details are missing, such as the small white dot on the upper right of the butterfly wing. Fig.1(f) shows the results of the proposed method, which preserves as much image detail information as possible during denoising for better recovery due to the adaptive regularization of the structure T θ (z) in this model. The PSNR and SSIM values are shown in Table 1 and Table 2. We can see that the PSNR and SSIM values have increased by adding the TGV-based regular term compared with other models.   2 shows the image denoising results of house image, we compared the proposed model with DIP and DIP_TV, and cropped and enlarged the three experimental results. We find that the chimney part of the house in Fig.2(d) still contains a lot of texture information, such as the white edge of the chimney, and the black spots on the roof we retain better. The texture of the chimney part in Fig.2(b) and Fig.2(c) is almost completely removed, and the sky still contains more noise. In the enlarged view, the edges of the house in Fig.2(e) are slightly blurred and there is noise in the sky compared to the original image. Although the edge is protected in Fig.2(f), there is an obvious staircase effect, while Fig.2(g) has a better denoising effect, eliminating the staircase effect and making the image clearer. Overall, our model is better than the other results in terms of vision and detail, and the values of PSNR and SSIM are improved. Fig.3 shows the results of several model of a natural image. We can see that the antennae in the upper left in Fig.3(b) are almost removed. In Fig.3(c), the tree branches in the back and the texture of the roof are blurred. In contrast, the texture of the tiles on the roof in Fig.3(d) is relatively intact. Our method preserves both structural and textural details, and works better than the other models without destroying important information of the image.
Figs.4, 5, and 6 show the denoising results of several grayscale images. Moreover, the experimental results in Fig.6 were cropped and enlarged for display Fig.4(b) shows the optimal denoising image of the original DIP method shown in [19]. We can observe that the chest section in Fig.4(b) is too smooth and loses some structural details. For example, the  small lines inside the chest map are almost missing. Fig.4(c) shows the denoised image obtained by DIP_TV [25], and it can be seen from Fig.4(c) that the outer circle of the chest is partially obscured, the inner bone contours are not clear and the inner fine bronchial lines are not clear. Fig.4(d) shows the results of the proposed method. The recovered image Fig.4(d) is relatively stable in terms of detail detection and noise reduction. As can be seen in Fig.5, the circles in the upper row in Fig.5(b) are still noisy and the edges of the pattern are blurred, and the two small circles in the last row in the upper right corner are missing and part of the circles in the image are distorted. For Fig.5(c), although the noise removal is good, a small part of the middle of the cross in the lower right corner is missing. For Fig.5(d), our method is better and the structure of the circles and crosses are preserved. In Fig.6(b), it is obvious that the noise is not removed cleanly, as can be seen from Fig.6(e), the edge of the figure's head is a little fuzzy, and even there is noise on it. In Fig.6(c), it can be seen that it has a lot more blurred parts, which is because its weights are adaptive, and the noise is smoothed out during the denoising process, so the visual effect is not very good. In Fig.6(f), it can be clearly seen that there is an obvious staircase effect. However, the Fig.6(d) has a better visual effect, with more detail at the edges of the structure. In addition, it can be clearly seen from Fig.6(g) that the proposed model not only overcomes the shortcomings of the staircase effect, but also eliminates the noise and makes the image closer to the original image. The PSNR and SSIM values are shown in Table 1 and Table 2, from several comparison results, we can conclude that our method provides better recovery results: the edges are clearer and the fine structures are better reconstructed. And there are improvements in PSNR and SSIM, so our model is better than DIP and DIP_TV denoising models. For a more intuitive evaluation of the performance of the proposed algorithm, we calculated the image quality evaluation indexes PSNR and SSIM. Table 1 and Table 2 are the experimental numerical results of the proposed method, DIP, DIP_TV and DeepRED at noise σ = 20. It can be seen that the average value of PSNR of the new model is 31.481, which is 1.231 higher than DIP, 0.771 higher than DIP_TV and 0.54 higher than DeepRED. The average SSIM is 0.934, which is at least 0.08 higher than other methods. Overall, the new model works better than the others.
In Figs. 7 and 8, we added Gaussian noise with standard deviation equal to 15 and 20 to the experimental images, respectively, and compared them with three conventional denoising methods, manually adjusting the parameters for each experiment to achieve the best results. Some numerical experimental results are given below, and PSNR and SSIM are calculated in Table 3 and Table 4, aiming to evaluate the VOLUME 11, 2023   effectiveness of the proposed TGV as a regular term in the image denoising problem. In contrast, the conventional mode causes the image to be too smooth and lose some details, while the proposed method obtains a better denoising effect.
As can be seen from the first row in Fig.7, the proposed method better preserves the detailed information of the butterfly, such as the small white dot on the top right of the butterfly's wing and the crease in the middle of the wing.  In the second row, the outline of the building in the background of the proposed new algorithm is clear, and the texture of the lawn part is better protected. In the third row, the texture information of the hat part of the image is better preserved, especially the texture of the hair part of the hat. In the fourth line, although the NL_mean method removes the noise more cleanly, it removes the texture details of the rocks and grasses as well. Compared with other three methods, our method preserves the textures better and the edges of the tower and the house are more clearly defined. As we can see in the fifth row, the new method protects the texture details of the grass and trees behind the house and the poles very well. In general, the proposed method protects the texture and structure edges of the image better in the image denoising process Also, from Table 3 and Table 4, the average PSNR of the new model is 31.359, which is at least 1.282 higher than that of other methods. The average SSIM is 0.932, which is at least 0.26 higher than other methods. In general, the numerical results of the new model are better, which shows the effectiveness of the proposed method.
In Fig.8, we show the denoising results of some images at noise σ = 20. From the first row of Fig.8, we can see that the proposed method retains the details of the butterfly better, such as the small white dot on the top right of the butterfly's    wing and the crease in the middle of the wing. In the second row, the edges of the circles above the images of the other three methods are blurred and irregular, while the edges of the   circles of the proposed algorithm are clear and the denoising effect is better. In the third row, the outline of the building in the background of the proposed algorithm is clear, and the texture of the lawn part is retained more. The texture information in the hat part of the image is better preserved in the fourth row, while the other methods remove some of the textures. In the fifth row our method preserves the texture of the roof intact, and the details of the roof poles and the trees behind are better preserved. In addition, we add four images in Table 5 and Table 6 for experiment, due to space limitation, their images are not released here. And the numerical results show that the average value of PSNR of the new algorithm is 31.244, which is at least 0.943 improvement over other methods, and similarly, the average value of SSIM is 0.908, which is also at least 0.014 improvement. in summary, the proposed method has better denoising effect.
We also give some results in Table 7 and Table 8 at noise σ = 35. The average values of PSNR and SSIM of the new model are 27.103 and 0.875 respectively, both of which are improved. These results show that the new model has better denoising performance and better reconstruction effect, which is superior to the other three methods.

V. CONCLUSION
This paper presents a new model extending the classical DIP framework using TGV as a regularizer. The regularization term provides a way to balance between the first and second-order derivatives of a function, protect details with the first derivative in the image edge and texture area, and remove noise with the higher derivative in the flat area of the image, thus being more flexible relative to the TV function, better protects the image edge, and provides more reliable recovery. And we converted the constrained problem into an unconstrained problem using the augmented Lagrange method and solved the optimization problem using the flexible ADMM framework, which makes the problem easier to solve by applying the Legendre-Fenchel transformation in the subproblem. Numerical experiments show that the new model can protect the edge of the structure while removing noise, make the edge and texture of the image structure clearer, and avoid the emergence of ladder effect. Compared with other methods, the new method has a better denoising effect on noisy images, and the average values of PSNR and SSIM of the new method also increase by at least 0.97 and at least 0.052 under higher noise. The overall results are better than other models. For future work, we hope that the model can be further studied for higher noise removal, although its parameters are difficult to tune during the experiments.
JIANLOU XU received the Ph.D. degree in applied mathematics from Xidian University, Xi'an, in 2013. He is currently an Associate Professor with the School of Mathematics and Statistics, Henan University of Science and Technology, China. His current research interests include partial differential equations, variational methods, sparse representations, and machine learning for image processing.
SHAOPEI YOU is currently pursuing the M.S. degree with the School of Mathematics and Statistics, Henan University of Science and Technology, China. Her current research interests include variational methods, numerical optimization, and deep learning for image processing.
YUYING GUO is currently pursuing the M.S. degree with the School of Mathematics and Statistics, Henan University of Science and Technology, China. Her current research interests include variational methods, numerical optimization, and deep learning for image processing.
YAJING FAN is currently pursuing the M.S. degree with the School of Mathematics and Statistics, Henan University of Science and Technology, China. Her current research interests include variational methods and deep learning for image processing.