Deep Equilibrium Learning of Explicit Regularizers for Imaging Inverse Problems

There has been significant recent interest in the use of deep learning for regularizing imaging inverse problems. Most work in the area has focused on regularization imposed implicitly by convolutional neural networks (CNNs) pre-trained for image reconstruction. In this work, we follow an alternative line of work based on learning explicit regularization functionals that promote preferred solutions. We develop the Explicit Learned Deep Equilibrium Regularizer (ELDER) method for learning explicit regularizers that minimize a mean-squared error (MSE) metric. ELDER is based on a regularization functional parameterized by a CNN and a deep equilibrium learning (DEQ) method for training the functional to be MSE-optimal at the fixed points of the reconstruction algorithm. The explicit regularizer enables ELDER to directly inherit fundamental convergence results from optimization theory. On the other hand, DEQ training enables ELDER to improve over existing explicit regularizers without prohibitive memory complexity during training. We use ELDER to train several approaches to parameterizing explicit regularizers and test their performance on three distinct imaging inverse problems. Our results show that ELDER can greatly improve the quality of explicit regularizers compared to existing methods, and show that learning explicit regularizers does not compromise performance relative to methods based on implicit regularization.


I. INTRODUCTION
The recovery of an unknown image from a set of noisy measurements is one of the most widely-studied problems in computational imaging. The task is often formulated as an inverse problem, and solved by integrating the measurement model characterizing the response of the imaging instrument with a regularization functional imposing prior knowledge of the unknown image. Over the years, many regularizers have been proposed as image priors, including those based on transform-domain sparsity, low-rank penalty, and self-similarity. The focus in the area has recently shifted to methods based on deep learning (DL). Instead of explicitly defining a regularization functional, DL approaches for solving inverse problems learn a mapping from the measurements to the desired image by training a convolutional neural network (CNN) [1], [2].
Model-based DL (MBDL) has emerged as an alternative to the traditional DL, where the knowledge of the measurement model is combined with an image prior specified by a CNN (see reviews [3], [4]). For example, plug-and-play priors (PnP) is a widely-used MBDL regularisation method based on specifying the image prior using a pre-trained image denoiser [4], [5], [6], [7]. Deep unfolding (DU) is another MBDL approach based on interpreting a fixednumber of iterations of an algorithm as layers of a neural network and training it end-to-end in a supervised fashion [8]. DU architectures, however, are usually limited to a small number of unfolded iterations due to the high memory complexity of training. Deep equilibrium learning (DEQ) is an alternative to DU that enables training of very deep neural networks with a constant memory complexity in the number of iterations [9], [10], [11], [12].
Existing research on MBDL for inverse problems, including most of the work on PnP, DU, and DEQ, has largely focused on regularization implicit in pre-trained neural networks [13]. While this strategy has led to state-of-the-art performance, it requires stringent assumptions on the MBDL architecture to ensure algorithmic stability (see related reviews [4], [14]). For example, a common assumption used for proving stability of MBDL architectures is that the CNN is nonexpansive [10], [15], [16], [17], [18]. An alternative line of work has explored MBDL with learned explicit regularization functionals parameterized by neural networks [19], [20], [21], [22], [23]. Explicit regularization functionals significantly simplify the convergence analysis due to the direct applicability of optimization theory. Throughout this article, we will use the term explicit to specify the direct availability of the regularization functional.
We develop Explicit Learned Deep Equilibrium Regularizer (EL-DER) as a new method for learning explicitly defined regularization functionals for imaging inverse problems. This method seeks to learn a regularizer that achieves the smallest value of mean-squared error (MSE) for a given inverse problem. To this end, we parameterize the regularization functional by a CNN and train it end-to-end by using DEQ on the MSE loss. Similarly to existing approaches for learning explicit regularizers, ELDER directly inherits traditional concepts from optimization for parameter selection and convergence analysis. ELDER thus improves the stability and flexibility of DEQ by introducing a new forward pass based on minimizing an explicit objective function using line-search. On the other hand, ELDER outperforms existing PnP methods based on explicit regularization functionals due to its ability to optimize learned functionals near the fixed points of the reconstruction algorithm. We present numerical results on three imaging inverse problems: image super-resolution, image reconstruction from subsampled Fourier measurements, and image inpainting. We apply ELDER to optimize the weights of three parameterization approaches for regularization functionals-namely least squares residual (LSR), regularization by denoising (RED), and direct scalar-valued network (DSV)-and compare the effectiveness of all three as regularizers when trained to be MSE-optimal. We also show that ELDER does not compromise the imaging quality relative to methods based on implicit regularization. Our results show that ELDER achieves excellent imaging results, while also offering the potential for automatic step-size selection and algorithmic stability, even when using expansive update rules. In short, our work provides a method to learn state-of-the-art explicit regularizers for MBDL that preserve traditional tools from optimization theory. 1

II. BACKGROUND
Inverse Problems: We consider the imaging inverse problem of recovering an unknown image x ∈ R n from noisy measurements y = Ax + e, where A is the measurement operator and e ∈ R m is additive white Gaussian noise (AWGN) vector. The problem is traditionally formulated as an optimization problem where τ > 0 is the regularization parameter, g is the data-fidelity term enforcing consistency of the solution with y, and h is the regularizer enforcing prior knowledge of x.
Model-based Optimization: Proximal algorithms are often used for solving problems of form (1) when h is nonsmooth (see the review [24]). One widely used family of proximal algorithms for imaging inverse problems are the proximal gradient method (PGM) [25]. PGM avoids differentiating h by using the proximal operator, which can be defined as prox τ h (z) := arg min 1 Our code is publicly available at https://github.com/wustl-cig/ELDER with τ > 0, for any proper, closed, and convex function h [24]. Comparing (2) and (1), we see that the proximal operator can be interpreted as a MAP estimator for the problem by setting h(x) = − log(p x 0 (x)). Another less known but equally valid statistical interpretation of the proximal operator is as a minimum mean-squared error (MMSE) estimator [26]. PnP: PnP refers to a family of algorithms that integrate measurement operators and CNN denoisers for solving inverse problems (see the recent review [4]). Since the prior in PnP is implicit in the denoiser, it is common to interpret PnP as fixed-point iterations of some high-dimensional operators [27]. For example, given a denoiser D θ : R n → R n parameterized by a CNN with weights θ, the iterations of proximal gradient method (PGM) [25] variant of PnP can be written where g is the data-fidelity term in (1), I is the identity mapping, and γ > 0 is the step size. DU and DEQ: DU is a DL paradigm that has gained popularity due to its ability to systematically connect iterative algorithms and deep neural network architectures (see reviews in [3], [28]). The DU training is usually performed by solving the optimization problem by a MSE loss. DEQ [9] is a related approach that enables training of infinite-depth, weight-tied networks by analytically backpropagating through the fixed points using implicit differentiation. The DEQ output is specified implicitly as a fixed point of an operator T θ parameterized by weights θ The DEQ forward pass usually estimates x in (5) by running a fixed-point iteration. The comparison of (4) and (5) highlights the connection between PnP and DEQ, which was recently explored [10] by using DEQ for end-to-end learning of the weights of the CNN prior D θ . There has been considerable interest in DEQ for imaging, including in MRI [10], [29], computed tomography (CT) [18] and video snapshot imaging [30]. The relationship between DEQ and bilevel learning in the context of inverse problems was recently explored in [12]. The convergence of forward iterations is essential for the stability and accuracy of DEQ. Similar to the theoretical analysis of PnP [15], [31], a sufficient condition to guarantee the convergence of the DEQ forward pass is to ensure that g is convex and the residual R θ := Since most CNN architectures do not inherently satisfy this property, several methods have been proposed to train Lipschitz constrained CNNs [15], [32]. However, it has been observed that constraining CNNs to be Lipschitz continuous can negatively impact their performance [21], [23]. Learning Explicit Regularization Functionals: RED [19] is an early approach for specifying an explicit regularization functional by parameterizing it using an image denoiser D θ When the denoiser is locally homogeneous and has a symmetric Jacobian [19], [27], the gradient of the RED regularizer h has a simple expression which enables efficient minimization of the RED functional within (1). However, when these rather stringent conditions are not satisfied, RED does not correspond to an explicit regularizer, corresponding instead to an MBDL method with an implicit regularizer [27]. Other notable approaches for learning explicit regularization functionals include fields of experts (FoE) [33] and its bi-level counterpart [34], adversarial regularization [35] and its convex counterpart [36], network Tikhonov (NETT) [37], and total deep variation (TDV) [38]. These approaches are similar to ELDER in the sense that they seek to explicitly parametrize h using a neural network, but differ in the way the functional is learned. Another line of work has explored gradient-step denoisers for PnP based on the direct parameterization of regularization functionals, thus leading to explicit loss functions and convergence guarantees without Lipschitz constraints on the neural networks [20], [21], [39].

III. PROPOSED METHOD
We now present our proposed method. Unlike the traditional DEQ approach for inverse problems [10], the forward pass in our method is designed to minimize an explicit objective, where the regularization functional is parameterized by a CNN. Our method inherits the benefits of having an explicit regularizer, such as convergence without Lipschitz constraints on the CNN and the possibility of using line-search strategies for step-size selection. At the same time, our method leads to state-of-the-art explicit regularizers that are trained to be MSE optimal at the fixed-point of the inference algorithm.

A. ELDER FORWARD PASS
We consider an inverse problem with a regularization function h θ : R n → R parameterized by weights θ The forward pass of ELDER seeks to minimize f θ via the iterations where γ > 0 is the step-size. For a linear inverse problems with a 2norm loss g(x) = 1 2 y − Ax 2 2 , the proximal operator has a closedform solution where I is the identity matrix and A H denotes the conjugate transpose of A. In many applications, the proximal map (11) can be computed or approximated efficiently using general methods-such as conjugate gradient-or with specialized methods-such as when the forward model is a spatial blurring operator that can be computed using the fast Fourier transform (FFT) [4], [40], [41]. The step-size γ must be carefully selected to ensure the convergence of the algorithm in eq. (10). Since we have access to the explicit objective function, we can ensure convergence by adopting a line-search strategy. To that end, we use a backtracking line-search (BLS) to ensure the sufficient decrease condition (see Section IX-B in [42]) Our implementation starts with a large step-size γ > 0, which is subsequently reduced by a factor 0 < β < 1 to ensure that (12) is satisfied.

B. PARAMETERIZATION OF REGULARIZATION FUNCTIONALS
Several approaches for explicitly parametrizing regularization functionals using CNNs have been previously proposed. We consider three well-known approaches and compare them in our numerical evaluations. All three regularizers are based on an operator G θ : R n → R n corresponding to an image-to-image CNN. Note that G θ can be implemented using any differentiable CNN architecture.
Least-Squares Residual (LSR): Inspired by one of the formulations suggested in [19] (see Section V-B) and in the recent work on the gradient-step denoiser for PnP [21], we first consider a regularizer that explicitly quantifies the distance between the input and the output of G θ . More specifically, we consider where J G θ is the Jacobian of G θ with respect to x. LSR can be interpreted as promoting solutions to the inverse problem that are near the fixed points of the CNN G θ .
(Real) RED Regularizer: While the RED gradient in (8) has been extensively used for solving inverse problems, the RED functional (7) has not been previously used as a regularizer. We explore the RED functional itself as a regularizer by considering the true gradient of h RED θ given by where we do not require that the network G θ is locally homogeneous and has a symmetric Jacobian. Direct Scalar-Valued (DSV) Regularizer: Similar to [20] and [43], we can also directly sum the output of the CNN G θ to obtain a scalarvalued neural network It is worth noting that for all the considered regularizers, we use conventional auto-differentiation tools to efficiently compute the Jacobian-vector products without actually storing the entire matrix.

C. JACOBIAN-FREE DEEP EQUILIBRIUM LEARNING
We train the regularizer h θ by minimizing the discrepancy between a fixed-point x = T θ (x) obtained via (10) and the ground-truth image x using MSE loss L(θ) = 1 2 x(θ) − x 2 2 . The DEQ backward pass produces gradients by implicitly differentiating through the fixed points This converts the memory-intensive task of backpropagating through many iterations of T θ to the problem of calculating an inverse Jacobian-vector product. Since inverting the Jacobian matrix in (16) can become computationally expensive, we introduce an approximation that replaces the inverse-Jacobian term in (16) with an identity as in [11]

FIGURE 1. Visual illustration of the results obtained by ELDER using the three parameterization approaches for h θ . We show results for three inverse problems: image super-resolution (top), reconstruction from Fourier samples (middle), and image inpainting (bottom). The rightmost panel plots the evolution of PSNR (dB) accross forward iterations. Note how LSR achieves an overall better performance compared to DSV and RED.
This Jacobian-free approximation significantly reduces the complexity of the backward pass without compromising the quality of DEQ.

D. CONVERGENCE THEORY
Since ELDER minimizes an explicit loss f θ = g + τ h θ , it directly inherits traditional convergence results from optimization theory. For completeness, we state these convergence results using our notation. Assumption 1: The function g is proper, convex, and lower semi-continuous. The function h θ : R n → R is proper, lower semicontinuous, finite valued, and differentiable with L-Lipschitz gradient. The function f is bounded from below.
These assumptions on g and h θ are standard, and are satisfied by a large number of functions used in the context of inverse problems (see, for example, the discussion in [21]).
The proof is widely available in the existing analyses of PGM [21]. The importance of this result is that, unlike the traditional DEQ based on implicit regularization [10], the stability of the ELDER forward pass does not require any assumptions on the non-expansiveness of the CNN prior.

E. ADDITIONAL TECHNICAL DETAILS
Our method is compatible with any differentiable CNN architecture for implementing G θ . We use the simplified DRUNet architecture [41] for ELDER and the traditional DEQ [10]. We have replaced the rectified linear unit (ReLU) activations with exponential linear unit (ELU) ones to ensure the smoothness of h θ . We also limit the number of residual blocks at each scale to 2. Similar to [10], the CNN prior of ELDER is initialized using a pre-trained denoisers. Additionally, we follow [10], [44] in setting the convergence criterion in training to where we choose = 10 −2 for all experiments.

A. COMPARING PARAMETRIZATION STRATEGIES
We first compare the performance of the three parameterization strategies in Section III-B on image denoising. We pre-train all the regularizers by adopting the gradient-step denoising strategy from [21]. Pre-training explicit regularizers as denoisers is computationally useful for ELDER, since the denoisers can be used to initialize the DEQ learning. All three explicit denoisers are compared against DRUNet [41], which was shown to result in state-of-the-art PnP image restoration. We employ the color image training dataset in [41]. During training we consider AWGN with standard deviation σ at a uniform random draw from the range [0, 55/255]. Table 1 presents the results of all denoisers at several noise levels (σ ∈ {5, 15, 25, 50}) on both color and gray images. The RED denoisier in the table refers to the gradient-step in (14). While all three explicit denoisers perform well at all noise levels, LSR most closely matches the performance of DRUNet. It is worth noting that LSR (with 2 residual blocks at each scale) uses fewer parameters than DRUNet (with 4 residual blocks). Note that the PSNR gap between DRUNet and LSR is within 0.15 dB, which implies that having explicit regularizers does not significantly impair performance (this was also observed in [21]).   Table 2 present the results of comparing the three parameterization strategies within ELDER on the three inverse problems: single image super-resolution, reconstruction from Fourier measurements, and image inpainting (each problem is discussed in the dedicated section below). We observe that LSR leads to the best results, motivating its use as a primary parameterization strategy for ELDER.

B. SINGLE IMAGE SUPER-RESOLUTION
We consider measurements of form y = SHx + e, where H ∈ R n×n is the blurring matrix and S ∈ R m×n performs standard d-fold down-sampling with d 2 = n/m. When the blur satisfies the circular boundary conditions, the blurring matrix and its conjugate transpose can be decomposed as H = F H MF and H H = F H M H F, where F is the discrete Fourier transform (and F H its inverse, satisfying FF H = F H F = I), and M ∈ C n×n is a diagonal matrix whose diagonal elements are the Fourier coefficients of the blur. The proximal operator (11) of g(x) = 1 2 y − SHx 2 2 has the closed-form solution where r = γ H H S H y + z. The matrix M = [M 1 , · · · , M d 2 ] ∈ C m×n , and where the blocks M i ∈ C m×n satisfy M = diag{M 1 , · · · , M d 2 } a block-diagonal decomposition according to a d × d tiling in the Fourier domain (see Lemma 1 in [47]). Note that the inverse of diagonal matrix d 2 I + MM H can be computed element-wise.
We verify the effectiveness of ELDER on SISR using a large variety of blur kernels and different down-sampling factors. We use 8 realistic camera shake kernels tested in [41], plus a 9 × 9 uniform kernel, and a 25 × 25 Gaussian kernel with standard deviation 1.6. All 10 kernels are presented in Table 3. We use the same dataset in Section IV-A for training ELDER. We train a single model on all blur kernels, each with downsampling factors of d ∈ {2, 3, 4}, to test the generalizability under different SISR settings. We set the number of forward-iterations to K = 100. At every training iteration, we initialize x 0 with a shift-corrected bicubic interpolation of y [41]. Additionally, during training we use AWGN with random noise levels σ ∈ [0, 10.0]/255 to ensure robustness of our model to different noise perturbations.
We compare ELDER and DEQ against bicubic interpolation, RCAN [45], and state-of-the-art PnP methods IRCNN [46], DPIR [41], and GSPnP [21]. We use the publicly available implementations for all the baseline methods. RCAN refers to the PSNR oriented deep model based on bicubic degradation. GSPnP [21] is a PnP method using pre-trained denoising CNN as an explicit regularization functional, with an algorithmic update similar to the ELDER forward pass. DPIR uses the original DRUNet in Table 1, while GSPnP uses the LSR gradient-step denoiser. The regularization and step-size parameters of GSPnP, DEQ, and ELDER were optimized at test time using fminbound in scipy.optimize. Table 3 summarizes the PSNR values achieved by ELDER and other methods when applied to {σ = 7.65, d = 3} and {σ = 2.55, d = 4} on the CBSD68 dataset. It is clear that both ELDER and DEQ outperform the other methods at all settings. ELDER also matches the overall performance of DEQ, while performing better for higher noise levels at d = 3. The excellent performance of ELDER relative to DEQ highlights that use of an explicit regularizer does not imply a compromise in performance. Fig. 2 visually compares EL-DER against the baseline methods at scale factors ×3 (top) and ×4 (bottom), respectively. The enlarged regions in the images suggest that ELDER better recovers the fine details and sharper edges compared to DPIR and GSPnP, while providing same or better PSNR than DEQ. These results indicate that ELDER reaches state-of-the-art performance in PnP/DEQ SISR for a variety of kernels and noise levels. Fig. 3 illustrate the convergence behavior of the forward pass of ELDER in terms of the objective f (x k ) (left) and the residual x k − x k−1 2 /γ (middle), when tested on a subset of 10 color images taken from the original CBSD68 dataset (CBSD10). Fig. 3 (right) shows the corresponding step-size γ used in the experiment. The optimal step-size on each test image is optimized for the best PSNR value. Note how ELDER using the backtracking line-search strategy enables automatic selection of the step-size parameter. Fig. 4 compares the convergence of ELDER and GSPnP in terms of PSNR for SISR with scale factor 3. The figure also shows visual results for both methods at different iterations. Note how ELDER learns a regularizer that leads to an improved PSNR compared to GSPnP. Since both methods share the same parametrized regularization functional, the improvement is due to DEQ learning of the regularizer.

C. COMPRESSED SENSING MRI (CS-MRI)
MRI is a widely-used medical imaging technology that is limited by the low speed of data acquisition. CS-MRI seeks to address this limitation by recovering an image x * from its sparsely-sampled Fourier measurements. We simulate a simplified noiseless single-coil CS-MRI using radial Fourier sampling. The measurement operator is thus A = BF, where B is the diagonal sampling matrix with values in {0, 1}. The proximal operator of g(x) = 1 2 y − BFx 2 2 has a closedform We train ELDER using the brain dataset from [8], which consists of 800 slices of 256 × 256 training images and 50 slices of testing   images. Both ELDER and conventional DEQ are trained on sampling ratios within [10, 20%]. At every training iteration, we use zero-filled image x 0 = A H y to initialize the forward pass with 100 iterations. Table 4 presents PSNR values obtained by ELDER, publicly available implementations of several well-known methods, including TV [48], ADMM-Net [49], as well as IRCNN, DPIR, GSPnP, ISP [20], and DEQ. TV is solved using the accelerated proximal gradient descent method [48]. ADMM-Net is a DU method that trains both image transforms and shrinkage functions within the algorithm. ISP refers to an MBDL using explicit deep denoisers, similar to GSPnP. Overall, ELDER and DEQ achieve the best performances, indicating that having the explicit regularizer does not compromise  performance. Fig. 5 provides some visual examples at sampling ratio 10%, highlighting the imaging quality obtained by our method relative to several baseline methods. From the zoomed regions and the corresponding error maps, ELDER improves over DPIR and GSPnP due to the training of the explicit regularizer to be end-to-end MSE optimal.

D. IMAGE INPAINTING
We apply ELDER to image inpainting characterized by the measurement model y = Px, where A = P is a random diagonal matrix with p ∈ [0, 1] denoting a probability of missing a pixel. We assume a noiseless setting. The data-fidelity term is the indicator function g(x) = C (x) for the set C = {x ∈ R n : y = Px}, which, by definition is 0 when x ∈ C and +∞ elsewhere. The proximal step (11) has a closed form solution prox γ g (z) = y + z − Pz.
At the kth forward iteration of ELDER, the proximal operator returns an image consisting of the network output at the missing pixels and measured pixels at the other locations. We train our model under random sampling parameter p ∈ [0.3, 0.7]. To demonstrate the flexibility of ELDER, we consider p = 0.5 and p = 0.7 and compare to IRCNN, DPIR, GSPnP, and DEQ. Again, since IRCNN and DPIR lack the implementation for inpainting, we apply our data fidelity term on them, set the σ K parameter to 3.0, and set the iteration number to 100. The PSNR performance is reported in Table 5, and visual results are provided in Fig. 6. These results indicate that, achieves better performance compared to GSPnP and nearly matches the performance of DEQ. Moreover, we can observe from the visual results that GSPnP smoothes out the fine details, while DPIR generates distortions. In contrast, ELDER can recover detail as well as avoid distortions.

V. CONCLUSION
ELDER is a new method for learning explicit regularization functionals for model-based deep learning in inverse problems. It parameterizes the regularizer using a CNN and learns its weights to minimize MSE values using DEQ. The benefit of having an explicit regularizer is that one directly inherits the fundamental results from optimization. This article shows that ELDER outperforms existing approaches for learning explicit regularizers for inverse problems. It also suggests that using an explicit regularization functional does not compromise imaging performance compared to the traditional DEQ methods based on implicit regularization.