Local Kernels That Approximate Bayesian Regularization and Proximal Operators

In this paper, we broadly connect kernel-based filtering (e.g., approaches such as the bilateral filter and non-local means, but also many more) with general variational formulations of Bayesian regularized least squares and the related concept of proximal operators. Variational/Bayesian/proximal formulations often result in optimization problems that do not have closed-form solutions and therefore typically require global iterative solutions. Our main contribution here is to establish how one can approximate the solution of the resulting global optimization problems using locally adaptive filters with specific kernels. Our results are valid for small regularization strength (i.e., weak noise), but the approach is powerful enough to be useful for a wide range of applications because we expose how to derive a “kernelized” solution to these problems that approximates the global solution in one shot, using only local operations. As another side benefit in the reverse direction, given a local data-adaptive filter constructed with a particular choice of kernel, we enable the interpretation of such filters in the variational/Bayesian/proximal framework.


I. INTRODUCTION
Kernel-based non-linear filtering, such as bilateral filter [1], [2] and non-local means [3], is one of the most effective and popular methods used in modern image processing [4].Besides denoising/smoothing, kernel methods have shown successes in many other tasks, including image enhancement [5], superresolution [6], and demosaicing [7].Recent techniques [8], [9] have also been proposed to speed up kernel methods, making them feasible to run on mobile devices.
On the other hand, Bayesian/variational/proximal methods have the advantage of well-defined objective functions, and allow the direct incorporation of penalty terms to control the properties of the resulting solutions.These estimators use explicit signal priors within the frameworks of maximum a posteriori (MAP) or minimum mean-squared error (MMSE).Moreover, the statistically based methods can easily be adapted to more general measurement models by changing the data fidelity term.However, a major drawback is that the solution often requires iterative global optimization methods, and is rarely used in practical situations where speed is not only desired, but absolutely required.his is not the first time the relationship between local filters and global regularization has been considered [10], [11].Among the many such works, three are most directly related to ours: Durand et al. [12] showed the relation between bilateral filters and robust statistics.Namely, that the bilateral is related to the first gradient step of minimizing a cost function with a specific robust regularization function.Durand's work built on the elegant work of Black et al. in [13] where they showed connections between three techniques for image filtering: anisotropic diffusion, robust statistics, and regularization.These connections made it possible to analyze, design, implement, and interpret anisotropic diffusion using the tools of robust statistics.Finally, in the relatively recent work of Louchet et al. [14], they proved (in the specific context of Total Variation (TV)) that the local interactions induced by TV do not propagate much at long distances in practice, so that the TV filtering model can be adequately interpreted and implemented as a local filter.In fact, they built a purely local filter by considering the TV model in a given neighborhood of each pixel, and illustrated that for small regularization parameters, it behaved very closely to the ideal global TV.
What is the contribution then of the present work in light of these illuminating previous works?
• With respect to [14], we aim to illustrate that the connections between local filtering and global optimization via a variational formulation go far beyond Total Variation, encompassing a much wider class of regularization functions.• The significant speedups enabled by the approach proposed in Durand's work [12] subsequently enabled a fantastically wide range of use cases for the bilateral filter, making it a standard tool in both vision and graphics.Our work aims to show that by localizing a much wider range of complex Bayesian/variational filters, similar speedups may be realized for much more sophisticated filters beyond the bilateral.• The work of Black et al. [13] generalized the notion of anisotropic diffusion using tools from robust statistics, and made a connection to image filtering.By contrast our work takes the somewhat wider and more practical view that a more or less arbitrary regularization function (robust of otherwise) can be converted or approximated with a localized (or "kernelized") image filter.In the process, the notion of anisotropic diffusion is naturally captured as iterations of the derived local filters with small steps, as explained in [4].• Finally, our approach makes the novel case that very general variational restoration formulations can be directly associated to diffusion by local filters through the (graph) Laplacian operator.This is significant and timely with respect to the wide use of proximal operators in learning, optimization theory, and their particular use in solving inverse problems [15], [16], [17], [18], [19].Approaches such as [15] have become popular for suggesting an ad hoc procedure for replacing the proximal operators with simpler (often patch-based) denoisers.The current work is a more principled approach toward that replacement, and makes the calculation of the appropriate kernel filters more direct and transparent.
II. BACKGROUND Removing noise from a signal may seem like a solved problem [20], [21], [22], [23], but it is still an ongoing area of work.More importantly, denoising (i.e.smoothing) is a fundamental operation in all of image processing, statistics, and machine learning.It is not only this wide use, but the fundamental nature of the denoising problem and its relationship to proximal operators that motivates our presentation here.This said, while denoising is the appropriate platform on which to illustrate the main concepts proposed in this paper, it is important to note that the results of this paper are not meant for only denoising applications per se.As we have illustrated elsewhere [19], [9], insight on how to implement, approximate, and generalize denoisers is very widely useful not only as a way to describe proximal operators, but also as a key building block to nearly all facets of image processing more generally.Now, let's specifically consider the measurement x of a clean signal u, corrupted by zero-mean Gaussian white noise of variance σ 2 , as 1x = u + e.
We note that this model is an adequate description for images scanned lexicographically into vectors.The aim of any denoiser is to recover an estimate of the clean underlying signal as a function applied to the noisy measurement, The denoiser is a map f : R N → R N .Denoisers enjoy many interesting functional properties, which are elaborated and extensively used in [19].

A. MAP Denoising and Proximal Operators
Bayesian denoising invokes the use of a prior P (u) on the class of "clean" images u.This prior influences the estimate of the underlying signal away from the direct measurement x.In particular, (as the name indicates) the maximum a posteriori (MAP) estimate is the value of u at which the posterior density P (u|x) is maximized, xmap = arg max u P (u|x).
When the noise is Gaussian and white (as we have assumed), the optimization boils down to a regularized least-squares problem where φ(u) = − log P (u) is the negative log-prior on the space of clean signals.This expression for the MAP estimate coincides precisely with what is known in the optimization literature as the proximal operator [24] for φ.Proximal operators have many interesting properties of their own.In this exposition, however, we simply mention them here and leave the further exploration of these connections for another occasion.
Let's develop a more explicit form for this solution by computing the gradient of the right-hand-side and solving, In the last line above, if we consider I + σ 2 ∇φ as an operator acting on its argument, we can (at least symbolically, if not explicitly) invert the operator to obtain This expression is known as the resolvent [24] of the operator ∇φ.When σ 2 is small, we can approximate the resolvent The left hand side, the MAP solution, can be interpreted as an implicit Euler step of size σ 2 on the differential equation ∂ t u = −∇φ(u).The approximation on the right hand side is an explicit Euler step, and we can clearly see the "shrinkage" effect of the denoiser, where the second term shrinks the measured noisy signal toward zero by an amount proportional to the gradient of the loss function φ(x).The MAP estimate is therefore approximated by We can estimate the error in this approximation by letting u = x map and observing from (7) that while Therefore: so the error is small if the denoising residual xmap − x is small.For ∇φ a linear operator, M is the spectral norm (largest singular value) of ∇φ.Alternatively if φ is twice differentiable, then the error can be bounded by the mean value theorem in terms of the max spectral norm of the Hessian Hφ along the line segment connecting x and xmap , where We will next consider the MMSE framework, which instead of the maximum of the posterior, considers its mean.This contrast between the MAP and MMSE is highlighted in Figure 1.It is interesting to note from this cartoon that the two estimates will tend to coincide when the posterior is symmetric and unimodal, or when the noise variance σ 2 is small.The latter condition is interesting to the extent that it allows us to formulate our results in a unified way across both MAP and MMSE estimators.Before we do this, let's examine the form and function of the MMSE denoisers in a bit of detail.

B. MMSE Denoising
In contrast to MAP denoising, MMSE denoising relies on the mean of the posterior density, The MMSE expectation integral is often impractical to evaluate directly, especially in high dimensions.
But there is a clever way around this difficulty, due to a classical result known as Tweedie's formula [25], [26], [27], where P (x) is the marginal density of the measurement x, which is computed from the prior P (u) as P (x) = P (x|u)P (u) du.This marginal density is a "blurred" version of the prior because when e is Gaussian white noise of variance σ 2 , That is, the marginal P (x) is the convolution of the prior P (u) ∝ exp −φ(u) with a Gaussian blur centered at zero, with variance σ 2 .The MMSE expression can be further written as where φ σ (x) = − log(P (x)).That is, φ σ (x) is the negative log of the blurred prior.We can directly compare the above expression for MMSE to the expression (11) for the MAP estimator and note that they are similar.Indeed, the MMSE estimate is also a shrinkage of x toward zero, and a gradient descent step on (a blurred version of) the penalty φ(x).This observation underscores the importance of the shrinkage idea in all forms of Bayesian/variational/proximal smoothing.Going forward, we will consider the following form as the fundamental equation for all such formulations of smoothing.Namely where φ(x) may henceforth be interpreted as a log-prior, a smoothed log-prior, or more generally an arbitrary but (almost everywhere differentiable) regularization function that may be used in a proximal setting.

C. Kernel Filters
A fundamentally different approach to smoothing is to compute data-dependent weighted averages of the input signal values to yield each element of the output.As described below, the affinity (or similarity) of pixels is measured using a "kernel" (a symmetric function), which when normalized, yields the weights for the above averaging.This framework (sometimes referred to as kernel regression [4]) does not rely explicitly on a prior as does the Bayesian framework, though as shown in [4] the kernel approach can be interpreted as an empirical Bayes procedure.The resulting kernel filters we shall concern ourselves with are pseudo-linear.Specifically, estimated pixels are computed from the measured pixels by with the important caveat that the weights W ij can depend on x.To be more compact with our notation, we can gather all weights corresponding to the ith estimated pixel as a row vector And collecting all these weights into a matrix we have from which the denoiser can be conveniently written in pseudo-linear form as Where do the weights in W come from?This is explained in detail in [4], but to summarize, the weights are normalized affinities computed between individual or groups of pixels.These affinities are typically computed using a symmetric positive-definite kernel function where R i is a matrix that extracts a pixel or patch centered at a reference position i, and γ is a smoothing parameter.Following [28], we call the operator R ij the non-local difference (or discrete derivative) between position j and the reference position i.By definition, the self-weights k ii (•) = 1 for all i.The Gaussian non-local means kernel is just one possible choice of kernel, and a wide variety of others are possible, as noted in [4], illustrated in Table I, and elaborated later in this paper 2 .When normalized, these affinities give the weights W ij pointing to each reference position i, and summed over the other indices j to produce the output of the (pseudo-linear) filter: Using the matrix notation in (28), we rewrite: where In [19] we have described in detail some fundamental properties that such smoothing filters satisfy.Of particular interest in the next section is the property that such filters are weakly nonlinear in the sense that a small perturbation of x leaves the filter matrix W(x) unchanged.As such, their basic behavior strongly mimics certain properties of linear filters, particularly with respect to differentiation; namely, if f (x) = W(x)x, then ∇f (x) = W(x) (see [19] for details.) Correspondingly, it is also shown in [19] that the (pseudo-quadratic) loss function has gradient directly given by the the pseudo-linear operator and Hessian given by The operator L(x) = I − W(x) on the right-hand-side is called the (graph) Laplacian [4].Intuitively, if we think of W(x) as a smoothing (low-pass) filter, then L(x) is a sharpening (high-pass) filter.With this correspondence, we can also write the kernel filter in terms of the Laplacian operator as follows: It is worth noting that the right-hand side of this equation is in precisely the same form as the shrinkage x = x − σ 2 ∇φ(x) equation ( 24) we presented earlier in the Bayesian/variational/proximal setting.This observation will be key throughout the rest of the paper.
III. APPROXIMATION OF REGULARIZATION BY PSEUDO-QUADRATIC FORMS This section contains the main message of this paper, which is to establish a direct connection between the choice of kernel filter W(x) := I − L(x) and the penalty function φ(x) in the Bayesian/regularization framework.To this end, the first step is to locally approximate the explicit regularizer with a pseudoquadratic form of the Laplacian.In particular, for each x we set This is the approximation from which all the rest of our results will flow.Specifically, given a kernel k(x), we can compute the filter matrix W(x) and the Laplacian L(x) = I − W(x), which immediately gives an approximation to the loss function φ(x).As we will elaborate below, going in the other direction is much more profitable as it will help us to localize global approaches.We will dedicate a substantial part of this paper to working out the details.Before proceeding further, we make some broad observation that set the stage.
• The approximation of φ by φ ker also implies approximation of the Bayesian denoisers with a kernel filter by way of differentiating (36): We call equation ( 37) the first order relationship that bridges the Bayesian/variational framework and the kernel framework.It makes the surprising and novel observation that very general variational restoration formulations can be directly associated to filtering by local filters through the Laplacian operator.Specifically, (37) states that the action of the Laplacian operator on the image approximates the gradient of the implied regularization function 3 .Conversely, computing the gradient of a given regularizer can specify the kernel implicitly in terms of the Laplacian operator L(x).Finally, it is worth noting that a special case of this expression also connects us to ideas in robust statistics [29], as the gradient ∇φ(x) can also be thought of as the influence function of the penalty.We shall have much more to say about this later in this section.• We can differentiate both sides of (37) yet again, yielding the second order relationship where Hφ(x) denotes the Hessian.The identity (39) shows that the loss φ(x) is locally convex if and only if L(x) (and therefore the corresponding kernel k(•)) is positive definite.On the other hand, φ(x) does not have to be globally convex for the relation to hold.Even in the case where φ(x) is non-convex (but twice differentiable), the Hessian still yields a usable kernel (albeit not necessarily a positive definite one).

A. First Order Mapping of Losses to Kernels
Here we will use the explicit first order relationship to relate the kernel filters and their corresponding losses.First, let's analyze a couple of relatively simple examples in order to illustrate the above identity.
. Therefore, ∇φ(x) = x (and the Hessian is simply the identity matrix Hφ(x) = I).This implies through (37) that L(x) = σ 2 I and The filter is a pointwise multiplicative shrinkage, reminiscent of the James-Stein estimator [30].For comparison, the exact MAP solution is which for small σ can be approximated using the very same shrinkage expression: b) Dirichlet Energy Regularization: An interesting loss for image processing is the 2 norm of the image gradient 4 , which encourages solutions that are smooth.For simplicity, we discuss discretization on a finite one-dimensional grid with periodic boundary conditions.
Given a noisy periodic signal x = [x 0 , . . ., x N −1 ] T , the MAP denoiser, regularized by the Dirichlet energy is the minimizer of where G is the local differencing operator representing circular convolution 5 with the kernel Where we note that L 0 = G T G is the standard (linear) Laplacian operator representing convolution with the 3-tap filter [−1, 2, −1].With this, we calculate the approximate MAP solution using (10) as This is to be contrasted against the exact MAP filter which is the following: The exact and approximate filters share the same eigenvectors, but their eigenvalues are different 6 .Indeed, the exact filter is (I + σ 2 L 0 ) −1 whereas the approximate filter is I − σ 2 L 0 , and they have eigenvalues respectively of the form (1 + σ 2 λ i ) −1 and 1 − σ 2 λ i , which are approximately equal whenever σ 2 λ i is small.We illustrate the 3-tap approximate denoising filter and the exact solution in Figure 2, where plots of the respective filter impulse responses for N = 256 for several values of σ are shown.Details of the calculation are carried out in the appendix.The filters are quite close at σ = 0.2 and become less close for larger σ.The approximation breaks down at σ = 1/2 √ 2, which is a zero of the approximate filter.For completeness, in Figure 11 in the appendix we also illustrate the 1 distance between the filter impulse responses 7 .
The above examples illustrated two of the simplest cases where the reduction of the regularization function to kernel form led to linear operators (shrinkage and convolution).Next, we consider a much more general class of regularization functions that lead to more complex (data-dependent) kernels.In particular, we consider isotropic kernels and their corresponding penalty functions.
1) Stationary and Isotropic Losses: In this section we concentrate on special classes of loss functions that allow us to carry out closed-form calculations.In particular, we begin by considering simple losses of the form φ(x) = ρ( Ax ), which depend only on the magnitude of the argument and some linear operator A. Such loss functions are also sometimes called radial basis functions, commonly used in approximation theory and machine learning [31], [32].We make the standard regularity assumption on ρ(•) that it is differentiable a.e., and ρ(t) ≥ 0, with ρ(0) = 0.
We proceed to calculate the gradient of this loss [33]: Invoking (37), for all x we have 8 This implies that Next, we generalize the loss functions to the following form: ) 6 These eigenvectors are simply the Discrete Fourier Transform (DFT) basis and the corresponding eigenvalues are the DFT transformed coefficients. 7This is equivalent to the ∞ operator distance, which relates the max pointwise error between xmap and xapprox as (48) 8 If A is any orthogonal matrix, then A T A = I and Ax = x , which means that ρ ( x ) x are effectively assigning the eigenvalues of the Laplacian matrix L(x).The problem of finding the kernel in this case can be viewed as an inverse eigenvalue problem of solving for a positive semi-definite matrix with given eigenvalues.
where h ij = h i−j is a positive stationary but otherwise arbitrary weight functions.In this form, the class of functions we consider includes many well-known losses such as those shown in Table I.With this definition, the gradient of φ(•) is: where As we will show in detail shortly, this class of losses will induce stationary and isotropic kernels that depend strictly on the norm (and not the direction) of the (non-local) difference: This is not a severe restriction for our purposes, and is routine for many of the kernels used in image processing, such as those illustrated in Table I.Isotropic kernels are also well-behaved, and convenient tools for their analysis and computation exist.Bochner's Theorem [34], for instance, gives them a useful spectral representation: namely, isotropic positive-definite kernels are the Fourier transform of a finite nonnegative-valued function (measure) [34].Of course, one can construct new kernels by adding, multiplying, composing, etc. as is well-known in the machine learning community [35].Now, let's take steps to extract the implied kernels.In analogy to (52), we have To clarify what the RHS really means, we remind the careful reader that To simplify further, we pause to interpret each of the terms in the above.R i is an operator that extracts a pixel or a (symmetric) patch centered at the position i.The transpose of this operator R T i takes a given patch or pixel and places it at position i in an image of all zeros.Therefore, R T i R i (and R T j R j ) does nothing more than extract a patch and put it back where it came from; thereby only contributing a value of 1 to the relevant diagonal element of the matrix 9 R T ij R ij .Correspondingly, R T j R i picks off a patch at position i and places it at position j.In matrix terms, R T j R i contributes a value of 1 to the corresponding off-diagonal position each time pixels from the patches i and j overlap 10 .Together, these observations help us write the RHS of (56) as a sum of two sets of terms: one set for the diagonal elements and another for the off-diagonals, as follows: Here we have an expression where the first term on the RHS fixes the diagonal elements of L(x), and the second term on the RHS determines its off-diagonal elements.Our hope is that this expression can be used to solve for the elements of the kernel matrix K ij (x).
To do this, let's recall the definition of the Laplacian operator once again: In the context of solving for a specific kernel K(x), the presence of the term D −1 (x) performing pixelwise division complicates matters significantly.First, the pixel-wise division is a computational complexity 9 When such terms are summed, the resulting patches placed in the image must be summed at locations where they overlap.If R select only one pixel in the simplest case, then there is no overlap. 10Since the left-hand-side of ( 57) is symmetric by definition, we also have that R TABLE I: Some well-known isotropic, positive-definite kernels k( R ij x ) and corresponding penalities φ( R ij x ).
we would rather avoid; second, the division prevents us from solving directly for the kernels of interest; and third, the division renders the filter and the Laplacian matrix non-symmetric.Fortunately, all of these issues can be overcome by a nice computational simplification, whereby we can eliminate the per pixel division [36].We explain this idea next.
2) Avoiding pixel-wise normalization: The normalized Laplacian matrix L(x) = D −1 (x)K(x) can be written in terms of the unnormalized Laplacian matrix as follows: where α is a fixed parameter.As shown in [36], for practical kernels this approximation tends to be quite accurate.To illustrate the approximation, consider the optimization problem where the norm is the Frobenius norm.The objective to be minimized is quadratic in α.Therefore, upon differentiating and setting to zero, we are led to the global minimum solution : When N is large enough, the right-hand-side is closely approximated by11 Now we have all the ingredients in place to solve for an underlying kernel.With the unnormalized Laplacian, we write Equating the elements of the LHS to the respective elements of the RHS we have To summarize, This interesting expression directly relates the kernel to the loss function.Let's simplify notation by defining the index l = |i − j|, and the variable t l = R ij x : The RHS in particular is noteworthy, as it is exactly the expression for the weights seen in the reweighted least squares formulation of M-estimators in robust statistics [29].The above equation is a special case of the first-order kernel relationship (37) for scalar, isotropic, shift-invariant kernels: in this case, it is possible to divide both sides by x (above, t l ).Division isn't possible in (37) (where x is a vector), which is a much more general formulation and puts these ideas in a broader context.The regularity assumptions we made earlier on ρ ensure that the implied kernel k is (a.e.) well defined and differentiable.Next, we examine some specific pairs of kernels and corresponding regularization functions.To simplify the exposition, we set h l = δ(l), the Dirac delta function, and therefore drop the dependence on the index l.This simplified form is where due to symmetry, it is sufficient to consider the case of t > 0. We compute some specific examples in detail below.
Examples of kernels k(t) to losses ρ(t) First, consider some familiar kernels k(t), with γ denoting generically their smoothing parameter.We will illustrate how these kernels map to corresponding loss functions ρ(t).Plots of the functions are shown in Figure 3.
Boxcar Kernel: This kernel gives equal weights to all samples within a distance γ This implied loss function is simply a "clipped" version of the standard 2 loss, which naturally makes it more robust to outliers.
Gaussian Kernel: This kernel is the most commonly used, for instance in the context of non-local means and bilateral filters.The resulting loss function is known as the Welsch loss [37], and behaves like 2 near zero and for large t asymptotically clips to the same constant value as we derived above for the boxcar.It is interesting, and consistent with what has been observed elsewhere [13], [10], that the use of even the most common kernel (Gaussian) in the context of locally-adaptive filtering results in robust behavior in the implied loss. .
Cauchy Kernel: This kernel is uncommon in image processing.The resulting loss function is sometimes called the Lorentzian.The implied loss function (again) behaves like 2 near the origin, but grows logarithmically for large t.Exponential Kernel: This kernel is the only one that results in a loss function that behaves qualitatively different than the others near t = 0. Namely, it behaves like 1 for small t and settles to a constant for large t Examples of losses ρ(t) to kernels k(t): Now let's go in the other direction, where we specify some well-known loss functions ρ(t), and compute the implied kernels.Since any constant scaling of these kernels is canceled after normalization and therefore unimportant, we shall drop them.Plots of several of these functions are shown as part of a much larger family in Figure 4.
Dirichlet Energy: The standard 2 loss applied to the difference operator t = R ij x as ρ(t) = 1 2 t 2 is the Dirichlet energy defined earlier 12 .The result is a constant (averaging) kernel which is consistent with the approximate MAP solution, and when modulated by the weights h l gives a local spatial convolution kernel as described in the appendix.Total Variation: This penalty is ρ(t) = t (with identical weights h l = 1/N ), and yields As is apparent, the resulting kernel gives huge weight to pixels (patches) that are very similar, and hence acts largely locally.This means that the resulting filter tends to behave mostly like a linear averaging filter in low contrast areas, while respecting edges and therefore producing piecewise constant results.The TV penalty as a local filter has been studied in [14] and our observation is consistent with the conclusions in that work.Huber Loss: The Huber loss is meant to strike a balance between the use of the 2-norm near the origin and the 1-norm away from the origin.This is a very useful loss function that yields appropriate behavior for both high and low signal-to-noise regions.For instance, when used in the context of denoising, it tends to aggressively smooth areas that are "flat" (i.e.very similar) and be less aggressive in smoothing areas that contain discontinuities.The resulting kernel reflects exactly this behavior: It is constant for small values of t, and decays in inverse proportion to increasingly large values of t.
Fig. 4: Kernel functions k(t) and their loss functions ρ(t) for the general robust loss function, presented by Barron [38], sweeping over β, and fixing c = 1.
For the sake of completeness, let's also carry out the computations for a wide class of loss functions that span a gamut of behaviors including the 2 and 1 .For this purpose, we consider the family described in [38], which includes the Welsch and Lorentzian losses discussed earlier, where β, and γ are order and strength parameters, and z(β) = max(1, 2 − β).Plots of the functions are shown in Figure 4. Their corresponding kernel functions are IV. SECOND ORDER CORRESPONDENCE OF KERNELS AND REGULARIZERS In the previous section we illustrated the first order approximation of regularizers with kernel filters.This relationship allowed us to directly compute kernels given the Bayesian/variational/proximal description.These convenient formulas were derived for isotropic losses ρ(t) which simplified the calculations to yield locally adaptive kernels.In this section, we ask the question of whether more general expressions exist that apply to the non-isotropic cases.For this purpose, we have to resort to second order analysis.Ironically, the second order analysis, by virtue of being less accurate is less generally useful in practice.But it does have the advantage that it illuminates the geometry of the problem at hand.Namely as we shall see below, the second order analysis relates the curvature of the loss function φ(x) to the Laplacian operator.Given φ(x), we can compute the Hessian Hφ(x) (assuming it exists) and directly obtain L(x): which then implies the filter W(x) = I−L(x).If the level sets of the loss φ(x)−c = 0 describe a smooth hypersurface 13 of dimension N −1 embedded in R N , then the Hessian Hφ(x) measures the local curvature of the graph of this function.In particular, the Hessian is the second fundamental form of the surface, which contains as its eigenvectors the principal curvature directions [39].On the right-hand side we have the Laplacian operator.This indicates that with the second order approximation, the Laplacian operator (and therefore the filter weights) are assigned by the curvature of the loss function -an interesting and reasonable intuition.
What guarantee do we have that the Hessian even exists?Interestingly, for convex φ(x), a classical result due to Alexandrov [40] states that ∇φ(x) exists and is differentiable almost everywhere (a.e.); therefore the Hessian is well-defined a.e. and we are in business.In particular, let's again employ the unnormalized Laplacian to write If we write this element-wise, the diagonal and off-diagonal elements of the kernel weight matrix are the solutions of the equations These equations can directly specify the kernel in terms of mixed partial derivatives of the loss function.
Example 1: The standard linear-quadratic loss is a useful and simple case: where A is assumed symmetric and positive semi-definite.So any (linear-) quadratic loss function yields a (constant) linear kernel filter.
Example 2: What about the opposite direction?Namely, given a symmetric kernel k(•, •), and the corresponding symmetric Laplacian L(x), can we try to solve for a φ(x)?This is a far harder question as it implies a nonlinear elliptic partial differential equation for which in general a solution may not even exists.Indeed, not every vector field f : R N → R N corresponds to the gradient of some scalar function f ?= ∇φ, and similarly, an arbitrary L(x) might not be a valid Hessian.This equation falls under the class of Monge-Ampère equations [41], which model (among other things) the relationship between the curvature of hypersurfaces and their explicit descriptions in Euclidean coordinates.Finding solutions is challenging, especially in high dimensions.Whether a solution exists for general cases is an open question.

A. Specializing to isotropic losses, and comparison to first order
As we did earlier in the first order analysis, consider the particular case of (a.e.twice differentiable) losses of the form φ(x) = ρ( Ax ).This gives where u = Ax Ax .It is interesting to note that the first (order) term on the right-hand-side is identical to the expression we derived earlier in (52).Let's rewrite (79): The first term in this expression is in fact singular for all x!
Therefore, the expression for the Hessian simplifies considerably as we consider the action of this operator on x Hence, the second order analysis gives the Hessian as To summarize, for losses of the form we have: First-order kernel approximation: Second-order kernel approximation: Examples of how well the first and second order solutions work are presented in the next section.As seen in Figs. 5 − 9 of Section V, both of these approximations are accurate for small σ, though not surprisingly, the second order approximation is consistently less accurate.The reader might wonder under what circumstances the two approximations might coincide.To address this questions, let's again denote t = Ax .It is obviously the case that whenever ρ(t) ≈ t 2 locally, the two approximations coincide because ρ (t) = ρ (t)/t.As we've noted earlier, this is the case for many losses of interest, including the Huber loss and the wider class of robust losses in (74), all of which behave quadratically for small t ≤ γ.Another interesting scenario is robust losses that are redescending [29], [37].Redescending losses are those where, for large t > γ, the second derivative ρ (t) is small 14 .Therefore, the Hessian of the redescending loss for large t = Ax , when applied to any x is That is, the Hessian in the large argument regime is in fact singular.In retrospect, this should not be entirely surprising given that we would expect redescending losses to censor vectors that have large magnitude Ax .As before, we can specialize the above observations to the case where the loss is a sum of radial basis functions acting on patches of the image as follows: which again, in summary, yields the corresponding kernel functions: First-order kernel approximation: Second-order kernel approximation: V. EXPERIMENTAL RESULTS In this section we demonstrate the use of the results derived in the earlier sections.

A. Approximating Total Variation Regularized Denoising
In the loss-to-kernel direction, we consider a form of total variation (TV) denoising where ρ is the Huber function (73) with γ = 5 and h ij is one in an 11 × 11 spatial neighborhood for (i − j) ∈ {5, . . ., 5} × {−5, . . ., 5} and zero otherwise.We include a 1/2 factor since the sum in φ counts each difference |x i − x j | twice.We approximate the solution of the MAP problem (96) with kernelized filters using the first-and second-order kernel relationships.Through the first-order relationship (37), we obtain the kernel filter where The roles of i and j are symmetric in to output pixel i and (x j − x i ) to pixel j.So for a given output pixel k, The second-order relationship (39) and calculation (86) yield another kernel filter where the output at pixel k is We note that in this case, no row-sum normalization is needed since the kernel filters (98) and (100) are already normalized.Interestingly, the first and second-order kernel filters both have a form resembling the division-free bilateral filter described in [36], and differ only in the factor In Figure 5, we added white Gaussian noise with standard deviation of 10 and applied the first-and second-order kernel filters.The output with the first-order kernel approximation is both qualitatively and quantitatively close to the output of total variation denoising, while having the advantage of being a non-iterative method.Figure 6 shows the approximation PSNR swept over σ.The approximation is more accurate for small σ.

B. Approximating the Bilateral Filter with First and Second Order Kernel Approximations
This section demonstrates the kernel-to-loss direction.Here we begin with a division-free form of the bilateral filter [36] (a kernel filter), where k(t) is the range kernel, h is the spatial weight, and α is a parameter.We approximate it with a MAP problem, using the first-and second-order relationships to determine the loss function.We note the similar form of the above bilateral filter to those derived in the loss-to-kernel direction in the previous experiment with MAP problem Working backward, (98) from the first-order relationship implies and (100) from the second-order relationship implies The coefficients c 0 , c 1 are determined such that ρ(0) = ρ (0) = 0. To solve (102), we use the first-order gradient method, which guarantees convergence.Since the regularizer for the second order approximation is smooth and convex, we further add the heavy-ball momentum, which converges to the optimum at the fastest possible (first-order) convergence rate [42].We initialize with the input image.
In the following experiments, we use a Gaussian spatial weighting h that is h ij = exp − i − j 2 /(2 • 10 2 ) over (i − j) ∈ {−5, . . ., 5} × {−5, . . ., 5} and zero otherwise.The range kernel k(t) is the boxcar, Gaussian, or exponential kernels.In each case, we sweep the α parameter and select the value with minimum mean squared error.We sweep α over [0.01, 0.2] for the first-order approximation and over [1,1024] for the second-order approximation.We show the bilateral filter (101) and its approximation as a MAP problem (102) where ρ(t) is determined by either the first-order (103) or second-order (104) relationship.
Figures 7, 8, and 9 show the results for σ = 30 along with the peak signal-to-noise ratio (PSNR) between the MAP approximations to the bilateral filtered images.We see that both quantitatively and qualitatively, the images produced from bilateral filtering and derived MAP approximations are comparable: small details are blurred, while significant features are retained.Quantitatively, the second-order kernel approximation results in slightly worse PSNR than the first-order kernel approximation.
Figure 10 shows the approximation PSNR swept over σ for Gaussian, boxcar, and exponential kernels.As expected, the approximations are more accurate for small σ.Also, the first-order kernel approximation is consistently more accurate than the second-order kernel approximation, even though the objective function is not convex.

Original
Bilateral Filtered First-order Approx.Second-order Approx.
Fig. 8: Comparison between bilateral filter with boxcar kernel and Bayesian denoiser with the proposed first and second-order kernel approximation.The first-order kernel approximation achieved PSNR of 44.09 dB, and the second-order kernel approximation achieved PSNR of 36.79 dB.
Original Bilateral Filtered First-order Approx.Second-order Approx.
Fig. 9: Comparison between bilateral filter with exponential kernel and Bayesian denoiser with the proposed first and second-order kernel approximation.The first-order kernel approximation achieved PSNR of 46.13 dB, and the second-order kernel approximation achieved PSNR of 44.07 dB.
Fig. 10: PSNR over σ for approximating bilateral filter with boxcar, Gaussian, and exponential kernels with the experimental set up described in V-B.
VI. FINAL REMARKS AND CONCLUSIONS Connections between kernel and regularization-based methods have been studied before, but for the most part have addressed specific kernels (e.g bilateral) or have not been directly useful in constructing general, local, non-iterative, and practical filters from global penalties.In this paper, we established a clear and general way to approximately exchange local kernels k(x) and corresponding global regularizers φ(x).In the "kernelization" direction (φ(x) → k(x)) we gave first and second order methods for reducing a global loss to a local kernel filter.The first order loss established a relationship between the tangent space to the regularization function and the (pseudo) linear operator described by the kernel filter; namely For the second order approximation, we showed that the Hessian of the regularizer determines the affinity weights k(x) as In the other direction (k(x) → φ(x)) we enabled a Bayesian interpretation or justification for the use of a positive-definite kernel for locally-adapted filtering, and provided a practical path to computing the implied penalties φ(x).
Variational/Bayesian/proximal formulations often result in optimization problems that do not have closed-form solutions, and therefore typically require global iterative solutions.Our contribution here is to establish how one can approximate the solution of the resulting global optimization problems with use of "kernelized" locally adaptive filters which approximates the global solution in one-shot, using only local operations.This is significant and timely, particularly with respect to the wide use of proximal operators in learning, optimization theory, and their use in solving inverse problems.In light of our work, approaches that replace proximal operator in ad hoc fashion with simpler (often patch-based) denoisers can be replaced with this more principled approach that makes the calculation of the appropriate kernel filters more direct and transparent.

APPENDIX
Point-wise we have the gradient of the approximate filter as: The solution of (44) is approximated by cyclic convolution with the 3-tap filter [σ 2 , (1 − 2σ 2 ), σ 2 ], which for sufficiently small 15 σ, is strictly a smoothing filter as intended.Now let us compute the exact MAP solution by passing to the Fourier domain.Specifically, under the DFT, the objective decouples over spectral coefficients, where U k and X k denote the kth spectral coefficients of u and x.The |e i2πk/N − 1| 2 factor simplifies to 2(1 − cos 2πk N ).Differentiating and solving the Euler-Lagrange equation, the MAP solution X is the pointwise product (in the frequency domain) of X with a denoising filter spectral coefficients, Back in the spatial domain, the exact MAP solution is the cyclic convolution where w n is the inverse transform of W k , given by where r = 1 + (1 − √ 1 + 8σ 2 )/(4σ 2 ).Therefore by the transform pair DFT r n /(1 − r N ) = (1 − re −i2πk/N ) −1 , the inverse transform of W k is r n + r N −n .

Fig. 5 :Fig. 6 :
Fig. 5: Comparison between Bayesian denoiser with Huber total variation regularization and non-local means with the proposed first and second-order kernel approximation.The first-order kernel approximation achieved PSNR of 37.10 dB, and the second-order kernel approximation achieved PSNR of 29.24 dB.

Fig. 7 :
Fig. 7: Comparison between bilateral filter with Gaussian kernel and Bayesian denoiser with the proposed first and second-order kernel approximation.The first-order kernel approximation achieved PSNR of 48.48 dB, and the second-order kernel approximation achieved PSNR of 42.75 dB.