Denoising Score Matching via Random Fourier Features

The density estimation is one of the core problems in statistics. Despite this, existing techniques like maximum likelihood estimation are computationally inefficient in case of complex parametric families due to the intractability of the normalizing constant. For this reason, an interest in score matching has increased, being independent on the normalizing constant. However, such an estimator is consistent only for distributions with the full space support. One of the approaches to make it consistent is to add noise to the input data called Denoising Score Matching. In this work we build computationally efficient algorithm for density estimate using kernel exponential family as a model distribution. The usage of the kernel exponential family is motivated by the richness of this class of densities. To avoid calculating an intractable normalizing constant we use Denoising Score Matching objective. The computational complexity issue is approached by applying Random Fourier Features-based approximation of the kernel function. We derive an exact analytical expression for this case which allows dropping additional regularization terms based on the higher-order derivatives as they are already implicitly included. Moreover, the obtained expression explicitly depends on the noise variance, so that the validation loss can be straightforwardly used to tune the noise level. Along with benchmark experiments, the method was tested on various synthetic distributions to study the behavior of the method in different cases. The empirical study shows comparable quality to the competing approaches, while the proposed method being computationally faster. The latter one enables scaling up to complex high-dimensional data.


I. INTRODUCTION
One of the core problems in statistics is density estimation. The most well-known approach is the Maximum Likelihood Estimation (MLE). However, MLE and all other approaches based on MLE require normalizing constant to be known or computed efficiently, which is not the case in many real-world problems. The intractability of the normalizing constant makes the approach infeasible. In contrast, an unsupervised score matching estimator [1] based on Fisher divergence minimization does not depend on the normalizing constant. The resulting estimate is proved to be asymptotically normal and consistent in the case when data and model distributions The associate editor coordinating the review of this manuscript and approving it for publication was Miaohui Wang. support coincide. There are numerous developments of the idea [2]- [6].
Another important part of the density estimation is the class of models to search the solution in. A special interest is paid here to an exponential family of distributions which leads to the closed-form solution [2], [7]- [10]. A generalization of the finite-dimensional exponential families is a Kernel Exponential Family (KEF). In this case, the natural parameter is treated as a function from some Reproducing Kernel Hilbert Space (RKHS). It can be seen as an infinite-dimensional generalization of the exponential family. The KEF contains all well-known exponential family densities such as Exponential, Gaussian, Gamma, etc. Also, RKHS reveals a sufficiently rich class of estimators with convergence guarantees w.r.t. different metrics [11], [12]. The main disadvantage is the computational cost for the sample matrix inversion making this method inapplicable even for a moderate amount of the training data.
To approach the computational complexity issue, the authors of [13], [14] propose to use Nyström-type approximation of the kernel function. Alternatively, Random Fourier Features (RFF) [15] embedding can be considered. Employing a particular structure on the RFF [16], [17] we come up with a faster model than the Nyström-type approximation. There are a lot of papers studying the convergence of the RFF models for the regression problem. The optimal learning rate with O( √ n log n) features is the same as for the full kernel [18] which gives substantial speed up. The theoretical properties of using RFF for score matching is less studied, though there are some general theoretical results on RFF and higher-order kernel derivatives [19], [20].
The naive approach to score matching with RFF suffers from several issues. The first one is an oscillating behavior in the tails of the distribution [12]. The second problem is poor convergence in the case of disjoint supports (consistency could not be guaranteed) or in the areas where density value is close to zero [14]. Inconsistency explains low approximation accuracy in regions of almost zero density.
It was shown that convolution with a small Gaussian noise (which is equivalent to the noisy data perturbation) improves learning behavior and approximation quality, e.g. [21]- [23]. It makes the support of both densities (distribution of the data and the model distribution) the same and allows to overcome the issue above. For most of the models, the convolution cannot be calculated analytically, so authors usually stick to the second-order Taylor series expansion [23]- [25] which results in a special regularization term in the loss function. It turns out that the noise level is an important parameter. With a large noise level, we have better convergence but lower accuracy. With a smaller noise level, the convergence is less stable, but the solution is more accurate. This means that tuning of noise level is required. Recently, it was proposed to use several noise levels optimizing cumulative objective [21].
Having fast and accurate density estimation technique is important in many engineering applications, it can be used in Monte Carlo simulations, Bayesian Optimization, in can be a building block in various machine learning algorithms and many other fields. Such a broad applicability motivated us to develop a new approach for density estimation based on kernel exponential family, random fourier features and denoising sore matching.

A. CONTRIBUTION
In this work, we introduce a computationally efficient method to estimate unknown distribution using denoising score matching. To tackle the convergence issue we optimize the convolution of the score matching objective with symmetric noise. To improve the computational complexity we apply RFF technique for kernel approximation. Application of RFF also allowed us to derive an exact analytical solution for the considered objective function. The benefit of the obtained solution is that it allows avoiding additional regularization terms as they are embedded into the loss function naturally. The derived expression of the loss function explicitly contains the noise parameters that allows us to use simple gradientbased approaches to tune these parameters. In the experimental section, we demonstrate the performance of our approach both in terms of the accuracy and the training time. While the quality is comparable to the Nyström-type approximations, the training speed is much faster.
The paper is organized as follows. In Section II we give background information that is used to construct the final model: score-matching and its RKHS form, learning using random features. Section III provides results on the necessary condition of denoising score matching and its RFF approximation. Numerical experiments are presented in Section IV. Finally, Section V concludes the results of conducted research. All additional materials are presented in appendices A and B.
This article is based on Section 4.2 of one of the author's Ph.D. thesis [26].
a=1 , x a ∈ R d be a set of observations drawn from an unknown distribution with a probability density function p 0 (x). Let p(x, θ) be a model density parameterized by θ ∈ ⊂ R m . The task is to find such θ * that the model density is close to the real one: p(x, θ * ) ≈ p 0 (x). This is a density estimation problem. In score matching approach we minimize the Fisher divergence: Under the sufficiently weak regularity conditions (see [1]) the minimization of the Fisher divergence is equivalent to the minimization of Note, that the normalizing constant does not depend on x, therefore, p(x, θ) in (2) could be replaced with the unnormalized onep(x, θ) = p(x, θ)Z (θ). In an abuse of notation from now on we will use p(x, θ) to denote the unnormalized density if it is not stated explicitly. Objective (2) now does not depend on unknown density p 0 and provides an opportunity to estimate p 0 up to the normalizing constant using only samples drawn from p 0 : log p(x a , θ) This loss suffers from several drawbacks. Firstly, the expression (1) assumes that model and data distributions have the VOLUME 10, 2022 same support. However, in real world the real distribution lies on a low-dimensional manifold embedded in R d [21], while the support of the model density is usually the whole space. Secondly, score matching convergence is guaranteed only in the case of supp p 0 = R d (see [1]).
To tackle the issue, we use Denoising Score Matching (DSM) [27]. In this approach, we add noise to the data. The score matching loss in this case is given by where p ε (x) is a distribution of noise. Now both densities have the same support, so the solution converges. The optimal model satisfies ∇p θ = ∇ [p 0 * p ε ] (x), where * is the convolution operator. However, ∇ [p 0 * p ε ] (x) is close to the true density ∇p 0 (x) only when the noise is small enough.
To estimate the loss in a general case, we can generate a finite set of noisy samples and use them to evaluate the loss function's expectation. Another option is to use the Taylor series expansion, assuming that the noise level is small. In both cases, we get an approximate value of the loss function. Moreover, when we use the Taylor series expansion, we need to calculate higher-order derivatives of the model, which can be computationally complex (for example, in the case of neural networks). However, for the kernel exponential family, the denoising score matching loss can be computed exactly.

B. KERNEL EXPONENTIAL FAMILY
The kernel exponential family is a set of distributions where unnormalized probability density functions p f (x) satisfy log p f (x) = f (x) + log q 0 (x), f ∈ H, H is some Reproducing Kernel Hilbert Space (RKHS) with kernel k and q 0 is some generating density. The normalizing constant is usually not known and cannot be computed analytically. The class of such densities is rich enough. In fact, it is dense in a set of continuous probability density functions that decay at the same rate as q 0 .
In a well-specified case, i.e. p 0 belongs to the kernel exponential family with RKHS H, the score matching loss (2) can be expressed as (see [11]) k(x, y) and Using the general representer theorem the optimal f (x) can be found as a weighted sum of the kernel derivatives located at the training samples. To find the weights, we need to invert nd × nd matrix and the computational complexity, therefore, is O(n 3 d 3 ). While the convergence in RKHS of this estimator implies the convergence in L r , in terms of Kullback-Leibler divergence and Hellinger distance, in the misspecified case, a density estimator remains the same, but with convergence guarantees only for Fisher divergence.
To reduce the computational complexity, the authors of [13] proposed to find a solution in a span over a randomly selected subset of training samples (inducing points). The computational cost of this approach is O( , where m is a number of inducing points. Additional sub-sampling over md basis functions enables even more computationally efficient approach. In this extreme case, the complexity is O(nm 2 d +m 3 ). As in the case of full data usage, the obtained estimator is consistent when p 0 lies in the kernel exponential family, but the rate of convergence is slower (under assumptions presented in [13]). The misspecified case was not studied.
To obtain a consistent estimator from the kernel exponential family the authors of [14] used Denoising Score Matching with Taylor series expansion. This results in an additional regularization term in the loss function that penalizes second derivatives of the model. The need to calculate second derivatives restricts the approach only to relatively low-dimensional cases.

C. RANDOM FOURIER FEATURES
In general, there are two types of approaches to scale up the kernel methods: Nyström-type approximation [13] and random features based approximations [15], [17]. The second one is data-independent, the idea of which follow from Bochner's theorem [28]: any shift-invariant bounded continuous kernel k(x, y) = k(x − y) is a Fourier transform of a non-negative bounded measure p(w) Using Monte-Carlo to estimate the integral, we obtain the Random Fourier Features approximation of the kernel function where The same idea can be applied to the kernel derivatives where p, q ∈ R d denote multi-indices, , and p and q act on the first and the second arguments of the kernel correspondingly.
RFF sampling could be improved using the special structure on weights w j and better expectation approximation in (6) [16], [17], [29], which makes RFF generation faster than the Nyström approach. Nevertheless, the theoretical properties are well studied only for the kernel ridge regression [18], [30]. Also, the random features approach could be extended to a more general class of kernels that admits representation k(x, y) = φ(w, x) φ(w, y)dπ(w) for some feature map φ(w, ·) and measure π(w). One of the most well-known kernels of such type is the Arc-Cosine kernel [31] k(x, y) = φ(x, w)φ(y, w)dπ(w) where π(w) = N (0, I d ) and θ(x, y) is an angle between vectors x and y.

III. KERNEL DENOISING SCORE MATCHING
This section provides the optimal solution for the Kernel Denoising Score Matching, its RFF approximation, and some error bounds of the resulting model. Here we assume that the noise distribution is symmetric, i.e. p ε (x) = p ε (−x).

A. DENOISING SCORE MATCHING IN RKHS
We start by rewriting the expression for the Denoising Score Matching objective (4) and follow the same logic in derivation as in paper [11] with the difference that our objective function is the convolution of the usual score matching objective with the noise distribution.
Let V : R m → R be a convex and differentiable function. Assume that the objective function takes the form × ∂ i log q 0 (x a + y), and for simplicity let us denote it as Then the objective (4) can be written as Using the first order optimality condition we can see that the solution takes the form where A * (y) : R m → H is an adjoint to A(y). Now we are ready to formulate the proposition.
Proposition 1: The solution to (10) has the following form The optimal model requires a solution of the operator equation, and in general this is a difficult task. To avoid this, let us consider a Monte-Carlo approximation of (11). Suppose we sampled K noise vectors {z k } K k=1 , z k ∼ p ε . In this case, the approximation to the optimal β * can be found by solving the system of equations The obtained result can then be used to derive an approximation of f * , but the computational complexity is O(n 3 d 3 K 3 + n 2 d 2 K ). Moreover, the convolution in the second term of (12) could be directly computed only for a limited set of kernels, e.g. Radial Basis Function kernel (RBF).
In order to improve the computational complexity, we employ the RFF approach to the kernel function approximation.

B. RFF FOR DENOISING SCORE MATCHING
For the RFF (6) we introduce the following matrix of RFF derivatives ∂ y corrupted by noise y.
The finite sample solution to (12) is given by VOLUME 10, 2022 Here operator denotes the Hadamard product, and Then by taking the limit over K → ∞ we obtain the final RFF solution The detailed derivation can be found in Appendix A-B. Similar results can be derived for the Nyström-type approximation (see Appendix A-G). The disadvantage, in this case, is that we need to calculate convolution of the first and the second-order derivatives with the noise distribution for each kernel. For RFF, on the other hand, all the terms remain the same for any shift-invariant kernel except for the distribution of weights W , which is much more convenient.
Another important thing we would like to stress is that in the resulting solution each feature has a weight proportional to exp − σ 2 2 w i (see Appendix A-B for details). This means that the high-frequency features have weights that are close to zero. Such behavior can be interpreted as a regularization that penalizes oscillating terms.
There are several hyper-parameters in the approach that affects the resulting quality, namely, the kernel hyperparameters θ, the regularization parameter λ and, assuming that the noise is zero-mean Gaussian, the noise variance σ . To tune these parameters, we use the loss on the holdout (validation) set. The loss, in this case, is ordinary score matching (no denoising) loss as we would like to estimate how good our model approximates the original data, not the noisy one.
Another important part of the algorithm is the base density q 0 . From a theoretical point of view, base density is responsible for the tails of the distribution and do not affect the estimator in the areas with high density. Therefore, in this paper, we consider three different options for q 0 : uniform distribution with support bounded by a particular training sample, multivariate Gaussian distribution and the mixture of Gaussians. In the latter case, q 0 is fitted before training using Bayesian Mixture Model [32].
At the end of the training, we estimate the normalizing constant via importance sampling as was proposed in [14]. It should be noted that in the case of uniform base density, normalization could not be estimated properly due to the unknown data support measure. The whole method is summarized in Algorithm 1.
The total complexity of the proposed approach is O(nmd + n z -number of samples to estimate the normalizing constant, initial regularization parameter λ 1: Fit q 0 (x) to the given data set. 2: while not stopping condition do 3: for mini-batches D t , D v ∈ D do 4: Compute random Fourier approximation f * m = φ(·) b t using equation (14) on D t , where b t are decomposition coefficients with respect to features φ(·).

5:
Compute ordinary score matching loss on the validation data set Do gradient step over hyper-parameters (λ, σ , θ) 7: end for 8: end while 9: Compute f * m = φ(·) b D using full dataset D 10: Compute normalizing constant approximationẐ via importance samplinĝ by using iterative methods for solving systems of linear equations. Compared to Nyström-based approximation's [13], which requires O(nm 2 d 3 + m 3 d 3 ) (or in case of sub-sampling O(nm 2 d + m 3 )), our approach provides an improvement. Now, let us provide the bounds on the error of the approximation for the proposed approach. Let us introduce the derivatives of the exact kernel matrix We also denote the derivatives of the random feature vector as The error bounds for score matching with RFF is given by the following theorem.

C. DISCUSSION
While RFF kernel approximation admits computationally efficient solution of score matching, the convergence properties remain an open question. Using results from [11] the convergence can be established only in the RKHS that corresponds to the approximate kernel. So we have the following relation where p λ,n,m is the density obtained using m features andP is an exponential family with sufficient statistic φ (W x + b). To upper bound the error of the approximation we can consider the following inequality where f λ minimizes (5) and f λ,n is a solution of a finite sample version of (5), · is a norm in L 2 (R d , p 0 ). f λ,n,m − f λ,m includes the term ξ m −ξ = O(m − 1 2 ) that can be obtained using the concentration lemma from [13] under an additional assumption on the boundedness of the derivatives of the approximate kernel. This implies that we should potentially use O(n) features to obtain the same convergence rates as in the case of the exact solution. To reduce the lower bounds on the number of features we need to provide a refined analysis in a way similar to [18], [30] in the future study.
In the case of Denoising Score Matching the estimated density will converge to p * * p ε , where p * = inf p∈P J (p 0 p) is the density fromP closest to p 0 . So, on the one hand, the noise variance should be as small as possible. On the other hand, the Wasserstein distance between the approximation and the true density for the Denoising Score Matching can be upper bounded as follows where E[ ε 2 ] = nσ 2 . The second term takes large value for small noise levels (due to different supports of the approximate density and the true density) and smaller values for large noise values. So the choice of σ is a trade-off between estimator stability and how close it is to the unknown density function p 0 .

A. EXPERIMENTAL SETUP
In all our experiments, we used RBF kernel with a diagonal covariance matrix, the noise was assumed to be isotropic Gaussian (though in general we can use an arbitrary noise covariance matrix).
We compare the proposed approach (DSM RFF), ordinary score matching with RFF (SM RFF), exact kernel solution (5) (Exact) and its Nyström version with subsampled basis [13] (Nyström). We used original implementations of this model from [33].
The comparison is conducted on two types of data: artificially generated 2D densities and datasets from the UCI repository [34] (the particular choice of data is motivated by previous research on kernel exponential family [12]- [14]): 1) Synthetic data generated from the following densities: a mixture of Gaussians, Uniform, Mixture on Uniforms, Cosine, Funnel, Banana, Ring, Mixture of Rings. 2) RedWine, WhiteWine, MiniBoone. The Exact kernel model was not compared on MiniBoone dataset because it's too computationally expensive.
To estimate the quality of models, the following metrics were used: 1) Log-likelihood (higher is better). It requires the normalization constant, which can only be approximated, so the log-likelihood tends to be overestimated [14].

2) Fisher divergence (lower is better). It requires a true
log-density gradient to be known and hence could be estimated only for artificial data; moreover, for uniform settings, it could be computed only on the support of the true density. Alternatively, score-matching could be used, but scores for different models are not comparable in general.

3) Finite-Set Stein Discrepancy (FSSD) goodness of fit
test [35], [36] with 0.05 significance level. We used Gaussian kernel, and its lengthscale was chosen to be median over pairwise distances between samples to avoid optimization over test points for the particular model; otherwise, we can not compare models. FSSD statistic almost surely equals zero if and only if a model density and p 0 coincide. 4) Wasserstein distance. To estimate this quantity, we used Metropolis Adjusted Langevin Algorithm (MALA) [37] to draw samples from the model densities. We used step-size 0.1, chain length was 10 4 with 5 · 10 3 burn-in.

B. RESULTS
We start by considering an approximate denoising approach (see Appendix A-F for derivation) to figure out if there is a benefit from the convolution with noise. To accomplish this, we construct an illustrative experiment with 300 RFF features for a Gaussian mixture. We used multivariate Gaussian distribution for q 0 and the training set size was 10 3 . The results are presented on Fig. 1, from which it is clear that noisy approach better estimates ground truth in between components region even with the presence of small noise Also, note that there are fewer oscillations when we add noise to the data. The next step is to compare the proposed algorithm to other approaches on synthetic 2D data and datasets from UCI. In this case, we used 512 Random Fourier Features. As the base density q 0 we used a Mixture of Gaussians. As in the previous example, a relatively small sample size was used. Our models were trained for 60 iterations using Adam VOLUME 10, 2022 FIGURE 1. Comparison of score-matching with and without noise, noise variance is σ = 5 · 10 −4 . We clip values of log-density that less then −10.    oversmooth the areas between components. Another observation about the approach is that in some cases it tends to choose large noise variance. In Fig. 3 we visualize the dependence of the loss on the regularization parameter and noise variance for several 2D distributions. Interestingly, for ''good'' distributions (like Funnel, that has a full space support and one mode) the loss surface has wide minimum w.r.t regularization and noise variance. For multimodal distributions, the loss surface has a narrower minimum. For uniform distribution (which differs from others in that it has bounded support) the minimum w.r.t noise variance is narrow, but it is also separated from zero. This indicates the need for noise in such cases.
In Figure 4 we plot all the metrics for all data sets. For each dataset, each metric was normalized across methods to have a unit norm. This was done only for better visualization.  The original values are given in Appendix B. The figure illustrates the mean value of the metrics and corresponding variance calculated across 10 runs. From the figure, we can see that w.r.t. almost all metrics (except the log-likelihood) the proposed approach shows better or comparable results in many cases. Actually, the Wasserstein distance is smaller for DSM RFF for all data sets. We can also see that SM RFF tends to have a larger variance than its noisy version.
In Table 1 we provide results for the datasets from the UCI repository as well as the training time. The MiniBoone dataset is large and the Nyström-based implementation could not fit into memory, so we had to train the model using only a subset of 15000 samples. Other methods were trained using the whole data set. To fairly compare the training time, the experiments were conducted on Intel(R) Core(TM) i7-7820X CPU @ 3.60GHz with 64Gb RAM. We can see that the proposed approach is much faster than the implementation of the Nyström based approach.

V. CONCLUSION
In this work, we presented denoising score matching for the kernel exponential family. The computational complexity issue was approached using the Random Fourier Features technique. We derived a closed-form solution that is a more accurate estimate of the loss in denoising score matching. The proposed approach is also computationally more efficient than the existing approaches to model the kernel exponential family using Nyström-type approximation. The obtained solution naturally regularizes the model's complexity due to the convolution with the noise, which can also be interpreted as an additional regularization parameter. The analytical expression also allows tuning the noise parameters as it is now explicitly presented in the expression. So, we can use gradient-based methods to optimize with respect to noise parameters. Convolution with the noise prevents model misspecification and builds accurate models if a true density lies in a lower-dimensional manifold.
The obtained model was tested on synthetic and real-world datasets. Our experiments showed that additional noise is reasonable for complex multimodal distributions or distributions with bounded support. The proposed model gives better estimates in-between modes. The empirical study of the loss surface revealed the need to use the noise, especially for the misspecified case. We also provide the bounds for the method; however, they are not tight and refining the results is planned for future work.

A. EXACT SOLUTION FOR THE KERNEL DENOISING SCORE MATCHING WITH RFF
We start with the first order optimality condition: the first order optimality condition could be rewritten as an integral equation on α(y): where The gradient of V is given by ∇V = ( 1 n , 1 n , . . . , 1 n , 1) . Then, we have α(y) = (β (y), δ) , where δ = − 1 λ . The integral equation on β(y) can be expressed as where (Â(y)f ) i = (A(y)f ) i , i = 1, . . . , m − 1. Let b = p ε (y)φ m (y, ·)dy and C = p ε (y)Â * (y)β(y)dy. Then we search for the solution of (11) in the form β(y) = − 1 nλÂ (y)C + 1 nλ 2Â (y)b. In this case we havê where B = p ε (y)Â * (y)Â(y)dy and b ∈ H is a convolution of φ m and noise density p ε . Solution C * of the above equation provides β * (y) and, as a result, the solution to the initial problem. Let us show that the obtained estimator belongs to H. In fact, since and B + nλI is continuously invertible we have that C * ∈ H. Finally, we have where we assume γ ∈ KerÂ(y).

B. RFF SOLUTION DERIVATION
Let us use the expressions for the solution without noise (here for simplicity the term with ∂ i log q 0 (x a ) is omitted): Then we haveÂ(y)Â * (z) ≈ ∂ y ∂ z andÂ(y)φ m (z, ·) * p ε (z) ≈ 1 n ∂ y (∂ 2 z * p ε (z)) 1. Now we have everything to obtain RFF approximation of (12): Denoting h = 1 n (∂ 2 z * p(z)) 1, H = p ε (y)∂ y ∂ y dy we obtain the following expression for the discretized solution f K : By taking a limit over K → ∞, and using H = lim the solution f * is given: where index m refers to the number of RFF features.

C. PROOF FOR THE ERROR BOUNDS OF SCORE MATCHING WITH RFF
The idea of the proof is to upper bound the expected square difference between solutions: where the difference between RFF and exact kernel solutions (f * n,m , f * n ) is expressed as follows: The above expectation is taken jointly over random Fourier weights and given points x ∼ p 0 . It can be written as The first term in the above expression is the difference betweenξ and its RFF approximationξ m , so, we have: where the first inequality is obtained using sup x |φ i (W x + b)| ≤ 1. As this expression does not depend on x, the joint expectation will be the same. For the second term in f * n,m (x) − f * n (x) derivation of the upper bound is technically the same, but with lower-order derivatives, so Third them: where only R depends on x.
This inequality holds with probability 1 − δ for n ≥ 8 3ε 2 log m δ and obtained from the Bernstein inequality assuming that the weights are fixed [38].
where the latter term is obtained under assumption that D 1 does not depend on x. This assumption holds for a sufficiently smooth kernels and we can rewrite the expression under an expectation as polynomial of weights times trigonometric function. Finally, combining all the above, we have

D. DERIVATION OF H AND H FOR GAUSSIAN NOISE
Assuming that p ε = N (0, σ 2 I) and using we will obtain: Next, firstly, assume that q 0 is uniform: n a=1 sin(W x a + b) W −1 (x a − µ) In the case of arbitrary q 0 we use Taylor expansion: ∇ log q 0 (x + ε) ≈ ∇ log q 0 (x) + ∇ 2 log q 0 (x)ε In the vicinity of x it is equivalent to previous case and the additional term is obtained with simple replacement: − −1 → ∇ 2 log q 0 (x) and − −1 (x − µ) → ∇ log q 0 (x).

E. DERIVATION OF H AND H FOR ARC-COSINE KERNEL
Considering p = 2, uniform base density q 0 and isotropic Gaussian noise we will obtain: W x a x a W * p ε = W E p ε (x a + ε)(x a + ε) W * p ε = W (x a x a + σ 2 I)W The same holds for any symmetric noise distribution with covariance and corresponding substitution to the above equation.    where the last holds for any symmetric zero-mean density.

G. NYSTRÖM KERNEL APPROXIMATION
Let K be a sample Gram matrix, then for Nyström kernel approximation [39] we have: where k(x) = k(x, x 1 ) . . . k(x, x M ) , M is the amount of subsampled points.