p-Power Exponential Mechanisms for Differentially Private Machine Learning

Differentially stochastic gradient descent (DP-SGD) that perturbs the clipped gradients is a popular approach for private machine learning. Gaussian mechanism GM, combining with the moments accountant (MA), has demonstrated a much better privacy-utility tradeoff than using the advanced composition theorem. However, it is unclear whether the tradeoff can be further improved by other mechanisms with different noise distribution. To this end, we extend GM (p = 2) to the generalized p-power exponential mechanism (pEM with p > 0) family and show its privacy guarantee. Straightforwardly, we can enhance the privacy-utility tradeoff of GM by searching noise distribution in the wider mechanism space. To implement pEM in practice, we design an effective sampling method and extend MA to pEM for tightly estimating privacy loss. Besides, we formally prove the non-optimality of GM based on the variation method. Numerical experiments validate the properties of pEM and illustrate a comprehensive comparison between pEM and the other two state-of-the-art methods. Experimental results show that pEM is preferred when the noise variance is relatively small to the signal and the dimension is not too high.


I. INTRODUCTION
M ACHINE learning has been highly successful across a variety of applications, including computer vision [1], image understanding [2], natural language processing [3]. The great success is thanks to the advances of computing power, the breakthroughs in algorithmic design, as well as the availability of massive data. However, data collection and mining have raised severe privacy issues [4], [5], [6]. Ideally, it is wished that, after well training, the output model generalizes away from the specific data of any individual user. However, information related to the training data is memorized by deep network models [7]. Even worse, some attacks exploiting this implicit memorization have demonstrated that private, sensitive training data can be recovered from model parameters or updated gradients [8], [9], which further exacerbate the privacy concerns in the development and applications of machine learning.
Recently, differential privacy (DP) has been proposed to protect the privacy of training data by introducing noises to machine learning process. Machine learning with DP enjoys the high effectiveness and efficiency in privacy preservation, thus attracting increasingly attentions from both the academia and industry [10], [11], [12], [13]. Due to the added noise, DP based machine learning algorithms often suffer from the model utility degradation and a fundamental problem is how to improve the trade-off between model utility and privacy loss. Traditionally, DP for machine learning can be achieved by output perturbation or objective perturbation, i.e., adding calibrated noises to the final model parameters [14] or the optimization objective [15]. However, for the popular deep learning tasks that the training process is more like a black box, it is very difficult to characterize the dependence of the final parameters or the objective on the training data. Differently, gradient perturbation that adds noise to the gradients during training is more flexible than output or objective perturbation for general machine learning models [16]. A popular gradient perturbation method is DP-SGD [11], [12], [17], which adds noises to the clipped gradients.
Among the existing research, Gaussian mechanism (GM) which draws noises from Gaussian distribution, combing with MA [11] which provides a tight estimates of privacy loss, has been widely applied in various SGD based machine learning tasks [5], [11], [12], [18], [17], [19]. Besides, considerable research has been conducted to improve the tradeoff from other perspectives, which can be categorized as sensitivity calibration [19], [20] that aims to improve the utility by adaptively calibrate the sensitivity, dimension reduction [21] and transformations [22] that aim to reduce the sensitivity in the lower-dimension or the transformed space instead of the original space, correlation exploration [23], [24] that aims to introduce fewer noises according to the correlation between gradients and target parameters, post-processing that applies denoising technique on the noisy gradient to improve the trained model utility [25]. However, these strategies are usually problem dependent. To use them, one has to analyze the specific features of data space, query function, and exploit the underling relationship. This limits the generality of these methods. (1) Can it be extended to general x ! exp(" x # /$)? (2) Can MA be applied to pEM?   1 shows two state-of-the-art problem-independent methods, aGM and MA, for improving the classical Gaussian mechanism (cGM). In particular, cGM sets the noise scale parameter σ ≥ 2 ln(1.25/δ)∆/ε and uses the advanced composition to estimate the total privacy loss after T folds, where ∆ is the global sensitivity. On one hand, the analytical Gaussian mechanism [26] (aGM) sets a smaller σ by numerically solving an inequality, thus achieving the better utility than cGM. On the other hand, MA gives a tighter estimate of ε after T folds by solving a minimal problem with respect to the higher moments α(λ). Therefore, MA ensures a stricter privacy guarantee under the same noise variance. Obviously, one can combines aGM and MA to get a better tradeoff. That is, one can use aGM to set a smaller σ and then uses MA to output a tighter total privacy loss. However, aGM and MA can only be applied to GM. It unclear how to apply MA to general DP mechanisms. This limits the further improvement of privacy-utility tradeoff. To address it, we aim to propose a new family of DP mechanisms based on GM, and extend MA to it. Meanwhile, this motivates us a deep thinking about GM.

A. MOTIVATION
Given the exact σ, does GM is the optimal (ε, δ)-DP? If this is not true, what is the better mechanism and how to efficiently apply it in practice?
Some remarks are in order. First, the criterion of "optimal mechanism" or "better mechanism" is defined as a mechanism which achieves the lowest or lower privacy loss under the given mean square error 1 . Second, the "efficiently apply" means one can generate noises from the proposed mechanism at least the same as generating noises from GM. Third, we use MA to track the privacy loss, unless otherwise indicated 2 .

B. BASIC IDEA
To simply answer the above question, a straightforward approach is to find an instance which is better than GM. That is, we should explore other mechanisms rather than exploiting GM itself. As Fig. 1 shows, the basic idea is to generalize the density function h(x) ∼ exp(− x 2 /β) ( · denotes the l 2 norm as default) to h(x) ∼ exp(− x p /β) with p > 0, meanwhile, using MA to tightly track the privacy loss. Intuitively, it is possible to select a better mechanism from the richer distribution space. The possibility is based on the fact that each pEM has a different function shape and the corresponding richer moment information about the privacy loss variable. Therefore, we can make a tighter estimate of ε based on the richer moment information. Figure 2(a) illustrates the density function shape of pEM and the corresponding privacy loss. Besides, Figure 2(b) illustrates the application scope of aGM, MA, and the proposed pEM, based on the results of parameter analysis in Section VI.  (a) Density shape and privacy of pEM (b) Application scope of aGM/MA/pEM FIGURE 2. Illustration of pEM's privacy loss (left) and rough scope for aGM/MA/pEM (right). In the left plot, ε was tracked by MA under parameter settings ∆ = σ = 4, δ = 10 −5 , q = 0.01 where q is the amplification coefficient.

C. CHALLENGES AND CONTRIBUTIONS
Till now, we have illustrated that GM is not the optimal and we can select a better one from pEM with p = 2. It seems that the problem has been done. However, two main challenges emerge when we make a deeper analysis about pEM. One challenge is how to use MA for multi-dimensional pEM?
The main difficulty lies in the inefficient or even prohibitive computation when solving the high-dimension integration in MA. The other challenge is how to formally prove the nonoptimality of GM. The main difficulty is how to solve the restricted optimization problem with a max min objective. We aim to solve the challenges and main contributions of the work are as follows.
1) We generalize GM to pEM, which ensures ε-DP when p ∈ (0, 1] and (ε, δ)-DP when p > 1. This significantly enriches existing DP mechanisms and allows us to find a mechanism from the broader mechanism family to ensure a better privacy-utility tradeoff than GM. 2) We apply MA to pEM while preserving a high computational efficiency. In particular, the estimate of ε reduces from a d-dimensional integration to a double integral whenever the dimension d ≥ 2. It is the first time to apply MA to other DP mechanisms, beyond GM. In addition, we show how to efficiently sample noises from pEM. 3) We derive the necessary conditions for optimal (ε, δ)-DP based on the variation method and prove that GM is not the optimal (ε, δ)-DP mechanism when ε is tracked by the moment accountant. 4) Experimental results validate the proposed formula for tracking privacy loss and show that pEM outperforms other two state-of-the-art DP methods, especially when the signal-to-noise ratio is relatively large and the dimension is not too high. The remainder is organized as follows. Section II introduces related work about how to improve the privacy-utility tradeoff. Section III gives some preliminaries about DP, MA and SGD. In Section IV, we propose the pEM and extend MA to general pEM. In Section V, we prove the non-optimality of GM. We conducted extensive experiments to validate pEM in Section VI, and summary our study in Section VII.

II. RELATED WORK
There have been a variety studies about how to improve the privacy-utility tradeoff in DP. Because DP-SGD is an instance of DP application, we review the related references including but not limited to the machine learning from three aspects.

A. PRIVACY LOSS ACCOUNTANT
To improve the tradeoff, one approach is using different methods to estimate the privacy loss. For a mechanism satisfying (ε 0 , δ 0 )-DP at each fold, the simple composition ensures (T ε 0 , T δ 0 )-DP after T folds. As shown in [27], this can be improved to (ε , T δ 0 + δ ) under the advanced composition with ε = 2T ln(1/δ )ε 0 + T ε 0 (e ε0 − 1). However, this only exploits the first order moment of privacy loss variable. Furthermore, by tracking the higher moments information, the authors [11] propose MA for achieving a tighter estimate. Meanwhile, MA has been implemented in the Tensorflow Privacy library. Other relaxation conceptions of DP, such as RDP [28] and CDP [29] are also proposed to account the privacy loss. All conceptions [11], [28], [29] can give a tighter estimates for the privacy loss than the advanced composition, but the privacy loss measurements in [28], [29] are inconsistent with the original conception of DP and a conversion is needed.

B. SENSITIVITY CALIBRATION
Under a given privacy loss tracking method, one can still calibrate the sensitivity to improve the tradeoff. When the sensitivity is large and the worst-case rarely occurs, using the global sensitivity will introduce overestimating noises and destroy the model utility. To alleviate it, the smooth sensitivity [30] is proposed to reduce the noise variance. In machine learning, the clipping strategy [11] is a common method to calibrate the influence of training data on gradients or parameters. However, because using the fixed clipping threshold can not accurately capture the varying characteristic of gradients or parameters, several adaptive methods are proposed to further improve the model utility [19], [20], [31]. Besides directly calibrating the sensitivity, other research with the similar idea has been conducted to improve the tradeoff based on dimension reduction [32], transformation [33], and correlation exploration [13].

C. NOISE DISTRIBUTION DESIGN AND OPTIMAL MECHANISM
Under the given privacy loss accountant and techniques described above, one orthogonal perspective to improve the tradeoff is designing noise distributions. When outputs are real values, Laplace mechanism [34] and GM [27] are two typical mechanisms, ensuring ε-DP and (ε, δ)-DP respectively. When outputs are discrete values, Geometry mechanism [35] and Binomial mechanism [36] are the corresponding two mechanisms. At the first time, [37] proves that it is impossible to design the universally optimal mechanism. Nonetheless, analyzing the optimal mechanism in certain conditions is possible. For ε-DP, [35] proves that Geometry mechanism is optimal for the single count query and [38] presents the staircase-shaped mechanism is optimal for the single real-valued query. Besides, Laplace mechanism is proved to be the optimal under the entropy-minimizing criterion [39] or the constraint of ε-Lipshcitz privacy [40].
For (ε, δ)-DP, although many researchers have studied the theoretical bound of the optimal mechanism [41], [42], [43], [44], they mainly focus on the non-adaptive linear queries but seldom propose practical mechanism. In machine learning, the most common mechanism is the GM [12], [17], [45], [46]. In [26], the authors develop aGM in which the noise variance is derived from the cumulative density function instead of using the tail bound approximation. In [47], the authors propose a novel approach R 2 DP in which a twofold distribution may approximately cover the search space of all distributions, leading to a high efficient computation. Different from [11], [26], [47] which exploits GM itself, we VOLUME 4, 2016  Clipping bound for ∇F (w, ξ) and noise variance Bt, b Sampled mini-batch at t-th iteration with size b n, q, T Dataset size, sampling ratio, and iteration number ε, δ, ∆ Privacy loss, failure probability, and global sensitivity p 0 , p 1 , ps Range of pEM [p 0 , p 1 ] with step size ps propose new mechanisms and analyze the optimality of GM based on the proposed mechanisms.

III. PRELIMINARIES
Before giving the formal analysis of the optimal mechanism, we review some basic concepts about the mini-batch SGD, DP, and MA. We use the bold font to denote vector and vector function, such as x, g(·). By default, · means · 2 . For convenience, we list the used acronyms in Table 1 and variables in Table 2.

A. MINI-BATCH SGD
For a machine learning task, let F (w, ξ) be the loss function with w ∈ R d is the model parameter and ξ is the training sample. In practice, we usually obtain a training set of n i.i.d. samples ξ 1 , · · · , ξ n following the unknown P and aim to minimize the empirical risk F (w) = 1 n n i=1 F (w, ξ i ), using the iterative update where γ t is a positive stepsize and B t is the sampled minibatch of size b which is usually small compared to n. Minibatch SGD has two advantages [48], the high efficiency due to the computations of the first-order gradients and the reliability of convergence due to the reduced variance of the stochastic gradient estimates. Therefore, it has been extensively used in practice [5], [11], [12], [18], [17], [19] and is adopted in our analysis. Considering the DP-SGD, i.e., adding noises to gradients to protect the training data privacy, Eq. (1) is rewritten as where x t,i , i = 1, · · · , b are noises independently drawn from a given distribution (e.g., Gaussian distribution) with zero mean at the t-th iteration.

B. DIFFERENTIAL PRIVACY (DP)
DP is a rigorous privacy notion that has emerged from a line of work in theoretical computer science and cryptography, aiming to limit the information disclosure in computing results via adding proper noises.
Definition 1 (DP [34]). Let ε > 0 and δ ∈ [0, 1]. A mechanism M is said to be (ε, δ)-DP if for all neighboring data sets D and D that differ in a single entry and for any measurable subset E in the output space, we have The probability is taken over the random coins of M. When δ = 0, we say that M preserves pure DP which is denoted as ε-DP. When δ > 0, we say that M preserves approximate DP which is denoted as (ε, δ)-DP.
In this paper, we consider the multi-dimensional noise addition mechanism for privacy preservation. That is, we aim to find an optimal noise distribution (denoted by the density function h(x)), from which we draw noise x ∈ R d and add it to the assumed real-vector f (D) for privacy protection. Then . One common mechanism used in DP-SGD is GM, which has the following conclusion.
It has been proven that the estimate of ε in Theorem 1 is loose. To improve the tradeoff, MA is used for tracking a tight estimate for ε, especially after T folds.

C. MA WITH PRIVACY AMPLIFICATION
MA is a privacy account method using Markov's inequality to track the detailed information of privacy loss distribution [11]. Formally, the privacy loss variable is defined as where aux is the side information and o ∈ R d is the noisy output. When noise is drawn from distribution with probabil- To compute privacy loss ε for the given failure probability δ after T iterations, one needs to solve the following optimization Without considering the amplification, ε is determined by h(x) and its translation h(x −g) with g ≤ ∆, where ∆ is the global sensitivity. However, when considering amplification, ε will be further reduced and is determined by h(x) and h q, q is the sampling ratio [11], [49]. Then, the calculation of are two important conceptions when analyzing MA and also in our analysis. Note that Eq. (5) is a unimodal/quasi-convex function with respect to λ [50]. Therefore, there exists a unique solution of Eq. (5). If we extend λ ∈ Z + to λ ≥ 1, λ ∈ R, then MA is extended to Rényi DP [28] and Eq. (5) can be solved to an arbitrary accuracy τ in time log(λ * /τ ) with λ * is the optimum.

D. PIPELINE OF DP-SGD
Based on Sections III-A, III-B, III-C, we show the pipeline of DP-SGD in machine learning, as shown in the top block (light orange) of Figure 3. In particular, it consists of five steps: (1) sample mini-batch B t with ratio q; with threshold c; (4) add noise to clipped gradient; (5) use average noisy gradient to update w t . Note that DP is used to add noise to the clipped gradient in step (4), and minibatch SGD is used to update the model parameters in step (5). Additionally, MA is used to track the privacy loss of the accumulated noises, based on Eq. (5). In traditional, the used DP mechanism in step (4) is Laplacian or Gaussian. In the paper, we propose to use a new family of DP mechanism, pEM, to protect the training process by adding noise to the clipped gradient. The aim is to further improve the privacyutility tradeoff.

IV. P -POWER EXPONENTIAL MECHANISM FAMILY
In the section, we show how to design and apply pEM to DP-SGD. As the bottom block (light green) of Figure 3 shows, the process of pEM consists of four parts, A to D. They are corresponding to Section IV-A -IV-D, respectively. Note that A and B are two components of C. In Section IV-A, we propose and analyze the general p-power exponential mechanism (pEM) family. Obviously, the optimal mechanism selected from pEM family will achieve a better tradeoff than GM under any given criterion. In Section IV-B, we apply MA to pEM for tightly tracking the privacy loss. In Section IV-C, we show how to select the optimal p * based on given parameters. In Section IV-D, we show how to efficiently generate noise from pEM.

A. DEFINITION AND PRIVACY GUARANTEE
pEM is an extension of GM. Therefore, based on Gaussian distribution, we propose to define pEM as follows. This section corresponds to box A in Figure 3.
where α > 0 is the normalization and β > 0 is the scale parameter.
Obviously, pEM contains GM (p = 2) as a specific instance. In the following, we further demonstrate that pEM satisfies ε-DP when p ≤ 1 and (ε, δ)-DP when p > 1. That is, pEM can ensure both pure and approximate DP for different values p > 0. The proof when p ≤ 1 is straightforwardly derived from [15], while the proof when p > 1 is based on Lemma 1. In particular, Lemma 1 shows that x follows a generalized gamma distribution γ(r; k, β, p) = 1 N r kp−1 e − r p β , r > 0, where k = d/p and . Based on Lemma 1, we deduce the following VOLUME 4, 2016 properties of pEM.
Theorem 2. For a random mechanism M belonging to the pEM family, that is, , where D and D are two adjacent data sets. If the scale parameter β is set as: where Proof. Refer to Appendix B1.
Although Theorem 2 shows how to set β for the required privacy protection, Eq. (9) gives a loose upper bound because of the relaxation in proof. As a special case, when p = 2, Theorem 2 provides a larger noise variance than Theorem 1. In particular, based on which saves a factor of O(d ln(d)). The reason is that the probability inequality for proving general pEM is looser than that of the specific GM (Theorem 1 [27]). To further improve the privacy-utility tradeoff from the privacy loss perspective, we extend MA to the pEM family. This enables us to tightly track the privacy loss of pEM, ignoring the impacts of scaling skills.

B. MA FOR P EM
The section corresponds to box B in Figure 3. According to Eq. (6) and Eq. (7) in Section III-C, we have to compute two d-dimensional integrals I(h, h q,g ) and I(h q,g , h) to obtain the optimal ε. Obviously, it is inefficient or even prohibitive to compute these integrals due to the possibly high dimension d.
To address it, we use high-dimensional polar transformation and properties of general gamma distribution to reduce it to a double integral. Recall that ε is calculated as follows where α M (λ; h(x)) = max{ln I(h, h q,g ), ln I(h q,g , h)}. Therefore, The above equation is rewritten as where ε 1 = min λ∈Z + T ln I(h,h q,g )+ln(1/δ) λ , and ε 2 = min λ∈Z + T ln I(h q,g ,h)+ln(1/δ) λ , and h q,g (x) = (1 − q)h(x) + qh(x −g). Obviously, the above two optimization problems need to be solved to obtain ε. Each of them contains a d-dimensional integral. Fortunately, Theorem 3 shows that I(h q,g , h) > I(h, h q,g ) holds for general pEM. This enables us only to compute ε 2 to obtain ε.
Based on the first conclusion of Theorem 3, the computation of ε reduces to ε 2 . From the second conclusion of Theorem 3, ε 2 is an unimodal/quasi convex expression. Therefore, there exists the single value λ * subject to ε achieves the minimum, which can be solved through bisection method. Theorem 4 shows that the d-dimensional integral I(h q,g , h) will reduce to a double integral whenever d ≥ 2.
Theorem 4 (Double Integral for Computing Privacy Loss of pEM ). According to MA, privacy loss ε of pEM after T -fold adaptive composition can be tracked by where Proof. Refer to Appendix B3.
The double integral in Theorem 4 is used for tracking the privacy loss when d ≥ 2. When d = 1, one can directly use the definite integral to track it. That is, ε = T ln I(hq,∆,h)+ln(1/δ) λ with and h(x) = 1/α exp(−|x| p /β). Fig. 2(a) shows that, comparing with p = 2, the privacy loss ε tracked by the above formula can be improved from about 1.32 to 0.36.

C. MECHANISM SELECTION
The section corresponds to box C in Figure 3. Based on Eq. (10), the computation of ε depends on several parameters: dimension d, iteration number T , global sensitivity ∆, noise variance σ 2 , sampling ratio q, and failure probability δ. We show how to select a better mechanism from pEM based on these parameters. Recall that "better" in this paper means achieving smaller value of ε(T ) under E x 2 . Furthermore, based on Lemma 2, E x 2 = β 2/p Γ((d+2)/p) . Due to that For convenience, we will use σ 2 to control E x 2 based on the relation E x 2 = dσ 2 . Therefore, a better mechanism can be selected via the following process, as the bottom block (light green) of Figure 3 shows. where ∆ = c and T = t; • Return the optimal p * EM with p * = arg min p {ε(p)}. Among parameters appeared in the process, q, σ 2 , T can control the privacy-utility tradeoff. (1) A large sampling ratio q causes more privacy loss but a better utility. (2) A large noise variance σ 2 cause less privacy loss but a worse utility.
(3) A large iteration number T causes more privacy loss but a better utility (Note that is not always true in practice due to overfitting on training dataset).
Remark 1. The above procedures are used for searching a better (ε, δ)-DP mechanism from pEM, where p > 0 in general. Note that this is not contradictory to that pEM is ε-DP when 0 < p ≤ 1. Because we can use MA to track privacy loss ε for any DP mechanism under a given δ.
For example, if a task T has parameters as shown in Fig. 4, we can use 1.3EM to achieve a better tradeoff than GM, where dimension d = 20.

D. COMPUTATIONAL METHOD
The section corresponds to box D in Figure 3. Till now, we have shown how to track the privacy loss ε(p) for pEM and select a better p * = arg min p {ε(p)}. The only remaining problem in practice is how to efficiently generate noise from p * EM. We prove that the computational efficiency of pEM is the same as Gaussian distribution. That is, we can efficiently generate random numbers from pEM. Without extra complex computation. The formal statement is as follows.
Theorem 5 (Computational Method). Given two random variables, r ∈ R + and x ∈ R d , where r is drawn from the generalized gamma distribution γ(r) ∝ r d−1 e −r p /β and x is drawn from Gaussian distribution N (0, I d ), then, the random variable x ← x/ x * r has a pEM distribution with parameters p and β.
Proof. Refer to Appendix B4.
i h(x)dx satisfies the rotation symmetry on integral domain R n . And Ex i = 0 holds due to the odd property of integrand x i h(x). Therefore, one can generate noises as follows.
Step 1. Generate x from Gaussian distribution N (0, I d ).

V. ANALYSIS OF OPTIMAL MECHANISM
Based on Fig. 2(a) and Fig. 4, we have known that GM is not the optimal. However, this lacks the theoretical support. This motivates to analyze the optimality of GM. We first model it as a restricted optimization problem in Section V-A. Based on variation method, we then derive the necessary conditions for the optimal solution in Section V-B and finally prove the non optimality of GM in Section V-C.

A. PROBLEM FORMULATION
In this section, we analyze the optimal mechanism in the whole space of continuous density function, beyond pEM. The reason is that even if we could prove the optimality of GM in pEM, this is not sufficient for ensuring the optimality in the whole space. However, the opposite is true. Therefore, we model it as the following optimization (Problem 1) in the general expression. For convenience, let h q,g (x) = (1 − q)h(x) + qh(x −g).

Problem 1 (Optimal Density Function).
In Problem 1, the objective function ε is presented as an optimization with respect to λ ∈ Z + , with three constraint conditions. The first is the constraint of the probability density and the second is about the model utility where A is a given constant value. The last is about the high-order moments information. Recall in Section III-C, α M (λ; h(x)) = max{ln I(h, h q,g ), ln I(h q,g , h)}, VOLUME 4, 2016 where I(h, h q,g ) = h(x) λ h q,g (x)dx. Therefore, Problem 1 can be further decomposed to two subsequent optimization problems. Assume ε * is its optimal value, that is, Furthermore, Problem 1 is equivalent to ε * = min λ∈Z + {ε * (λ)}, where ε * (λ) = max{ε 1 * λ , ε 2 * λ } and ε i * λ , i = 1, 2 are two minima corresponding to the following two optimizations.

B. NECESSARY CONDITIONS
The optimal mechanism h(x) must exist in the sets of Problems 2 and 3, which can be derived by solving two functional extreme value problems based on the variation method. With respect to Problems 2 and 3, we construct two augment Lagrange functions formulated as Here, µ 1 and µ 2 are two coefficients to be determined. By letting the derivative of Eq. (13) be zero, we will obtain the necessary conditions for optimal solution to Problem 1.

Theorem 6. If a function h(x)
is the optimal solution to Problem 1, then at least one of the following conditions holds.
Proof. Refer to Appendix B5.
Ideally, we want to obtain the closed form of h(x) and determine the corresponding coefficients µ 1 and µ 2 . However, due to the concurrence of termsg, x −g and x +g in function f (·), as well as the concurrence of E x f λ (·) and E x f λ+1 (·), it is hardly to obtain the closed expression of h(x) straightforwardly. Nevertheless, Theorem 6 still can give us two theoretical implications. On one hand, it enables us to search the optimal mechanism through numerical calculation. For example, one can construct a series of functions which converge to the solution of equations in Theorem 6 and generate noise from the convergence function to improve the tradeoff. On the other hand, we can use these necessary conditions to determine the optimality of some specific mechanisms, such as GM.

C. OPTIMALITY OF GM
In this section, we prove that GM is not the optimal mechanism, based on Theorems 3, 6, and Lemma 3.
To address it, we need to verify that h(x) = 1 α e − x 2 /β does not satisfy the both conditions in Theorem 6. Fortunately, Theorem 3 indicates that I(h q,g , h) > I(h, h q,g ) ≥ 1 for pEM family. That is, it suffices to verify the second condition. Furthermore, Lemma 3 indicates that the sampling ratio q does not change the optimal mechanism. This is because that for all mechanisms achieving (ε, δ) with different ε and the fixed δ, the mechanisms after subsampling with probability q satisfy (log(1 + q(e ε − 1)), qδ)-DP. Therefore, sampling does not change the optimality. Then, substituting h(x) = 1/α exp(− x 2 /β) into the second condition with q = 1 and via a straightforward calculation, the verification of GM finally reduces to verify whether equality e 2∆ 2 /β = 1 + 2∆ 2 /β holds. Because e 2∆ 2 /β = 1 + 2∆ 2 /β whatever ∆ 2 /β is, we obtain the following conclusion.
Theorem 7. GM is not the optimal (ε, δ)-DP mechanism which achieves the minimal privacy loss under the given model utility.
Proof. Refer to Appendix B6.

VI. PERFORMANCE EVALUATION
In the section, we conducted extensive experiments to validate properties of pEM and give a comprehensive comparison with other state-of-the-art methods. We implemented the python codes of pEM, and embedded them to the Tensor-Flow Privacy library by modifying the noise generating and privacy ledger modules. Our codes are publicly available 4 and can support DP-SGD based algorithms (Algorithm 1), where the input parameters are defined in Table 2. In all experiments, a better value p * is selected by searching the interval [1,2] with step size 0.1 (i.e., p 0 = 1, p 1 = 2, p s = 0.1 in Algorithm 1), and the optimal value λ ∈ Z + Algorithm 1: DP-SGD Algorithm with pEM Input: F (w, ξ), d, c, q, σ 2 , δ, T , n, p 0 , p 1 , p s . Output: w T and ε. // Selecting optimal pEM. for p ∈ [p 0 , p 1 ] with step size p s do Set β(p) = σ p dΓ(d/p) // Eq.(11) Compute ε(p) ; // Eq.(10) Return p * = arg min p {ε(p)} and ε = ε(p * ); // Updating model parameter.

A. VALIDATION OF THE PROPOSED PRIVACY LOSS FORMULA
We first validate Eq. (10) which is used for tracking the total privacy loss of pEM. The approach is to compare the results derived from Eq. (10) and classical MA when p = 2. Due to that MA is formally proposed in [11], and then is further applied to natural language processing [12] by the original authors, it is convincing to compare the results of Eq. (10) with [11], [12]. Specifically, we set the same parameters as set in [11], [12], which also uses MA to track ε. As shown in Table 3, ε accounted by Eq. (10) is consistent with the results in [11], [12], which validates the proposed method.

B. IMPACTS OF PARAMETERS ON PRIVACY LOSS
As mentioned above, several parameters have impacts on privacy loss ε. Fig. 5 shows the results with respect to parameters q, δ, T, σ 2 when dimension d is fixed as 20. We set q = 10 −3 , δ = 10 −5 , T = 10 4 , σ 2 = 4 2 as the baseline, and varied each of them in the range shown in Fig. 5.
Overall, three conclusions are observed from Fig. 5. Firstly, ε increases with q, T (Fig. 5(a) and Fig. 5(c)) but decreases with δ, σ 2 ( Fig. 5(b) and Fig. 5(d)). Fig. 5(a) is consistent with Lemma 3 that the larger q, the larger ε. Figs. 5(b) and 5(c) are consistent with formula ε = min λ T ln I(h q,g ,h)+ln(1/δ) λ , where the numerator increases with T and decreases with δ. Fig. 5(d) is consistent with the definition of DP that a larger noise variance corresponds the stricter privacy guarantee (lower ε). Secondly, there exists a better mechanism such as p = 1, 2, 1.4 which achieves significantly lower ε than GM (i.e., p = 2). For example, in Fig. 5(b) where ∆/σ = 1, the privacy loss is saved about 50%. Thirdly, the advantage in Figs. 5(b)-5(d) is decreasing as the parameters (i.e., δ, T, σ 2 ) increasing, especially in Figs. 5(c) and 5(b). The varying trending of Fig. 5(b) is in accordance with definition of DP. That is, when δ < 1 is large, each DP mechanism can use a smaller ε to satisfy Pr[M(D) ∈ E] ≤ e ε Pr[M(D ) ∈ E] + δ and therefore the differences among mechanisms decrease. Fig. 5(c) can be explained by the idea of MA which exploits the moment information of privacy loss variable. As pointed out in [51], the variable will converge to the normal distribution as T → ∞. Therefore, the differences will decrease. Fig. 5(d) is indicated by analysis of Theorem 7 that when ∆/σ is small, GM is the nearly optimal. Then all pEMs trend to the same level of the privacy loss.

C. COMPARISON RESULTS BETWEEN CGM, AGM, AND P EM
In this section, we compare pEM with the cGM, aGM, and MA (exchangeable denoted as 2EM). We illustrate a comprehensive comparison in Section VI-C1, where various several were considered. Then, we compare cGM/aGM/MA/pEM by conducting several machine learning tasks in Section VI-C2.

1) Comprehensive Numerical Comparison
For a given (ε 0 , δ 0 ) at each fold, due to the different tracking methods, cGM, aGM, and pEM obtain the different (ε, δ) after T folds. In particular, by applying advanced composition to cGM and aGM, both ε and δ vary with T . However, by using MA in pEM, only ε varies with T but δ is fixed. We design the following pipeline to make a fair comparison. The main purpose is to fix δ for all cGM, aGM and pEM after T folds, and we only need to compare ε. In the subsection, δ was fixed as 10 −4 .
Step 2. Set σ 2 . For the derived δ 0 , and given ε 0 , ∆, set σ 2 c and σ 2 a for cGM and aGM, based on σ c ← 2 ln(1.25/δ 0 )∆/ε 0 and σ a ←: Step 3. Track ε. Based on σ c , σ a and q, using advanced composition and Lemma 3 to track ε c for cGM and ε a for aGM. Meanwhile, based on σ a and δ, using MA to track ε p for pEM.
Note that σ c and σ a vary with T . Table 4 shows the range of σ a when T increases up to 10, 000 with step 200. From the vertical, σ a increase with ε 0 . From the horizontal, σ a decreases with q. Focusing on each cell, σ a increase with T . Fig. 6 shows the privacy comparison between cGM, aGM, and pEM for all settings in Table 4. Some intuitively correct results are observed from Fig. 6. (1) From left to right, ε decreases with the amplification ratio q.
(2) In each of plots, ε increases with ε 0 and T . The underlying reasons have been interpreted in Section VI-B.
Besides, we present a comprehensive comparison of cGM, aGM, and pEM. Firstly, within cGM and aGM, the latter outperforms the former all along. Therefore, we can drop out cGM from the comparison. Secondly, within pEM family, there exists a better mechanism than 2EM when ε 0 = 3, 1.
In such case where σ/∆ ranges about 1 to 5 (rows 1-2 of Table 4), we can use the pEM with p = 2 to replace 2EM. Thirdly, between aGM and pEM, it is observed that aGM outperforms pEM when T and ε 0 are small. For example, in Fig. 6(c) when ε 0 = 1, 0.1, 0.01, the curves of aGM are lower than pEM when T is near coordinates (0, 0). Elsewhere, pEM outperforms aGM always. Note that the three red boxes in Figs. 6(c) and 6(d) indicate that pEM is significantly worsen than aGM. The reason is that we fix λ in a fixed range {1, · · · , 32} in all cases. In fact, when q is diminishing, λ corresponding to the minimal ε should increase up to infinity. However, we note that these cases have little practical significance. As red cells in Table 4 show, in such case, the noise-to-signal ratio ∆/σ is too large to destroy the outputting utility in practice. Finally, we illustrate a sketch map about the application scope of aGM and pEM in Fig. 2(b), which shows that pEM is a preferred choice when ε 0 and T are moderate.

2) Machine Learning Accuracy Comparison
In this section, we applied cGM, aGM, and pEM on two dataseta, Adult 5 and SAHeart 6 . Each of datasets was trained by logistic regression (LR) and support vector machine (SVM) 50 times, with regularized coefficient ρ = 0.001 and learning rate 0.01. The purpose is to protect the training process, where noises were added into the gradients. Experimental settings are as follows. • Privacy loss (ε, δ). When using SVM method, we aimed to achieve (1, 10 −4 )-DP and (1, 10 −2 )-DP for Adult and SAHeart, respectively. When using LR method, the privacy loss was set as (0.5, 10 −4 ) for Adult and (0.5, 10 −2 )-DP for SAHeart. • Global sensitivity ∆. For SVM method, ∆ was set as 1 via clipping gradient with threshold 1. For LR method, ∆ was set as 1/4 + 0.001 based on that the smoothness constant is 1/4 + ρ. • Sampling ratio q. For Adult dataset, q was set as 0.0044 (batch size 128 to dataset size 29304). For SAHeart, it was set as 0.01 (batch size 5 to data size 462). • Noise variance σ 2 and iteration number T . In each of four settings (i.e., LR on SAHeart, SVM on SAHeart, LR on Adult, and SVM on Adult), we compared the proposed pEM with cGM and aGM. To ensure the same privacy loss in each setting, σ 2 /∆ and T were set in Table 5. The settings are based on method used in numerical comparison section (i.e., Section VI-C1). Fig. 7 shows the average test accuracy with respect to privacy loss ε, which denotes the impact of accumulated noise added to gradient. Experimental results are almost consistent with the numerical comparison in Section VI-C1 and three conclusions are observed.
Firstly, aGM outperforms cGM in most cases. This is because aGM uses a smaller variance than cGM under the given privacy loss ε. Especially in Figs. 7(a) and 7(b), due to the large noise variance, cGM has a fast increasing phase which followed by a fluctuated plateau. Secondly, there exist a better mechanism 1.4EM which achieves the higher test accuracy than 2EM. Especially in Fig. 7(a), the smaller noise variance ensures 1.4EM has the faster and better performance. In other three plots, 1.4EM also has the faster training speed than 2EM but with same final utility. Thirdly, pEM outperforms aGM in overall. This is because we use MA to track the privacy loss in pEM. The tighter estimate of ε ensures more iterations for pEM than aGM. However, in Fig. 7(d), aGM has a faster training speed than pEM. This is mainly because the task can be efficiently trained, even with small total privacy loss.

VII. CONCLUSION
In the paper, we clearly claim that GM is not the optimal (ε, δ)-DP and propose a general pEM family to improve the tradeoff of GM. However, two challenges emerge when we make a full analysis of pEM. One is how to extend MA for tracking privacy of pEM, and the other is how to prove the non-optimality of GM. We use the high-dimensional poplar transformation and properties of gamma distribution to solve the first challenge, and use the variation method to solve a derived restricted optimization problem of the second challenge. Besides the above theoretical analysis, we present how to apply pEM in practice and show that one can efficiently generate noises from pEM the same as generating normal noises. We conducted extensive numerical experiments to validate the properties of pEM and show that pEM is preferred when ratio of noise variance to signal is relatively small and the dimension is not high. In contrast, the improvement of pEM is slight. The reason is that when the ratio is large, GM is proved to be the nearly optimal mechanism, therefore, pEM can not obtain a significant improvement. On the other hand, when dimension is too high, pEM with different p has similar moments information, therefore, pEM performs similarly to GM. In future, we will develop pEM from two aspects. In theory, we will analyze the impacts of parameters on the privacy loss and aim to find a closed solution of the optimal pEM. In practice, we will apply pEM to federated learning to improve the existing privacyutility tradeoff of GM.

APPENDIX. SUPPLEMENTARY MATERIAL FOR THEORETICAL CONCLUSIONS A. THREE LEMMAS
In the section, we propose two lemmas (Lemma 1, 2) about the proposed pEM and adopt a lemma (Lemma 3) about the privacy loss amplification after sampling. In particular, for pEM with density function h(x) = 1/αe − x p /β , Lemma 1 shows the density function about x and Lemma 2 shows the result about E x 2 . In proof of both Lemma 1 and Lemma 2, the following high-dimensional polar transformation is necessary. When x = (x 1 , · · · , x d ) T ∈ R d , let where θ i ∈ [0, π], i = 1, · · · , d − 2, θ d−1 ∈ [0, 2π], and the Jacobin determinant is Furthermore, we denote the superficial area of unit sphere in R d , i.e., Lemma 1. If x ∈ R d has the density function h(x) = 1/αe − x p /β , then variable r = x follows the generalized gamma distribution with the density function γ(r; k, β, p) = 1/N r kp−1 e −r p /β , r > 0, where N is the normalization with 1/N = p/β k Γ(k) and k = d/p.
Proof. The cumulative probability distribution of x is (when z > 0) Based on Eq. (14), we have That is, F x (z) = A(d) α z 0 e −r p β r d−1 dr. Taking derivative with respect to z, we deduce the probability density function Replacing z with r, we further obtain Comparing f x (r) with the density of generalized gamma distribution γ(r; k, β, p) = 1/N r kp−1 e −r p /β , r > 0, where N is the normalization with 1/N = p/β k Γ(k) , we deduce that VOLUME 4, 2016 x follows the generalized gamma distribution with k = d/p and A(d) α = 1 N = p/β k Γ(k) . Lemma 2. If the random variable x has the probability density function h(x) = 1 α e − x p /β , x ∈ R d , where α is the normalization, then the m-th order moment is To differentiate scale and vector, we rewrite x = x andg = ∆ . For an intuitive cognition, we combine Fig. 8 to complete the prove. Uniformly splitting the region (−∞, ∆/2] and [∆/2, +∞) with length 1, we obtain on interval that contain x and x respectively. Result about I(h q,∆ (x), h(x)) can be similarly obtained. Next, we prove that I(h q,∆ (x), h(x)) ≥ I(h(x), h q,∆ (x)). For simplicity, we ignore variable x and denote f (a i ) = (1 − q + qa i ) λ . Then, we have To prove I gm (h q,∆ , h) > I gm (h, h q,∆ ), it reduces to Replacing a i with 1/a i in Eq. (23), we have Therefore, it is equivalent to prove f (a i )f (1/a i ) > 1. Based on definition of f (a i ) and λ ≥ 1, this further reduces to prove g(a i ) = (1 − q + qa i )(1 − q + q/a i ) ≥ 1, 0 < a i ≤ 1.
Taking derivative of g(a i ) with respect to a i , we have Because the sampling ratio q ∈ [0, 1] and a i ∈ (0, 1], then g (a i ) ≤ 0 and hence g(a i ) is decreasing with a i . Based on definition of g(·), we have g(1) = 1. Then, g(a i ) ≥ 1 and the equality holds only if a i = 1. Therefore, I gm (h q,∆ , h) ≥ I h,hq,∆ and then I(h q,G , h) ≥ I(h, h q,G ). When x ∈ R d , d > 1, the proof is similar except replacing the scale ∆ with the vectorg satisfying g = ∆, and replacing symmetric points (with respect to x = ∆/2) x,x ∈ R 1 with symmetric pointsx, x ∈ R d with respect to x = ∆/2.

Case 2)
Based on Eq. (7), i.e., I(h q,g , h) = E x∼h q,g (x) 1 − q + q h(x−g) h(x) λ , we have Furthermore, E x∼h q,g (x) h(x−g) h(x) i can be written as Then, I(h q,g , h) can be expressed as a combination of i . Due to ln E x∼h(x−g) ph(x−g) h(x) α is convex for λ ∈ R (refer to Lemma 36 in [49]), so function exp(ln(·)) is also convex, then I(h q,g , h) is convex.

3) Proof of Theorem 4
Proof. For greater clarity, we recall several expressions at the beginning.
Replacing results of E x∼h(x) 1 − q + q h(x−g) h(x) λ and E x∼h(x) 1 − q + q h(x) h(x+g) λ back into I(h q,g , h) and replacing θ 1 with x in g(θ 1 , d), we complete the proof.