Unbiased Estimating Equation and Latent Bias Under f-Separable Bregman Distortion Measures

We discuss unbiased estimating equations in a class of objective functions using a monotonically increasing function f and Bregman divergence. The choice of the function f gives desirable properties, such as robustness against outliers. To obtain unbiased estimating equations, analytically intractable integrals are generally required as bias correction terms. In this study, we clarify the combination of Bregman divergence, statistical model, and function f in which the bias correction term vanishes. Focusing on Mahalanobis and Itakura-Saito distances, we generalize fundamental existing results and characterize a class of distributions of positive reals with a scale parameter, including the gamma distribution as a special case. We also generalized these results to general model classes characterized by one-dimensional Bregman divergence. Furthermore, we discuss the possibility of latent bias minimization when the proportion of outliers is large, which is induced by the extinction of the bias correction term. We conducted numerical experiments to show that the latent bias can approach zero under heavy contamination of outliers or very small inliers.


I. INTRODUCTION
T HE maximum likelihood estimation (MLE) for the sta- tistical model p(x|θ) estimates the parameter θ by minimizing the negative log-likelihood.It is equivalent to empirical inference under the Kullback-Leibler (KL)-divergence.However, MLE is susceptible to outliers or mismatches of the assumed model.In robust statistics, estimation methods weakening adverse effects of outliers have been studied [1], [2].M-estimation is a well-known method that changes the negative log-likelihood function − 1 Masahiro Kobayashi is with the Information and Media Center, Toyohashi University of Technology, Toyohashi 441-8580, Japan (e-mail: kobayashi@imc.tut.ac.jp).
Kazuho Watanabe is with the Department of Computer Science and Engineering, Toyohashi University of Technology, Toyohashi 441-8580, Japan (e-mail: wkazuho@cs.tut.ac.jp).
Communicated by P. Netrapalli, Associate Editor for Machine Learning and Statistics.
Color versions of one or more figures in this article are available at https://doi.org/10.1109/TIT.2024.3366538.
Digital Object Identifier 10.1109/TIT.2024.3366538[2].This is equivalent to changing the assumed model to heavy-tailed distribution under the KL-divergence measure.
Another definition is given by solving the following estimating equation with respect to the parameter θ, where the function ψ is the generalized score function in MLE [1], [2].Although the two definitions are not always the same, they concur if the condition ψ(x, θ) = ∂ρ(x,θ) ∂θ holds.Another well-known estimation method is the minimum divergence estimation, which is to change KL-divergence to another divergence [3], [4].Well-known divergences are fand Bregman divergences, which are defined by a strictly convex function [5].A common part of these divergences is only KL-divergence [6].The f -divergence includes the αdivergence as a subclass, which plays an important role in information geometry [7].Particularly, estimators based on the minimization of Hellinger distance, which is an α-divergence, have robustness and completely efficient properties [8].However, the continuous model estimation using f -divergence requires the use of non-parametric kernel density estimators instead of the true distribution.Non-parametric kernel density estimators have the disadvantages of a bandwidth selection problem and worse estimation efficiency in high dimensions.There are known methods to deal with this problem, such as using the same kernel to represent empirical and model distributions [9], or using a dual representation of divergence [10], [11], [12].However, the estimation based on minimizing the Bregman divergence does not require kernel density estimator because the empirical distribution is available.Broniatowski et al. [13] called this type of divergence in which the empirical distribution can be used instead of the true distribution decomposable divergence.Jana and Basu [14] simply called such divergence non-kernel divergence and showed that single-integral and non-kernel divergences are limited to Bregman divergence.The β-divergence, also known as density power divergence, which belongs to the Bregman divergence is the first non-kernel divergence proposed as an extension of M-estimation (1) [15].Since then, robust non-kernel divergences applicable to empirical inference have been developed.
The minimization of these divergences reduces to estimating equations by the weighted (negative) score function s(x, θ) = following two types of estimating equations are well known: (2) n i=1 ξ(l(x i , θ))s(x i , θ) n j=1 ξ(l(x j , θ)) where ξ : R → R works as the weight function.These estimating equations are included in the M-estimation framework (1) by putting the function ψ as follows: ψ(x, θ) = ξ(l(x, θ))s(x, θ) − E p(x|θ) [ξ(l(X, θ))s(X, θ)] , ψ(x, θ) = ξ(l(x, θ))s(x, θ)E p(x|θ) [ξ(l(X, θ))] − ξ(l(x, θ))E p(x|θ) [ξ(l(X, θ))s(X, θ)] , respectively [16].Equation ( 2) is called the non-normalized estimating equation because the summation of weights of score functions is not one.This estimating equation corresponds to the Bregman divergence, and its special cases (β-divergence [15], and its generalizations [17], [18], [19]) and variants (U -divergence [20], Ψ-divergence [21]).Equation ( 3) is the normalized estimating equation because the summation of weights of score functions is one.Windham [22] proposed an estimator using density power weight, which is equivalent to the solution to equation (3).Then, Jones et al. [23] constructed the corresponding divergence.This divergence, called γ-divergence, has the property that the latent bias can be minimized when the proportion of outliers is large and that the divergence with such a property is unique under some assumptions [24].This property of the γ-divergence was extended to the normalized estimating equation (3) with the general weight ξ [16].However, these approaches require bias correction terms, i.e., the right-hand sides of ( 2) and (3), which generally result in analytically intractable integrals.This is true for most practical models except for a few simple cases.For example, if the weight function ξ is a power function that corresponds to βand γ-divergences, and the statistical model is a specific case within the exponential family, such as Gaussian, gamma, and inverse Gaussian distributions, the bias correction term can be calculated.However, if the statistical model is complex or for general weight functions, analytical computation becomes intractable.
Recently, following the success of divergences using density power weight such as βand γ-divergences, the extension, unification, and relationship of these divergences have been investigated.There are two mainstream research directions.The first direction is to extend the existing divergences within non-kernel divergences.Kanamori and Fujisawa [25] proposed Hölder divergence, which establishes invariance to the affine transformation of random variables based on the composite score.In their approach, the proportion of outliers can be estimated by considering the unnormalized density [26].Kuchibhotla et al. [27] proposed the bridge density power divergence (BDPD), which is constructed by a convex combination of estimating equations ( 2) and (3) using the density power weight.They tried to deal with the spurious global solution problem that γ-divergence produces.Both Hölder divergence and BDPD include βand γ-divergences as special cases.The γ-divergence is generated by the logarithmic transformation of each term of the β-divergence.Ray et al. [28] proposed the functional density power divergence (FDPD) which is generated by the general functional transformation of each term of the β-divergence.It contains BDPD [27] and Jones et al.'s general divergence [23].To provide better robustness versus efficiency trade-off, the expansion of β-divergence has been investigated within the Bregman divergence framework [17], [18], [19].Notably, β-divergence, γ-divergence, Bregman divergence, and BDPD correspond to M-estimation (1), whereas Hölder divergence and FDPD do not necessarily correspond to it.
The second direction is to extend beyond the non-kernel divergence framework to the super family of divergences that include many existing divergences.Ghosh et al. [29] proposed super (S)-divergence, which generalizes αand β-divergences and has two tuning parameters.The cases of continuous and discrete models have been investigated, and it has been reported that the regions outside αand β-divergences show good performance [29], [30], [31].This divergence has been further generalized [32].Maji et al. [33] proposed a logarithm transformation divergence for each term of S-divergence, which includes Rényi- [34] and γ- [24] divergences and called it logarithmic S-divergence (LSD).S-divergence and LSD correspond to the non-normalized and normalized estimating equations, respectively.However, these estimating equations cannot be expressed by the summation of independent and identically distributed data points as in (2) and (3).In estimators based on both S-divergence and LSD, the asymptotic variance depends on only one of the two tuning parameters.Maji et al. [35] proposed C-divergence, which is a very wide divergence class and includes f - [5] and generalized S- [32] divergences.In fact, this divergence was previously proposed by Vonta et al. [36] and used for testing.Basak and Basu [37] considered a new divergence by giving the argument of the Bregman divergence a power of the density function.It is called generalized S-Bregman divergence, which includes S- [29] and Bregman exponential [17] divergences and has three tuning parameters.
In this paper, we consider the M-estimation under f -separable distortion measures, which were proposed to extend linear distortion, such as the average distortion to nonlinear distortion, and for which the rate-distortion function was studied [38].It was also used to solve the estimation problem with Bregman divergence as the base distortion measure, and a simple clustering or vector quantization algorithm was constructed [39].In this paper, this class of objective functions is called the f -separable Bregman distortion measure.Note that this distortion measure is defined by neither f -nor Bregman divergences between the aforementioned probability distributions.It is defined by a function f , not necessarily convex, and the Bregman divergence between vector or scalar variables.The M-estimation under this distortion measure, as discussed in Section III, can be viewed as a deviance-based estimation of the regular exponential family model.The unbiasedness of the estimating equation of deviance-based methods Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
has been studied, and some sufficient conditions for it have been obtained [40], [41].However, these results only apply to the case where the data-generating distribution is included in the assumed model.On the other hand, the M-estimation of the location family is proved to have an unbiased estimating equation for general symmetric distributions [2].It is unknown in what cases of f -separable Bregman distortion measures the estimating equation is unbiased for such a general class of distributions.If an estimating equation is unbiased, it can be regarded as normalized, and the estimator has the potential to minimize the latent bias, even if the proportion of outliers is large.
In this paper, we study the conditions for bias correction terms of f -separable Bregman distortion measures to vanish and characterize the combination of Bregman divergence, statistical model, and function f .Focusing on Mahalanobis and Itakura-Saito (IS) distances, we specify the conditions for the general model classes and the function f to achieve unbiased estimating equations.We also describe the properties of these general model classes and discuss the relationship between these models and regular exponential family.Furthermore, we discuss if the latent bias can be minimized when the proportion of outliers is large.We compare the M-estimation under the f -separable IS distortion measure with the estimation methods minimizing βand γ-divergences theoretically in terms of asymptotic efficiency and numerically through experiments examining latent bias under heavy contamination.

II. f -SEPARABLE BREGMAN DISTORTION MEASURES
This section introduces the estimation method based on f -separable Bregman distortion measures [39].We consider estimating the parameter θ ∈ Θ ⊆ R d of a statistical model p(x|θ) when given the data is the data-generating distribution, and the parameter θ is the expected value of x under the model, i.e., θ = E[X] = xp(x|θ)dx if it exists.The objective function is defined by where f : R + → R is a differentiable and continuous monotonically increasing function, d ϕ (x, θ) : χ × Θ → R + is the Bregman divergence, and R + is the set of nonnegative real numbers.The Bregman divergence is defined by a differentiable strictly convex function ϕ : χ → R as where ∇ϕ is its gradient vector and ⟨•, •⟩ is the inner product.The corresponding estimating equation to the objective function ( 4) is given by where f ′ is the derivative of f .This is not generally unbiased.
The estimator θ of the parameter θ * is given by the solution of the estimating equation (6).The property of the estimator depends on the function f .For example, if the function f is concave, the estimator is robust against outliers.Then, the update rule of the estimator θ is given by which gives the iterative algorithm.When the function f satisfies f (0) > −∞, the update rule converges with finite iteration [39].In other words, the estimator is one of the local minima of the objective function (4).The original f -separable distortion measures are defined by f -mean with respect to some base distortion d [38].From the viewpoint of f -mean, representative examples are the logsum-exp function and power mean, which are, respectively, given by the following functions: where α ∈ R, β ∈ R and a ∈ R + .When α = 0 or β = 0, functions ( 8) and ( 9) are, respectively, given by the following continuous limits: If tuning parameters satisfy α > 0 and β < 1, the estimators become robust.When α = 0 and β = 1, functions (8) and ( 9) become linear functions.The derivative of functions (8) and ( 9) are, respectively, given by

III. RELATION TO ROBUST DIVERGENCES
First, we show that the minimization of L f (θ) is derived from deviance-based M-estimation of the expectation parameter under the regular exponential family, where r ϕ (x) is uniquely determined by the strictly convex function ϕ and θ is the expectation parameter, i.e., θ = E[X] [42].The deviance function [40] is defined by The objective function of the deviance-based M-estimation is defined by where ρ : R + → R. If the function ρ is differentiable, the corresponding estimating equation is given by Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
where ρ ′ is the derivative of ρ.From (11), the deviance function of the regular exponential family (10) is given by Thus, the objective function and estimating equation under the regular exponential family are, respectively, given by where s(x, θ) = ∂ ∂θ d ϕ (x, θ) is the negative score function.The objective functions of f -separable Bregman distortion measures (4) and deviance-based M-estimation (12) are related as follows f (z) = ρ( √ 2z).Similarly, the estimating equations ( 6) and ( 13) have the following relationship Next, we turn to empirical inference based on robust divergences under the regular exponential family (10).Suppose for a moment that the bias correction term can be ignored.In this case, the non-normalized estimating equation ( 2) is given by Compared with this equation, the estimating equation ( 6) of f -separable Bregman distortion measures can be interpreted as a weighted score function.We focus on the arguments of the weight functions of ( 6) and (14).The only difference is the term inf θ l(x, θ) = − log r ϕ (x).Specifically, if the domain of the function f ′ is extended to (−∞, ∞), the function f ′ works identically to the weight function ξ.In view of this relation, function (8) associated with the log-sumexp function yields the estimation methods that minimize βand γ-divergences with the non-normalized and normalized estimating equations ( 2) and (3), respectively.In particular, if the function f is given by ( 8) and the squared distance is used to estimate the standard Gaussian location parameter, the estimating equation ( 6) reduces to which is the same estimating equation as βand γ-divergences.
In other words, when we assume the regular exponential family and the function (8), then it is related to the estimation based on the power of the statistical model.In this section, we assumed the bias correction term is exactly 0; however, it is not generally true.With the combination of the model and Bregman divergence discussed in the next section, the estimating equation ( 6) becomes unbiased without any bias correction term for any function f satisfying the condition given in the main theorems.

IV. CONDITIONS FOR UNBIASED ESTIMATING EQUATION
In general, the estimator based on f -separable Bregman distortion measures introduced in Section II does not satisfy consistency because its estimating equation is not necessarily unbiased.Thus, to satisfy an unbiased estimating equation, we must subtract the bias correction term b f (θ) from the objective function (4) as follows: where •dθ denotes the indefinite integral with respect to θ. Differentiating (15) with respect to θ and setting it to 0, we obtain the following estimating equation, Multiplying both sides by the inverse matrix (∇∇ϕ(θ)) yields the following non-normalized estimating equation: We can also consider the normalized estimating equation as follows: .
Fujisawa [16] elucidated that this estimating equation can possibly minimize the latent bias, even for a large proportion of outliers.In both cases, it is necessary to calculate the integral for bias correction for each combination of statistical model, Bregman divergence, and function f .However, the integral may not exist or be analytically intractable in many cases.In this paper, we discuss the following estimating equation: It means that the bias correction term is independent of the parameter θ.In other words, the following equation is satisfied, Then, this estimating equation is automatically normalized.Therefore, the estimator has the possibility to minimize the latent bias, even for many outliers.In the remaining section, we characterize the combination of the statistical model p(x|θ), Bregman divergence d ϕ (x, θ), and function f , where the bias correction term vanishes.In what follows, the statistical model considered is generally not the regular exponential family.
In particular, we focus on Mahalanobis and IS distances.To estimate the location parameter of elliptical distribution, it is known that the bias correction term vanishes, and the estimator is consistent under certain conditions on the function f [2].Moreover, it is known that the bias correction term vanishes for a log-gamma regression model.This is equivalent to the case where IS distance is used, and the model is the gamma distribution [41].In this paper, we derive a simple condition of the function f that induces an unbiased estimating equation.Even if the weight function f ′ is changed, the calculation of the bias correction term is unnecessary; only the simple conditions should be checked.In particular, for the IS distance, the class of the model is extended to a more general class.

A. Mahalanobis Distance
Suppose the strictly convex function is given by ϕ(x) = x T Ax, where A is a positive definite matrix.Then, the corresponding Bregman divergence is given by which is called Mahalanobis distance.If the positive definite matrix A is identity, it reduces to the squared distance, We assume that the statistical model is the elliptical distribution.
Definition 1 (Elliptical Distribution [43], [44, pp. 46-47]): For x ∈ R d and the location parameter θ ∈ Θ = R d and a nonnegative function g : R )dt < ∞ be the normalization constant, and the positive definite matrix A be the inverse of a fixed covariance matrix.Then, the elliptical distribution is defined by the following probability density function, This distribution includes Gaussian, Laplace, t distributions and so on [44].Assumption 1: There exists the elliptical distribution (17) corresponding to the nonnegative function g, i.e., C Mah. = 2π )dt < ∞.Theorem 1: Under Assumption 1, if and only if the following condition holds against the combination of the function f and the statistical model (17), the estimating equation without a bias correction term or equivalently (16) holds: The proof of Theorem 1 is in Appendix A. Although in this case, the unbiased estimating equation is intuitively trivial because of the symmetry around θ and has been pointed out in the literature [2], the explicit condition for unbiasedness has never been discussed.

B. IS Distance
Suppose the strictly convex function is given by ϕ(x) = − log x.Then, the corresponding Bregman divergence is given by which is called the IS distance.Definition 2 (IS Distribution): For x ∈ R + \ {0} and the scale parameter θ ∈ Θ = R + \{0}, and a nonnegative function g : R + → R + , we define the following probability density function with the normalization constant C IS < ∞, The normalization constant C IS independent of the parameter θ is given by We used integration by substitution t = x/θ.When the expectation exists, the scale parameter coincides with the expectation.In particular, if g(z) = exp(−kz), the IS distribution reduces to the gamma distribution with the known shape parameter k > 0, where Γ(•) is the gamma function.Details of the IS distribution are described in Section V-A.Assumption 2: There exists the IS distribution (20) corresponding to the nonnegative function g, i.e., C IS = ∞ 0 1 x g(d IS (x, 1))dx < ∞.Theorem 2: Under Assumption 2, if and only if the following condition holds against the combination of the function f and statistical model (20), the estimating equation without a bias correction term or equivalently (16) holds: Proof: From the left-hand side of ( 16), substituting the IS distance (19) and IS distribution (20), we have We used integration by substitution, t = d IS (x, θ).Therefore, if the integral ( 22) exists, then (16) holds, i.e., the unbiased estimating equation holds without any bias correction term.Conversely, the above discussion also shows that This means that the condition ( 22) is also necessary.□ Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
1) Example: Gamma Distribution: In the case of the function (8) and gamma distribution with the known shape parameter k > 0, i.e., g(z) = exp(−kz), then the integral in equation ( 22) is given by Therefore, the condition α > −k must be satisfied for the integral to be bounded.In other words, the lower limit of α that satisfies the unbiased estimating equation differs for each shape parameter k.Since k > 0, we can see that the condition of Theorem 2 is satisfied if α > 0, for which the estimator is robust against outliers.
For the function ( 9) and gamma distribution with the known shape parameter k > 0, the integral in equation ( 22) becomes When a > 0, the condition of Theorem ( 22) holds for β < ∞.Moreover, when a = 0, the condition of Theorem ( 22) holds for 0 < β < ∞.However, it does not hold for β ≤ 0.

C. Other Bregman Divergence
The conditions of Theorems 1 and 2 are the same in onedimensional case.A common point is that the statistical model is expressed by the Bregman divergence used for estimation.Hence, the results of Theorems 1 and 2 can be extended to a wider class of continuous distributions written by Bregman divergence.We assume the following statistical model, which is defined by one-dimensional Bregman divergence.
Definition 3 (Continuous Bregman Distribution): For x ∈ (a, b) ⊆ R, the parameter θ ∈ Θ = (a, b) ⊆ R, and the function g : R + → R + , we define the following probability density function with the normalization constant satisfying 2) Bregman divergence used for estimation corresponds to that of the model (23).
Theorem 3: Under Assumption 3 and Assumption 4, if and only if the following condition holds against the combination of the function f and statistical model (23), the estimating equation without a bias correction term or equivalently (16) holds: Proof: From the left-hand side of equation ( 16), substituting the one-dimensional Bregman divergence (5) and the continuous Bregman distribution (23), we have We used integration by substitution as t = d ϕ (x, θ) and ( 24).Therefore, if integral (25) exists, then ( 16) holds, i.e., the unbiased estimating equation holds without any bias correction term.Conversely, the above discussion also shows that This means that the condition ( 25) is also necessary.□ Note that Assumption 3 must be satisfied for the integration by substitution to imply the unbiased estimating equation.
The elliptical and IS distributions are rare examples with unbiased estimating equations for the corresponding f -separable Bregman distortion measures and include the corresponding regular exponential family models.
Remark 1: In this section, we derived the condition under which the estimating equation holds without the bias correction term.Generally, the bias correction term does not vanish.We showed rare examples, Mahalanobis, IS, and one-dimensional Bregman divergences, for which the bias correction term vanishes.We emphasize that conditions (18), (22), and (25) of the theorems are easy to check.For example, when the statistical model is the gamma distribution, i.e., the function g is exp(−kz), we immediately see that the condition is satisfied with respect to the function f ′ which is a polynomial.Thus, the conditions of the theorem can narrow the range within which the function f or f ′ can be chosen.
Remark 2: This paper focuses on unbiasedness of the estimating equations, which is the necessary condition for the consistency of estimators.On the other hand, even when the unbiasedness of estimating equations does not hold, the generalization performance may be good due to the trade-off between bias and variance.For example, the Lq-likelihood estimator is the case where the bias correction term is truncated from the β-divergence [45].However, under small samples, the exchange of bias and variance has been shown to improve Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
the generalization performance.For f -separable Bregman distortion measures, the generalization performance when the estimating equations are not unbiased is unknown and is a subject for future work.

V. DETAILS OF STATISTICAL MODELS
In Section IV, we clarified conditions that consist of the Bregman divergence, statistical model, and the class of function f for the unbiased estimating equation to hold without the bias correction term.We have newly defined the IS distribution and its generalized continuous Bregman distribution.In this section, we describe the properties of these distributions.We also discuss the relationship between the continuous Bregman distribution and regular exponential family.
and Normalization Constant C IS : Lemma 1: Under Assumption 2, the following relation holds with respect to the expected value E[X] and the function g which composes the IS distribution, Proof: This relation immediately holds from Theorem 2 by substituting f ′ (z) = 1.□ Lemma 2: Under Assumption 2, the following relation holds with respect to the expected value E[X] and the normalization constant C IS of the IS distribution, Proof: We assume that the expected value E[X] exists and is θ.From the definition and finiteness of expected value E[X], we have Here, the normalization constant C IS must satisfy Therefore, we have Conversely, when we assume (27), we have Therefore, (26) holds.□ Theorem 4: Under Assumption 2, the following relation holds with respect to the expected value E[X], the normalization constant C IS , and the function g which composes the IS distribution, Proof: Theorem 4 immediately holds from Lemma 1 and Lemma 2.
□ Lemma 1 shows that when the function g satisfy ∞ 0 g(t)dt < ∞, meaning that g ∈ L 1 (R + ), if the IS distribution exists, its expected value E[X] is θ.In other words, the existence of the expected value depends only on the function g.This property holds in the general continuous Bregman distribution described in Section V-B (Corollary 1).
Lemma 2 means the normalization constant C IS is expressed in another form than (21).Then, it holds that The integrand of the normalization constant (27) does not have factor 1 x which diverges infinity when x goes to 0. This fact has an advantage in calculating the normalization constant numerically.However, the relation (26) between the normalization constant and expected value does not hold on the continuous Bregman distribution.It is a property of the scale family as discussed in Appendix B (Theorem 7).
Remark 3: Lemma 1 is obtained from the property of the continuous Bregman distribution.Similarly, Lemma 2 is obtained from the property of the scale family.The IS distribution belongs to the continuous Bregman distribution and scale family.Therefore, Theorem 4 is obtained.
2) Examples of the IS Distribution: a) Gamma distribution: When we choose the function g(z) = exp(−kz), the IS distribution becomes the gamma distribution with the known shape parameter k > 0.Then, The normalization constant C IS is given by Therefore, the gamma distribution is obtained The gamma distribution is also expressed as The parameters β and k are called scale and shape parameters, respectively.This model corresponds to (28) by the transformation with respect to parameter θ = kβ or the change of random variable Y = kX.It is worth nothing that the parameter θ is also the scale parameter and expected value.b) Mixture distribution: Let g 1 (z) : R + → R + and g 2 (z) : R + → R + be nonnegative functions.We define the function g by a convex combination, i.e., g(z) = bg 1 (z) + (1 − b)g 2 (z), where the coefficient satisfies 0 ≤ b ≤ 1.Then, the IS distribution with respect to the function g is given by the convex combination of IS distributions with the same parameter θ: where ))dx are the normalization constants of the component distributions, respectively, and w is the mixing coefficient.Similarly, if g is a convex combination of three or more functions, the IS mixture is generated.If g 1 (z) and g 2 (z) are given by exp(−k 1 z) and exp(−k 2 z) respectively, the IS mixture reduces to the gamma mixture with the same parameter θ, where k 1 and k 2 are positive.

B. Continuous Bregman
b) IS distribution: We set ϕ(x) = − log x.Then, (23) becomes the IS distribution as follows: We explained this distribution in Section V-A.c) Mixture distribution: As in the case of the IS mixture, we define the function g by a convex combination, i.e., g(z) = bg 1 (z)+(1−b)g 2 (z).Then, we obtain the following continuous Bregman mixture: , where all component distributions have the same parameter θ and depend on the same strictly convex function ϕ.Thus, the supports of the component distributions are all same.Here, C ϕ,1 (θ) and C ϕ,2 (θ) are the normalization constants of the component distributions.Similarly, for a convex combination of three or more functions g, the continuous Bregman mixture can be generated.

C. Relation to Regular Exponential Family
We consider the relationship between the continuous Bregman distribution (23) and the regular exponential family (10).
2) For all x, the factor x − θ does not depend on the parameter θ.Proposition 1: Under Assumption 5, the continuous Bregman distribution becomes the regular exponential family as follows: where r ϕ (x) is uniquely determined by the strictly convex function ϕ [42], i.e., x − θ .
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
Figure 1 shows the relationship between the continuous Bregman distribution and regular exponential family.The intersection between the continuous Bregman distribution and the regular exponential family include Gaussian and gamma distributions.The condition 2 of Assumption 5 is a tight condition.Even if the condition 1 of Assumption 5 holds and the continuous Bregman distribution corresponding to the strictly convex function ϕ exists, it is not necessarily the regular exponential family.

VI. LATENT BIAS
In this section, we discuss the possibility of the latent bias minimization for many outliers, which is induced by the vanishing bias correction term.The estimating equation of the f -separable Bregman distortion measures reduces to the normalized estimating equation.From the viewpoint of the normalized estimating equation, the condition of latent bias minimization was shown as a theorem [16].We can apply this theorem.From the viewpoint of the objective function, we see that the definitions of outliers are different for f -separable distortion measures and γ-divergence.As a byproduct, we obtain a solution to a drawback of γ-divergence in the case of the exponential model.In what follows, we assume that the function f is twice differentiable.

A. Contaminated Distribution
We assume that the data-generating distribution is given as follows: where p(x|θ * ) is the target distribution, c(x) is the contamination distribution that generates outliers, and ε is the proportion of outliers.Suppose the parameter θ estimated from the data generated from this distribution is expressed asymptotically as θ, i.e., θ P − → θ.Here, θ − θ * is called the latent bias, which expresses the bias caused by the contamination distribution [16].

B. Definition of Outliers: γ-Divergence
In the estimation based on γ-divergence, it is assumed that the following quantity can be made arbitrarily small for an appropriately large γ 0 > 0 as an assumption regarding outliers, This assumption means that outliers are distributed over the region where the likelihood is small in the target distribution p(x|θ * ).Since nothing about the outlier proportion is assumed, it is also possible to deal with the case where the proportion of outlier is large.However, Kuchibhotla et al. [27] reported the following two disadvantages.First, the γ-divergence is adversely affected by data at the edge of the support of the target model, like location-scale family.For example, in estimating of the scale parameter of the exponential distribution, a wrong global solution is generated when a very small inlier around x = 0, such as x = 10 −4 , is mixed [23].Here, an inlier means a data point near zero [1, p. 140].Secondly, the estimator, which can achieve the latent bias minimization, is a local solution.Nevertheless, the solution selection criteria have yet to be established.A solution to these problems has been invented by Kuchibhotla et al. [27]; however, they are not fully resolved.

C. Definition of Outliers: f -Separable Bregman Distortion Measures
In the estimation based on f -separable Bregman distortion measures, we consider that the following quantity can be made arbitrarily small for an appropriate function f as an assumption regarding outliers, In other words, if the function f is parameterized by a parameter α, (31) can be arbitrarily reduced for an appropriately large parameter α.This assumption corresponds to the assumption (30) of γ-divergence.It means that when the random variable follows the contamination distribution, i.e., X ∼ c(x), an outlier is in the region where d ϕ (X, θ * ) → ∞ is satisfied.When estimating the location parameter of the elliptical distribution using Mahalanobis distance, the definition of outlier is the same as (30), i.e., x with ∥x∥ → ∞ is regarded as the outlier.However, when estimating the scale parameter of the IS distribution using the IS distance, the definition of outlier is not the same as (30).In this case, from the data near 0 or ∞ are regarded as outliers.In other words, the estimator based on f -separable IS distortion measures is robust against large outliers and very small inliers to which γ-divergence is vulnerable.

D. Necessary Condition for Latent Bias Minimization
We express the estimating equation with the data-generating distribution p(x) as follows: The solution to this estimating equation is given by θ.
Proof: Under Assumption 6, the following matrix is positive definite, , Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. because where E is the identity matrix.The positive definiteness of this matrix and the condition (32) ensure that all the assumptions of [16, Theorem 3.1] are satisfied.Hence, the assertions of the theorem directly follow from those of [16,Theorem 3.1].□ These assertions of the theorem mean that the latent bias can be arbitrarily small and that Fisher consistency holds.The positive definiteness of ∂ψ p θ * (θ)/∂θ T | θ=θ * in the proof is necessary for the implicit function theorem.It is difficult to actually confirm the condition (32) of Theorem 5. Therefore, we assume condition (3.1) in the literature [16] corresponds to the condition (32) of Theorem 5.In our case, condition (3.1) [16] reduces to the following: can be made arbitrarily small.The fact that ∥ψ c (θ * )∥ can be made arbitrarily small implies that ∥ψ c (θ)∥ can also be made arbitrarily small in the vicinity of θ = θ * by the continuity.Since ψ −1 p θ * (τ ) is contained in the vicinity of θ = θ * for a sufficiently small η > 0, (32) is implied by the fact that ∥ψ c (θ * )∥ can be made arbitrarily small.If the quantity (31) can be made arbitrarily small under an appropriate function f , then (33) can also be made arbitrarily small.However, the converse is not generally true.
If the contamination distribution c(x) has the point mass at ∥x∥ → ∞, (33) can be rewritten as then the following condition is required for the limit (34) to be arbitrarily small: which is also a necessary condition for the influence function to be bounded [39].When (34) equals to 0, the influence function has a desirable property called the redescending property.In other words, when the redescending property holds, the influence of the sufficiently large outliers is ignored.For functions ( 8) and ( 9), sufficient conditions for the redescending property were investigated [39].Remark 4: The estimator that can achieve the latent bias minimization is one of the local solutions of (4).Thus, its solution selection problem occurs.In other words, this problem is the initial value selection of the iterative update rule (7).

E. Strategy for Initial Value Selection
The estimator that can achieve the latent bias minimization is a local minimum solution given by a fixed point of the iterative algorithm.Thus, we need the strategy for initial value selection.We have already assumed that ( 33) is sufficiently small.It means that the contamination distribution is far from the target distribution.Furthermore, we consider that the proportion of the target distribution 1 − ε is larger than the proportion of the contamination distribution ε.In other words, we assume ε < 0.5.We apply the K-means clustering with two clusters to the dataset and roughly separate it into the data generated from the target and contamination distributions.The initial values of the cluster centers are set at the minimum and maximum values of the data.If the target and contamination distributions are ideally separated, it can be expected that the initial value near the true value is obtained.

VII. ASYMPTOTIC PROPERTY
The estimation based on f -separable Bregman distortion measures, which satisfies the unbiasedness of the estimating equation, can be interpreted as an M-estimation (1), where Therefore, under appropriate assumptions, the following consistency and asymptotic normality of the estimator follow from the asymptotic theory of M-estimation [1], [2], [46], where If the data are generated from the distribution (29), the asymptotic variance is given by Σ(θ * )/(1 − ε) [16,Theorem 4.2].

A. Gamma Distribution
We assume that the statistical model is the gamma distribution p , the function f is (8), and Bregman divergence is the IS distance (19).Then, the asymptotic variance of the estimator is given by the tuning parameter satisfies α > −0.5k.For the exponential distribution (k = 1), we can compare the asymptotic relative efficiencies (AREs) of the estimators based on minimizing the f -separable IS distortion measures and βand γ-divergences.The ARE is given by V , where V [ θMLE ] is the asymptotic variance of the maximum likelihood estimator (α = 0).The asymptotic variances of the estimators based on βand γdivergences were derived for the exponential distribution [15], [23].Figure 2 shows their AREs, when the tuning parameter α = β = γ.We observe that the range of tuning parameter α = β = γ > 0 induces the robustness against outliers.As shown in Figure 2, for the function (8) and IS distance, the ARE is generally greater than that of β-divergence in the range of tuning parameter α < 2. The ARE is also greater than that of γ-divergence in the entire range of the tuning parameter.However, in general, the ARE and robustness have trade-off relationship.Therefore, it is essential to choose the tuning parameter appropriately, taking into account both of them.
We show the behavior of the estimator with respect to the number of data with numerical examples where the results were averaged over 10,000 trials.The number of data is given as n = 30, 50 and from 100 to 1000 in steps of 100. Figure 3 shows the bias and mean squared error (MSE) (log-log plot) of the estimator based on the f -separable IS distortion measure when the data are given in the exponential distribution (θ = 1).In addition, Figure 4 shows the bias and MSE (log-log plot) of the estimator when the distribution is contaminated by a Gaussian distribution with a proportion of contamination ε = 0.4.Figure 3 shows that both bias and MSE converge to zero with the same order of convergence regardless of the value of the tuning parameter.However, looking at the values of bias and MSE with a fixed number of data, it can be seen that they reach a minimum at α = 0 (MLE) and increase monotonically as α increases.Figure 4 shows that bias and MSE both converge to larger values for α = 0.25 than for α = 0, and converge to 0 for α = 0.5 and above when the proportion of contamination is large.The phenomenon of bias and MSE taking larger values for small values of the tuning parameters than for α = 0 is discussed in Section VIII-B1 with a fixed number of data.When the proportion of contamination is small or the contamination distribution is farther from the target distribution, the values of bias and MSE converge to zero even when the tuning parameters are small.This shows that if the target distribution contains a distribution of data, the estimator asymptotically approaches the true value of the target distribution, regardless of whether the data are contaminated or not.

VIII. NUMERICAL EXPERIMENTS
This section discusses the results of experiments conducted to demonstrate the latent bias minimization by f -separable Bregman distortion measures under heavy contamination.Generally, in the location parameter estimation, the bias correction term vanishes, and the estimating equation is normalized.In any case, the latent bias can be minimized under heavy contamination.However, in scale parameter estimation, the latent bias minimization is difficult under heavy contamination.Thus, we focus on scale parameter estimation using f -separable IS distortion measures.

A. Setup
We use the function (8).When the target is the exponential distribution and the Bregman divergence is the IS distance, Assumption 6 holds for α ≥ 0 in (8).We consider condition (32) of Theorem 5.If the contamination distribution is sufficiently far away from the target distribution, the integrand in (33) approaches 0 exponentially.Therefore, condition (32) of Theorem 5 holds for sufficiently large α.
Competitors are the estimation methods based on βand γ-divergences, which include tuning parameters β and γ, respectively.These divergences have weight functions ξ that are power functions.Estimation based on the β-divergence is expected to have a non-zero latent bias because the β-divergence corresponds to the non-normalized estimating equation.If α = β = γ = 0, the estimation methods reduce to the exact MLE under the assumed model.When the tuning parameters are significantly large, the estimation methods are robust against outliers.For each estimation method, the iterative method is used by giving the initial value of parameter θ.The fixed point of the iterative method is treated as an estimator of θ.In the cases of βand γ-divergences, we set the true value to the initial value of parameter θ to investigate the behavior of the solution near the true value.For the estimation based on f -separable distortion measures, we obtain the initial value from the method of Section VI-E.Estimation based on βand γ-divergences is advantageous because the initial value is not always close to the true value in f -separable Bregman distortion measure-based estimation.Parameter θ is estimated from 100 data samples.The proportion of outliers is ε ∈ {0.1, 0.2, 0.3, 0.4}.The reported results are averaged over 100 trials.We considered the following situations.
1) Exponential Distribution With Outlier Contamination: In this experiment, we investigate the behavior of the latent bias under significant outlier contamination.The data-generating distribution is the following: , where the location parameter of the contamination distribution is µ out ∈ {10, 20, 30}.
2) Exponential Distribution With Inlier Contamination: The aim of this experiment is to investigate the behavior under the small inlier contamination, for which it was reported the estimation based on minimizing γ-divergence generates a spurious global solution [23], [27].The data-generating distribution is the following: where δ is the Dirac delta function.
3) Gamma Distribution With Outlier Contamination: We investigate the behavior of the latent bias in the gamma distribution when the shape parameter k is greater than and less than one when outliers are mixed.The data-generating distribution is the following: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.where the shape parameter is k ∈ {0.5, 2}.4) Gaussian Distribution Estimation Under Outlier Contamination: In this experiment, we apply the estimation based on minimizing the f -separable IS distortion measures to the variance estimation problem of the Gaussian distribution.However, we need the location parameter to estimate the variance.
Therefore, we estimate the location parameter µ by minimizing the f -separable squared distortion measures as the function f is (8).In other words, we estimate the location and variance parameters simultaneously.The update rules of location and variance are respectively given as follows: .
The update rule (35) of the location parameter is the same as those based on minimizing βand γ-divergences.In other words, the result of the variance estimation causes the difference in estimation.The update rules and the objective function of the estimation based on minimizing βand γ-divergences are given in Appendices C-A3 and C-B3.The data-generating distribution is given as follows: (1 − ε)N (µ * = 0, σ * 2 = 1) + εN (µ out = 5, σ 2 out = 1).

B. Results
1) Exponential Distribution With Outlier Contamination: First, we discuss the influence of the proportion of outliers when the location parameter of the contamination is 20 (Figure 5).For the f -separable IS distortion measure, the bias goes to zero regardless of the proportion of outliers.It is achieved when we set the tuning parameter α to a large value.However, when the proportion of outliers is greater than or equal to 0.3, the bias increases once and then approaches 0 as the tuning parameter α increases.To find out the cause of this, we investigated the shape of the objective function.When α = 0, the estimation method is the MLE of the exponential distribution, and the solution of the objective function is unique.Since the objective function changes continuously as the parameter α increases, the solution of the objective function is unique when α is small.The unique solution moves to the direction of the target or contamination distributions as α increases.We observed that the moving direction of the unique solution depends on the proportion of outliers.For ε = 0.1 and 0.2, the unique solution moves to the direction of the target distribution.However, for ε = 0.3 and 0.4, the unique solution moves in the direction of the contamination distribution.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.Next, we discuss the influence of the location parameter of the contamination distribution.Figure 6 shows the results of the f -separable IS distortion measure when the location parameter µ of contamination distribution is 10, 20, and 30, and the proportion of outliers is 0.2.The farther the contamination distribution that generates the outliers is from the target distribution to be estimated, the smaller the minimum value α when the bias reaches zero.It is worth nothing that the curve in the figure changes continuously only when the location parameter of the contamination distribution is 10.This is because when the target and contamination distributions are close to each other, the unique solution for sufficiently small α moves toward the true parameter of the target distribution with respect to the increase in α.
2) Exponential Distribution With Inlier Contamination: Figure 7 shows the results of the inlier contamination experiment.Only the f -separable IS distortion measure has achieved bias going to zero, while neither βnor γ-divergences has been achieved.This is because, in the objective function of βor γ-divergence, the solution near the true value moves toward zero with respect to the increase in β or γ.Especially in the case of γ-divergence, there are suspicious solution near θ = 0 [23], [27] and the solution near the true value.Additionally, when γ exceeds a certain value, the solution near the true value disappears.Therefore, above a certain value of γ, the estimator is given as a solution near θ = 0, so the bias approaches -1.In other words, for βor γ-divergence, the bias cannot approach zero no matter how the initial value of parameter θ is tuned.
3) Gamma Distribution With Outlier Contamination: Figures 8 and 9 show the biases of the contaminated gamma distribution when the true shape parameter k is 2 and 0.5, respectively.The β-divergence-based-estimator was numerically unstable when using the iterative algorithm.Therefore, it was obtained through grid search.When the true shape parameter k is 2, the behavior of the bias is almost the same as in the exponential distribution (Figure 8).However, when the true shape parameter is 0.5, the behavior of the bias differs significantly from that of the exponential distribution (Figure 9).The βand γ-divergences-based-estimators did not reduce the bias to zero; it increased as the proportion of outliers increased.The objective function and the estimating equations for the βand γ-divergences with respect to the gamma distribution include the gamma function.Because of the constraint that the argument of the gamma function is positive, if the true shape parameter k is less than one, the tuning parameters β and γ that can be adjusted are constrained to be in the range [0, k/(1 − k)).Here, the range of tuning parameters for βand γ-divergences is [0, 1).The case of fseparable IS distortion measure-based estimation, has achieved bias converging to zero.In estimating the contaminated gamma distribution based on the f -separable IS distortion measure, the bias can be reduced to zero independent of the known shape parameter.
4) Gaussian Distribution Estimation Under Outlier Contamination: Figures 10 and 11 show the biases of the mean and the variance of the contaminated Gaussian distribution estimation experiment, respectively.To estimate the mean, the bias achieved zero regardless of the proportion of outliers and estimation methods.However, as the proportion of outliers increases, the variance estimation results start to differ.Thus, the process of the bias of the mean estimation results going to zero starts to differ when the tuning parameter value increases.In particular, this difference becomes more prominent as the proportion of outliers increases.In the estimation based on the f -separable distortion measure and γ-divergence, the bias of the variance estimation can approach zero by increasing the tuning parameter regardless of the proportion of outliers.The tuning parameter value achieving bias near zero is smaller for γ-divergence than of the f -separable IS distortion measure.In the estimation based on β-divergence, the bias cannot approach 0, even when the proportion of outliers is 0.1, and it worsens as the proportion of outliers ratio increases.

C. Trade-off Between Sample Efficiency and Robustness
We demonstrated that the bias can reach zero with the f -separable IS distortion measure-based estimation.However, the performance of the estimator should be measured under the trade-off between efficiency and robustness.We focus on the exponential distribution for which the AREs of the estimators are discussed in Section VII (Figure 2).The contamination distribution is the Gaussian distribution with µ out = 20 and Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.σ 2 out = 1.The proportion of outliers was set to ε ∈ {0, 0.1, 0.2, 0.3, 0.4}.We changed the number of data samples as 30, 50, 100, and 1000.Since we observed similar tendencies in results as discussed below, we omit showing the results, for n = 50 and 1000, and for ε = 0.3 in this paper.We averaged the results over 10, 000 trials.For comparison, the tuning parameters corresponding to AREs between 0.5 and 1 were obtained from Figure 2 for f -separable IS distortion measure, βand γ-divergences.This means that all three estimators have an equal ARE if there is no contamination by outliers.Note that ARE and tuning parameters are inversely proportional.Table I shows the correspondence between ARE and tuning parameters under the exponential distribution, where α, β, and γ are tuning parameters of f -separable IS distortion measure using (8), β-divergence, and γ-divergence, respectively.The full correspondence between ARE and tuning parameters is shown in Figure 2.
Figures 12-15 show the MSE and bias of the estimators for n = 30 and 100.Note that, the x-axis represents ARE, not tuning parameters.When ARE is close to one, it is weak against outliers but efficient.Conversely, when ARE is close to 0.5, it is outlier-resistant but less efficient.When outliers are not mixed (ε = 0), for all estimators, both MSE and bias monotonically increase as ARE decreases.This works as a sanity check for using ARE despite its asymptotic nature.For ε = 0, the MSEs of all three estimators are roughly same, whereas the bias of the f -separable IS distortion measure is smaller than those of βand γ-divergences.When the proportion of outliers is 0.1, the MSEs of f -separable IS distortion measure, and βand γ-divergences show similar trends.However, the f -separable IS distortion measure shows slightly better performance overall.In particular, it shows better results for both MSE and bias for AREs around 0.95.When the proportion of outliers is greater than 0.2, the MSE of the f -separable IS distortion measure is greather than those of βand γ-divergences for AREs between 0.9 and 1.This is because the bias is then increased relative to the MLE (Figures 14 and 15).Considering the bias, when ARE is 0.9, the bias in the f -separable IS distortion measure is smaller than those in βand γ-divergences, but the MSE is greater, and hence the variance is larger.On the other hand, f -separable IS distortion measure shows better performance than βand γ-divergeces in most regions where ARE is less than 0.85.

IX. CONCLUSION
In this paper, we discussed the condition for the unbiased estimating equation in the class of parameter estimation by minimizing the f -separable Bregman distortion measures.Its condition consists of the statistical model, Bregman divergence, and function f .We clarified that the condition the function f and statistical model should satisfy is characterized by a simple integral for Mahalanobis and IS distances.These results were extended to the case of one-dimensional Bregman divergence.In estimating the scale parameter of the gamma distribution, divergence-based estimation generally requires bias correction terms.Furthermore, we proved that the vanishing bias correction term implies the possibility of minimizing latent bias caused by the large proportion of outliers.We demonstrated that the latent bias could approach zero through experiments with outliers or very small inliers mixed.For the choice of function f , there is a trade-off between robustness against outliers and model efficiency.Methods for determining the tuning parameters of divergence have been studied [47], [48], [49].These methods can be used to determine the appropriate function f and the strictly convex function ϕ.

APPENDIX A PROOF OF THEOREM 1
We decompose the positive definite matrix A as, A = V ΛV T , where V −1 = V T and Λ is a diagonal matrix with positive eigenvalues.Then, Mahalanobis distance is rewritten as C Mah. g y T Λy f ′ y T Λy V y ∂(x) ∂(y) dy g y T Λy f ′ y T Λy ydy where Jacobians are given by respectively.Notably, because the matrix V is an orthogonal matrix, |V | 2 = 1.Here, the random vector S follows a spherical distribution p(s) = 1 C Mah. g(∥s∥ 2 ).We refer to the next theorem.
Note that R is the random variable with respect to radius, i.e., R = ∥S∥ and H is the random vector that follows the uniform distribution on the unit hypersphere.The joint distribution of (38) [44, p. 38] From ( 36), (37), and (39), we have This means that the expected value of the uniform distribution on the unit hypersphere is zero.Therefore, if the following integral exists, then the unbiased estimating equation holds without any bias correction term where we used integration by substitution as t = r 2 .Conversely, the existence (finiteness) of the bias correction term requires that the absolute value of its each element also has a finite expectation.This requires that f ′ (R 2 )R has a finite expectation in the above discussion since S/R = s(H) is always bounded.This means that the condition ( 18) is also necessary.

APPENDIX B PROPERTY OF SCALE FAMILY
The scale family is defined by where h is the probability density function and θ ∈ Θ = R + \ {0} is the scale parameter.Let us consider the following case, where C SF is the normalization constant as follows, That is, the scale family is given by Theorem 7: The following relation holds with respect to the expected value E[X] and the normalization constant C SF , Proof: From the definition and finiteness of the expected value E[X], we have Here, the normalization constant C SF must satisfy Therefore, we have Conversely, when we assume (41), we have Therefore, Theorem 7 holds.□ If we set h(x) = 1 x g (d IS (x, 1)), then, the scale family (40) reduces to the IS distribution (20).We immediately obtain Lemma 2 as a corollary to Theorem 7.
Ordinally, the probability distribution q is the data-generating distribution, and p is the statistical model.Thus, the β-crossentropy is minimized to estimate the probability distribution p by minimizing.However, the empirical distribution is substituted for the empirical estimation since the true distribution q is unknown.The objective function to be minimized is given by the following equation, where the empirical distribution is substituted for q as .
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
2) Gamma Distribution: The objective function and update rule for the β-divergence, assuming the gamma distribution (28) with a known shape parameter k > 0 for the statistical model, are given by p(x i |θ) in MLE to a general robust objective function 1 n n i=1 ρ(x i , θ) [1], Manuscript received 20 July 2022; revised 28 November 2023; accepted 6 February 2024.Date of publication 16 February 2024; date of current version 16 July 2024.This work was supported in part by the Japan Society for the Promotion of Science (JSPS) KAKENHI under Grant JP23K16849 and Grant JP19K11825.An earlier version of this paper was presented in part at the 2020 IEEE Information Theory Workshop [DOI: 10.1109/ITW46852.2021.9457678].(Corresponding author: Masahiro Kobayashi.) Here, a ∈ R ∪ {−∞} and b ∈ R ∪ {∞} express the left and right edges of the support of the probability density function which depend on the strictly convex function ϕ.For example, if ϕ(x) = − log x, a = 0 and b = ∞, and if ϕ(x) = x 2 , a = −∞ and b = ∞.In general, the normalization constant C ϕ (θ) depends on the parameter θ.This distribution includes one-dimensional elliptical and IS distribution.Specifically, if(24) holds, and the expected value exists, E[X] < ∞, E[X] = θ holds from the estimating equation (16) with f (z) = z, and the condition for it is given by(25).Details of the continuous Bregman distribution is discussed in Section V-B.Assumption 3: 1) Bregman divergence satisfies the following for any θ and a positive constant ζ (including ∞) with respect to the support (a, b) of (23):

Distribution 1 ) 2 )
Relationship Between Function g and Expected Value E[X]: Corollary 1: Under (24) of Assumption 3 and Assumption 4, the following relation holds with respect to the expected value E[X] and the function g which composes the continuous Bregman distribution, ζ 0 g(t)dt < ∞ ⇐⇒ E p(x|θ) [X] = θ < ∞.Proof: Corollary 1 follows immediately from Theorem 3 as f ′ (z) = 1.□ Examples of the Continuous Bregman Distribution: We show common examples of the continuous Bregman distribution.Note that the normalization constants of the following examples and those of the corresponding continuous Bregman distributions are different.a) One-dimensional elliptical distribution: We set ϕ(x) = x 2 .Then, (23) becomes the one-dimensional elliptical distribution as follows:

Fig. 1 .
Fig. 1.Comparison between the continuous Bregman distribution and regular exponential family.(a) The intersection between the continuous Bregman distribution and regular exponential family.(b) One-dimensional elliptical distribution; (c) Gaussian distribution; (d) IS distribution; (e) gamma distribution.

Fig.Fig. 4 .
Fig. Bias and MSE of the non-contaminated exponential distribution against the number of samples.

Fig. 12 .
Fig. 12. MSE of the exponential distribution with the outliers under n = 30 samples.

Fig. 13 .
Fig. 13.MSE of the exponential distribution with the outliers under n = 100 samples.

Fig. 14 .
Fig. 14.Bias of the exponential distribution with the outliers under n = 30 samples.

Fig. 15 .
Fig. 15.Bias of the exponential distribution with the outliers under n = 100 samples.

1 )
Exponential Distribution: The objective function and update rule for the β-divergence, assuming the exponential distribution for the statistical model, are given by L β (θ) = − 1 β