Bayes Extended Estimators for Curved Exponential Families

The Bayesian predictive density has complex representation and does not belong to any finite-dimensional statistical model except for in limited situations. In this paper, we introduce its simple approximate representation employing its projection onto a finite-dimensional exponential family. Its theoretical properties are established parallelly to those of the Bayesian predictive density when the model belongs to curved exponential families. It is also demonstrated that the projection asymptotically coincides with the plugin density with the posterior mean of the expectation parameter of the exponential family, which we refer to as the Bayes extended estimator. Information-geometric correspondence indicates that the Bayesian predictive density can be represented as the posterior mean of the infinite-dimensional exponential family. The Kullback–Leibler risk performance of the approximation is demonstrated by numerical simulations and it indicates that the posterior mean of the expectation parameter approaches the Bayesian predictive density as the dimension of the exponential family increases. It also suggests that approximation by projection onto an exponential family of reasonable size is practically advantageous with respect to risk performance and computational cost.

density p(y; ω). We adopt the Kullback-Leibler divergence D{p(y; ω);p(y; ω)} = p(y; ω) log p(y; ω) p(y; ω) dy as a loss function of a predictive densityp(y; ω). Then, the risk function and the Bayes risk with respect to a prior π(ω) can be written as

E[D{p(y; ω);p(y; ω)}]
= p(x n ; ω)D{p(y; ω);p(y; ω)}dx n , and π(ω) p(x n ; ω)D{p(y; ω);p(y; ω)}dx n dω, respectively. Except for in limited cases, the Bayesian predictive distribution does not belong to any finite-dimensional model, which makes it intractable to obtain the full density although it is optimal with respect to the Bayes risk. The Bayesian predictive density is defined bŷ p π (y | x n ) = p(y; ω)p π (ω | x n )dω where p π (ω | x n ) is the posterior density p π (ω | x n ) = p(x n ; ω)π(ω) p(x n ; ω)π(ω)dω of ω. It is shown in [1] that the Bayesian predictive density is optimal with respect to the Bayes risk in terms of the Kullback-Leibler divergence in the family of all probability densities, which we denote as F . However, the full Bayesian predictive density is intractable in most problems due to the complex representation that involves averaging plugin densities about the model parameters. It is not included in the model P or even in any finite-dimensional model in most problems, while plugin densities are always included in P as they are constructed by plugging-in an estimatorω(x n ) to the model. In the present paper, we represent the Bayesian predictive density as the infinite-dimensional limit of a parameterized distribution of an exponential family. We demonstrate that the Bayesian predictive density can be considered as an infinite-dimensional extension of the plugin density with the posterior mean of the expectation parameter of an exponential family. It is shown that theoretical properties including optimality with respect to the Bayes risk are appropriately retained by this extension. The plugin density with the posterior mean This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ of the expectation parameter coincides with the projection of the Bayesian predictive density onto the exponential family with respect to the Fisher metric. It is shown that it approaches the Bayesian predictive density closer with respect to the risk as the projected exponential family increases. There is also information-geometrical correspondence between the Bayesian predictive density and the projected density that comes from the correspondence between F and the exponential family. In practice, the Bayesian predictive density can be computationally approximated, for example, by taking the mean of plugin densities using Markov chain Monte Carlo simulations, or by performing an approximation of the posterior density using methods like the Laplace method, although the objective of this research is not to develop a complex approximation based on computational methods. Apart from computational approximations, a class of empirical Bayes predictive densities is proposed for multivariate normal models in [2] to avoid the intractable implementations of Bayesian predictive densities. Rather than constructing an approximation of the Bayesian predictive density that has good performance in terms of risk, we aim to formulate a simple interpretation of the Bayesian predictive density that maintains its theoretical properties on a finite-dimensional model.
The outline of the construction of the approximate predictive densities is explained below. We consider a model of a subspace of an exponential family as P, namely we consider a statistical model of a curved exponential family where Ω ⊂ R d and 1 ≤ d ≤ m. The model P parametrized by ω is embedded in an exponential family parametrized by θ, thus here we represent θ as θ(ω). Summation over a repeated index is automatically taken according to Einstein's summation convention: if an index occurs as an upper and lower index in one term, then the summation is implied. Curved exponential families embedded in exponential families can express a variety of models including network models (e.g., [3]) and time series models (e.g., [4]). They can also be applied to stochastic processes [5]. We consider predictive densities in a finite-dimensional full exponential family that includes the original curved exponential model P. We refer to plugin densities in E as extended plugin densities. The inclusion relation is P ⊆ E ⊆ F and we consider the middle layer of the three-layer structure. The coordinate system θ = (θ i ) (i = 1, . . . , m) is called the natural parameter of exponential families. Another coordinate system η = (η i ) defined by is called the expectation parameter. The posterior mean of η is closely related to the Bayesian predictive density, and the extended plugin density with the posterior mean of η is considered in this paper. Based on the idea of covering F by extending exponential families, we specify the models that can be embedded in exponential families, and the theoretical properties described in the following sections are based on this embedding. The policy of expressing a probability density by extending exponential families has been investigated, for example, in [6], in which a log-density is approximated using series of polynomials and the rate of convergences is obtained. It should be noted that the practical advantage illustrated in the numerical experiments in Section IV can be attained for other models if it is possible to find an appropriate exponential family onto which we project the Bayesian predictive density. From the viewpoint of information geometry, the posterior mean of η can be considered as the correspondence to the Bayesian predictive density in E. Table I represents the infinite-finite correspondence of m and e representations. Here, "m" and "e" are short notations denoting "mixture" and "exponential," respectively. Exponential families and mixture families are important dual families in information geometry, and their typical representations are denoted as m-representation and e-representation, respectively. Concerning exponential families, the e-representation is log p(x) = θ i x i − Ψ(θ) + log s(x), and θ is called the e-affine parameter, as the basis vector fields (∂/∂θ i )p(x) (i = 1, . . . , m) are parallel vector fields with respect to the e-connection as defined in Section II of this paper. Concerning mixture families, the mrepresentation can be written as p(x) = η i q i (x) + c(x), and η is the affine coordinate system about the m-connection (also defined in Section II). We can also set the m-affine coordinate system in exponential families, and it is defined by (1). Here, the m-affine (or e-affine) parameters are the finite-dimensional typical representations of m-representation (or e-representation, respectively). The Bayesian predictive density is the posterior mean about the m-representation (that is, density functions), and its finite-dimensional correspondence is the posterior mean about (η i ). Since the Bayesian predictive density is optimal in the infinite-dimensional exponential family, we might expect that the posterior mean of (η i ) exhibit the same properties as the Bayesian predictive density, such as optimality with respect to the Bayes risk.
The properties of the posterior mean of η are investigated in following sections as follows. In Section III-A, we show that the extended plugin density with the posterior mean of η is optimal with respect to the Bayes risk in the finite-dimensional exponential family. We denote the posterior mean of η as the Bayes extended estimator. In Section III-B, the extended plugin density with the Bayes extended estimator is proved to be the projection of the Bayesian predictive density onto E in terms of the Fisher metric. In Section III-C, its optimality with respect to the risk along orthogonal shift from the model is shown to be common with the property of optimality in the case of the Bayesian predictive density. The relation between the projection angle under the Fisher metric and the risk difference between the Bayesian predictive density and the extended plugin density with the Bayes extended estimator is investigated. In Section IV, we compare the risk performance of the Bayes estimator, the Bayes extended estimator, and the Bayesian predictive density by conducting numerical simulations on the Gaussian spiked covariance models. We confirm that the projection angle converges to zero as the dimension of E increases. The simulation results also suggest that the projection of the Bayesian predictive density is practically effective in approximating it with respect to the Kullback-Leibler risk and the computational cost.

II. PRELIMINARIES
In this section, we prepare some information-geometric notions. For details of the notions and notation concerning the differential geometry of curved exponential families, refer to [7].
Let a, b, . . . be indices for ω. Let T ω P be the tangent space of P at a point ω. The tangent space T ω P is identified with the vector space spanned by ∂ a p(x; ω) (a = 1, . . . , d), where ∂ a denotes ∂/∂ω a . We define inner products in the tangent space by In a statistical model P, each component of the Fisher information matrix is defined by Let g ab be a component of the inverse matrix of (g ab ). Then, e-connection and the m-connection coefficients are defined as: The Jeffreys prior density is given by where |g(ω)| is the determinant of the matrix (g ab (ω)). The coordinate systems (θ i ) and (η i ) of the exponential family E are dual to each other in the sense that where δ i j is the Kronecker delta. In a curved exponential family, e-connection and m-connection coefficients are expressed as: In the rest of the paper, we assume regularity conditions to ensure that equalities such as hold. For the details of the regularity conditions, see [8].

A. Optimality With Respect to Bayes Risk
The posterior mean of η (we denote asη π ) of E is evaluated as follows:η Note thatη π (x n ) = η(ω π (x n )) in general whereω π (x n ) is the posterior mean of ω: We demonstrate that p(y;η π ) is optimal in E with respect to the Bayes risk based on a prior π. Proposition III.1: The Bayes risk of p(y;η), whereη is an estimator of η, is minimized whenη =η π .
Proof: Letθ be an estimator of θ. Note that θ and η are functions of ω. The Kullback-Leibler loss of p(y;θ) is Hence where, for a function f (η), It is minimized whenθ = θ(η) = θ(η π ). By multiplying (5) with p(x n ; ω)π(ω)dω and then integrating with respect to x n , it is shown that p(y;η π ) is optimal with respect to the Bayes risk in E.
We refer toη π as the Bayes extended estimator. Hereinafter, we denote the Bayes extended estimator and the Bayes estimator of ω about a prior π asη π (=η π ) andω π , respectively. From Proposition III.1, the extended plugin density withη π is the projection of the Bayesian predictive density onto E about the Bayes risk. It is nearest to the Bayesian predictive density in E regarding the Bayes risk, because the Bayesian predictive density is optimal about the Bayes risk in F . In fact, p(y,η π ) coincides with the projection of the Bayesian predictive density onto E regarding the Fisher metric asymptotically, as shown in Section III-B.
The choice of E does not require to be fixed, and we can consider situations in which the size of the extended model E can be increased, for example, by employing sequences of exponential families as in [6] and [9]. In those situations, the extended plugin density with the Bayes extended estimator η π approaches the Bayesian predictive density as E grows, as E approaches the set of all probability distributions F .
Here, we use a simple example to illustrate the difference of the plugin density withω π , the extended plugin density witĥ η π , and the Bayesian predictive density.
Example (Fisher circle model) We consider two dimensional Gaussian distribution N(μ, I 2 ) with unknown mean vector μ and the identity covariance matrix I 2 . The density function is When the mean vector μ is expressed as the one-dimensional submodel is called the Fisher circle model. Here, the following holds: Then, we derive the Bayes estimatorω π , the Bayes extended estimatorη π , and the Bayesian predictive density. For . Then by the law of cosines, where I 0 (·) is the modified Bessel function of the first kind. See [10] (pp. 138-140) for the details. When the uniform prior π(ω) ∝ 1 is adopted, the posterior density is It follows that the plugin of the Bayes estimator is N x x , I 2 : The extended plugin with the Bayes extended estimator is whereη π is not included in the circle parametrized by ω and η π = η(ω π ). Here I 1 (·) is the modified Bessel function of the first kind. On the other hand, the Bayesian predictive density is given by Therefore, p π (y | x n ) is not included in P or E because it is not a two-dimensional Gaussian with a covariance matrix I 2 .

B. Projection of Bayesian Predictive Densities in Terms of Fisher Metric
Here, we demonstrate that p(y;η π ) is the projection of the Bayesian predictive density regarding the Fisher metric. It is shown via asymptotic expansion of p(y;η π ), which is represented as a point in E that is parallelly and orthogonally shifted from the plugin density of the maximum likelihood estimator of P as shown in Figure 1.
Theorem III.1: The Bayes extended estimator based on a prior π(ω) is expanded aŝ where π J is the density of the Jeffreys prior and T a = T abc g bc .
Proof: See Appendix A. We can obtain the asymptotic expansion of the extended plugin density withη π .
Theorem III.2: The extended plugin density withη π is expanded as Proof: Symbols such as η(ω MLE ), ∂ a η(ω MLE ), and ∂ a ∂ b η(ω MLE ) are abbreviated toη, ∂ aη , and ∂ abη , respectively. Considering the asymptotic expansion introduced in Theorem III.1, we obtain the following: The shift from p(y; η(ω MLE )) to p(y;η π ) in Theorem III.2 is composed of two components, one "parallel" and the other "orthogonal" to the model P. That is, the term is included in the tangent space spanned by ∂ a p(y; η) (a = 1, . . . , d) and the term is orthogonal to ∂ a p(x; η) (a = 1, . . . , d) with respect to the inner product (2), because by using (3) and (4). We can compare the orthogonal shifts (6) to the orthogonal shifts from p(y; η(ω MLE )) to the Bayesian predictive density, and we show (6) is the projection of the orthogonal shifts to the Bayesian predictive density onto E. In [11], the Bayesian predictive density p π (y | x) is asymptotically expanded as The parallel shift is identical to that of p(y;η π ). On the other hand, the orthogonal shift is different from that of p(y;η π ) and it is not included in the tangent space of E. Therefore, the shifted density p π (y | x) is not included in E, while p(y;η π ) is in E.
To cope with the shifts that are orthogonal to P, we introduce a coordinate system to the subspace of E that is orthogonal to P. We divide the tangent vectors of E at η into two parts, namely, into those parallel to P and those orthogonal to P. For each point η ∈ E, the tangent space T η E is identified with the vector space spanned by The tangent space T ω P is a subspace of T η E. Let A(ω) be an (m−d)-dimensional smooth submanifold of E attached to each point ω ∈ P and assume that A(ω) orthogonally transverses P at η(ω). Such a family of submanifolds {A(ω) | ω ∈ Ω} is called an ancillary family. We introduce an adequate coordinate system ξ = (ξ κ ) (κ = d + 1, . . . , m) to A(ω) so that a pair (ω, ξ) uniquely specifies a point of E in the neighborhood of η(ω). We adopt a coordinate system ξ on A(ω) so that η(ω, ξ) ∈ P if ξ = 0. Then, we have = 1, . . . , d, κ = d+1, . . . , m).

C. Projection Angle and Risk Difference
In this section, we demonstrate that the extended plugin density withη π is optimal with respect to the risk along the orthogonal shift from the model P. This property is parallel to those of Bayesian predictive densities investigated in [11], as explained later in this section. Orthogonal shifts from P can asymptotically improve the Kullback-Leibler risk from plugin densities of P. By evaluating those risk improvements, the risk difference between p(y;η π ) and Bayesian predictive densities can be corresponded to the projection angle (as represented in Figure 2).

E[D{p(y; ω), p(y;η)}]
The risk expansion (13) confirms that the risk can be improved from a predictive density in P by selecting an appropriate orthogonal shift β. We obtain the optimal orthogonal shift.
Theorem III.3: The optimal β κ in (12) is given by Proof: The risk in Proposition III.2 is Therefore, β is optimal when Accordingly, the orthogonal component of the shift in Theorem III.2 is optimal, and the extended plugin density witĥ η π has the optimal shift (as illustrated in Figure 3). The risk improvement by the optimal shift is evaluated by the inner product of the optimal shifts β κ opt and given by 1 which does not depend on the parallel shift α. Here, m H λ ab m H κ cd g ab g cd g κλ is the mixture mean curvature of P embedded in E at ω, thus this risk improvement has a geometrical interpretation.
These results are related to the properties of Bayesian predictive densities. In [11], it is shown that Bayesian predictive densities are optimal along orthogonal shift from P. The orthogonal shift is not included in the tangent space of E, as explained in Section III-B. The risk improvement achieved by the orthogonal shift is the mixture mean curvature of P embedded in F , while the risk improvement of p(y;η π ) is the mixture mean curvature of P embedded in E. The risk improvement is evaluated by inner products of the optimal shifts, and as a result the cosine of the angle between the two Fig. 3. Extended plugin density withηπ has the optimal orthogonal shift. orthogonal shifts, as shown in Figure 2, can be found as the square root of the ratio of the two risk improvements.
Example (Fisher circle model, continued) We have g ωω = 1 and m Γ ω ωω = 0. Thus, the optimal orthogonal shift is and the risk improvement obtained by the optimal orthogonal shift is The risk improvement corresponding to the optimal shift (7) is 3/(8n 2 ) + o(n −2 ). If the variance of x 1 , x 2 is σ 2 , the risk improvements corresponding to the optimal orthogonal shift and to the shift (7) can be obtained as follows, respectively: Therefore, when σ 2 is large, the risk improvement obtained by p(y;η π ) becomes relatively significant as well, and the performance of the Bayes extended estimator is close to that of the Bayesian predictive density. The cosine of the angle between the two shift vectors is which approaches 0 as σ 2 increases. In this way, the Bayesian predictive density and p(y;η π ) approach each other.

IV. NUMERICAL STUDIES
The numerical simulations of the Kullback-Leibler risk of p(y;η π ) and Bayesian predictive densities are shown in a curved Gaussian model. It confirmed the theoretical results so far and also illustrates the practical importance of the projection of Bayesian predictive densities.
The model P is the spiked covariance model (for more information about related models, see e.g. [12]), that is l-dimensional Gaussian N l (0, Σ) with the covariance matrix Σ expressed as where the vector u ∈ R l satisfies u u = 1 and λ > 0. The eigenvalues of the matrix Σ are λ + 1, 1, . . . , 1, and u is the first eigenvector. The model P parametrized by ω = (u, λ) is embedded in the larger full exponential family E = {N l (0, Σ) | Σ}, and the expectation parameter η comprises the components of Σ. The extended plugin distribution p(y;η π ) is N l (0,Σ π ), wherê Σ π is the posterior mean of Σ. The Bayes estimator (λ π ,û π ) of P is composed ofλ π , the posterior mean of λ, andû π , the first eigen vector ofΣ π . The plugin distribution of the Bayes estimator is N l (0,λ πûπû π + I l ). The settings of l and n are (l, n) = (5, 20) and (80, 320). The posterior mean of λ and Σ is computed by the 1000 MCMC samples for l = 5 and 2000 MCMC samples for l = 80 produced by Gibbs sampling with 250 and 500 burn-in samples, respectively. The Bayesian predictive density is computed by taking the mean of the plugin densities of those MCMC samples of (λ, u). The Kullback-Leibler risk is derived as the mean of 2000 trials.
The results are illustrated in Figure 4, and it is confirmed that p(y;η π ) approaches the Bayesian predictive density as the size of E increases. In Figure 4a, the three-layer structure of the plugin density, the extended plugin density, and the Bayesian predictive density is seen in the risk comparison. On the other hand, in Figure 4b, it could be seen that the risk plots of p(y;η π ) and the Bayesian predictive density are quite close, which means the projection angle between them is close to zero. The dimension of the parameter space of P is l and that of E is l(l + 1)/2; thus, P is embedded in relatively larger full exponential families when l increases. Therefore, it is natural that the two risk performances approach as l increases because the extended model E approaches the set of all probability densities F , and p(y;η π ) approaches the Bayesian predictive density. Figure 4b also illustrates the practical advantage of p(y;η π ), as it shows that projection of Bayesian predictive densities onto finite-dimensional models of reasonable size is an effective way of approximation. Bayesian predictive densities are typically approximated by the mean of plugin densities, because obtaining the full density function is intractable. In these experiments, the Bayesian predictive density is the mean of 2000 plugin densities. The problem of this approximation is that it requires large space and time to compute density, because storing all MCMC samples is necessary, and we have to take the mean of these plugin densities for each y. Figure 4b demonstrates that the approximation by p(y;η π ), a single point in E is comparable to the mean of 2000 points in P. Approximation by p(y;η π ) does not require storing MCMC samples, and the full density function is available avoiding taking mean of plugin densities every time. Table II provides the computation time required to evaluate the density of 1000 new samples y and the memory size of the Bayesian predictive density and p(y;η π ). It is shown that both computational time and memory size are effectively saved when we utilize p(y;η π ). Therefore, the extended plugin density p(y;η π ) is an  Let L(η) = 1 n n t=1 log p(x(t); η). Then,η π (x n ) = {(η π ) i (x n )} is given bŷ η π (x n ) = η(ω)p(x n ; η(ω))π(ω)dω p(x n ; η(ω))π(ω)dω = η(ω) exp(nL(η(ω)))π(ω)dω exp(nL(η(ω)))π(ω)dω .