Information-Theoretic Limits for the Matrix Tensor Product

This article studies a high-dimensional inference problem involving the matrix tensor product of random matrices. This problem generalizes a number of contemporary data science problems including the spiked matrix models used in sparse principal component analysis and covariance estimation and the stochastic block model used in network analysis. The main results are single-letter formulas (i.e., analytical expressions that can be approximated numerically) for the mutual information and the minimum mean-squared error (MMSE) in the Bayes optimal setting where the distributions of all random quantities are known. We provide non-asymptotic bounds and show that our formulas describe exactly the leading order terms in the mutual information and MMSE in the high-dimensional regime where the number of rows <inline-formula> <tex-math notation="LaTeX">$n$ </tex-math></inline-formula> and number of columns <inline-formula> <tex-math notation="LaTeX">$d$ </tex-math></inline-formula> scale with <inline-formula> <tex-math notation="LaTeX">$d = O(n^{\alpha })$ </tex-math></inline-formula> for some <inline-formula> <tex-math notation="LaTeX">$\alpha < 1/20$ </tex-math></inline-formula>. On the technical side, this article introduces some new techniques for the analysis of high-dimensional matrix-valued signals. Specific contributions include a novel extension of the adaptive interpolation method that uses order-preserving positive semidefinite interpolation paths, and a variance inequality between the overlap and the free energy that is based on continuous-time I-MMSE relations.


Introduction
Inference problems involving the estimation and factorization of large structured matrices play a central role in the data sciences. The last decade has witnessed significant progress on theory and algorithms for a variety of models involving low-rank structure, such as the spiked matrix models used in low-rank covariance estimation [1,2], sparse principal component analysis (PCA) [3], and clustering [4,5], as well as related models involving sparse graphical structure, such as the stochastic block model (SBM) for community detection [6,7].
To understand the fundamental (or information-theoretic) limits of inference a recent line of work  has focused on exact characterizations of the asymptotic mutual information and minimum-mean squared error (MMSE). This body of work focuses on problem settings where the probability distributions of all unknown quantities are known and there are no constraints on computational complexity. The limits are described by single-letter formulas (i.e., analytical expressions that can be approximated numerically) that characterize the leading order terms in the mutual information and MMSE.
The value of exact formulas is that they provide detailed information about the role of different model parameters (e.g., the amount or quality of data). Even more they can describe phase transitions that delineate between problem regimes with drastically different behaviors with respect to the quality of inference and computational complexity. The development of this type of theory often follows a two-part sequence where the formulas are first conjectured using heuristic approaches, such as the replica method, and then proven rigorously later on using very different techniques [30][31][32][33][34]. While the emphasis on single-letter formulas is standard in areas such as information theory and statistical physics, it differs from some of the other approaches used in the data sciences, which focus instead on order-optimal bounds or rates of convergence.
In this context, the contribution of this paper is a rigorous analysis of the fundamental limits for a broad class of problems related to the matrix tensor product (or Kronecker product) of large random matrices with low-dimensional structure. Specifically we introduce a model that generalizes existing models in the literature and we provide single-letter formulas for the mutual information and MMSE. These formulas recover a number of existing results as special cases and also provide new results for settings that could not be analyzed previously.
On the technical side, a further contribution of this paper is the development of techniques for the analysis of high-dimensional matrix-valued signals. Our formulas leverage functional properties of entropy under Gaussian convolution, using ideas introduced in analysis of multi-layer networks [19] and used more recently for matrix factorization problems motivated by community detection [24][25][26]. The proof builds upon the adaptive interpolation method [35,36], which represents an evolution of earlier interpolation techniques in the statistical physics literature [37,38]. In this direction, one of the key steps in this paper is the introduction of matrix-valued interpolation paths that satisfy a certain order-preserving property that is easier to work with than the regularity conditions used in previous work. Further innovations include a new method for controlling overlap concentration using perturbations parameterized by positive semidefinite matrices and a variance inequality between overlap and the free energy that is based on continuous-time I-MMSE relationships [39,40].

The matrix tensor product and related models
This paper studies a model consisting of the matrix tensor product (or Kronecker product) observed in additive Gaussian noise with an arbitrary coupling matrix. Throughout this paper, we use X to denote a real n× d random signal matrix with known distribution. The bulk of the analysis focuses on non-asymptotic bounds that hold uniformly for any signal matrix satisfying a finite fourth moment constraint. In the final stages of our proof these bounds are shown to be asymptotically tight in scaling regimes in which the number of rows n increases while the number of columns d is fixed, or at least grows much more slowly than n. Definition 1. The matrix tensor product model with coupling matrix B ∈ R d1d2×m is given by where (X 1 , X 2 ) ∈ R n1×d2 × R n2×d2 is a pair of jointly random signal matrices, (X 1 ⊗ X 2 ) denotes the n 1 n 2 × d 1 d 2 obtained by the Kronecker product, and W ∈ R n1n2×m is noise matrix that has i.i.d. standard Gaussian entries. Equivalently, the rows of Y are in one-to-one correspondence with the set of m-dimensional observations {Y ij : 1 ≤ i ≤ n 1 , 1 ≤ j ≤ n 2 } given by where X 1i and X 2j denote the i-th and j-th rows of X 1 and X 2 and each W ij is an independent m-dimensional standard Gaussian vector.
The matrix tensor model defined above is general in the sense that it allows for any linear combination of the rows of the tensor product. The spiked matrix models used for sparse PCA and covariance estimation can be represented by this model with specific choice of the coupling matrix B. To see this correspondence, suppose that d 1 = d 2 and let B be the d 2 × 1 vector obtain by stacking the columns of λ/nI d . Then, (1) is equivalent to the asymmetric low-rank matrix estimation model (or spiked Wishart model) given by where W is an n 1 × n 2 noise matrix with i.i.d. standard Gaussian entries. The case where X 2 = X 2 = X recovers the symmetric spiked-matrix model (or spiked Wigner model).
A generalization of the usual spiked matrix model was introduced to study the k-community degreebalanced SBM [24,25]. In the setting of diverging average degree, the community detection problem can be modeled using where B is a d × d coupling matrix (d = k − 1) that models the interactions between different communities. In this model, positivity (or negativity) of an eigenvalue of B corresponds to assortativeness (or disassortativeness) in the community interactions. An indefinite coupling matrix is needed to model the full range of behaviors seen in real-world networks [41]. A further generalization of (4) is the multiview spiked matrix model used to study community detection with multiple correlated networks [26]. This model is given by where each B ℓ is associated with a different network. By vectorization, this model is an instance of (1) with The main result of this paper is an approximation formula for the mutual information and MMSE associated with the matrix tensor product model. Rather than focusing on (1) directly we consider a symmetric form that is easier to work with. Definition 2. The symmetric matrix tensor product model with d 2 × d 2 positive semidefinite matrix S is given by where X ⊗2 = X ⊗ X is n 2 × d 2 and W is a noise matrix with i.i.d. standard Gaussian entries.
In fact, there is no loss in generality in restricting our attention to the symmetric case.
Lemma 1. The matrix tensor product model (1) and the symmetric form (6) are equivalent in the sense that either one can be used to represent the other.
Proof. The mapping from (6) to (1) is obvious. The reverse direction is based on two steps. First, any pair (X 1 , X 2 ) can be embedded into a single (n 1 + n 2 ) × (d 1 + d 2 ) block-diagonal matrix via the matrix direct product X 1 ⊕ X 2 = diag(X 1 , X 2 ). The tensor product X ⊗ X then includes X 1 ⊗ X 2 as subset of the columns and row. Second, by orthogonal invariance of the standard Gaussian distribution, the projected observation Y B ⊤ is a sufficient statistic for inference about X, and thus there is no loss in generality in considering only positive semidefinite coupling matrices.

Overview of contributions
This paper provides theoretical guarantees for optimal estimation in the matrix tensor product model. We derive a single-letter approximation formula for the mutual information and provide a rigorous non-asymptotic bound on the approximation error (Theorem 6). Similar in spirit to recent work focusing on spiked matrix models with generative priors [28], this result is stated under general conditions that depend only on the variance of the free energy. Combining this result with a more common (and more restrictive) separability assumption leads to a bound in which the error decays at rate n −1/5 in the problem dimension (Theorem 8).
Although it is unlikely that this is the optimal rate of convergence it is, to the best of our knowledge, the strongest rate that has been proven in the literature. Some implications of these results for the MMSE are given in Theorem 9. Our proof builds upon the method of adaptive interpolation used in previous work on spiked matrix models [27,29,35,36] and shares the same basic outline of 1) specifying an interpolation path via a firstorder ordinary differential equation and 2) introducing a perturbation to the problem to guarantee overlap concentration (almost everywhere) and thus cancel a remainder term. The main novelty in our approach is that we are able to perform these steps in a more general setting that allows us to address the matrix tensor product model.
Two new ideas are the following: • Order-preserving interpolation path: The interpolation paths are parameterized by positive semidefinite matrices and satisfy an order-preserving property. This property can be verified thanks to a comparison inequality for differential equations (Lemma 20). As a consequence, matrix-valued perturbations can be introduced directly along the interpolation path.
• Continuous time variance inequality: Inspired by pointwise I-MMSE relationships [39,40] we introduce a continuous time coupling of the noise in the perturbation model. Leveraging the power of Ito calculus, we derive a variance inequality (Lemma 23) that provides a direct link between the variance of the free energy and the variation in the overlap matrix. This approach simplifies certain aspects of the analysis and allows for concentration under weaker assumptions on the scaling of the problem dimension.

Related work
A great deal of work has focused on the information-theoretic [13][14][15][16][17][18][19][20][21][22][23][24][25][26][27][28][29] and algorithmic [42][43][44][45][46][47][48] limits of recovery in spiked matrix models. Two special cases are now well understood: the spiked Wigner model, which is equivalent to (4) with a positive semidefinite coupling matrix B, and the spiked Wishart model (3) in the setting where X 1 and X 2 are independent matrices with i.i.d. rows. Using the replica method, Lesieur et al. [4,9,34] provided conjectured formulas for the asymptotic mutual information and MMSE. The special case of the rank-one (d = 1) spiked Wigner model was proven rigorously by Krzakala et al. [15] and Barbier et al. [16]. The case of low-rank matrices (d ≥ 1) was proven for the spiked Wigner model by Lelarge and Miolane [17] and for the spiked Wishart model by Miolane [18]. The role of the coupling matrix in the symmetric model (4) was investigated by Reeves et al. [24,25], who provided one-sided bounds on the formulas for the mutual information and MMSE. Subsequent work by Mayya and Reeves [26] extended this approach to the multiview version (5). Concurrent to the work in the present paper, Barbier and Reeves [49] provide an exact characterization for a special case of the multiview model (5) that satisfies an additional convexity property. Specifically it is assumed that B 1 , . . . , B L are such that This assumption is restrictive in the sense that it precludes the model equivalence in Lemma 1.
In comparison to previous work, this paper is the first to consider the general setting of the matrix tensor product (1). Special cases of our general result include the following settings: • The spiked Wishart model (3) with an arbitrary distribution on the matrices (X 1 , X 2 ). Note that this includes the spiked Wigner model as a special case.
• The spiked matrix model (4) and the multiview version (5) with arbitrary coupling matrices.
The approach in this paper builds upon the method of adaptive interpolation, which has been applied primarily in the case of rank-one estimation problems [27,35,36] and more recently to the setting of lowrank even order tensors [29]. Some different approaches include the cavity method [17,18], large deviations techniques [20,28], and the analysis of Hamilton-Jacobi equations [21,22].

Notation
We use S d , S d + and S d ++ to denote the space of d × d symmetric, positive semidefinite, and positive definite matrices, respectively. The trace inner product between matrices A ∈ R m×n and B ∈ R m×n is given by A, B := tr(A ⊤ B). For each A ∈ R m×n , we use vec(A) to denote the mn × 1 vector obtained by stacking the columns in A. The Kronecker product (or tensor product) is denoted by ⊗ and the Kronecker power is denoted by A ⊗k = ⊗ k ℓ=1 A. Unless specified otherwise, · p denotes the p-Shatten matrix norm.

Information functions
Throughout this paper we use the following convention: • W denotes a matrix with i.i.d. standard Gaussian entries.
• X is an n × d random signal matrix with finite second moments E X 2 2 < ∞.
• Y denotes an observation of a function of X corrupted by additive Gaussian noise.
• Z denotes a generic observation taking values in a set Z.
All of these random quantities are defined on the same probability space and their probability distributions are known throughout. We use G to denote the sigma-field generated by a set of observations. The conditional expectation of X given G is an n × d random matrix denoted by E[X | G].
Our analysis focus on two closely related estimation-theoretic quantities, both of which are represented by d × d positive semidefinite matrices. The MMSE matrix introduced in [24,25,50] is given by It is important to note that the expectation is taken with respect to G and thus this is a non-random matrix.
The trace corresponds to the usual scalar MMSE, which is given by E X − E[X | G] 2 2 . The second matrix of interest is the maximum mean overlap (MMO) given by These matrices satisfy MMO(X | G) + MMSE(X | G) = E X ⊤ X and thus provide equivalent information whenever X has finite second moments (as is assumed throughout this paper). While the MMSE is perhaps more familiar from an estimation-theoretic point of the view, the MMO is more closely related to the analysis in statistical physics. An important property of these matrices is that they satisfy a data processing inequality. Specially, the MMO matrix is an order-preserving operator with respect to the partial order on sigma-algebras generated by inclusion and the partial order on symmetric matrices generated by the cone of positive semidefinite matrices. Equivalently, the MMSE is an order-reversing operator.

Entropy under Gaussian convolution
Our analysis also leverages a fundamental relationship between the MMSE and the entropy under Gaussian convolution. Given S ∈ S d + , let Y ∈ R n×d be the output of the additive Gaussian noise observation model: where W is an n × d noise matrix with standard Gaussian entries. Following [50], the conditional mutual information function I X|Z : S d + → R is defined according to I X|Z (S) = I(X; Y | Z).
Similarly, the conditional relative entropy function D X|Z : S d + → R is defined according to where D(P Q) denotes the relative entropy (Kullback-Leibler divergence) between probability measures P and Q defined on the same space. To lighten the notation, we often use the shorthand notation I(S) and D(S) where the dependence on the pair (X, Z) is implicit. Similar to the MMSE and MMO, it can be verified that these function are affine transformations of each other: These functions satisfy a number of important functional properties. The following results follow from [50,[52][53][54][55]; see [24,Appendix D].
where S ′ (t) denotes a weak derivative of S(t) and Y t = XS(t) 1/2 + W .

Lemma 4 (Convexity and smoothness). The relative entropy is convex on
Lemma 5 (Data processing inequality). The relative entropy and MMO are order-preserving in the sense that

The convex conjugate
Next it is useful to extend the domain of the relative entropy function to the space of symmetric matrices. While there are a variety of extensions that could be considered, we will use the convention that D : Then, D is a proper lower semicontinuous convex function whose effective domain is given by The factor of 1/2 is introduced for notational convenience to mirror the factor of one half appearing in the expression of the gradient. By the Fenchel-Moreau theorem [56,Theorem 13.37], the biconjugate (i.e., the conjugate of the conjugate function) is equal to D, and thus The domain of the convex conjugate is non-empty and satisfies The sub-differentials are set-valued mappings given by where we have introduced the shorthand notation MMO(S) = MMO(X | XS 1/2 + W , Z).
Example 1. If X has finite entropy H(X) then D * (E X ⊤ X ) = H(X). Alternatively, if X has a density with respect to Lebesgue measure, then D * (Q) increases to +∞ as Q increases E X ⊤ X .
Example 2. If the entries of X are i.i.d. standard Gaussian variables then

Multivariate functions
Given a pair of random matrices (X 1 , X 2 ) the bivariate relative entropy function is defined by where ⊕ denotes the matrix direction sum (i.e., A ⊕ B = diag(A, B)). This function inherits the functional properties derived in the case of a single matrix, such as convexity, smoothness, and the data processing inequality.
Example 3. If X 1 and X 2 are independent. Then,

Overlap and free energy
The information functions considered so far have all been non-random quantities that depend on the average difference between the marginal distribution of the signal P X and the (random) conditional distribution given the observations P X|G . To understand whether these averages represent the typical behavior it is useful to consider certain random quantities whose distributions provide further information about the variability in P X|G .
Let X be the n × d random signal matrix and let G be the sigma-field generated by a set of observations. The overlap matrix is the d × d random matrix given by where X 1 and X 2 are replicas that are drawn independently from the conditional distribution of X given G. The conditional independence of the replicas means that the conditional expectation of the overlap is a positive semidefinite matrix given by Taking the expectation with respect to G, we recover the MMO matrix. The extent to which the overlap concentrates about its mean provides an important measure of the amount of dependence between the signal components under P X|G . The second random quantity of interest is a function of the instantaneous or pointwise information contained in the observations. Specifically, given a random pair (X, Z) and positive semidefinite matrix S, we define the random variable where Y is drawn from the additive Gaussian noise model (11) with parameter S. With some abuse of terminology we will refer to this variable simply as the free energy. From this definition, it follows that the expectation with respect to the pair (Y , Z) is equal to the relative E[F ] = D X|Z (S). Moreover, a straightforward expansion of the Gaussian density gives an alternative representation: Finally, the variance of the free energy is defined as In contrast to the overlap, a useful property of the free energy is its variance can often be bounded under relatively mild assumptions.

Symmetric case
Our main result is stated for a version of the symmetric tensor product. Given a pair of positive semidefinite where W , W ′ have i.i.d. standard Gaussian entries. We use D(R, S) = D X,X ⊗2 |Z (R, S) to denote the bivariate relative entropy function associated with the observation Y and a generic observation Z. Furthermore, the special case S = 0 is denoted by Definition 3. The approximation formula associated with (31) is given bŷ where D * 0 (Q) is the convex conjugate of D 0 (R).
As shown in Appendix D, the approximation formula has many of the same properties as the relative entropy function. In particular, it non-negative, convex, and order-preserving. An important difference, however, is is that it need not be differentiable everywhere.
Our main result is the following non-asymptotic bound on the difference between the relative entropy function and the approximation formula.
where C is a numerical constant, and V X,X ⊗2 |Z (R, S) is the variance of the free energy associated with (31) To compare with other results in the literature, consider the following bound on the variance of the free energy.
Lemma 7. Suppose that the rows of X are bounded almost surely X i ≤ b and for each realization of Z, the conditional distribution P X|Z (· | Z) factorizes as a product distribution over the rows. Then Proof. This result follows straightforwardly from the approach in [27, Section 6.3].
Define the scaled functions: The next result follows straightforwardly from Theorem 6 and Lemma 7.
The gradients of the relative entropy function provide information about the MMSE matrix. Specifically, letting D n (R, S) be given by (36) we have The next result follows from Lemma 24.
Theorem 9. Suppose that D n converges poinitwise to limit D. Then, D is convex and differentiable almost everywhere. Furthermore:

Asymmetric case
Given a tuple of positive semidefinite matrices (R 1 , S 2 , S) ∈ S d1 where to denote the relative entropy function associated with the observation Y and a generic observation Z. The special case S = 0 is denoted by Definition 4. The approximation formula associated with (43) is given bŷ where D * 0 is the convex conjugate of D 0 . The approximation formula for the asymmetric model (43) is equivalent to the one for the symmetric model (31).

Proof of Theorem 6
The proof has two main steps. First we use adaptive interpolation to bound the difference in terms of a remainder term. Then, we show that under an appropriate perturbation, this error can be made small.

Interpolation with order-preserving paths
The high-level idea is to describe an interpolation path parameterized by t ∈ [0, 1] that maps from the original problem of interest at time t = 0 to a setting involving only the direct observation model at time t = 0. The interpolation path is described by a family of functions where (W , W ′ ) are independent noise matrices with i.i.d. standard Gaussian entries. The overlap matrix associated with this path is given by O t,ǫ := (X 1 t,ǫ ) ⊤ X 2 t,ǫ where X 2 t,ǫ and X 2 t,ǫ are conditionally independent draws from the conditional distribution of X given (Y t,ǫ , Z). By Lemma 4 and the property (A ⊗ B) ⊗ (C ⊗ D) = (AC) ⊗ (BD), the partial gradients D(R, S) can be expressed in terms of the overlap: The remainder term is defined as Definition 5 (Order-preserving). The family of functions (φ ǫ ) ǫ∈E is said to be strongly order preserving if the mapping ǫ → Φ(t, ǫ) − ǫ is order preserving. In other words, for all t ∈ [0, 1], and ǫ 1 , ǫ 2 ∈ E, Remark 1. The order-preserving assumption can be seen as a natural extension of regularity conditions used earlier versions of the adaptive interpolation method (see e.g., [13, Supplement, Definition 1]), which require that the mapping ǫ → Φ(t, ǫ) is a C 1 diffeomorphism, whose Jacobian is greater than or equal to 1. Unlike the conditions used previously, the order-preserving property does not require continuity.
Theorem 10. Fix any (R, S) ∈ S d + × S d 2 + and bounded set E ⊂ S d + .
(i) (Lower bound) For each q ⋆ ∈ ∂ 1D (R, S), there exists a family of strongly order-preserving functions (φ ǫ ) ǫ∈E such that for all ǫ ∈ E, (ii) (Upper bound) There exists a family of strongly order-preserving functions (φ ǫ ) ǫ∈E such that for all ǫ ∈ E, In both cases, the interpolation functions are bounded and satisfy 0 φ ǫ (t)−ǫ +2 S ∞ E X ⊤ X uniformly for all t ∈ [0, 1].
Corollary 11. The approximation formula satisfies whereR(ǫ) is the maximum absolute value of the remainders associated with the lower and upper bounds.
Corollary 12. If M → S, M ⊗2 is convex on R d×d then the remainder terms are non-negative. Consequently,

Canceling the remainder
The absolute value of the integrand in (47) can be bounded from above as follows: where the second step follows from Jensen's inequality and the matrix Hölder inequality and the last step holds because the Shatten-p norm satisfies M ⊗2 p = M 2 2p for any M ∈ R m×n . To bound the remainder term, we will show that there exists a probability measure µ supported on E such that the averaged overlap is small uniformly for all t ∈ [0, 1]. There is a tradeoff with respect to the size of the perturbation. The larger the set E, the easier it is to smooth out the variance in the overlap, but the larger the discrepancy in the first term in Corollary 11 Theorem 13 (Overlap concentration). For all ρ ∈ (0, ∞) there exists a probability measure µ such that ǫ ≤ ρ(d + 1) almost surely and

Simplification of bound
To simplify the analysis, let us now assume that E X 4 2 ≤ B 4 . Further, we use C to denote a numerical constant whose value may change from place to place. Theorem 13 combined with (55) implies that whereV (ρ) is given by (34). Combining this bound with Corollary 11 yields To simplify the analysis, we use a transformation that reduces a problem with general R to the problem with R = 0. The direct observation model with matrix R + φ(t) is equivalent to two separate observations, one with matrix R and other other with matrix φ(t). Associating the R-dependent model with the side information yields:Ỹ t := where W , W ′ , W ′′ are independent noise matrices. By the chain rule, the relative entropy corresponding to the interpolation model satisfies where the second term on the right-hand side does not depend on t. Thus, it is sufficient to prove the stated result for the first term on the right-hand side.

Decomposition along path
The first step of the proof is a decomposition of the relative entropy associated with a generic interpolation path. Let By the integral form of the I-MMSE relationship (Lemma 3), the chain rule for differentiation and (46), it follows that Rearranging terms leads to the decomposition D(φ(0), S) = D + R where To further simplify the first term, let T : S d → S d be the unique linear operator, described in Lemma 26, which satisfies T (A), B = S, A ⊗ B for all A, B ∈ R d×d and let T * denote the adjoint operator, which satisfies T (A), B = A, T * (B) . Then, This representation is generic in the sense that it holds for any interpolation path φ(t).
Next, we note that the overlap is as a function of the interpolation path in the sense that The key idea in the adaptive interpolation method is to construct a path such that the derivative φ ′ (t) cancels out certain terms that depend on E[O t ]. In the following, we will construct paths that satisfy for some integrable function ψ ′ : [0, 1] → S d + . Evaluating (66) along this path and then using the property T (A), B = A, T * (B) for all A, B ∈ R d×d leads to where ψ(1) = 1 0 ψ ′ (t) dt.

Lower bound
We begin by constructing a path that depends on an auxiliary parameter q ∈ Q := {q ∈ S d + : q E X ⊤ X } that will be specified later. Consider the initial value problem defined by This is a first order ordinary differential equation because E[O t ] is a function φ(t). The next result shows that this problem has a unique solution satisfying certain regularity conditions. Lemma 14. The initial value problem (70) has a unique solution φ(t) = Φ(t, ǫ, q) defined on [0, 1]. Furthermore, for each t ∈ [0, 1], the mapping (ǫ, q) → Φ(t, ǫ, q) − ǫ is continuous and order-preserving.
Next, r → M(t, r) is order-preserving by the data processing inequality for MMSE (Lemma 2). Also, both T and T * are order-preserving. Therefore, (φ, q) → F (t, φ, q) is order-preserving because it is the composition of order-preserving maps. For any (ǫ 1 , q 1 ) (ǫ 2 , q 2 ) we can apply the comparison result in Lemma 20 with Having established the existence of the path satisfying (71), we may consider the representation in (69), which becomes: In this expression, T ( where the second line follows from the definition of the convex conjugate. The remaining step is to specify the parameter q. The greatest lower bound is obtained by any value q ⋆ = q ⋆ (ǫ) that attains the maximum in (73), and this approach leads to D ≥D(ǫ, S). However, a technical difficulty that arises is that the selection of q depends on ǫ and so to verify that the resulting family (φ ǫ ) ǫ is strongly order-preserving, it is necessary to show that the mapping from ǫ to q is order-preserving.
For the purposes of this paper, we bypass this issue by considering a slightly weaker lower bound, where q is optimized for the case ǫ = 0. Specifically, let q ⋆ be such that: This set is nonempty because Q is compact and −D * 0 is upper-semicontinuous, and thus the extreme value theorem [56,Theorem 1.29] guarantees that the maximum is attained. Then, (73) implies that Moreover, because q ⋆ does not depend on ǫ, it follows from Lemma 14 that ǫ → Φ(t, ǫ, q ⋆ ) − ǫ is orderpreserving.

Upper bound
Similar to before we begin by defining a path with respect to an auxiliary parameter q ∈ Q. Consider the initial value problem with coupled equations: This is a coupled system of first order ordinary differential equations because E[O t ] is a function of φ(t).
Starting with (69) and then adding and subtracting terms leads to where the last step follows from Jensen's inequality and the convexity of D * . To proceed, we will use Jensen's inequality a second time to upper bound the term D 0 (φ(1)) according to Plugging this inequality back into (84) and then rearranging terms gives The first term on the right-hand side is upper bounded byD(ǫ, S). The second term is nonnegative and equal to zero when q =q(1).
To complete the specification of the path we will show that there exists a mapping q ⋆ : E → Q that satisfies the following properties: (i) The interpolation path satisfiesq(1) = q ⋆ (ǫ) and thus the upper bound in (88) implies that (ii) The mapping q ⋆ is order-preserving, and thus by Lemma 15, ǫ → Φ(t, ǫ) = Φ(t, ǫ, q * (ǫ)) is strongly order-preserving.
To show that such a function exists, let Q : E × Q → Q be given by Q(ǫ, q) = 1 0Q (t, ǫ, q) dt and define the fixed-point set FP(ǫ) = {q ∈ Q : Q(ǫ, q) = q}. The next result shows that FP(ǫ) is nonempty and there exists a unique mapping q ⋆ : E → FP(ǫ) that is order-preserving.
Lemma 16. For each ǫ ∈ E, the set FP(ǫ) is nonempty and there exists a greatest element q ⋆ (ǫ) such that q q ⋆ (ǫ) for all q ∈ FP(ǫ). Furthermore, the mapping q ⋆ : E → Q is order-preserving.
Proof. For each ǫ ∈ E, the mapping q → Q(ǫ, q) is continuous by Lemma 15 and maps a convex compact subset of Euclidean space to itself. Therefore, we can apply Brouwer's fixed-point theorem to conclude that FP(ǫ) is nonempty.
Next we show that there is a greatest element. If FP(ǫ) is single-valued then we are are done. Otherwise, define the set K(ǫ) = q∈FP(ǫ) {q ∈ Q : q q}. This set is nonempty because Q has a largest element and it is convex and compact because it is the intersection convex and compact sets. The mapping (ǫ, q) →Q(t, ǫ, q) is order-preserving by Lemma 15 and this implies that Q is order-preserving. Consequently, q ∈ K implies that Q(ǫ, q) q for allq ∈ FP(ǫ), and so q → Q(ǫ, q) maps K(ǫ) to itself. By Brouwer's fixed-point theorem, there must be a solution on K(ǫ). From the definition of K(ǫ), this solution is the greatest element of FP(ǫ), which is necessarily unique.
Finally, we show that the greatest element q ⋆ is order preserving. Fix any ǫ 1 , ǫ 2 ∈ E with ǫ 1 ǫ 2 and let q k = q ⋆ (ǫ k ) for k = 1, 2. For each q q 1 , the fact that Q is order preserving means that Q(ǫ 2 , q) Q(ǫ 1 , q 1 ) = q 1 and so q → Q(ǫ 2 , q) maps the closed convex set {q ∈ Q : q q 1 } to itself. By Brouwer's fixed-point theorem, there exists a solution on this set and by the assumption that q 2 is the greatest element it follows that q 1 q 2 .
Remark 2. In comparison to earlier versions of the adaptive interpolation method for special cases of this problem, one of the main innovations of the our approach is the introduction of the coupled path (76) and the auxiliary parameter q . Moreover, a crucial step in the bounding technique is that the Jensen's inequality in (84) is applied to the overlap but not the auxiliary pathq ′ (t), and this allows for the cancellation of the second term in (88). 6 Overlap concentration via continuous-time perturbations 6

.1 Decomposition of overlap
We begin with a decomposition of the overlap matrix introduced in Section 2.4. By construction, the conditional expectation of the overlap is a d × d symmetric positive semidefinite random matrix given by (27). The following decomposition is known as the law of total variance: The first term in this decomposition can be bounded from above by the conditional covariance matrix.
Lemma 17. The following holds: where the terms on the right-hand side are orthogonal in expectation. Consequently, where the third step follows from the conditional independence of X 1 and X 2 and the last step is the Cauchy-Schwarz inequality.
To deal with the first factor, let X 1 , . . . , X n ∈ R d denote the rows of X and observe that For the second factor, Jensen's inequality leads to E E XX ⊤ | G 2 2 ≤ E XX ⊤ 2 2 ≤ E X 4 2 .

Positive semidefinite frame for symmetric matrices
To control the variation in the conditional expectation of the overlap we need to bound the Frobenius norm of a symmetric random matrix. A useful technique for reducing the d-dimensional setting to the 1-dimensional setting is to decompose this norm in terms of inner products with a set of rank-one matrices. For example, one natural decomposition for a symmetric matrix M is given by where e i denotes the i-th standard basis vector. This decomposition follows from the fact that the set {E ij } forms an orthogonal basis for the space of symmetric matrices. For the purposes of our proof technique, the problem with the decomposition described above is that most of the basis elements are indefinite and thus do not correspond to order preserving perturbations. The approach used in this paper is to construct an overcomplete basis (or frame) consisting of d 2 rank-one matrices in S d + that span S d . While there are many ways to construct such a frame, the following specification has two nice properties: 1) the lower and upper frame bounds differ by a factor that is only linear in the dimension, and 2) the norm of each frame element is either 1 or 1/ √ 2.
Definition 6 (Frame for symmetric matrices). Let {B ij } d i,j=1 be the set of d×d rank-one positive semidefinite matrices given by B ij = a ij a ⊤ ij where with e i denoting the i-th standard basis vector in d dimensions.
Lemma 18. The set of rank-one positive semidefinite matrices {B ij } d i,j=1 forms a frame for the space of d × d symmetric matrices with lower and upper frame bounds 1 and (d + 1)/2, respectively. In other words, for each M ∈ S d , Proof. Because both B ij and M are symmetric the inner product is determined by the entries on or below the diagonal and we can write B ij , M = vech(B ij ), vech(M ) where vech(·) denotes the d(d+ 1)/2-dimensional vector obtained by stacking the entries on and below the diagonal andB ij = 2B ij −diag(B ij ). To proceed, let C be a d 2 × d(d + 1)/2 matrix whose rows are in one-to-one correspondence with the elements of {vech(B ij )}. Using this notation, we can write d i,j=1 B ij , M 2 = C vech(M ) 2 2 and thus the lower and upper frame bounds (i.e., the constants in double inequality (99)) are given by the minimum and maximum eigenvalues values of C ⊤ C. As a straightforward exercise, it can be verified that there exists permutation matrices P, Q such that whereC is a d(d − 1)/2 × d binary matrix that satisfiesC ⊤C = (d − 2) + 1 d 1 ⊤ d . Hence, From here, a direct calculation reveals that the minimum and maximum eigenvalues are given by 1 and (d + 1)/2, respectively.
With these results in hand, we are ready to bound the second term in (27).
be the collection of d-dimensional vectors described in Definition 6. Then Proof. Starting with Lemma 18 gives By (27) and the construction of B k it follows that Thus, the expectation of the square is the variance of E[Xa k | G] 2 2 .

Proof of Theorem 13
Let t ∈ [0, 1] be fixed. We consider a parameterization of the perturbation matrix in terms of vector u ∈ U := [0, 1] d 2 +1 . Given ρ ∈ (0, ∞), define where B 0 = I d and {B k } d 2 k=1 is the collection of rank-one positive semidefinite matrices described in Definition 6. By Lemma 18 it follows that ǫ(u) ∞ ≤ ρ(d + 1) uniformly for all u ∈ U. Moreover, By Lemma 17, To deal with the first term, we use the identity which follows from the expression for the Hessian of the relative entropy (Lemma 4). We are interested in an upper bound on the integral of this expression with respect to u 0 . Working backwards, we can write where the last step follows from the decomposition R(u) = u 0 ρI d + (R(u) − u 0 ρI d ) and the fact that u 0 → (R(u) − u 0 ρI d ) is order-increasing. Combining the above displays leads to where we used Jensen's inequality to move the integral inside the square root. We now consider the variation with respect to the conditional expectation of the overlap. Lemma 19 gives To bound the summands on the right-hand side we will use the variance inequality in Lemma 23. To apply this result, let F k u,δ and G k u,δ denote the free energy and the sigma-field associated with the observation model with matrix R(u) + δB k . Then the variance inequality becomes To further simplify this expression, we use the bound Var(A + B) ≤ 2 Var(A) + 2 Var(B) to write Hence, Similar to before, we are interested in an upper bound on the integral of this expression with respect to u k . Working backwards, we have Here, the crucial step is the last inequality which follows because R(u)−ρu k B k is order-increasing. Combining the expressions above leads to Combining (113) and (127) with the decomposition in (90) gives the stated upper bound.

A Comparison lemma for differential equations
There exists a large literature on differential inequalities; see e.g., [57,Chapter III]. The next result is related to a result by Lasota et al. [58,Lemma 4]. A self-contained proof is provided for completeness.
Lemma 20 (Comparison Lemma). Let φ(t) and ψ(t) be S d -valued functions that are differentiable on (0, 1) with where u, v ∈ S d satisfy u v and F, G : Proof. The desired result can be stated equivalently as Using the variational representation of the maximum eigenvalue λ max (M ) = sup u ≤1 u ⊤ M u and then applying the envelope theorem [59,Theorem 2], it follows that h(t) is absolutely continuous with The integrand can be bounded from above using where the first inequality follows from the assumptions on φ(t), ψ(t) and the remaining steps follow from the assumptions on F, G and the basic inequality φ(t) ψ(t) + h(t)I d . Thus, we have shown that

B Variance inequality from pointwise I-MMSE
This section considers a continuous time representation of the free energy. This approach is inspired by the pointwise I-MMSE relations in [39,40]. Let X be an n × 1 vector and define the observation process where (W t ) t≥0 is R n -valued standard Brownian motion that is independent of (X, Z). We use G t to denote the filtration generated by the observations (Y s , Z) 0≤s≤t . Using this notation, the conditional expectation of a function of X can be expressed as The innovations process (Ȳ t ) t≥0 is defined bȳ An important property of the innovations process is that it is a standard Brownian motion under P Yt|Z .
We study two different processes that are adapted to G t . The first is the free energy: and the second is the conditional overlap:Ô The means of these processes satisfy E[F t ] = D X|Z (t) and E[Ô t ] = d dt 2D X|Z (t) where D X|Z (t) is the relative entropy function. As a consequence of Girsinov's theorem, it can be shown that [40,Equation (116)] Writing this expression in terms of the innovations process yields The second term is a zero-mean martingale and so taking the expectation of both sides recovers the I-MMSE relationship for the relative entropy and the overlap. Next, we consider a further decomposition of the conditional overlap. We use the following result, which can be viewed as special case of the Kushner-Stratonovich equation.

Lemma 21.
For any function f : Proof. For simplicity we derive this result in the setting where there is no side information Z. The conditional expectation can be expressed as By Itô's formula [60,Theorem 8.3], whereġ t (y) denotes the time derivative, and g i t (y) and g ii t (y) denote the first and second partial derivatives with respect to the i-th entry. Some straightforward but tedious calculations show thaṫ and plugging these expression back into Itô's formula, gives Writing this expressions in terms of the innovations process gives the desired result.

Lemma 22. The conditional overlap can be expressed aŝ
Proof. By Lemma 21, the conditional mean satisfies E[X | G t ] = E[X | Z] + t 0 Cov(X | G s ) dȲ s . Applying Itô's formula with the mapping x → x 2 gives the stated result.
Lemma 22 can be viewed as a second-order pointwise I-MMSE relation providing a link between the first and second derivative of the free energy. Indeed, the second term in (149) is a zero-mean martingale and so taking the expectation of both sides recovers the identity d 2 dt 2 D X|Z (t) = − 1 2 E Cov(X | G t ) 2 2 . Using these pointwise decompositions of the free energy and conditional overlap we can derive an inequality linking the variance of the conditional overlap to the variance of the free energy.
Lemma 23 (Variance inequality). For all δ > 0, Proof. We begin with the decomposition By the Cauchy-Schwarz inequality, the first covariance term satisfies To bound the second term, we use (141) to write Combining the above displays and recognizing that E Cov(X | G u ) 2 In view of (151), (152), and (160) we have shown that The final expression follows from the fact that for nonnegative numbers x, a, b, the inequality x ≤ √ bx + a implies x ≤ b + 2a.

C Gradients of convex functions
Lemma 24. Let f n be a sequence of convex functions defined on S d + that converges pointwise to a limit f . Then, f is convex and for all x ∈ S d + and y ∈ S d such that x + ty ∈ S d + for small enough t > 0, it follows that Proof. The directional derivative of a convex function satisfies ) and taking the infimum over both sides yields lim sup n→∞ f ′ n (x; y) ≤ f ′ (x; y). To obtain the desirerd expression we apply the max formula [56,Theorem 17.18], which states that f ′ (x; y) = max u∈∂f (x) u, y .

D Properties of the approximation formula
Lemma 25. The approximation formulaD(R, S) given in (32) is convex and order-preserving on S d + × S d 2 + . Furthermore, the supremum over Q is attained on the compact convex setQ 0 := {Q ∈ S d + : Q E X ⊤ X }, and thusD Proof. The approximation formula is convex because it is the supremum over a family of convex functions [56,Proposition 9.3]. Similarly, it is order-preserving because it is the supremum over a family of order-preserving functions. Next, the mapping Q → 1 2 S, Q ⊗2 −D 0 (Q) is upper semicontinuous and dom D * 0 ∩S + + is contained in the compact setQ 0 . Therefore, by the extreme value theorem [56,Theorem 1.29], the supremum is bounded and attained onQ 0 .

D.1 Alternative characterizations
The approximation formula can be expressed a number of different ways. For example, starting with (163) and recalling the definition of the convex conjugate leads tô where the second line follows from the change of variablesR → R +R. The next result shows that S, Q 1 ⊗ Q 2 can be represented as a bilinear function. This representation is used to specify the interpolation path in the proof of Theorem 10.
Lemma 26. For each S ∈ S d there exists a unique linear operator T : for all A, B ∈ R d×d where T * is the adjoint operator. These operators can be expressed as for any λ 1 , . . . , λ m ∈ R and V 1 , . . . , V m ∈ R d2×d1 such that S = Starting with (165) and writing the term S, Q ⊗2 in terms of the linear operator T leads tô where the last step follows from the change of variablesR →R + T (Q). The next result shows that the infimum overR can be restricted to the compact set given by the image ofQ 0 under T * .
The expression in the brackets is concave in Q and linear (and hence convex) in U 2 . Therefore, we can apply the minimax theorem (e.g., [61,Corollary 37.3.1]) to swap the order of the supremum and the infimum over these terms, and this leads tô where the second step follows from the definition of the convex conjugate and the fact that dom D * 0 ∩S d + ⊂Q 0 . Making the change of variables U 1 → Q and U 2 →Q gives the stated expression.

D.2 The role of convexity
The approximation formula can be simplified further in the special case where M → S, M ⊗2 is convex. Necessary and sufficient conditions for convexity are given as follows: Proof. By vectorization, we can write f (M ) = g(vec(M )) where g : R d 2 → R is given by g(m) = m ⊤ L l=1 (B l ⊗ B l ) ⊤ m. The function g is convex if and only if its Hessian is positive semidefinite. For the second case, we use the half vectorization operator and write f (M ) = h(vech(M )) where h : R d(d+1)/2 → R is given by h(m) = m ⊤ D ⊤ n L l=1 (B l ⊗ B l ) ⊤ D n m. Recall that by Corollary 12, convexity on R d×d implies that the approximation formula is a lower bound: D(R, S) ≥D(R, S). The next result shows that convexity also allows for a variational representation involving only a single matrix variable. where T is the linear operator described in Lemma 26.
Proof. By (173) it is clear that the right-hand side of (183) is an upper bound onD(R, S). At the same time, the assumed convexity implies that for each Q,Q ∈ S d + ,

S, Q
Taking the maximum of this lower bound with respect toQ inQ 0 gives the matching lower bound.