Rigorous Dynamics of Expectation-Propagation-Based Signal Recovery from Unitarily Invariant Measurements

Signal recovery from unitarily invariant measurements is investigated in this paper. A message-passing algorithm is formulated on the basis of expectation propagation (EP). A rigorous analysis is presented for the dynamics of the algorithm in the large system limit, where both input and output dimensions tend to infinity while the compression rate is kept constant. The main result is the justification of state evolution (SE) equations conjectured by Ma and Ping. This result implies that the EP-based algorithm achieves the Bayes-optimal performance that was originally derived via a non-rigorous tool in statistical physics and proved partially in a recent paper, when the compression rate is larger than a threshold. The proof is based on an extension of a conventional conditioning technique for the standard Gaussian matrix to the case of the Haar matrix.


I. INTRODUCTION
A. Motivation C ONSIDER the recovery problem of an N -dimensional signal vector x from a compressed noisy measurement vector y ∈ C M (M ≤ N ) [2], [3], In (1), A ∈ C M×N denotes a known measurement matrix. The signal vector x is an unknown sparse 1 vector that is composed of independent and identically distributed (i.i.d.) elements. The noise vector w ∈ C M is independent of the other random variables. The goal of compressed sensing is to recovery the sparse vector x from the knowledge about y and A, as well as the statistics of all random variables. A breakthrough for signal recovery is to construct messagepassing (MP) that is Bayes-optimal in the large system limit, where the input and output dimensions N and M tend to infinity while the compression rate δ = M/N is kept constant.
Manuscript received April *, 2017. The author was in part supported by the Grant-in-Aid for Exploratory Research (JSPS KAKENHI Grant Number 15K13987), Japan. The material in this paper was submitted in part to 2017 IEEE International Symposium on Information Theory, Aachen, Germany, Jun. 2017 [1].
K. Takeuchi is with the Department of Electrical and Electronic Information Engineering, Toyohashi University of Technology, Aichi 441-8580, Japan (email: takeuchi@ee.tut.ac.jp). 1 In this paper, a signal x ∈ R is called sparse if the Rényi information dimension [4] of x is smaller than 1. If x is zero with probability 1 − p, the information dimension is at most p. If x is discrete, it is zero.
The origin of this approach dates back to the Thouless-Anderson-Palmer (TAP) equation [5] in statistical physics. Motivated by the TAP approach, Kabashima [6] proposed an MP algorithm based on approximate belief propagation (BP) in the context of code-division multiple-access (CDMA) systems with i.i.d. measurement matrices. When the compression rate is larger than the so-called BP threshold [7], the BP-based algorithm was numerically shown to achieve the Bayes-optimal performance in the large system limit, which was originally conjectured by Tanaka [8] via the replica method-a nonrigorous tool in statistical physics, and proved in [9] for i.i.d. Gaussian measurements. However, Kabashima [6] presented no rigorous analysis on the convergence property of the BPbased algorithm.
In order to resolve lack of a rigorous proof, approximate message-passing (AMP) was proposed in [10] and proved in [11] to achieve the optimal performance for i.i.d. Gaussian measurements, when the compression rate is larger than the BP threshold. Spatially coupled measurement matrices are required for achieving the optimal performance in the whole regime [7], [12]- [14]. However, it is recognized that AMP fails to converge when the i.i.d. assumption of measurement matrices is broken [15], unless damping [16] is employed.
As solutions to this convergence issue, several algorithms have been proposed on the basis of expectation propagation (EP) [17], expectation consistent (EC) approximations [18], [19], or turbo principle [20]- [22]. The EP-based algorithm [17] is systematically derived from Minka's EP framework [23] by approximating the posterior distribution of x with factorized Gaussian distributions. The EC-based algorithms [18], [19] are iterative algorithms for solving a fixed point (FP) of the EC free energy derived by Opper and Winther [24]. The algorithms [20]- [22] based on turbo principle are derived from a few heuristic assumptions. Interestingly, the algorithms in [17], [19], [21] are essentially equivalent. In this paper, these algorithms for signal recovery are simply referred to as EP-based algorithms, since Céspedes et al. [17] first proposed this type of algorithms, to the best of author's knowledge.
Ma et al. [20], [21] derived state evolution (SE) equations of an EP-based algorithm under two heuristic assumptions. By investigating the properties of the SE equations, they conjectured that, for unitarily invariant measurement matrices, the FPs of the EP-based algorithm are the same as those of an 0000-0000/00$00.00 c 2017 IEEE energy function that describes the Bayes-optimal performance in the large system limit, which were derived in [25] via the replica method. In other words, the EP-based algorithm was conjectured to achieve the optimal performance in the large system limit, when the compression rate is larger than the BP threshold. The purpose of this paper is to present a rigorous proof for this conjecture.

B. Proof Strategy
The proof strategy is based on a conditioning technique used in [11]. A challenging part in the proof is to evaluate the distribution of an estimation error in each iteration conditioned on estimation errors in all preceding iterations. Bayati and Montanari [11] evaluated the conditional distribution via the distribution of the measurement matrix A conditioned on the estimation errors in all preceding iterations. When linear detection is employed as part of MP, the conditional distribution of A can be regarded as the posterior distribution of A given linear, noiseless, and compressed observations of A, determined by the estimation errors in all preceding iterations. For i.i.d. Gaussian measurement matrices, it is well known that the posterior distribution is also Gaussian. The proof in [11] heavily relies on this well-known fact.
In order to present our proof strategy, assume that A is a Haar matrix [26], [27], which is uniformly distributed on the space of all possible N × N unitary matrices. Under appropriate coordinate rotations in the row and column spaces of A, it is possible to show that the linear, noiseless, and compressed observation of A is equivalent to observing part of the elements in A. Since any Haar matrix is bi-unitarily invariant [26], the distribution of A after the coordinate rotations is the same as the original one. Thus, evaluating the conditional distribution of A reduces to analyzing the conditional distribution of a Haar matrix given part of its elements.
Evaluation of this conditional distribution is a technically challenging part in this paper, while this part is not required for i.i.d. Gaussian measurements. Intuitively, conditioning on a small part of the elements can be ignored, since the number of conditioned elements is O(N )much smaller than the number of all elements N 2 in the large system limit. In order to prove this intuition, we use coordinate rotations to reveal the statistical structure of the conditional Haar matrix. We know that a Haar matrix has similar properties to an i.i.d. Gaussian matrix as N → ∞. In particular, a finite number of linear combinations of the elements in a Haar matrix were proved to converge toward jointly Gaussian-distributed random variables in the large system limit [28], [29]. Note that the classical central limit theorem cannot be used, since the elements of a Haar matrix are not independent. Using this asymptotic similarity between Haar and i.i.d. Gaussian matrices, we prove the intuition.

C. Contributions
The main contribution is the rigorous justification of the SE equations for the EP-based algorithm, conjectured in [21]. This implies the achievability of the Bayes-optimal performance conjectured in [25], when the compression rate is larger than the BP threshold, while the converse theorem is still open, i.e. there are no algorithms outperforming the EP-based algorithm in the large system limit.
The technical novelty is in an extension of the conditioning technique in [11] for i.i.d. Gaussian measurement matrices to the case of Haar matrices. The proof strategy of the main theorem is applicable to any MP algorithm for signal recovery from unitarily invariant measurements, such as the AMP, unless the algorithm contains nonlinear processing in the measurement vector y, e.g. quantization [22].

D. Organization
The remainder of this paper is organized as follows: After summarizing the notation used in this paper, Section II presents the definition of unitarily invariant matrices and a few technical results associated with Haar matrices. In Section III, we introduce assumptions used throughout this paper, and then formulate an EP-based algorithm. The main result is presented in Section IV, and proved in Section V. Several technical results are proved in appendices.

E. Notation
The proper complex Gaussian distribution with mean m and covariance Σ is denoted by CN (m, Σ). For random variables X and Y , the notation X ≤ indicates that →, ≥, and ≤ hold almost surely. The notation X ∼ Y means that X follows the same distribution as Y , while X d → Y is used to represent that X converges in distribution to Y . The notation X| Y indicates that we focus on the conditional distribution of X given Y .
The notation o(1) denotes a vector of which the Euclidean norm converges almost surely toward zero in the large system limit. For a vector v ∈ C N , we write the nth element of v as v n , in which the index n runs from 0 to N − 1.
For a complex number z ∈ C and a matrix M ∈ C M×N , the complex conjugate, transpose, and the conjugate transpose are denoted by z * , M T , and M H . We write the (m, n) We write the singular-value decomposition (SVD) of M as consists of the remaining columns. For M > N , we write the SVD of M as When M is full rank, the pseudo-inverse of M is de- Let P M denote the orthogonal projection matrix onto the space spanned by the columns of M . We have

II. PRELIMINARY
The purpose of this section is to present two technical results: the strong law of large numbers and the central limit theorem for a Haar matrix. The two results correspond to [11, Lemma 2 (a) and (b)] for an i.i.d. Gaussian matrix. We start with the definition of a Haar matrix.
Definition 1: A unitary random matrix U ∈ U n is called a Haar matrix if U is uniformly distributed on U n .
An important property of a Haar matrix is bi-unitary invariance [27]-used throughout this paper.
Definition 2: A random matrix M is called bi-unitarily invariant if M ∼ U M V holds for all deterministic unitary matrices U and V .
We first present the strong law of large numbers associated with a Haar matrix. The elements of a Haar matrix are not independent of each other, so that we utilize the strong law of large numbers for dependent random variables.
Theorem 1 (Etemadi [30]): denote a sequence of complex random variables with finite second moments, and define S n = n i=1 X i . = 0, if the following assumptions hold: for some γ < 2. • The limit lim n→∞ S n /n β a.s. = 0 holds, if lim sup is satisfied for some γ < 2β.
Proof: The proof presented in Appendix A is based on [30, Theorem 1].
The following lemma is the strong law of large numbers associated with a Haar matrix.
Lemma 1: Suppose that V ∈ U N is a Haar matrix. Let a ∈ C N and b ∈ C N denote random vectors that are independent of V and satisfy lim N →∞ N −1 a 2 a.s. = 1, lim N →∞ N −1 b 2 a.s. = 1, and lim N →∞ N −1 b H a a.s.
= C. Furthermore, we define a Hermitian matrix D ∈ C N ×N such that D is independent of V , and that N −1 Tr(D 2 ) is almost surely convergent as N → ∞. Then, Proof: See Appendix B. We next present the central limit theorem associated with a Haar matrix.
Theorem 2 (Chatterjee and Meckes [29]): Let V ∈ U N denote a Haar matrix. For any k ∈ N, suppose that k matrices {W i ∈ C N ×N } are independent of V , and satisfy Tr(W i W H j ) = N δ i,j for all i, j = 1, . . . , k, in which δ i,j denotes the Kronecker delta. Then, the vector a = (Tr(V W 1 ), . . . , Tr(V W k )) T converges in distribution to the standard complex Gaussian vector z ∼ CN (0, I k ) as N → ∞.
Lemma 1 and Theorem 2, as well as Theorem 1, will be used in the proof of the main theorem.

A. Assumptions
Assumptions on the measurement model (1) are presented. Assumption 1: The signal vector x is composed of zeromean i.i.d. non-Gaussian elements with unit variance and finite fourth moments.
The i.i.d. assumption for x is implicitly used in the derivation of an EP-based algorithm. We require no additional assumptions for the prior distribution of each element to prove the main theorem, whereas it is practically important to postulate some prior distribution indicating the sparsity of x.
Definition 3: A Hermitian random matrix M is called unitarily invariant if M ∼ U M U H holds for any deterministic unitary matrix U .
Assumption 2: The measurement matrix A has the following properties: • A H A is unitarily invariant. • The empirical eigenvalue distribution of AA H converges almost surely to a deterministic distribution ρ(λ) with a finite fourth moment in the large system limit. We write the SVD of A as with U ∈ U M and V ∈ U N . Furthermore, Σ is an M × M positive semi-definite diagonal matrix. From Assumption 2, V is a Haar matrix and independent of U Σ [27].
Assumption 3: Let D ∈ C M×M denote any Hermitian matrix such that D is independent of w, and that N −1 Tr(D 2 ) is almost surely convergent as N → ∞. Then, the noise vector w satisfies Assumption 3 implies that σ 2 corresponds to the noise power σ 2 a.s. = lim M→∞ M −1 w 2 per element, by selecting D = I M . Assumption 3 is satisfied if ww H is unitarily invariant, or if U is a Haar matrix and independent of D.

B. Expectation Propagation
We start with an MP algorithm proposed in [21]. Let the detector postulate that the noise vector w in (1) is a circularly symmetric complex Gaussian (CSCG) random vector with covariance σ 2 I M . This postulation needs not be consistent with the true distribution of w.
As derived in Appendix C, the MP algorithm for this case is based on EP and composed of two modules. In iteration t, a first module-called module A-calculates the extrinsic mean x t A→B and varianceṽ t A→B of the signal vector x from x t

B→A
and v t B→A provided by the other module-called module B.
In the initial iteration t = 0, the prior mean x 0 B→A = 0 and variance v 0 B→A = N −1 E[ x 2 ] = 1 are used. In (11), the linear minimum mean-square error (LMMSE) filter W t ∈ C N ×M is given by The normalization coefficient γ t in (11) is defined as due to Assumption 2, with where ρ(λ) denotes the asymptotic eigenvalue distribution of AA H in the large system limit. The coefficient γ t keeps the orthogonality between estimation errors in the two modules.
On the other hand, module B computes the minimum meansquare error (MMSE) estimator and the MMSE of given the virtual additive white Gaussian noise (AWGN) observation, From Assumption 1, note that the MMSE is independent of n. If a termination condition is satisfied, module B outputs η t (x t A→B ) as an estimate of x. Otherwise, module B feeds the extrinsic mean x t+1 B→A and variance v t+1 B→A of x back to module A, given by where the decision function η t : C → C is defined as Remark 1: The decision function η t (x) is zero for all x ∈ C if and only if x n ∼ CN (0, 1) holds. We have postulated Assumption 1 to let the decision function be non-constant.
The following lemma presents an important property of the decision function η t in module B.
Lemma 2 (Ma and Ping [21]): Proof: See Appendix D for the proof based on [21]. Lemma 2 will be used to prove the orthogonality between estimation errors in the two modules.

C. Error Recursion
An error recursion for the EP-based algorithm is formulated to analyze the convergence property.
B→A denote the estimation errors for the extrinsic estimates in modules A and B, respectively. Substituting the system model (1) into (11), and using the SVD (9) and (19), we obtain the error recursion In (25), the linear filterW t is given bỹ Furthermore, we define η −1 (·) = 0 to obtain q 0 = x. In analyzing the convergence property, we focus on the distribution of the estimation error h t conditioned on the preceding iteration history. Thus, it is useful to represent the error recursion in the matrix form. Define The error recursion is represented as where the τ th columns of G t (B t ) and F (H t , q 0 ) are equal to the right-hand sides (RHSs) of (25) and (27) for t = τ , respectively.

IV. MAIN RESULT
We define the instantaneous mean-square errors (MSEs) for the extrinsic estimates in modules A and B as in the large system limit. Furthermore, let mse t denote the instantaneous MSE for the estimate (16) in the EP-based algorithm in the large system limit, given by The following theorem is the main result of this paper, which describes the rigorous dynamics of the EP-based algorithm in the large system limit.

Theorem 3 (State Evolution): Define deterministic SE equations as
with mse 0 B→A = 1, in which γ(·) and MMSE(·) are given in (15) and (17), respectively. Then, the instantaneous MSEs mse t A→B , mse t B→A , and mse t converge almost surely to mse t A→B a.s.
in the large system limit. Theorem 3 was originally conjectured in [21], and implies that the EP-based algorithm predicts the exact dynamics of the extrinsic variances in the large system limit. Compare (12) and (20) to the SE equations. The FPs of the SE equations were proved in [21] to correspond to those of an energy function that describes the Bayes-optimal performance-derived in [25] via the replica method. Thus, the Bayes-optimal performance derived in [25] is achievable when the SE equations have a unique FP, or equivalently when the compression rate δ is larger than the BP threshold.
We shall introduce several notations to present a general theorem, of which a corollary is Theorem 3. The random variables in the error recursions (30)-(33) are divided into three groups: V , Θ = {Σ,w}, and Table I for the notational conventions used in this paper. The set Θ is fixed throughout this paper. Thus, conditioning on Θ is omitted. The set X t,t describes the history of all preceding iterations just before updating (24), while X t,t+1 represents the history just before updating (26). Note that the condition (30) and (32). In order to investigate the dynamics of the error recursions, the distribution of the Haar matrix V conditioned on X t,t ′ is analyzed. Let Theorem 4: The following properties hold for each iteration τ = 0, 1, . . .: (a) Each element q τ +1,n in q τ +1 has finite fourth moments. More precisely, for all n For all τ ′ ≤ τ + 1, the following limit exists: For some constant C > 0, Then, for any hold for all subsets N ∈ N k in the large system limit. (c) Let ω ∈ C N denote any vector that is independent of V , and satisfies lim N →∞ N −1 ω 2 a.s. = 1. Suppose that D is any N ×N Hermitian matrix such that D depends only on Σ, and that N −1 Tr(D 2 ) is almost surely convergent as N → ∞. Then, for all τ ′ ≤ τ and τ ′′ ≤ τ + 1 with (59) (d) The MSEs for the posterior estimate (16) and the extrinsic estimate (19) in module B coincides with the MMSE (17) and extrinsic variance (20) in the large system limit, The assumption is too strong to be justified. In fact, the proof of Theorem 4 implies that the assumption is not correct, since it is impossible to let k = N in Theorem 2. However, the weaker property (b) is sufficient to prove Theorem 3.

A. Technical Lemmas
We need to evaluate the two distributions p(m t , b t |Θ, X t,t ) and p(q t+1 , h t |Θ, X t,t+1 ). The former distribution represents the error recursions (24) and (25) conditioned on the history of all preceding iterations, while the latter describes the error recursions (26) and (27). We follow the proof strategy in [11] to evaluate the two distributions via the conditional distribution p(V |Θ, X t,t ′ ) for t ′ = t or t ′ = t + 1. See Section I-B for the main idea in analyzing the conditional distributions.
Before analyzing the conditional distribution, we shall introduce several definitions used in the proof. For t ′ > 0, we as defined in Section I-E. For notational convenience, we The following lemma provides a useful representation of V ∈ U N conditioned on Θ and X t,t ′ , and corresponds to [11,Lemma 10].
Lemma 3: For t ≥ 0, t ′ > 0, and N − t − t ′ > 0, suppose thatṼ ∈ U N −t−t ′ is a Haar matrix and independent of V , and that both Q t ′ ∈ C N ×t ′ and M t ∈ C N ×t are full rank for with V 0,1 01 = b H 0 / q 0 . Then, the conditional distribution of V given Θ and X t,t ′ for t ′ = t or t ′ = t + 1 satisfies where the conditional meanV t,t ′ is given byV 0,1 = q 0 b H 0 / q 0 2 for t = 0, and bȳ for t > 0, withV Proof: See Appendix E. We next present a lemma to evaluate the conditional mean of V , which corresponds to [11,Lemma 12].
Lemma 4: For t ′ ≥ t > 0 and N − t − t ′ > 0, suppose that both Q t ′ ∈ C N ×t ′ and M t ∈ C N ×t are full rank. Let with Then, for all Proof: See Appendix F. The following lemma is used to prove the central limit theorem associated with the second term on the RHS of (71). Note that the lemma is not required for i.i.d. Gaussian measurement matrices, while evaluation of negligible terms is essentially the same as in [11, Lemma 2 (c)].
Lemma 5: For t ′ > t ≥ 0 and N − t − t ′ > 0, suppose that V ∈ U N −t−t ′ is a Haar matrix and independent of V . Let a ∈ C N −t−t ′ denote a vector that are independent ofṼ and satisfies lim N →∞ N −1 a 2 a.s. = 1. Suppose that z ∈ C N is a vector such that, for any k ∈ N, z N ∼ CN (0, I k ) holds for all subsets N ∈ N k as N → ∞.
• If the following properties hold: for some constant C > 0, then (86) holds conditioned on a, Θ, and X t,t in the large system limit.
• If the following properties hold: for some constant C > 0, then holds conditioned on a, Θ, and X t,t+1 in the large system limit.
Proof: See Appendix G. We are ready to prove Theorem 4. The proof is by induction. In the next subsection, we first prove the theorem for τ = 0.

= lim
M=δN →∞ Lemma 5 implies that, for any k ∈ N, h 0,N given by (93) converges in distribution toh 0,N given by (48) for all N ∈ N k in the large system limit, when ν 0 is defined by (94). In order to complete the proof, we shall prove that (94) reduces to (50). Applying V 0,1 01 = b H 0 / q 0 and P ⊥ Eq. (57) for τ = 0: Let us prove (57) for τ = 0. Using (93) and Lemma 1 yields conditioned on Θ and X 0,1 in the large system limit. From (55) for τ = 0, we find that the first term vanishes in the large system limit. Since ν 0 given by (50) is equal to the RHS of (57) for τ = 0 and τ ′ = 0, (57) holds for τ = 0. Eq. (43) for τ = 0: We shall prove (43) for τ = 0. For any a ≥ 0, b ≥ 0, and a + b ≤ c, we utilize the fact that, for random variables X and Y , E[|X| a |Y | b ] is bounded if both E[|X| c ] and E[|Y | c ] are bounded. The fact is trivial for a = 0 or b = 0. Otherwise, using Hölder's inequality yields (27) and Assumption 1 it is sufficient to prove that E[|η 0 (q 0,n − h 0,n )| 4 ] < ∞ holds in the large system limit. Using (52) for τ = 0, we have in the large system limit, withx 0 n,A→B = q 0,0 − ν 1/2 0 z 0,0 . From (21), the moment (97) is bounded if E[|η 0 (x 0 n,A→B )| 4 ] is bounded. From (56) for τ = 0, the variance ν 0 given in (50) coincides with v 0 A→B given by (12). Furthermore, the MMSE estimatorη 0 (x 0 n,A→B ) is equal to the posterior mean E[x n |x 0 n,A→B =x 0 n,A→B ] of x n defined via (18). From these observations, we use Jensen's inequality to obtain because of Assumption 1. Thus, (43) holds for τ = 0. Eq. (44) for τ = 0: We shall prove the existence of the limit (44) for τ = 0. We only consider the case τ ′ = 0, since the case τ ′ = 1 can be proved in the same manner. Using (27) Since the first term converges almost surely to 1 in the large system limit, it is sufficient to prove that the second term is convergent in the large system limit. In order to utilize Theorem 1, we use (52) for τ = 0 to note in the large system limit. In (100), the last expression follows from the fact that z 0 has pairwise independent elements. We repeat the proof of (43) to find the boundedness of (100). Thus, we can use Theorem 1 to find Thus, the limit (44) exists for τ ′ = 0 and τ = 0. Eq. (58) for τ = 0: We evaluate N −1 h H 0 q τ ′′ . For τ ′′ = 0, from (93) we have conditioned on Θ and X 0,1 , where we have used (Φ ⊥ Q 1 ) H q 0 = 0. From (55) for τ = 0, we find that N −1 h H 0 q 0 converges almost surely to zero in the large system limit.
We next evaluate N −1 h H 0 q 1 . Using (27) conditioned on Θ and X 0,1 , where h 0 is given by (93). We have proved that the first term converges almost surely to zero in the large system limit. Repeating the proof of (44) for τ = 0, we use Theorem 1 to find that (103) reduces to in the large system limit, where the last equality follows from (52) for τ = 0.

= lim
M=δN →∞ which is equal to MMSE(v 0 A→B ) given by (17), because of ν 0 = v 0 A→B in (48). Thus, (60) holds for τ = 0. Let us prove (61) for τ = 0. Applying (21) to (27) for τ = 0, and using (20), we have (107) In evaluating ζ 1,1 given in (44) with (107), repeating the argument in deriving (106) yields where (27) has been used. Repeating the proof of (44) for τ = 0, we find in the large system limit. We next evaluate the last two terms on the RHS of (110). Applying the Cauchy-Schwarz inequality to the last term implies → 0 (111) in the large system limit. The convergence to zero follows from the boundedness of V[η 0 (x 0 0,A→B )] and the upper bound Thus, the last term on the RHS of (110) converges almost surely to zero in the large system limit.
In order to evaluate the second term on the RHS of (110), we use the fact that the maximum eigenvalue of the projection matrix P Q 1 is 1. Thus, (110) is bounded from below by in the large system limit. Since η 0 (x 0 0,A→B ) is a non-constant random variable, the lower bound (112) is strictly positive. Thus, (46) holds for τ = 0.

C. Proof by Induction
We have proved that Theorem 4 holds for τ = 0. Next, we assume that Theorem 4 is correct for all τ < t, and prove that Theorem 4 holds for τ = t. Note that we can use Lemmas 3 and 4, since the induction hypotheses (45) and (46) τ < t imply that M t and Q t ′ are full rank for t ′ = t and t ′ = t + 1.
= o(1) in the large system limit. We use the submultiplicative property of the Euclidean norm to obtain We first prove that N −1 H H t q ⊥ t converges almost surely to zero in the large system limit. By definition, The induction hypothesis (58) for τ < t implies that N −1 H H t q t and N −1 H H t Q t converge almost surely to zero in the large system limit. Furthermore, the induction hypotheses (44) and (46) for τ < t imply that converges almost surely to zero in the large system limit.
In order to complete the proof of ǫ 1,t a.s.
= o(1), we shall prove that N Γ t,t 2 is bounded. Using (77) and The induction hypothesis (45) for τ < t implies that the first term into the trace is bounded in the large system limit. Furthermore, we use the induction hypotheses (45), (54), and (55) for τ < t to obtain 2 is bounded in the large system limit. We have proved ǫ 1,t = o(1). We next use Lemma 5 to evaluate the second term on the RHS of (113). It is possible to confirm that the last term on the RHS of (86) reduces to Thus, we use (113) and Lemma 5 to find that (51) holds for τ = t, when µ t in (47) is defined as µ t a.s.

= lim
M=δN →∞ In order to complete the proof of (51), we shall prove that (119) reduces to (49). From Lemma 4, it is sufficient to prove By definition, we have (121) We have already proved → 0 in the large system limit. Using the induction hypotheses (46), (57), and (58) for which implies the boundedness of (N −1 H H t P ⊥ Q t H t ) −1 in the large system limit, because of the induction hypothesis (45) for τ < t. From these observations, we have (120). Thus, (51) holds for τ = t.
For τ ′ = t, (113) and Lemma 1 imply conditioned on Θ and X t,t in the large system limit, with µ t given by (49). The induction hypothesis (54) for τ < t implies that the fist term converges almost surely to lim M=δN →∞ N −1 q t 2 N −1 Tr(D). Thus, we complete the proof of (54) for τ ′ = τ = t, by proving in the large system limit. From (82), it is sufficient to prove We only prove (127), since (126) can be proved in the same manner. Using the the Cauchy-Schwarz inequality yields In order to prove (53) for τ = t, we repeat the same argument to obtain conditioned on Θ and X t,t , where we have used the induction hypothesis (53) for τ < t. Let us prove (55) and (56) for τ = t. Using (14), (28), (53), and (54), we obtain The property (55) follows from (25) (131) in the large system limit. Using (28), (53), (54), and Assumption 2, we find that the second term reduces to (59) for t ′ = τ ′ . Thus, (56) holds for τ = t.
For τ ′ = t, we use (133) and Lemma 1 to have conditioned on Θ and X t,t+1 in the large system limit, with ν t given by (50). Applying the induction hypothesis (57) for τ < t and m t = M t α t to the first term, we arrive at (57) for τ = τ ′ = t. Thus, (57) holds for τ = t.
Eq. (43) for τ = t: Let us prove (43) for τ = t. It is sufficient to prove that E[|η t (q 0,n − h t,n )| 4 ] < ∞ holds in the large system limit. We first confirm that holds in the large system limit, withx t n,A→B = q 0,n − h G t,n , in which h G t,n is the nth element of h G t , recursively defined as in the large system limit. Applying the induction hypothesis (52) repeatedly in the order τ = t − 1, . . . , 0, we arrive at (137). We next evaluate the RHS of (137). From (56), (57), and (59) for τ = τ ′ = t, as well as from the induction hypothesis (61) for τ = t − 1, we find that the random vector h G t induced from the randomness of Z t = {z τ : τ = 0, . . . , t} has i.i.d. proper complex Gaussian elements with vanishing mean in the large system limit and variance v t A→B . Repeating the proof of (43) for τ = 0, we obtain Thus, (43) holds for τ = t. Eq. (44) for τ = t: We only prove the existence of (44) for τ ′ ≤ τ = t, since the case τ ′ = t + 1 can be proved in the same manner. Using (27) yields The induction hypothesis (44) τ < t implies that the first term is convergent in the large system limit. In order to prove the existence of (44) for τ ′ ≤ τ = t, it is sufficient to confirm a.s.
in the large system limit.
Repeating the proof of (101), we use Theorem 1 and (52) for τ = t to have in the large system limit, whereh t is given by (48). Using Theorem 1 and the induction hypothesis (52) repeatedly in the order τ = t − 1, . . . , 0, we arrive at (142). Thus, (44) exists for τ ′ ≤ τ = t. Eq. (58) for τ = t: We shall prove (58) for τ = t. From (132) and Lemma 4, we find conditioned on Θ and X t,t+1 for τ ′′ ≤ t, which is almost surely equal to zero, because of (55) for τ = t. Thus, (58) holds for τ ′′ ≤ τ = t. For τ ′′ = t + 1, we use (27) to have where we have used (58) for τ ′′ = 0 and τ = t. It is possible to prove by repeating the proof of (142). Let us prove that the RHS of (146) is equal to zero. Since h G t has i.i.d. proper complex Gaussian elements with vanishing mean in the large system limit and variance v t A→B , we use Lemma 2 to find that the RHS of (146) is equal to zero. Thus, (58) holds for τ = t.

APPENDIX A PROOF OF THEOREM 1
Without loss of generality, we can assume that ℜ[X i ] ≥ 0 and ℑ[X i ] = 0 hold, i.e. X i is a non-negative real random variable, by dividing X i into the real and imaginary parts, and by further separating each part into the non-negative and negative parts.
We first prove the first statement. Let k n = ⌈a n ⌉ for any a > 1. Using Chebyshev's inequality yields for all ǫ > 0. From the assumption (5), there is a constant C > 0 such that k −γ n V[S kn ] < C holds for all n ∈ N. Thus, we use a n ≤ ⌈a n ⌉ to have for γ < 2. From the Borel-Cantelli lemma, we arrive at In order to prove lim k→∞ T k a.s. = 0, for any integer k ≥ ⌈a⌉ we select n ∈ N such that k ∈ [k n , k n+1 ] holds. The monotonicity of S n implies Using (4) and (150), we arrive at lim sup k→∞ |T k | a.s.
≤ (a − 1) sup i E[X i ] for all a > 1. Thus, the first statement holds.
The second statement can be proved in the same manner. Using the assumption (6), we obtain instead of (149), which implies lim n→∞ S kn /k β n a.s. = 0. The second statement follows from the upper bound, for n = n ′ and m = m ′ , for n = n ′ and m = m ′ .
We shall prove Lemma 1. We only present the proof of the latter statement, since the former statement can be proved more straightforwardly. Without loss of generality, we can assume that D is diagonal, since V is bi-unitarily invariant.
Using Lemma 6 to evaluate the expectation of S N over V yields as N → ∞, because the existence of lim N →∞ N −1 Tr(D 2 ) and Hölder's inequality imply that We evaluate the second moment of S N as Using Lemma 6 yields Thus, the variance of S N over V is given by where we have used E V [S N ] = N −1 Tr(D)b H a. Note that the term N −1 n=0 |a n | 2 |b n | 2 vanishes for any N . As a result, we need no assumptions on N −1 n=0 |a n | 4 for the case a = b. We use the continuous mapping theorem and the assumptions on a, b, and D to arrive at as N → ∞, which implies Lemma 1.

APPENDIX C DERIVATION OF MESSAGE-PASSING
EP [17], [23] provides a framework for deriving MP algorithms that calculate the marginal posterior distribution p(x n |y, A) = p(x|y, A)dx \n , in which x \n is the vector obtained by eliminating x n from x. We consider the large system limit to derive an MP algorithm, which coincides with the algorithm derived in a heuristic manner [21].
We approximate the marginal posterior distribution p(x n |y, A) by a tractable probability density function (pdf) q A (x n ) = q A (x)dx \n , given by In (163), the notation f (x) ∝ g(x) means that there is a positive constant C such that f (x) = Cg(x) holds. Furthermore, q B→A (x n ) is a conjugate prior to the likelihood p(y|A, x). When the noise vector w in (1) is regarded as a CSCG random vector with covariance σ 2 I M , the conjugate prior q B→A (x n ) is proper complex Gaussian, where x n,B→A and v B→A are the mean and variance of q B→A (x n ), respectively. In order to derive the MP algorithm proposed in [21], we have selected the identical variance v B→A for all n, while Céspedes et al. [17] selected different values for different n to improve the performance for finite-sized systems.
We first evaluate the marginal pdf q A (x n ) in the large system limit, defined via (163). Since the conjugate prior (164) has been selected, the joint pdf q A (x) is also Gaussian.
where the mean and covariance are given by respectively. Using the matrix inversion lemma, it is possible to show that (166) and (167) reduce to respectively, with We shall prove that a H n Ξ −1 a n converges almost surely to γ(v B→A ) −1 in the large system limit for all n, in which γ(v B→A ) is given by (15). Applying the SVD (9) to a H n Ξ −1 a n , defined via (170), we have a H n Ξ −1 a n = e H n V DV H e n . with In (171), e n denotes the nth column of I N . Thus, Lemma 1 and Assumption 2 imply that a H n Ξ −1 a n converges almost surely to γ(v B→A ) −1 in the large system limit.
This observation indicates that for any n the diagonal element (169) converges almost surely to (173) in the large system limit. Thus, the marginal pdf q A (x n ) = q A (x)dx \n is the proper complex Gaussian pdf with mean In order to present a crucial step in EP, we define the extrinsic pdf of x n as Let x n,B and v n,B denote the mean and variance of x n with respect to the pdf p B (x n ) ∝ q A→B (x n )p(x n ). The crucial step in EP is to update the message q B→A (x n ) so as to satisfy the moment matching conditions [23], where the expectations are taken with respect to In (178), the updated pdf q new B→A (x n ) is given by We first derive module A. Using (164) and (174), we find that the extrinsic pdf (175) reduces to with Substituting (173) into (182) yields which results in the update rule (12). Similarly, Applying (168), (173), (182), and (183) to (181), we arrive at which implies the update rule (11). We next evaluate the moment matching conditions (176) and (177) to derive module B. Substituting (179) and (180) into (178) yields (187) Using the moment matching conditions (176) and (177), we arrive at the update rules (19) and (20) in module B, Note that v B given in (177) converges almost surely to its expectation in the large system limit.

APPENDIX D PROOF OF LEMMA 2
We utilize the following technical lemma: Lemma 7: We define the cumulant generating function χ t : C → R of the posterior distribution of x n as Then, χ t is twice continuously differentiable with respect to ℜ[z] and ℑ[z], and satisfies Proof: The former statement follows from Assumption 1 and the dominated convergence theorem. The latter statement is obtained by calculating the derivatives of χ t directly.
Let us prove Lemma 2. We use (21) to obtain Thus, it is sufficient to prove (23).
For notational convenience, we write z t n as z. By definition, we have (194) Since ℜ[z] and ℑ[z] are independent Gaussian random variables with zero-mean and variance v t A→B /2, using Stein's lemma [31] yields Applying Lemma 7, we obtain which tends to MMSE(v t A→B ) as ǫ → 0.

APPENDIX E PROOF OF LEMMA 3
We first prove the following technical lemma: Lemma 8: Suppose thatṼ ∈ U N −t−t ′ is a Haar matrix and independent of V , and that V t,t ′ 10 and V t,t ′ 01 in (62) are full rank for t > 0 and t ′ > 0. Let V 0,1 = {V 0,1 01 } and V t,t ′ = {V t,t ′ 00 , V t,t ′ 01 , V t,t ′ 10 } for t > 0 and t ′ > 0. Then, the distribution of V t,t ′ 11 conditioned on V t,t ′ satisfies whereV t,t ′ 11 for t > 0 and t ′ > 0 is given in Lemma 3, while the conventionV 0,1 11 = O is introduced. Proof: Conditioning on V t,t ′ is considered throughout the proof of Lemma 8. We first consider the case t > 0 and t ′ > 0. For notational convenience, we omit the scripts t and t ′ . Define two unitary matrices Φ ∈ U N and Ψ ∈ U N as Using the SVDs (63) and (65) Note thatV ∈ U N holds. We next confirm thatV 11 has the following block structure: In (203),Ṽ 11 ∈ C t×t ′ is given bỹ Furthermore,Ṽ ∈ U N −t−t ′ is a Haar matrix and independent of V . Note that Σ V 10 ∈ C t×t and Σ V 01 ∈ C t ′ ×t ′ are invertible, since V 10 and V 01 are full rank. The structure (204) in the upper-left block follows from the orthogonality between the first t + t ′ rows ofV . Similarly, (205) is due to the orthogonality between the first t + t ′ columns ofV . It is possible to confirm the consistency of the two expressions (204) and (205), by using the properties V H 00 V 00 + V H 10 V 10 = I t and V 00 V H 00 + V 01 V H 01 = I t ′ . The structures in the off-diagonal blocks follow from the orthogonality between the first t columns and the last (N − t − t ′ ) columns ofV and between the first t ′ rows and the last (N − t − t ′ ) rows. The structure in the bottom-right block is due to the orthonormality between the last N − t − t ′ columns.
In order to complete the proof, we use (202) and (203) to obtain Applying (204) and (205), as well as (63) and (65), we find that the first term reduces to (73) and (74), respectively. Thus, (197) holds. We next consider the case t = 0 and t ′ = 1, and omit the scripts t and t ′ . Let Using the SVD (65) to calculateV = Φ H V Ψ V 01 yieldŝ By using the orthogonality between the rows ofV , it is straightforward to confirm thatV ∈ U N has the following structure:V whereṼ ∈ U N −1 is a Haar matrix and independent of V . Thus, we use V = ΦV Ψ H V 01 and the convention Φ ⊥ V 0,1 01 = I N −1 to arrive at (197).
Let us prove Lemma 3. We first assume t > 0 and t ′ > 0. Substituting the SVDs (64) and (66) into (30) and (32) yieldŝ Since Σ Q t ′ and Σ M t are assumed to be invertible, conditioning on Θ and X t,t ′ is equivalent to observing the first t columns and the first t ′ rows ofV . Note thatV ∼ V holds, since the Haar matrix V is bi-unitarily invariant. Thus, we haveV withV = {V 00 ,V 01 ,V 10 }, partitioned in the same manner as in (62). We shall prove that (213) is equivalent to (71). From (210) and (211), we havê To complete the proof, we let ǫ = (Q H 1 Q 1 ) −1 Q H 1z , and prove that N β |ǫ| 2 for any β < 1 converges almost surely to zero in the large system limit. From Theorem 1, it is sufficient to show that N γ E[|ǫ| 4 ] is bounded for some γ > 2β.
Let z ∼ CN (0, I N ). Then, we have with G = Q 1 (Q H 1 Q 1 ) −2 Q H 1 ∈ C N ×N . Without loss of generality, we can assume that G is diagonal, since zz H is unitarily invariant. Using moment properties of z, such as E[|z n | 4 ] = 2, yields E z (z H Gz) 2 = Tr(G 2 ) + Tr 2 (G).
Selecting some γ ∈ (2β, 2) for any β < 1, we have which converges almost surely to zero in the large system limit, because of the assumption in Lemma 5. Combining (232) and (234), we arrive at N γ E[|ǫ| 4 ] → 0 for γ ∈ (2β, 2) in the large system limit. We next consider the case t > 0.

Q t+1
Ht must hold. Repeating the proof of (230) for Φ ⊥Ṽ a, we have Ht )z (235) conditioned on a, Θ, and X t,t+1 in the large system limit, where z N ∼ CN (0, I k ) holds for all subsets N ∈ N k . Evaluating the second term in the same manner as in the derivation of (229), we arrive at (89).