Reed–Muller Codes on BMS Channels Achieve Vanishing Bit-Error Probability for all Rates Below Capacity

This paper considers the performance of Reed–Muller (RM) codes transmitted over binary memoryless symmetric (BMS) channels under bitwise maximum-a-posteriori (bit-MAP) decoding. Its main result is that, for a fixed BMS channel, the family of binary RM codes can achieve a vanishing bit-error probability at rates approaching the channel capacity. This partially resolves a long-standing open problem that connects information theory and error-correcting codes. In contrast with the earlier result for the binary erasure channel, the new proof does not rely on hypercontractivity. Instead, it combines a nesting property of RM codes with new information inequalities relating the generalized extrinsic information transfer function and the extrinsic minimum mean-squared error.


I. INTRODUCTION
Reed-Muller (RM) codes have been the subject of considerable research since their introduction by Muller in [3] and their majority-logic decoding by Reed in [4].Almost 70 years after their discovery, RM codes remain an active area of research in theoretical computer science and coding theory.In 2007, Costello and Forney described "the road to channel capacity" [5] and wrote that: [I]n recent years it has been recognized that "RM codes are not so bad".RM codes are particularly good in terms of performance versus complexity with trellis-based decoding and other soft-decision decoding algorithms [...] Indeed, with optimum decoding, RM codes may be "good enough" to reach the Shannon limit on the AWGN channel.[...] It seems likely that the real coding gains of the self-dual RM codes with optimum decoding approach the Shannon limit [...], but to our knowledge this has never been proved.We note that their observations preceded the introduction of polar codes [6] by roughly one year and, after polar This is an updated version of [1] where the title has been changed to match the journal version [2].The work of G. Reeves and H. D. Pfister was supported in part by the National Science Foundation (NSF) under Grant Numbers 1718494, 1750362, and 1910571.Any opinions, findings, recommendations, and conclusions expressed in this material are those of the authors and do not necessarily reflect the views of these sponsors.G. Reeves is a member of the Department of Electrical and Computer Engineering and the Department of Statistical Science, Duke University (email: galen.reeves@duke.edu).H. D. Pfister is a member of the Department of Electrical and Computer Engineering and the Department of Mathematics, Duke University (email: henry.pfister@duke.edu).
codes, there was a significant renaissance in research on RM codes [7], [8].This paper considers the performance of long RM codes transmitted over binary memoryless symmetric (BMS) channels under bitwise maximum-a-posteriori (bit-MAP) decoding and proves that their intuition was indeed correct.
For a BMS channel, the output sequence is generated by passing all symbols in the input sequence through independent identically-distributed channels whose noise processes do not depend on the input symbol.Some examples are the binary erasure channel (BEC), the binary symmetric channel (BSC), and the binary-input additive white Gaussian noise (BIAWGN) channel.The primary technical result can be summarized by the following theorem, which follows easily from Theorem 36.
Theorem 1.Consider any BMS channel with capacity C ∈ (0, 1).For every sequence of RM codes with strictly increasing blocklength and rate converging to R ∈ [0, C), the bit-error rate (BER) under bit-MAP decoding converges to zero.
This essentially settles a rather old question in coding theory and shows that binary Reed-Muller codes can achieve capacity on any BMS channel under bit-MAP decoding!We note that this conclusion was certainly more expected than its alternative because [9] already established this result for the special case of the BEC.We note that our result falls short of the stronger condition that the block-error probability vanishes.
For a detailed discussion of relevant prior work until 2017, see [9].Since then, there have been a few papers that address this question directly or indirectly [10], [11], [12].In [10], new bounds on RM weight enumerators are introduced and used to prove that low rate and high rate RM codes can correct a large number of erasures and errors.A very different approach is pursued in [11] by treating RM codes from a polar coding perspective and showing that almost all of the effective channels polarize.This approach shows that RM codes must be close to a "twin code" that achieves capacity on BMS channels.Finally, the results of [12] very cleverly combine a number of earlier results (including the BEC result from [9]) to establish that the bit (and block) error probabilities of RM codes will vanish on general BMS channels but only for rates bounded away from capacity.For a good tutorial that covers RM codes and relevant prior work until 2020, we suggest [13].
The proof for the BEC case in [9] requires only linearity and doubly-transitive symmetry for the code.To achieve this, it relies on the sharp threshold property for symmetric boolean functions and the Extrinsic Information Transfer (EXIT) Area Theorem [9].One new element in this work is that our proof also relies on the RM nesting property which says that longer RM codes can be punctured down to shorter RM codes of the same order.But, this does not follow directly from the doubly-transitive symmetry of the code.Another difference between the new proof and [9] is highlighted by the fact that the new proof holds for the BEC but does not make use of hypercontractivity (which seems to be crucial for the boolean function result).Lastly, the new proof does not extend to all sequences of doubly-transitive codes nor does it imply that the block-error rate converges to 0. However, we are optimistic that an extension to block-error rate is possible, perhaps using techniques from [14], [15], [12].

A. Primary Contributions and Overview
The main contribution of this paper is to establish that RM codes can achieve capacity for any BMS channel.Our proof uses many ideas developed previously in the context of generalized EXIT (GEXIT) analysis.For example, we focus on a family of BMS channels and use the GEXIT area theorem.However, there are a number of steps in our proof that appear to be new.Since these steps may be of interest in their own right we summarize them briefly here.
A major theme in this work is our focus on the impact of extra observations on the estimate of a single codeword bit (e.g., see Lemma 25).The precise form of the "extra observation" varies from place to place.In some cases it corresponds to a second look at a single position in the codeword and in other cases it corresponds to a second look at a collection of symbol positions.But the underlying idea is the same -an additional observation cannot make a meaningful difference in the ability to estimate the bit of interest if either of the following conditions is met: (i) the expected information from the first observation is very small so that a second independent observation is unlikely to tell us much more; or (ii) the expected information from the first observation is nearly maximal and a second independent observation cannot contribute much more.To fully utilize this observation, a crucial step is harnessing the nesting property of RM codes to provide a strong upper bound on the variance of the conditional mean of a codeword bit given the observation (e.g., see Section V-B2).In particular, we embed the RM code of interest in a longer RM code with a slightly lower rate and show that the two codes must behave very similarly for almost all channel noise parameters as the block length grows.
To make these arguments precise, one needs to compare the associated GEXIT functions with and without extra observations.While there are numerous functional properties associated with mutual information and entropy in the context of an additional observation, the challenge faced in our setting is that the GEXIT function corresponds to a difference in mutual information, and in this setting many of the usual properties no longer hold.
The technical tools that allow us to overcome this difficulty form a collection of generalized I-MMSE relations, which are introduced in Section IV-B.They allow us to bound the GEXIT function in terms of a quantity, called the extrinsic MMSE, which is easier to analyze.In particular, the extrinsic MMSE satisfies a data processing inequality and has a sub-additivity property, which follows as a natural consequence of the Efron-Stein inequality.
Here is a list of key elements in the proof along with brief descriptions: • Lemma 9 describes the RM nesting property as used in the proof.
• Lemma 17 derives the two-look formula, which is the foundation for our GEXIT analysis.
• Lemma 20 defines a series expansion for the GEXIT formula and Lemma 21 uses the two-look formula to relate the GEXIT function to the extrinsic MMSE.
• Lemma 28 derives an integral constraint on the extrinsic MMSE function of RM codes that shows it must transition quickly from 0 to 1 as the blocklength increases.
• Lemma 35 uses the GEXIT area theorem to compare the transition point of the extrinsic MMSE to the capacity of the BMS channel.
• Theorem 36 proves the main result by deriving a nonasymptotic upper bound on BER of an RM code on any BMS channel.
For readers who are familiar with [9] and boolean functions, Appendix C discusses the method introduced in this paper, for the special case of the BEC, and compares it with the approach in [9].

B. Other Consequences
Our work also has some additional consequences when combined with other recent results: • Since our main result shows that RM codes achieve capacity on the BSC, it follows that Quantum Reed-Muller (QRM) codes [25] achieve the hashing bound on a modified depolarizing channel where X and Z errors occur independently with the same probability [26, p. 568].
• Combined with the duality result of Renes for classicalquantum (CQ) channels [27], our result can also be applied to an RM code on a pure-state CQ channel.
Due to the sequential nature of the implied quantum detection model, however, the established decay rate for the bit-error probability is not immediately sufficient to guarantee a vanishing bit-error rate for all code symbols.But, for linear codes on a pure-state CQ channel, one can show that the optimal measurements for each bit actually commute [28], [29].Thus, the bit-error probability is the same for all bits and does not depend on the decoding order!Hence, RM codes achieve a vanishing bit-error probability on the pure-state CQ channel for all rates below capacity.
• If one can prove the stronger result that the block error probability vanishes up to capacity, then one can apply Renes's quantum duality [27,Theorem 3] to show that RM codes achieve strong secrecy up to capacity for the pure-state wire-tap channel.

C. Open Problems
This results of this paper naturally suggest the following interesting open problems: a) Block Error Rate: Can the main result be strengthened to show that the block error rate vanishes under the same conditions?
For RM codes on BMS channels, the question of when a vanishing BER implies a vanishing block error rate was addressed some years back in [14].While simple arguments work if the BER decays faster than d min /N , their result shows that a much slower rate is actually sufficient.Unfortunately, the decay rate we achieve in this paper is still too slow.
Note: After this paper was accepted for publication, an arXiv preprint [30] was posted with an argument that RM codes achieve vanishing block error probability for all rates below capacity.This new work builds on our observation (see Lemmas 8 and 9) that long RM codes can be punctured, in many different ways, down to much shorter RM codes of roughly the same rate.Instead of using two looks to bound a variance (as we do), their proposed decoder first constructs, for each bit, a large set of weakly correlated looks and outputs their majority vote.Then, as a second stage, the decoder finds the nearest codeword to the first output sequence.The key technical achievement in [30] is showing that the BER of the first output sequence decays fast enough so that the second output equals the transmitted codeword with high probability.
b) Other Codes and Channels: Can this approach be extended to work for other codes and channels?
For example, there are known codes (e.g., multidimensional product codes and Berman codes [31]) which have a nesting property that is compatible with the approach in this paper but issues arise because they are not doubly transitive.Also, there may be families of affine invariant codes with a compatible nesting property (e.g., see [32]).
The extension to non-binary RM codes over symmetric nonbinary channels is also interesting.Many of the key properties should generalize but it could be challenging to extend the theory of GEXIT functions.The class of symmetric binaryinput classical-quantum channels also seems feasible but the extension of GEXIT functions would face similar challenges.
Note: After this paper was accepted for publication, a paper extending this result to non-binary RM codes over symmetric non-binary channels was accepted for publication [33], [34].The new paper shows that non-binary RM codes achieve capacity on sufficiently symmetric non-binary channels with respect to symbol error rate.The new proof also simplifies the approach in this paper in a variety of ways that may be of independent interest.
Finally, it is interesting to consider whether the new techniques in this paper can be applied to study sharp threshold behaviors such as the "all-or-nothing phenomenon" arising in other high-dimensional structured inference problems [35], [36], [37], [38].

II. REED-MULLER CODES
A. Background A length-N binary code is a set C ⊆ {0, 1} N of length-N binary vectors called codewords.Such a code allows the transmission of |C| different messages each of which is associated with a codeword c ∈ C. A codeword c = (c 0 , . . ., c N −1 ) is transmitted using a sequence of channel uses where the i-th code symbol, c i , determines the input for the i-th channel use.The rate of the code C is defined to be 1 N log 2 |C|.For a length-N binary linear code C with dimension K, the code rate equals K/N and a generator matrix G ∈ F K×N 2 defines an encoder E : F K 2 → C that maps an information vector u ∈ F K 2 to a codeword via u → uG.The Reed-Muller code RM(r, m) is a binary linear code of length N = 2 m and rate Below, we introduce facts about RM codes as they are needed.For a thorough discussion, see [39], [40].
Example 2. Let G r,m be the generator matrix of RM(r, m).
The generator matrix (in the standard RM order) of RM(1, 3) is given by where RM codes can be described in many different ways.One way is via the one-to-one correspondence between the set of codewords in RM(r, m) and the set of F 2 -multilinear polynomials in m indeterminates whose total degree is at most r.For this correspondence, the mapping from a polynomial p to a codeword c is given by evaluating p at all points v ∈ F m 2 .In particular, the i-th code symbol is given by c For 0 ≤ r ≤ m, let P r,m be the vector space (over F 2 ) of multilinear polynomials in m indeterminates with degree at most r.This vector space is spanned by the subset of multilinear monomials is defined by a set of coefficients {α S ∈ F 2 } S∈Sr,m with respect to the monomial basis and its evaluation is given by ( This viewpoint can be unified with the generator matrix perspective by noting that, for S ∈ S r,m , the coefficient α S can be seen as an information bit that modulates the row in the generator matrix associated with the monomial S. In particular, that row can be computed by evaluating the monomial S at all points in F m 2 .To generate the codebook, one first enumerates all information vectors u ∈ F K 2 (or equivalently all polynomials in P r,m ) and then multiplies each by G (or equivalently each polynomial is evaluated at the 2 m points in 2 ).Example 3. Continuing Example 2, we observe that the only degree-0 monomial is 1.Thus, the first and only row of G 0,2 can be computed by evaluating p(v) = 1 for all v ∈ {0, 1} 2 .This also explains the first row of G 1,2 .The second and third rows of G 1,2 are associated with evaluating p(v) = v 0 and p(v) = v 1 for all v ∈ {0, 1} 2 in the order given by τ (v).Likewise, for G 1,3 , the rows are associated with evaluating the monomials {1, v 0 , v 1 , v 2 }, respectively, for all v ∈ {0, 1} 3 with the order given by τ (v).
RM codes have many algebraic and combinatorial properties.One of these is a nesting property that will play a particularly important role in this work.To describe this property, we will consider a few different ways that C = RM(r, m) can be punctured down to the code RM(r, m − k).
Definition 4 (Punctured Code).For a length-N binary code C and a subset I ⊆ [N ], we denote by C I the punctured code formed by only keeping symbol indices with positions in I. Formally, we write Remark 5.One can imagine a puncturing operation that also includes the reordering of code bits.But, this is not needed for our results.So, we restrict our attention to the case where the bits whose indices are not in I are punctured and the remaining bits are renumbered but kept in the same order.
The code C = RM(r, m) can be punctured down to RM(r, m − k) in multiple different ways.For example, there exist I, I ′ ⊂ [2 m ] such that C I and C I ′ are both equal to RM(r, m − k).We emphasize that we mean equality here (rather than equivalence) and this does depend on the ordering of the code bits.Fortunately, in the construction below, the correct bit order is given by enumerating I (and I ′ ) in increasing order and this agrees with our definition of C I .We will see below that this statement follows naturally from two well-known properties of RM codes.The first property is encapsulated in the following lemma.Proof.To see this, we can split the monomials into two groups.Let the first set of monomials be the subset of M r,m that only contains the variables v 0 , . . ., v m−k−1 and observe that this equals M r,m−k .That means the second set, which contains all the rest, is given by The key observation is that the monomials in M ′ all evaluate to 0 on the set V because all points in Thus, for c ∈ C and i ∈ I, only the monomials in M r,m−k contribute to the value of c i .This implies that the codewords in C I are formed by evaluating the set of F 2multilinear polynomials in m − k indeterminates whose total degree is at most r at the points in V .Moreover, this notation orders the vector c I so that the evaluation at v ∈ V appears before the evaluation at v ′ ∈ V iff τ (v) < τ (v ′ ).This is the natural binary ordering on (v 0 , . . ., v m−k−1 ) and, hence, C I is precisely equal to RM(r, m − k).Another important point is that exactly 2 |M ′ | = |C|/|C I | codewords in C are mapped to each codeword in C I .This holds because, if the information bits associated with M r,m−k are fixed, then the punctured codeword c I is fixed.But, by choosing the information bits associated with the monomials in M ′ , one can generate The second property is that, for an invertible binary matrix Thus, the set of all multilinear polynomials with degree at most r is mapped to itself by this change of variables and the permutation τ (π Q,b (τ −1 (i))) defines an automorphism of the RM code in terms of symbol indices [39, p. 398].Combining these two properties, one finds that, if an evaluation subset V ⊆ {0, 1} m is an F 2 -subspace with dimension m − k, then there is an invertible binary matrix Q (and hence a linear automorphism m defines an ordered subset of indices that reveals an RM(r, m − k) code inside an RM(r, m) code.Moreover, each smaller code contains the code symbol c 0 from the larger code because all of these automorphisms map 0 to 0. This operation can be seen as a puncturing, according to our definition, if the ordered subset of indices is in increasing order.
For an RM(r, m) code, the number of information symbols is equal to because (2) implies that we can assign one information bit to each α S for S ∈ S r,m .This information symbol determines whether or not the monomial defined by S is present in the associated polynomial.This justifies the rate formula in (1).Notice that the rate formula equals the cumulative distribution function (cdf) of a binomial random variable, with m equiprobable trials, evaluated at r.For large m, the central limit theorem implies that R(r, m) transitions from roughly 0.025 to roughly 0.975 as r ranges from ⌊m/2 − √ m⌋ to ⌈m/2 + √ m ⌉ because the standard deviation of the binomial is √ m/2 and, for a Gaussian, roughly 95% percent of the probability lies within 2 standard deviations of the mean.It can also be useful to consider sequences of RM codes where the n-th code is RM(r n , m n ) with m n → ∞ and r n = m n /2+α √ m n /2+o( √ m n ).In particular, for such sequences, the central limit theorem implies that R(r n , m n ) → Φ(α), where Φ(α) = (2π) −1/2 α −∞ exp(−z 2 /2) dz is the cdf of a standard Gaussian random variable.
The above rate calculation appeared earlier in [9, Remark 24] and we mention it here for completeness.There, it is observed that, for any code rate R ∈ (0, 1), the rate calculation implies one can construct a sequence of RM codes with increasing m whose code rate converges to R.

B. New Observations
The following lemma characterizes the change in code rate due to perturbations of the m parameter.
Lemma 8 (RM Rate Change).For the codes RM(r, m) and To bound the difference between the Gaussian cdfs, we can write where we use Φ ′ (z) ≤ 1/ √ 2π in the first inequality and the fact that the integrand is non-increasing in the second inequality.Noting that r ≤ m, we can simplify to get the bound where the last step follows from √ 8π > 5.
The above observations have a surprising consequence that, as far as we know, has not been exploited previously.Notice that, if the code sequence RM(r n , m n ) satisfies R(r n , m n ) → R for R ∈ (0, 1), then the rate of the code sequence But, RM(r n , m n ) can be formed from RM(r n , m n + k n ) by puncturing all but the first 2 mn symbols.Thus, we have a code sequence whose rate converges to R where throwing away a fraction 1 − 2 −kn of the symbols gives another code sequence whose rate converges to R. This is quite surprising because one might expect that puncturing a significant fraction of the bits should increase the code rate by a significant amount.
Lemma 9.For an RM(r, m + k) code C with r ≤ m and 1 ≤ k ≤ m, there are multiple distinct puncturing patterns that result in an RM(r, m) code.In particular, there are subsets , and a uniform distribution on RM(r, m + k) induces a uniform distribution on the punctured codes C I and C I ′ .This construction is shown in Figure 2.
e., all length-(m + k) binary vectors where the last k entries are zero) be the subspace of This implies that C I is given by evaluating the set of degree at most r polynomials in m + k variables on the subset V .Using Lemma 6, we see that C I is equal to RM(r, m).
Similarly, let spanned by the first m − k and last k canonical basis vectors.For the associated set of codeword indices, I ′ = τ (V ′ ), this implies that C I ′ is given by evaluating the set of degree at most r polynomials in m + k variables on the subset V ′ .Since V ′ is a subspace of F m+k 2 , the automorphism argument earlier in the previous section implies that C I ′ is equivalent to RM(r, m) (i.e., equal up to bit order).The code C I ′ will equal an RM(r, m) code if the implied order of the evaluation points equals the natural binary ordering on (v 0 , . . ., v m−k−1 , v m , . . ., v m+k−1 ) ∈ F m 2 .Fortunately, the natural binary ordering for V ′ is mapped to an increasing sequence of integer indices in I ′ = τ (V ′ ) and this is the ordering used by our definition of the punctured code.Thus, C I ′ equals RM(r, m).We note that the condition k ≥ 1 is necessary to avoid the k = 0 case where Similar to the argument above, this shows that C T is equal to RM(r, m − k).Then, we can define the set A = T \ {0} to be the non-zero overlap, the set B = I \ T to be the indices needed to complete I, and the set C = I ′ \ T to be the indices needed to complete I ′ .Finally, as noted in Lemma 6, a uniform distribution on C generates a uniform distribution on C I (and hence C I ′ ).
The following example hints at how this property can be exploited to analyze RM codes.
Example 10.Continuing Example 7, consider the case where a random codeword X = (X 0 , . . ., X 7 ) ∈ RM(1, 3) is transmitted over a BEC and received as Y = (Y 0 , . . ., Y 7 ) ∈ {0, 1, ?} 7 .Then, one can estimate X 0 from Y 1 , Y 2 , Y 3 using only the fact that (X 0 , X 1 , X 2 , X 3 ) ∈ RM (1,2).One can also estimate X 0 from Y 1 , Y 4 , Y 5 using only the fact that (X 0 , X 1 , X 4 , X 5 ) ∈ RM (1,2).In this case, X 0 will be recovered if either of these estimates is not an erasure.Moreover, the performance given by combining the two estimates is strictly better than that given by either single estimate unless the two estimates are perfectly correlated.See Figure 1 for a graphical representation of this construction.
Since there are multiple copies of RM(r, m) embedded inside of RM(r, m+k) with the same bit 0, one can utilize two of them separately to compute estimates of bit 0 based on the RM(r, m) code structure.These two looks can be combined to get a better estimate of bit 0. Unless they are perfectly correlated, they will actually provide a strict improvement over one estimate.The interplay between the rate difference, R(r, m) − R(r, m + k), and the two-look phenomenon plays a key role in this work.
a) Discussion of the code rate after puncturing.: The above results show that it is possible to remove (i.e., puncture) half of the code symbols from an RM(r, m+1) code to get an RM(r, m) and this puncturing increases the code rate by less than 2/ √ m.The original code has 2 2 m+1 R(r,m+1) codewords and the punctured code has 2 2 m R(r,m) codewords.This implies that, on average, roughly codewords in the original code must collapse onto a single codeword in the punctured code.For a length-N binary code C, an information set is a subset [42].It follows that the number of codewords is unchanged if one punctures a set of symbols whose indices are disjoint from a fixed information set.Thus, in our example, if the code rate is less than 1/2 and we puncture a set of N/2 code symbols that is disjoint from a fixed information set, then the code rate must increase by a factor of 2. This is because the number of distinct codewords is unchanged but the code length is reduced by a factor of 2.
One can also ask whether random linear codes have some of the same properties.For now, we will ignore the fact that such codes have a very small probability of being transitive and focus only on the rate after puncturing.For the random generator matrix model, applying a fixed puncturing pattern to a random generator matrix, with design rate R and length N , simply gives a random generator matrix from the same ensemble with length N ′ and design rate R ′ = RN/N ′ .Thus, C B {0}∪A the existence of a puncturing pattern that nearly preserves the rate can be related to the concentration of the code rate around its expected value.For this model with design rate R ′ and any δ > 0, one can show that the probability that the true rate, for a fixed puncturing pattern, is less than R ′ − δ is upper bounded by 2 h b (δ)N −δR ′ N 2 .Since there are N N ′ ≤ 2 N puncturing patterns, the union bound implies that, with high probability as N increases, there is no puncturing pattern that results in rate smaller than R ′ −δ.In particular, the probability that one exists is upper bounded by 2 (h b (δ)+1)N −δR ′ N 2 which goes to zero as N increases.
Still, RM codes are not alone with this property.One can show that certain sequences of multidimensional product codes also have this property.There is also a generalization of RM codes, known as Berman codes, that have this property [31].It would be interesting to study whether or not there are other algebraic code constructions with this property.
It is worth noting that a property related to the above discussion was recently discussed in the context of successivecancellation list decoding for polar-like codes [32].In that work, it is shown that an affine-invariant code with rate R and length-2 m+1 can be transformed into a code of length-2 m whose rate R ′ approaches R for large m.When applied to RM codes, the transformation process they consider is a mapping from RM(r, m + 1) to RM(r − 1, m).Thus, some of the ideas in [32] might provide an avenue for extending some of our results to affine-invariant codes.

A. Binary Memoryless Symmetric Channels
An information channel W is defined by an input alphabet X , an output alphabet Y, and a transition probability that maps elements of the input alphabet to probability measures on the output alphabet.We follow the convention of representing the transition probability using a density function w(y|x) with respect to a base measure on the output alphabet (e.g., counting measure if output distribution is discrete or Lebesgue measure if Y = R d and the output distribution is continuous).In this paper, we restrict our attention to the channels satisfying the following.
Definition 11 (BMS Channel [43, p. 178]).A channel with binary input alphabet X = {±1} and output alphabet Y = R is said to be symmetric if the transition probability satisfies w(y | + 1) = w(−y | − 1) for all y ∈ Y.A binary memoryless symmetric (BMS) channel consists of a sequence of channels uses such that: • Each channel use has binary input alphabet {±1} and is symmetric.
• Conditional on the input to the i-th channel, the output of the i-th channel is independent of all of the other channel uses.
Every channel satisfying Definition 11 can be expressed as a multiplicative noise channel (though this representation is not unique).Specifically, if the input is a random vector X ∈ {±1} N then the output Y ∈ Y N is given by where ⊙ denotes the Hadamard (entrywise) product and Z ∈ Z N is an independent random vector whose entries are independent with Z i drawn according to the distribution of the output in the i-th channel given the input is +1 [43, p. 182].Minus a few exceptions, we will assume that the {Z i } are identically distributed, and thus each channel is described by the same transition probability.
Examples of BMS channels include the following: • Binary erasure channel (BEC) where Z i ∈ {0, 1} transmits the input faithfully if Z i = 1 and outputs an erasure if Z i = 0.
• Binary symmetric channel (BSC) where Z i ∈ {±1} transmits the input faithfully if Z i = 1 and flips the input if Z i = −1.
• Additive white Gaussian noise (AWGN) channel where i ) for some noise power σ 2 i .While the definition of BMS channels can be extended to output alphabets beyond R (e.g., which do not satisfy the multiplicative noise decomposition above), it turns out that every BMS channel defined in the more general sense has a real sufficient statistic that satisfies the definition given above.So there is no loss of generality in restricting our attention to channels satisfying (3).See Appendix A for details.

B. MMSE and Bit Error Rate
For a binary random variable X ∈ {±1} and an observation Y defined on the same probability space, the minimum-mean squared error (MMSE) in estimating X from Y is associated with where the L p norm of a random variable is denoted by The second expression is obtained from the first by expanding the square, using nested conditioning, and noting that ∥X∥ 2 2 = 1.More generally, for a Markov chain X−Y −Z, we have the useful identities where (5a) follows from expanding the square and then using the nested expectation )] = 0, and (5b) follows from (4) Another performance metric of interest is the biterror probability of the MAP decision rule ϕ(y Here, the second expression is a consequence of the relationship between the conditional probability and the conditional mean defined by See the proof of Lemma 12 for details. For digital communication systems with error-correcting codes, the bit-error probability (of codeword bits) after decoding is an important performance metric.The system may be considered reliable if this probability can be made arbitrarily small.A more stringent requirement is that the block-error probability (i.e., the probability that any bit in the codeword is not correct) can be made arbitrarily small.In this work, we focus on the bit-error rate.
Lemma 12.For a binary random variable X ∈ {±1} observed as Y , the quantities mmse( Thus, for a sequence of observations of X, the BER converges to 0 (respectively 1 2 ) if and only if the MMSE converges to 0 (respectively 1).
Proof.See proof in Section VI-A.

C. Generalized Extrinsic Information Transfer Functions
The generalized extrinsic information transfer (GEXIT) function [43], [44] provides a powerful tool for the analysis of communication problems.This section briefly reviews some of the main ideas used in the analyses of GEXIT functions as well as some related concepts involving I-MMSE relations.
Rather than focusing on a specific information channel W , the main object of interest is a family of channels {W (t)} indexed by a real-valued parameter t ∈ [0, 1], where each W (t) represents a channel from a common input alphabet X (not necessarily binary) to a common output alphabet Y.For concreteness, it is assumed throughout that t = 0 is a perfect channel (i.e., the input is determined uniquely by the output) and t = 1 is an uninformative channel (i.e., the output is independent of the input).For a given number of channel uses N , the problem is described as follows: • The input X ∈ X N is a random vector with distribution p X .For communication systems, this is typically the uniform distribution over a subset of the input space defined by a code.
• The output Y ∈ Y N is an observation of X through a memoryless channel where each Y i is an observation of X i through the channel W (t i ) for some t i ∈ [0, 1].In some cases, we use the notation Y (t 0 , . . ., t N −1 ) to make the dependence on the channel parameters explicit.Instead, if all channel parameters take the common value t, then we use the notation Y (t).
Once the input distribution and the family of channels have been specified, the high-level idea is to study how certain quantities, such as the entropy and the bit error rate, depend on the underlying channel parameters.Under the assumptions on the channel family outlined above, the conditional entropy of the input given the output satisfies the boundary conditions If we also assume that the family of channels depends smoothly on the parameter t (e.g., that the mapping (t 0 , . . ., t then we can use the fundamental theorem of calculus and the law of the total derivative to obtain the following decomposition: where in the second line, the partial derivative is taken with respect to parameter in the i-th channel. There are two special cases where the partial derivatives in (6) can be recognized as measures of uncertainty associated with the i-th entry of the input: • Erasure: Consider the family of erasure channels where Y = X ∪ {?} and the probability of erasure is equal to t.In this case, it is straightforward to show that the i-th partial derivative in ( 6) is equal to where the subscript ∼ i denotes the subvector with i-th element omitted.The mapping described by is called the extrinsic information transfer (EXIT) function and it has found many uses in the literature [45], [46], [47], [9].
• AWGN: Consider the family of AWGN channels defined by . standard Gaussian noise.We assume that snr(t) is a nonincreasing function of t with snr(t) → ∞ as t → 0 and snr(1) = 0.In this case, it follows from the I-MMSE relation [48] that the i-th partial derivative in ( 6) is equal to where snr ′ (t) is the derivative of snr(t) and mmse(X i | Y (t)) is the minimum mean-squared error of the i-th input.This relationship has played a key role in the analysis of coding problems as well as high-dimensional inference problems involving Gaussian noise [49], [50], [51], [52], [53].Going beyond the BEC and AWGN, the partial derivatives in (6) associated with a general channel no longer have such a simple interpretation.Nevertheless, many of the ideas developed in the context of the BEC and AWGN cases are still applicable.Historically, the idea of GEXIT functions is introduced and developed by Méasson, Montanari, Richardson, and Urbanke in [44].
Definition 13 (GEXIT function).Let X ∈ X N be a random vector with distribution p X and let Y (t 0 , . . ., t N −1 ) ∈ Y N be an observation of X through a memoryless channel where each Y i is an observation of X i through the channel W (t i ).
The GEXIT function G i : [0, 1] → R for entry i ∈ [N ] is defined to be the partial derivative w.r.t. the channel parameter for the i-th output: .
The full power of GEXIT analysis is realized when the input distribution and the channel satisfy certain symmetry properties that imply the GEXIT functions are all identical, i.e., For the BEC and AWGN channel, this result connects a wellknown reliability measure associated with the single element X i to the entropy of the entire vector X.A sufficient condition under which the GEXIT functions are identical is that the distribution of the input vector has transitive symmetry.
Definition 14 (Symmetry and Transitivity).Let S N be the set of permutations (i.e, bijective functions) mapping [N ] to itself.The symmetry group of a random vector X = (X 0 , . . ., X N −1 ) is defined to be where d = indicates equality in distribution.We say that X has transitive symmetry if G is transitive (i.e., for all i, j ∈ [N ], there is a π ∈ G such that π(i) = j).We say that X has a doubly-transitive symmetry if G is doubly transitive (i.e., for distinct i, j, k ∈ [N ], there is a π ∈ G such that π(i) = i, and π(j) = k).

A. BMS Families Ordered by Degradation
Our approach builds upon the GEXIT analysis outlined in Section III-C.Rather than focusing on a particular BMS channel we study a family of BMS indexed by a parameter t ∈ [0, 1] where t = 0 is a perfect channel and t = 1 is an uninformative channel.We also require that the family to be ordered with respect to degradation in the sense that W (t) is degraded with respect to W (s) for all 0 ≤ s ≤ t ≤ 1. Equivalently, for any distribution on the input X ∈ {±1} there exists a joint distribution on (X, Y (s), Y (t)) such that . Joint range of entropy and MMSE functions associated with a family of BMS channels as described by (9).The upper boundary is attained by the BSC and the lower boundary is attained by the BEC [54].Also, the derivative ratio forms a Markov chain.We remark that this degradation assumption is also standard in the literature on GEXIT analysis [43].See Appendix B for a precise definition of channel degradation and some of its consequences.
There are a few well-known examples of channel families that are ordered by degradation.Some examples are the family of BECs where the erasure probability transitions from 0 to 1, the family of BSCs where the crossover probability transitions from 0 to 1/2, and the family of AWGN channels where the noise power transitions from 0 to +∞.
The Shannon capacity of a BMS channel is equal to the mutual information between the input and output when the input is uniformly distributed [43, p. 193].For a family of BMS channels, two important metrics are provided by the entropy and the MMSE for a uniform input distribution: Definition 15.Let {W (t)} t∈[0,1] be a family of BMS channels that is ordered w.r.t.degradation where W (0) is the perfect channel and W (1) is the uninformative channel.Let X u be uniformly distributed on {±1} and let Y u (t) is an observation of X u through the channel W (t). Note that, under these assumptions, The family is said to be absolutely continuous if the entropy function is absolutely continuous, i.e., if there exists a function The entropy and MMSE functions have a number of important functional properties.Under the assumed degradation ordering, both functions are non-decreasing with the Shannon capacity of the channel W (t) is equal to C(t) = 1 − H(t).It is known that the extremal relationships between the entropy and the MMSE are attained by the BEC and BSC channels [54].For example, where is the binary entropy function.Equality on the left is attained by the BEC and equality on the right is attained by the BSC.This type of phenomenon is also known to be somewhat typical [55].
Another property, perhaps less known, is that the difference in entropy can be used to upper bound the difference in MMSE: To show this, one can apply Lemma 19 to H(t) − H(s) and then use Lemma 44 to verify that all terms in the resulting expansion are positive.Keeping only the first term in the expansion gives the bound in (10).In the case of the BSC, it can be verified that the factor 2 ln(2) is tight in the limit where the crossover probability approaches 1/2 (see Figure 3).Also, by applying Lemma 38, this inequality is sufficient for the absolute continuity of the entropy function to imply the absolute continuity of the MMSE function.

B. GEXIT and I-MMSE Properties
In this section, we present a number of useful results that characterize GEXIT functions for binary-input channels.GEXIT functions were introduced in [56], [43], [44] and analyzed further by a variety of authors [57], [54], [58], [59].Our treatment is based solely on the moments of the conditional expectation.
For the results in this section, it is convenient to specify the distribution of a binary random variable X ∈ {±1} in terms of it mean µ = E [X] ∈ [−1, 1], which corresponds to is the binary entropy function.The entropy is an even function of µ and it admits the following power series expansion, , This expansion extends naturally to the conditional entropy of X ∈ {±1} given an observation Y defined on the same space.In particular, replacing µ by the conditional mean E[X | Y ] and then taking the expectation of the both sides yields where the interchange of expectation and summation is justified by the uniform convergence of the power series.Notice that the conditional expectation E [X | Y ] appearing in (11) depends on both the prior mean µ as well as the channel from X to Y .In the special case of a BMS channel, it turns out that the conditional entropy can also be expressed using a different series expansion involving µ and a sequence {q k } that depends only on the BMS channel.This sequence and the corresponding expansion are defined as follows.
Definition 16.For a BMS channel, the moment sequence {q k } k∈N is defined by where X u ∈ {±1} is uniformly distributed and Y u is an observation of X u through the channel.We use the subscript u to emphasize that q k is always computed using a uniformly distributed input.
Lemma 17 (Two-look Formula).Let X ∈ {±1} be a binary random variable with mean µ ∈ [−1, 1] and prior probability Let Y be an observation of X through a BMS channel with sequence {q k } k∈N as given in Definition 16.Let U be another observation on the same probability space such that U −X −Y forms a Markov chain.Then, the following expansions hold: Proof.See proof in Section VI-B.
In comparison to (11) the expansion in (13) provides a decoupling between two different types of information: the information provided by the BMS channel (encapsulated by the sequence {q k }) and the prior information (summarized by µ).
Remark 18.The expansions in Lemma 17 provide simple proofs for various properties of BMS channels.For example, by (13) the mutual information between the input and the output of the BMS channel from X to Y is given by Since each term is maximized at µ = 0, this immediately implies the well-known fact that the capacity of the channel is attained by the uniform distribution.Thus, the capacity satisfies C = ∞ k=1 c k q k .Before stating the technical lemmas, let us first provide an informal overview of how Lemma 17 will be used to bound the GEXIT function.Let {W (t) : 0 ≤ t ≤ 1} be a family of BMS channels that is ordered by degradation and absolutely continuous according to Definition 15.Let {q k (t)} k∈N denote the sequence from Definition 16 as a function of t.For an input X ∈ {±1} N , consider the output given by where the i-th channel use has parameter s and the other channel uses have parameter t.Applying ( 14) with respect to the BMS channel from X i to Y i (s) and the Markov chain Taking the s-derivative of both sides, interchanging the derivative and the summation, and then evaluating at s = t gives the following expansion of the GEXIT function: A useful property of this expansion is that the terms in the sum are all non-negative.This follows from ∥E[X i | Y ∼i (t)]∥ 2k 2k ∈ [0, 1] and from the degradation ordering which ensures that each q k (t) is non-increasing in t (see Lemma 19).This expansion plays a crucial step in Section V-C where it provides a link between the integral of the GEXIT function and the integral of a related function that depends only on the conditional second moments.
A further application of Lemma 17 appears in Sections V-B2 and V-B3 where it is used to compare the GEXIT function G i (t) with an augmented GEXIT function G + i (t) that depends on the original output Y (t) as well as an additional observation U (t) such that U (t) − X − Y (t) forms a Markov chain.Applying the expansion in (17) to both G i (t) and G + i (t) and then taking the difference yields We again find that each term in the expansion is non-negative.This is due to the degradation ordering of the channel (e.g., q ′ k (t) ≤ 0 by Lemma 19) and Lemma 44 via the trivial Markov chain X i − (Y ∼i (t), U (t)) − Y ∼i (t).Importantly, this means that keeping only the k = 1 term provides a lower bound on the difference of the GEXIT functions.
Lemma 19.Let {W (t) : 0 ≤ t ≤ 1} be a family of BMS channels that is ordered by degradation and absolutely continuous according to Definition 15.Let {q k (t)} k∈N denote the sequence from Definition 16 as a function of t.Then, the following properties hold: (i) The entropy function and MMSE function from Definition 15 can be expressed as (ii) Each q k (t) is non-increasing and absolutely continuous on [0, 1] with q k (0) = 1 and q k (1) = 0.The derivative, q ′ k (t), exists almost everywhere on [0, 1] and satisfies q ′ k (t) ≤ 0 when it exists.
(iii) The derivative of H(t) exists almost everywhere on [0, 1] and is equal almost everywhere to Proof.See proof in Section VI-B.
Lemma 20.Let {W (t) : 0 ≤ t ≤ 1} be a family of BMS channels that is ordered by degradation and absolutely continuous according to Definition 15.Let X ∈ {±1} N be a binary random vector and let Y ∈ Y N be an observation of X through the BMS channel family.Let U (t) be another observation of X, which is conditionally independent of Y given X, through a family of channels indexed by t ∈ [0, 1] and ordered by degradation.For each i ∈ [N ], we use the parameterization in (15) to show that the GEXIT functions, exist almost everywhere and are integrable on [0, 1].These functions also satisfy (almost everywhere on [0, 1]) the series expansions (17) and Proof.See proof in Section VI-B.
Lemma 21.Under the setting of Lemma 20, we can lower bound (18) with ) .Proof.First, we note that one can rigorously establish (18) by subtracting (20) from (17).The terms, , in the resulting expansion are non-negative by the degradation ordering of the channel (e.g., q ′ k (t) ≤ 0 by Lemma 19) and Lemma 44 via the trivial Markov chain X i − (Y ∼i (t), U (t)) − Y ∼i (t).Thus, the stated result follows from retaining only the first term.
The next result is a further implication of the two-look formula that bounds one's ability to estimate a binary variable from two observations subject to an MMSE lower bound on one of the observations.Lemma 22. Consider the setting of Lemma 17 and let C be the capacity of the BMS channel from X to Y .Then Proof.See proof Section VI-B.
Remark 23.In comparison to [43], [44], our GEXIT formulation is somewhat more general and requires fewer regularity assumptions.In particular, we allow the channel family to be parameterized arbitrarily and we show that the GEXIT function exists as long as its entropy function H(t) is absolutely continuous.An alternative approach to the analysis of GEXIT functions, which shares some of these properties, can be found in [58].

C. Linear Codes on BMS Channels
A set C ⊆ F N 2 defines a binary linear code (i.e., a subspace of F N 2 ) if and only if it is closed under addition, that is to say u ⊕ u ′ ∈ C for all u, u ′ ∈ C, where ⊕ represents elementwise modulo-2 addition.To transmit a message over a binary channel, each codeword u ∈ C is mapped to a channel input sequence in {±1} N via the binary phase-shift keying (BPSK) mapping u → (−1) u .The resulting set of BPSK-modulated codeword sequences is denoted by C x .
A remarkable property of linear codes on BMS channels is that many performance metrics do not depend on the transmitted codeword [43, p. 190].This property greatly simplifies the analysis of coding problems because it means that one may condition on the event that the all ones input is transmitted (corresponding to the all zeros linear codeword).Note that under this event, the outputs of the BMS channel are independent random variables.
Next, we describe a consequence of linearity and channel symmetry that is useful for our analysis.
Lemma 24.Let X be distributed uniformly on the set of BPSK-modulated code sequences C x ⊆ {±1} N associated with the binary linear code C ⊆ F N 2 and let Y be an observation of X through a BMS channel of the form Y = X ⊙ Z where Z ∈ Z N is an independent vector with independent entries.For i ∈ [N ] and S ⊆ [N ], define f (y S ) := E [X i | Y S = y S ] to be the conditional expectation of the i-th input given the outputs indexed by S.Then, for all x ∈ C x and y S in the support of Y S the following identity holds: In particular, this implies that Proof.By Bayes rule, the conditional probability mass function of X given Y S = y S satisfies where p X (x) is the uniform distribution over the codewords, w(y i | x i ) is the transition probability, and the constant of proportionality is chosen to ensure the function sums to one.The fact that a linear code is closed under addition means that it is a subgroup of F N 2 , and thus for any u ∈ C, a vector Using the code to input mapping x = (−1) u , this implies that for any To proceed, fix any x ′ ∈ C x and observe that p X (x) = p X (x ′ ⊙x) for all x ∈ {±1} N .This is because p X is uniform over the code and, since x ′ ∈ C x , we see that x ′ ⊙ x ∈ C x if and only if x ∈ C x .Meanwhile, the assumption of channel symmetry means that w(y S | x S ) = w(x ′ S ⊙ y S | x ′ S ⊙ x S ) for all x ∈ {±1} N .Together, these statements imply that In view of this identity, the conditional expectation satisfies Our next result provides an identity for the MMSE associated with estimating a single input.For conceptual reasons, we find it convenient to frame the result in terms of two coupled channel outputs that are conditionally independent given the input.However, this approach is essentially the same as conditioning on the transmission of a particular codeword.
Lemma 25.Let X be distributed uniformly on the set of BPSK-modulated code sequences C x ⊆ {±1} N associated with the binary linear code C ⊆ F N 2 and let Y be an observation of X through a BMS channel of the form Y = X ⊙ Z where Z ∈ Z N is an independent vector with independent entries.For each i ∈ [N ] and S ⊆ [N ], the following identity holds , where Y ′ = X ⊙ Z ′ is an independent second use of the channel with the same input X.Furthermore, for every partition (B 1 , . . ., B K ) of S, we have the upper bound , where Y B k S is a modified observation of Y S where the entries indexed by B k are resampled independently according to the same input X.

Proof. Define the conditional expectation
almost surely where the equalities follow from nested conditional expectation.This implies that where the second step holds because X 2 i = 1 and the final step follows from (4).This decomposition holds generally for any random variable X i ∈ {−1, 1} and any channel X i → Y S .
Next, we appeal to the special properties of the BMS channel and the linear code.Specifically, by Lemma 24, it follows that where Z ′ is an independent copy of Z, we can now write where the second line can be verified by expanding the square, the third line holds since X 2 i = 1, and the last step is another application of Lemma 24.Combining the two different expressions for Var(X i f (Y S )) gives the desired identity.
To prove the upper bound, recall that f (Z) is a bounded function of independent random variables.Hence, we can apply the Efron-Stein inequality [61, Theorem 3.1] with respect to the partition (Z B1 , . . ., Z B k ) to conclude that where Z B k S denotes a version of Z S where the entries indexed by B k have been independently resampled from the Z distribution.Multiplying the terms in the square by X i and then applying Lemma 24 leads to the stated bound, which is given in terms of Y S and Y B k S .Finally, we need the following result concerning the distribution of a pair of estimates based on correlated observations.Lemma 26.Let X be distributed uniformly on the set of channel input sequences C x ⊆ {±1} N associated with the binary linear code C ⊆ F N 2 and let Y be an observation of X through a BMS channel of the form Y = X ⊙Z where Z ∈ Z N is an independent vector with independent and identically distributed entries.Suppose that for i ∈ [N ] there exist disjoint sets A, B, C ∈ [N ]\{i} and a permutation matrix Π such that where Y ′ is an independent second use of the channel with the same input X.
Proof.Let us define the conditional expectations, The assumption that (X i , X A , X B ) is equal in distribution to (X i , X A , ΠX C ), combined with the assumptions on the channel, imply that for all (u, v) in the support of (Y A , Y C ). From the channel assumptions, the channel outputs can be expressed as Y = X ⊙ Z and Y ′ = X ⊙ Z ′ where X, Z, Z ′ are independent.
We can now write where (22a) relies on Lemma 24, (22b) holds because the entries of Z and Z ′ are independent and identically distributed, (22c) is implied by (21), and (22d) again relies on Lemma 24.

V. PROOF OF MAIN RESULT
We prove that RM codes achieve capacity for any BMS channel in the limit of large blocklength.The first step of our proof is to embed the BMS channel of interest into a family of absolutely continuous BMS channels as described in Definition 15.To further simplify our analysis, we will add the additional assumption that the family of BMS channels is parameterized such that the MMSE function M(t) defined in (8) is given by M(t) = t.
We emphasize that these assumptions are not restrictive in the sense that, for any BMS channel, there exists a family of channels satisfying these constraints.An explicit construction based on linear interpolation with erasure channels is described in Section V-D.For the convenience of the reader we restate the channel assumptions for our main result as follows: Assumption 1.We have a family of BMS channels indexed by parameter t ∈ [0, 1] satisfying the following properties: A) For a random input X ∈ {±1} N , the output associated with t ∈ [0, 1] is given by a BMS channel of the form where is a random vector, independent of X, whose entries are independently and identically distributed according to a probability measure indexed by t.B) The family of BMS channels is ordered with respect to degradation where t = 0 is the perfect channel and t = 1 is the uninformative channel.C) The entropy function H(t) defined in (7) is absolutely continuous on [0, 1].D) The MMSE function M(t) defined in (8) satisfies M(t) = t.
While our main result concerns sequences of RM codes of increasing blocklength, many of the steps in our proof hold for a larger class of codes.To make these distinctions apparent, we list here the weaker properties that are sometimes used.We note that all of these properties are satisfied by the RM(r, m) code with N = 2 m .In particular, we always assume that the random input vector X ∈ {±1} N is distributed uniformly on the channel input sequences of a binary code.In some cases, we also require that: • the code is linear, • the code has a transitive symmetry group, • the code has a doubly transitive symmetry group.

A. The Extrinsic MMSE Function
As discussed in Section III-C, the entropy decomposition in (6) plays an important role in the analysis of the BEC and the AWGN channels, where the partial derivatives provide natural measures of the performance of estimating a single entry of the input.However, one difficulty that arises in extending these approaches to general channels is that the GEXIT function does not seem to have an obvious estimationtheoretic interpretation.The approach taken in this paper is to study a surrogate for the GEXIT function, which we call the extrinsic MMSE function: Definition 27 (Extrinsic MMSE function).Let X ∈ {±1} N be a random vector and let Y (t) be an observation of X through a BMS channel with parameter t ∈ [0, 1].The extrinsic MMSE function for input i ∈ [N ] is defined to be The extrinsic MMSE is similar to the EXIT function H(X i | Y ∼i (t)) in the sense that it provides a measure of the ability to estimate the i-th input based on the outputs from the other channels.As was the case for the GEXIT function, the extrinsic MMSE is identical for all i ∈ [N ] whenever the input distribution has a transitive symmetry group.In that case, we will sometimes drop the subscript i and denote the extrinsic MMSE by M (t).
For the purposes of proving our main capacity result with respect to the bit error rate, the code is transitive and the extrinsic MMSE has the property that M (t) converges to zero (for a particular sequence of problems of increasing dimension) if and only if the bit error rate converges to zero.To prove that a code sequence, with rate converging to R, achieves capacity on the BMS channel family W (t), it is sufficient to show that M (t) converges to zero for all t ∈ [0, 1] such that the code rate R is strictly less than Shannon capacity C(t).
Our proof that RM codes achieve capacity on BMS channels consists of the following steps: (1) Sharp threshold property: Show that, for every sequence of RM codes with increasing blocklength, the extrinsic MMSE has a sharp threshold property with respect to t.Specifically, we show that which implies that M (t) cannot be too different from a step function that jumps from 0 to 1.By itself, this does not imply convergence though because the location where the function jumps is not controlled.We note that the integral exists because M (t) is non-decreasing.
(2) Area theorem: Show that, if the sequence of RM codes has limiting rate R, then the location of the jump in the step function must converge to the unique value of t such that the Shannon capacity C(t) of the BMS channel is equal to the code rate R. The two-step approach of first establishing a sharp threshold and then using an area theorem to localize the jump is now somewhat standard [9], [52].The main novelty in our approach is the reliance on the extrinsic MMSE instead of the GEXIT curve and the mechanism by which we establish convergence to a step function.

B. Are Two Looks Better Than One?
This section establishes the sharp threshold property for the extrinsic MMSE as described by (23).Observe that, for any input distribution, M i (t) is non-decreasing in t for each i ∈ [N ] because of the assumed channel degradation.Hence, to show that M i (t) is close to a 0/1 step function it is sufficient to show that M i (t)(1 − M i (t)) is close to zero for most (but importantly not all) values of t in the unit interval.Now, we appeal to Lemma 25 which shows that, if the input is defined by a binary linear code, then the following identity holds: where Y ′ (t) is an independent second use of the channel with the same input X.
In view of ( 24) the entire problem of establishing a sharp threshold can be boiled down to the following question: Assuming that Y i is not observed, are two independent observations of the remaining code symbols likely to provide significantly different posterior estimates of X i ?In the setting where X i can be recovered accurately from Y ∼i the answer to this question is clearly negative.Conversely, in the setting where the first look is uninformative (i.e., with high probability the conditional distribution X i given Y ∼i is close to the prior distribution on X i ) then it is unlikely that a second look will make much of a difference.The interesting setting occurs when a single look provides partial information about X i , and so two looks are then better than one in a meaningful sense.Our goal is to show that, w.r.t. the parameter t, this "interesting" regime has measure tending to zero, that is to say most values of t are "uninteresting".Combined with (24) and monotonicity of M i (t), it follows that M i (t) converges to a 0/1 step function.
1) Decomposition of Variance: We consider a decomposition of the variance term appearing in (24) with respect to a set B ⊂ [N ]\{i}, which will be specified later.For S ⊆ [N ], define If the input distribution is defined by a binary linear code, then we can apply the upper bound in Lemma 25 to the partition given by B and the singletons in We remark that ( 26) is general in the sense that it holds for every linear code.
In the following, we will bound each term in ( 26) by first relating it to a GEXIT function, using the results in Section IV-B, and then combining properties of the GEXIT function with some other arguments to bound the integral with respect to t over the unit interval: is addressed in Section V-B2 where we establish the following.If N = 2 m and the input distribution is uniform on the codewords of the RM(r, m) code, then, for integers • The term ∆ j i (t) is addressed in Section V-B3, where it is shown that if the input distribution has a doubly transitive symmetry group, then the following bound holds for all i ̸ = j, Combining these results leads to a family of upper bounds on the integral of (26) that is parametrized by k ∈ {0, . . ., m}.
This parameter provides a trade-off between the two terms in the bound.For large values of k, the bound is dominated by the difference between the rates in (27).Conversely, for small values of k, the bound is dominated by the size of the set A, which is given by 2 m−k − 1. Optimizing over the choice of the integer k gives the following result: Lemma 28.Consider a family of BMS channels satisfying Assumption 1.If the input distribution is uniform on the codewords of the RM(r, m) code, then the extrinsic MMSE satisfies Proof.For every integer k ∈ [m], the bounds in ( 26), (27), and (28) give where the second step follows from the basic inequality The final result follows from multiplying this by 4 ln 2 and noting that 48 ln(2) ≤ 34.
2) Proof of Generalized Influence Bound in ( 27): This section proves an upper bound on the integral of the generalized influence term ∆ B i (t) defined in (25) for a carefully chosen set B ⊂ [N ]\{i}.This term can be expressed as where A = [N ]\({i} ∪ B) and Y ′ (t) denotes an independent second use of the BMS channel with the same input X.
Our approach to bounding this term is to view the input vector (X 0 , . . ., X N −1 ), as the first N entries in an extended input vector (X 0 , . . ., X L−1 ) of length L > N .With some abuse of notation we use X [N ] to denote the original input vector and X [L] to denote the extended input vector.Associated with the extended input we define the output Y [L] (t) = (Y 0 (t) . . ., Y L−1 (t)) from the same BMS channel.If we can find an extension such that: i) the extended input X [L] is distributed uniformly on the codewords of a linear code; and ii) there exists a set C ⊆ {N, . . ., L − 1} and permutation matrix Π such that (X i , X A , X B ) is equal in distribution to (X i , X A , ΠX C ), then we can use Lemma 26 to conclude that In this expression, the second look at the entries indexed by B has been replaced by an observation of the entries indexed by C in the extended codeword.
To apply this, we assume that the original input is generated by a uniform distribution over the codewords of RM(r, m) and the extended input is generated by a uniform distribution over the codewords of RM(r, m + k), for some positive integer k ≤ m.Then, the following lemma shows that the nesting property identified in Lemma 9 can be used to choose the sets A, B, and C to satisfy the distributional condition defined in Lemma 26.See Figure 2 for an illustration.
such that π acts as the identity on {i}∪A and swaps B and C.This implies that there is a permutation matrix Π such that (X i , X A , X B ) is equal in distribution to (X i , X A , ΠX C ) and that Proof.If X [L] is uniformly distributed on the codewords of C = RM(r, m + k) and I = [N ], then Lemma 9 shows that is uniformly distributed on the codewords of C I = RM(r, m).For i = 0, the sets A, B, C are also constructed in Lemma 9 and we will verify their properties below.In the last step, we describe how the i = 0 construction can be remapped to any i ∈ [N ].
To explain why the sets A, B, C from Lemma 9 satisfy the stated conditions, we recall their definition from the proof of Lemma 9. First, we define 2 as the evaluation sets associated with the indices I = τ (V ) and I ′ = τ (V ′ ), respectively.Then, we define T = I ∩ I ′ = [2 m−k ], A = T \ {0}, B = I \ T , and C = I ′ \T .To simplify notation, we also define Consider the function, π : and π(B ′ ) = C ′ .We do not discuss the precise element by element mapping from B ′ to C ′ because it is not important for this result (e.g., Lemma 26 allows for an arbitrary permutation of the set C).
Since π is a linear function, it defines an automorphism of C [39, p. 398].For integer indices, this automorphism is given by i → τ (π(τ −1 (i))).Since a code automorphism preserves the distribution of the codewords, it follows that (X 0 , X A , X B ) is equal in distribution to (X 0 , X A , ΠX C ) for some permutation matrix Π.Moreover, Y [L] is a memoryless observation of X [L] and applying this automorphism to both X [L] and Y [L] preserves their joint distribution as well.Thus, there is a another permutation matrix Π ′ such that ).This implies that (30) holds for i = 0.
For i ∈ [2 m ] \ {0}, we can simply translate the sets A ′ , B ′ , and C ′ by adding i ′ = τ −1 (i).In particular, we define , it defines an automorphism of C that preserves the uniform distribution over codewords.For i ∈ [N ], we define Then, the above statements imply that {i}, A i , B i forms a partition of [N ] and that (X i , X Ai , X Bi ) is equal in distribution to (X i , X Ai , Π i X Ci ) for some permutation matrix Π i .Likewise, there is a permutation matrix ) and this establishes (30 Using the nesting property, we can now bound ∆ B i (t) in terms of the difference between the extrinsic MMSE functions of the original and the extended code.Neglecting the dependence on t to lighten the notation, the bound is derived by starting with (29) and then writing where (31a) follows from combining Lemma 26 and Lemma 29 to establish (31b) is given by the triangle inequality, and (31c) holds because [N ] = {i} ∪ A ∪ B and Lemma 29 shows that the two terms in (31b) are equal.The last two steps follow from ( 5) and the fact that The next step is to use Lemma 21 to bound the difference in extrinsic MMSE in terms of the difference of GEXIT functions.Let the GEXIT functions of the original and the extended code be given by The fact that the channel is memoryless means that, for each i ∈ [N ], the GEXIT function for the extended code, G ext i (t), is equal to the GEXIT function for the original code augmented with the additional observations U (t) = (Y N (t), . . ., Y L−1 (t)).In other words, G ext i (t) equals Combining this bound with the fact that −q ′ 1 (t) = M ′ (t) ≡ 1 under the assumed channel parametrization and then rearranging terms, we conclude that following inequality holds almost everywhere: The remaining challenge is to argue that the difference between the GEXIT functions is small for most values of channel parameter t.Recall that, by Lemma 8, the difference in rates between RM(r, m) an R(r, m+k) is at most (3k+4)/( 5√ m).
We will use the fact that the difference in rate can also be expressed as the integral of the difference in GEXIT functions: where (33a) holds because X [N ] and X [L] are distributed uniformly on the codewords of RM(r, m) and RM(r, m + k), (33b) follows from the fundamental theorem of calculus and the assumption that t = 0 is the perfect channel and t = 1 is an uninformative channel, (33c) holds by the law of the total derivative, and (33d) is implied by the fact that the GEXIT functions of all bits are identical (which follows from the transitive symmetry of the original and extended RM codes).
We now have all the pieces in hand to bound the integral of the influence term ∆ B i (t).Specifically, we can write ≤ 2 where the last step follows from Lemma 8.This concludes the proof of (27).
3) Proof of Generalized Influence Bound in ( 28): This section proves an upper bound on the integral of the generalized influence term ∆ j i (t) defined in (25).Suppressing the explicit dependence on the channel parameter t, this term can be expressed as , where we recall that ) is a modified version of Y in which the j-th component has been resampled according to the same input.We can write where (34a) is the triangle inequality, (34b) holds because the triples (X i , Y ∼i , Y ′ j ) and (X i , Y j ∼i , Y j ) have the same distribution, and (34c) follows from (5).
Following the same approach as in the previous section, we can use Lemma 21 to bound the difference in extrinsic MMSE in terms of the difference of GEXIT functions.Let G i (t) be the GEXIT function of the original channel (Definition 13) and for j ̸ = i define s=t to be the GEXIT function for an augmented channel that uses the j-th channel twice.We can now apply Lemma 21 with from below in terms of the extrinsic MMSE.Combining this bound with the fact that −q ′ 1 (t) = M ′ (t) ≡ 1 under the assumed channel parametrization and then rearranging terms, we conclude that following inequality holds almost everywhere: In view of ( 34) and ( 35), we see that the integral of the difference in GEXIT functions provides an upper bound on the integral of ∆ j i (t).If the input distribution has a transitive symmetry group, then the integral of G i (t) follows directly from the definition of the GEXIT function as discussed in Section III-C.However, the integral of G j i (t) does not have such a simple interpretation because the partial derivative of the augmented channel with respect to j is different than for the other channels.
The next lemma provides a bound on the integral in question, averaged over the indices i ̸ = j.If the input distribution has a doubly transitive symmetry, then this gives a bound that holds uniformly for all pairs of indices.Combining this result with the bounds in (34c) and (35) gives the single term bound stated in (28).
In particular, if the input distribution has transitive symmetry, then If it has doubly transitive symmetry, then G j i = G ℓ k for all i, j, k, ℓ ∈ [N ] with i ̸ = j and k ̸ = ℓ.Thus, we have Proof.See proof Section VI-C.

C. Bounds on the Extrinsic MMSE via the Area Theorem
Since Lemma 28 establishes the sharp threshold phenomenon in the sense of (23), the next step is to provide bounds on the extrinsic MMSE in terms of the rate of the code.The key tool that enables this is a relation known as the area theorem for GEXIT functions [43], [44].Consider a family of BMS channels satisfying the assumptions in Definition 15, and let G(t) be the GEXIT function associated with a random input X of length N whose distribution has a transitive symmetry group.Then, the generalized area theorem (6) implies that This statement is an immediate consequence of the definition of the GEXIT function and the assumption of a transitive symmetry, which ensures that GEXIT function is the same for all inputs.If the distribution of X is uniformly distributed over the input sequences of a binary code, then the LHS of ( 38) is the rate of the code.
For the purposes of this paper, the connection between the rate and the extrinsic MMSE follows from the results in Section IV-B.The details are summarized in the following result, which provides bounds on the MMSE in terms of the integral appearing in (23) and the gap between the Shannon capacity and the code rate.
Lemma 31.Consider a family of BMS channels satisfying Assumption 1 and suppose that the input distribution is uniform over a code with transitive symmetry and rate R.There exists a unique value t R ∈ (0, 1) such that C(t R ) = R. Furthermore, the extrinsic MMSE satisfies  39) for a rate R = 0.5 code on a BEC channel with erasure rate t when ρ := 1 0 M (s)(1 − M (s))ds = 0.02.This bound is sharp because, for any t * ∈ [0, 1 − R − ρ), there is a non-decreasing function M : [0, 1] → [0, 1] that equals the upper bound at t = t * and also satisfies the area theorem (e.g., the area in blue is equal to R) and the integral constraint 1 0 M (s)(1 − M (s))ds = ρ.The function M (t) is shown (blue) for t * = 0.4 and is given by ( 41) with u * = 0.525.See Example 33 for more details.
where κ(t) is the inverse of the binary entropy function restricted to the domain [0, 1/2].The function ψ is non-negative and strictly increasing.Thus, the denominator in (40) is strictly positive for t ∈ (t R , 1].

Proof. See Section VI-C.
If the input is defined by an RM code, then we can combine Lemma 31 with Lemma 28 to obtain bounds on the extrinsic MMSE that depend only on the code rate and the blocklength.Applying these bounds to a sequence of RM codes with strictly increasing blocklength and code rate converging to R ∈ (0, 1), shows that the extrinsic MMSE converges to a 0/1 step function that jumps at the unique t R ∈ (0, 1) such that C(t R ) = R.
Remark 32.In Appendix D, we use an alternative approach to establish the limiting behavior of the extrinsic MMSE for a sequence of RM codes.In particular, by using the comparisons in Lemma 46, one can avoid the need for explicit bounds.While the proofs are not necessarily shorter or simpler, we believe that the approach may be of independent interest.Example 33.To help explain the upper bound in Lemma 31, we describe an extrinsic MMSE curve that satisfies the bound with equality (see Figure 4).This shows that (39) cannot be improved without imposing some additional constraints on the extrinsic MMSE function.Consider the family of BECs with erasure probability equal to t and recall that H(t) = t, C(t) = 1 − t, and κ(t) = 1.
• From the definition, we see that Using the stated u * , a bit of algebra shows that the last expression simplifies to ρ. 39) is attained at the point t * , i.e., M (t * ) = ρ/(C(t * ) − R) because κ(t) = 1 for the BEC family.

D. RM Codes Achieve Capacity on BMS Channels
We are now ready to prove the main result of the paper.To show that RM codes achieve capacity for any particular BMS channel, we need to show that the channel can be embedded into a family of BMS channels satisfying Assumption 1.To this end, we may consider the following construction.
Definition 34 (Interpolated family of BMS channels).For a BMS channel with input alphabet X = {±1}, output alphabet Y, and capacity C ∈ (0, 1) an interpolated family of BMS channels satisfying Assumption 1 is defined by the following steps: • Let M * ∈ (0, 1) be the MMSE of the channel associated with a uniform input distribution.
• For 0 ≤ t < M * the output is given by the original channel with probability t/M * and perfect knowledge of the input otherwise.This can be accomplished, for example, by adding the symbols ±∞ to Y and associating them with perfect knowledge of the inputs ±1, respectively.
• For M * ≤ t ≤ 1 the output is given by the original channel with probability (1 − t)/(1 − M * ) and is equal to the erasure symbol otherwise.The MMSE function is M(t) = t and the entropy function is .
The original BMS channel corresponds to the point t = M * .
The next result bounds the extrinsic MMSE for RM codes transmitted over a BMS channel.
Lemma 35.Consider a BMS channel with capacity C ∈ (0, 1).The extrinsic MMSE of an RM(r, m) code with rate where ρ(m) := (6 ln(m) + 34)/(5 √ m) and Ψ(u) := u 0 ψ(v) dv with ψ given in Lemma 31.Proof.Let M * be the MMSE of the given BMS channel when the input is uniformly distributed on {±1}.Let H(t) be the entropy function associated with the family of channels in Definition 34, and let t R be the unique value such that 1 − H(t R ) = R.
For the upper bound on the extrinsic MMSE, observe that if R < C then t R > M * .Combining the bound in (39), evaluated at t = M * , with the bound on where the second step holds because H ′ (s) = C/(1 − M * ) for s ≥ M * .Finally, we observe that the claimed expression follows from which is implied by (9).
For the lower bound on the extrinsic MMSE, observe that, if R > C, then t R < M * .Combining the bound in (40), evaluated at t = M * , with Lemma 28, we see that To simplify the integral, recall that t R is the unique value in Making the change of variables u = R − 1 + s(1 − C)/M * and noting the boundary conditions Finally, we simplify the bound to avoid dependence on M * .Notice that (9) in Section IV-A implies where the last inequality is equivalent to ψ(1−C) ≤ M * .
We now state main result of the paper, which provides non-asymptotic bounds on the BER under bit-MAP decoding for an RM code over a BMS channel.These bounds depend only on three quantities: the capacity of the channel, the difference between the capacity and the code rate, and the blocklength.Evaluating these bounds in the limit of increasing blocklength, it follows that RM codes achieve capacity on any BMS channel.
Theorem 36.Consider a BMS channel with capacity C ∈ (0, 1).For every RM(r, m) code whose rate satisfies R(r, m) < C, the bit-error rate under bit-MAP decoding satisfies , where ρ(m) := (6 ln(m) + 34)/(5 √ m).In particular, for every R ∈ [0, C) there exists a sequence of RM codes with increasing blocklength and rate converging to R such that the BER under bit-MAP decoding converges to zero.
for all i ∈ [N ] where Ψ(u) := u 0 ψ(v) dv with ψ given in Lemma 31.In particular, for every R ∈ (C, 1] and every sequence of RM codes with increasing blocklength and rate converging to R, the BER under bit-MAP decoding converges to the bit-error rate associated with a single use of the channel. Proof.The upper bound on the BER follows from combining the upper bound on the extrinsic MMSE in Lemma 35 with the relationship between the BER and MMSE in Lemma 12, and then noting that BER(X i | Y ) ≤ BER(X i | Y ∼i ).The lower bound on the BER follows from combining the lower bound on the extrinsic MMSE in Lemma 35 with the relationship between the BER and MMSE in Lemma 22.
From [9, Remark 24], we know that for any R ∈ (0, 1), there is a sequence of RM codes with strictly increasing m whose rate converges to R. The construction of this sequence is also discussed in Section II-A for completeness.If, as m → ∞, the code rate approaches any fixed R < C, then we see that the bit-error probability vanishes because ρ(m) → 0.

VI. PROOFS
In this section, we collect proofs that have been removed from the main text due to length or importance.

A. Background
Proof of Lemma 12. Starting with the definition of the MAP decision rule, we can write where the third step holds because 2 .The upper bound on the BER follows from the inequality with equality if and only if E[X | Y ] ∈ {0, ±1} (i.e., the channel is equivalent to an erasure channel).
The lower bound on the BER follows from Lyapunov's inequality: with equality if and only if E [X | Y ] has constant magnitude (i.e., the channel is equivalent to a BSC).Thus, for a sequence of observations, the bit-error probability approaches 0 (respectively 1 2 ) if and only if the MMSE approaches 0 (respectively 1).

B. Preliminary Results
Proof of Lemma 17.We begin with the proof of (13).Recall that the input X ∈ {±1} has mean µ and Y is an observation of X through a BMS channel.We can transform the problem into one with a uniform prior using a symmetrization argument.Specifically, let where V ∈ {±1} is a uniform binary variable that is independent of (X, Y ), and let Y u be an observation of X u through the same BMS channel such that (X, V ) − X u − Y u is a Markov chain.Notice that under this specification, the symmetrized input X u is uniformly distributed and the symmetrized inputoutput pair (X u , Y u ) is independent of the original input X.
The mutual information I(X u ; Y u ) can be decomposed according to where the first step holds because of the Markov structure and the second step is the chain rule for mutual information.
and applying the series expansion of binary entropy in (11) yields The second expansion can be simplified further by noting that V = XX u where X is independent of (X u , Y u ), and thus the conditional expectation decouples as the product of expectations: 2k , and then rearranging the terms in (42) gives Notice that the RHS is precisely the formula we are trying to prove.The LHS can be viewed as the entropy of a symmetrized binary-input channel where the input is flipped with probability one half and the status of whether it was flipped (i.e., the variable V ) is provided at the output of the channel.Since Y is an observation of X through a symmetric channel, the distribution of the channel is unaffected by this symmetrization procedure and thus This concludes the proof of (13).
Since (13) holds for an arbitrary prior on X, the proof of ( 14) follows as a direct consequence of (13).The U −X −Y Markov chain condition implies that, for any u in the support of U , conditioning on U = u only changes the prior on X.Thus, we can use (13) to write Averaging both sides over the distribution of U and interchanging the expectation with the summation (which is justified by the uniform convergence of the sum) gives the desired result.
Proof of Lemma 22.For the BMS channel from X to Y let {q k } k∈N be the sequence given in Definition 16.We proceed by expanding the conditional mutual information Starting with the entropy expansion in ( 14) we can write

This inequality holds because |E[X | U ]| ≤ 1 almost surely and thus
The last step follows from (11), which implies that k∈N c k (1 − q k ) = 1 − C, and (4).Alternatively, starting with the entropy expansion in (11) and then noting that all the terms in the expansion are nonnegative (by Lemma 44 and the fact that Y − (Y, U ) − X is a Markov chain) gives , where the last step follows from (5a).Combining these upper and lower bounds on the mutual information yields To prove the desired inequality for the BER, we use the identity where the second step is the reverse triangle inequality and the third step is Lyapunov's inequality.Combining this inequality with (43) and recalling that c 1 = 1/(2 ln 2) completes the proof.
For the next few results, the following definition and lemma will be useful.
Definition 37 (Absolutely Continuous).Consider a real interval [a, b] and a function f This definition is important because the fundamental theorem of calculus for the Lebesgue integral states that, if f is absolutely continuous, then f is differentiable almost everywhere on [a, b] and, for all c ∈ [a, b], the Lebesgue integral of its derivative satisfies Proof.For any ϵ ′ > 0, we use the absolute continuity of f with ϵ = ϵ ′ /γ to obtain the desired δ > 0. Thus, we find that, for any sequence of disjoint intervals Now, we provide a proof for Lemma 19.While the arguments for parts (i) and (ii) are self-contained, the proof of part (iii) depends on some further results (Lemmas 20 and 40) whose proofs appear below.We emphasize that, although the proofs of Lemmas 20 and 40 depend on parts (i) and (ii) of Lemma 19, they do not depend on part (iii).Thus, the argument is not circular.Proof of Lemma 19 (i) and (ii).By assumption, {W (t) : 0 ≤ t ≤ 1} is a family of BMS channels that is ordered by degradation according to Definition 15.Since H(t) = H(X u | Y u (t)) is defined with respect to an observation of a uniform input, the entropy formula given in (i) follows from (13) with µ = 0. Likewise, M(t) = mmse(X u | Y u (t)) is defined with respect to an observation of a uniform input and the formula given in (i) follows from combining (4) and (12).
For (ii) and all 0 ≤ s < t ≤ 1, we first observe that q k (s) ≥ q k (t) follows directly from the degradation ordering of the channel family and Lemma 44.Next, we observe from (i) that where the second step follows from the fact that each term in the sum is non-negative because c k ≥ 0 and q k (s) ≥ q k (t).
Proof of Lemma 20.Consider H(X | Y i (s), V (t)) where V (t) is any side information random variable parameterized by t ∈ [0, 1] that is ordered by degradation and conditionally independent of Y i given X i .The following chain rule plus derivative trick was introduced in [46] for the BEC.Starting with the chain rule for entropy and then using the fact that As the second term on the RHS does not depend on s, we see that the s-derivative can be expressed as From statement of Lemma 19, we know that, for all k ∈ N, q k (s) is absolutely continuous and −q ′ k (t) is nonnegative almost everywhere.Now, we can use 1 − q k (s) = s 0 −q ′ k (u)du (which follows from q k (0) = 1) to rewrite (16).For all t ∈ [0, 1], this gives where we use ν k (t) := 1 − ∥E[X i | V (t)]∥ 2k 2k and neglect the index i to lighten notation.Since, for all k ∈ N, the integrand is non-negative almost everywhere for u ∈ [0, 1], we can apply Tonelli's Theorem [63] (with respect to counting measure for k and Lebesgue measure for u) to interchange the sum and integral so that, for all t ∈ [0, 1], we see that exists for almost all u ∈ [0, 1].In addition, for all s, t ∈ [0, 1], it follows that F (u, t) satisfies This proves that, for all t ∈ [0, 1], ∂ ∂s H(X i | Y i (s), V (t)) exists for almost all s ∈ [0, 1] and is almost everywhere equal to F (s, t).
Notice that if ν k (t) = 1 for all k (i.e., X i is uniformly distributed and the V (t) is independent of X i ) then the entropy in ( 45) is equal to the function H(s).From the assumption that H(s) is absolutely continuous and the arguments given above it follows that there exists a set K ⊆ [0, 1] of measure 0 such that for all u ∈ K c , the derivative H ′ (u) exists, is finite, and is given by Now, we will use the above results to argue that The issue here is that we have not ruled out the possibility that ∂ ∂s H(X i | Y i (s), V (t)) does not exist whenever s = t.To handle this detail, let us define J k ⊂ [0, 1] to be the set of points where q ′ k (u) does not exist and observe that J k has a Lebesgue measure of 0. By the countable subadditivity of measure, it follows that J = ∪ k∈N J k also has Lebesgue measure 0. Hence, for all u ∈ J c , every element of the sequence {−q ′ k (u)} k∈N is well-defined and non-negative.Let J = J ∪ K and observe that J still has measure 0. Also, if u ∈ Jc , then the sum in F (u, t) converges to a finite number when the sequence ν k (t) = 1 for all k ∈ N. But, since we always have ν k (t) ∈ [0, 1], the sum must also converge to a finite number for any ν k (t) sequence.Thus, we see that, for all u ∈ Jc and all t ∈ [0, 1], the sum in F (u, t) is well-defined and finite.Integrating this sum over s shows that F (s, t) Finally, we consider the integrability of F (t, t).For t ∈ Jc , define the sequence of functions {f n } n∈N according to Each f n is measurable because ν k (t) is measurable by monotonicity and measurability is preserved under finite sums and products.Furthermore, by the monotone convergence theorem, f n (t) converges pointwise to F (t, t) for all t ∈ Jc .Finally, because 0 ≤ ν k (t) ≤ 1 the sequence is dominated in the sense that |f n (t)| ≤ |H ′ (t)| holds almost everywhere (because J has measure zero).Thus, we can apply the dominated convergence theorem to conclude that the limit F (t, t) is integrable.Since Y ∼i is conditionally independent of Y i given X i , we can choose V (t) = Y ∼i (t) to establish (17).Similarly, since (Y ∼i , U (t)) − X − X i − Y i forms a Markov chain, we can establish (20) by selecting V (t) = (Y ∼i (t), U (t)).Finally, letting V (t) be almost surely constant with E [X i | V (t)] = µ ∈ [−1, 1], we see that (49) holds.

C. Main Results
Proof of Lemma 30.To provide some context, let us first recall the setting of the area theorem for the GEXIT function.For any input distribution X ∈ {±1} N , the law of the total derivative gives From the assumed properties of the channel family, the conditional entropy is equal to 0 at t = 0 and H(X) at t = 1, and so the integral of the above expression is equal to H(X).
The desired expression in (36) differs from the setting of area theorem in two ways: 1) the i-th term in the summation is omitted and 2) the augmented GEXIT function G j i (t) treats the j-th channel differently from the others.Our approach is to find a suitable definition for the augmented GEXIT function in the case i = j such that the summation over all i ∈ [N ] can be expressed as the total derivative of a conditional entropy term.In particular, we will use the definition where Y ′ i (s) is resampled observation of the i input.By the law of the total derivative and the fact that Y i (s) and Y ′ i (s) are identically distributed, this term can be expressed as twice the partial derivative with respect to one observation.This gives and the existence and expansion of this derivative follows from applying Lemma 20 with U (t) = Y ′ i (t).Starting with (36), we can now add and subtract the terms with i = j to obtain For each j ∈ [N ], one finds that the first summation i on the RHS is the total derivative of the difference in entropy terms given by From the assumed properties of the channel family, both of the conditional entropy terms equal 0 at t = 0 and H(X) at t = 1.So, the integral of this term vanishes.
The proof has now been reduced to finding a suitable bound for the integral of the second term in (47), which contains only a single summation.Using the series expansions implied by Lemma 20 for G i (t) and ( 46), we can write , for k ∈ N, by Jensen's inequality.Thus, we find that 2k ≤ 1 and this implies that where the sum equals the derivative of the entropy function, H ′ (t), by Lemma 19.Since H(0) = 0 and H(1) = 1 by the assumed properties of the channel family, we have Summing this expression over i ∈ [N ] completes the proof of (36).
If the input distribution has doubly-transitive symmetry, then G j i = G ℓ k for all i, j, k, ℓ ∈ [N ] with i ̸ = j and k ̸ = ℓ.This implies that, in (36), all terms in the sum are equal.Thus, we can divide by N (N − 1) (i.e., the total number of terms) to see that each term satisfies (37).Definition 39.Let {W (t) : 0 ≤ t ≤ 1} be a family of BMS channels that is ordered by degradation according to Definition 15 and let {q k (t)} k∈N be the sequence given in Definition 16.For µ ∈ [−1, 1] and t ∈ [0, 1], let us define In addition, H ′ µ (t) is non-negative and non-increasing in µ 2 for almost all t ∈ [0, 1].
Proof.The function H µ (t) exists and is bounded because the k-th term in the sum is non-negative and upper bounded by c k (which is summable).The monotonicity of H µ (t) in t and µ follows directly from (48) given the monotonicity of q k (t) in t and µ 2k in µ 2 .The proof of Lemma 20 establishes the absolute continuity of H µ (t) and the power series expansion for its derivative.Given the expansion, we know that H ′ µ (t) ≥ 0 almost everywhere because, for k ∈ N, we have −q ′ k (t) ≥ 0 almost everywhere by statement (ii) in Lemma 19.We emphasize that this proof depends on parts (i) and (ii) of Lemma 19 (which are used in Lemma 20), but does not depend on part (iii) of Lemma 19.Thus, using Lemma 40 to prove part (iii) of Lemma 19 is not circular.
Likewise, this expansion shows that H ′ µ (t) is non-increasing in µ 2 for almost all t ∈ [0, 1].Remark 41.As described above, H µ (t) represents the conditional entropy of a random variable X ∈ {±1} with mean µ observed through the BMS channel W (t).For a different interpretation, consider the setting where X is uniformly distributed and U is an observation through a BSC with crossover probability p = (1 − µ)/2.In this case E [X | U ] ∈ {±µ} almost surely, and since H µ (t) is an even function of µ, it follows that H µ (t) = H(X | Y (t), U ).
Proof of Lemma 31.The existence and uniqueness of t R follow because C(•) is continuous and strictly increasing.Combining the area theorem (38) with the integral representation s) ds leads to the following decomposition: Notice that this difference is strictly positive on [0, t R ) and strictly negative on (t R , 1]. Combining the expansions in ( 19) and ( 17), we see that where the inequality follows from q ′ k (s) ≤ 0 almost everywhere and ∥E To prove the upper bound on M (t), we combine (52) with the nonnegativity of G(s) (which follows from (17)) to obtain Multiplying both sides by M (t) and recalling the M (•) is nondecreasing allows us to write If t < t R then C(t) > R and we can divide both sides by C(t) − R to obtain (39).
Using the expansion (17), we observe that where the inequality follows from ∥E 2 and the final step follows from (49).Now, we focus on the lower bound in (40).We start by multiplying both sides of (50) by negative one, applying (53) to upper bound on G(s), and then using the lower bound H ′ (s) − G(s) ≥ 0 (which follows from ( 51)).This gives Since M (s) is non-decreasing in s and H ′ µ (s) is nonincreasing in µ 2 (see Lemma 40) we have Integrating both sides gives where the first inequality follows from M (s) ≤ M (t) for s ∈ [0, t], the second inequality holds because H ′ µ (•) is nonnegative, and the equality follows from is strictly increasing on [0, 1] with inverse given by ψ(•), we can combine this with (54) to see that We can also strengthen this lower bound by incorporating knowledge about the area under the M (s)(1 − M (s)) curve.
To do this, we write Since ψ(•) is non-negative and strictly increasing, the integral is strictly positive and so we can rearrange terms to obtain the bound given in (40).

A. BMS Channels with General Output Alphabets
For the purpose of our proof, it is convenient to focus on BMS channels satisfying the conditions in Definition 11, i.e., the output alphabet is equal to the extended reals and the transition probability satisfies w(y | + 1) = w(−y | − 1).In this section, we provide a more general definition of BMS channels with respect to an arbitrary output alphabet Y and show any channel satisfying this definition can be mapped to one satisfying the conditions of Definition 11.
Let W Y | X be a binary channel with input alphabet X = {±1}, output alphabet Y, and let w(y | x) denote the conditional density of Y with respect to a fixed dominating measure.It well-known that a minimal sufficient statistic for estimating X from Y is provided by log-likelihood ratio ℓ : Y → R, which is defined by ℓ(y) := log w(y | + 1) w(y | − 1) .
Note that in cases where the output uniquely defines the input (e.g., the perfect channel), the log-likelihood ratio can take the values ±∞ in the extended real numbers.
Definition 42 (Channel Symmetry).A binary channel W Y |X with input alphabet X = {±1} and log-likelihood ratio ℓ is called symmetric if the conditional distribution of ℓ(Y ) given the input is +1 is equal to the conditional distribution of −ℓ(Y ) given the input is −1.
For a symmetric channel, the relevant properties of the channel are completely summarized by the distribution of the log-likelihood ratio when the input is +1.This distribution is often referred to as the L-density of the channel [43].As a consequence, the specific details of the channel and the output space Y can be neglected and one may assume, without loss of generality, that the output alphabet is a subset of the extended reals.For example, if a random variable X ∈ {±1} is transmitted through a symmetric binary channel W Y | X that produces an output Y , then the sufficient statistic ℓ(Y ) can be expressed as the product of the input X and an independent noise term Z according to: where Z := X ℓ(Y ) is drawn according to the conditional distribution of ℓ(Y ) when the input is +1.

B. Degradation Ordering of Channels
This section reviews some facts about channel degradation.The basic idea is that a channel W Z | X is degraded with respect to a channel W Y | X if the output of W Z | X can be simulated by post-processing the output of W Y | X .
Definition 43 (Channel Degradation [43, p. 204 for all x ∈ X and z ∈ Z. Likeiwse, if w(y | x) is a probability mass function then the same expression holds with the integral replaced by a summation.
In some cases, the relationship between random variables is described without specifying the channel explicitly.If Y and Z represent two observations of a third random variable X, we say that Y is stochastically degraded w The above definition is equivalent [43, p. 205] to the statement that, for any distribution p X on the input alphabet X , there exists a joint distribution on random variables (X, Y, Z) ∈ X × Y × Z such that: The following is closely related to previous characterizations of channel degradation [43, p. 206].
Lemma 44 (Convex Order).Let X be a vector space over R and let X ∈ X be a random variable that is transmitted through two channels W Y | X and W Z | X whose outputs are Y and Z, respectively.If W Z | X is degraded with respect to W Y | X , then for all convex functions ϕ : X → R, we have provided that the expectations exist.In particular, if X is realvalued then Proof.We note that expectations are defined using the vector space structure on X .Observe that the expectations in the inequality depend only on the marginal distributions of the pairs (X, Y ) and (X, Z) and thus we are free to consider any joint distribution on (X, Y, Z) with the same pairwise marginals.From the definition of channel degradation, there exists a joint distribution such that X −Y −Z forms a Markov chain.Under the distribution, the conditional expectation sat almost surely and so the first result follows from writing where the third step follows from Jensen's inequality and the convexity of ϕ.The second result holds because ϕ(x) = x 2k is convex on R for all positive integers k.

C. Comparison with Earlier Proof for the BEC
This section discusses the relationship between the approach used in this paper, which is applicable to any BMS channel, and the approach used in earlier work which applies only to the BEC [9].Recall that the proof in this paper depends crucially the nesting property of RM codes described Section II-B.In comparison, the approach in [9] combines special properties of the BEC with results from the theory of boolean functions [64], [65] to prove that any sequence of codes with a doubly transitive symmetry group achieves capacity.
To make the comparison, we first simplify the approach used in this paper for the special case of the BEC.For the BEC, let t denote the erasure rate and recall that the GEXIT function simplifies to the EXIT function in this case.Thus, we have In addition, for any received sequence, the channel input X i is either recoverable or unknown.It follows that the E [X i |Y ∼i (t)] 2 ∈ {0, 1} and the extrinsic MMSE also satisfies Now, we assume that the code has transitive symmetry so that we can restrict our attention to M (t) := M 0 (t) and use Lemma 25 to upper bound the variance of the estimate.Next, we will evaluate ∆ j 0 := ∆ {j} 0 by starting from its definition in (25).Suppressing t, we can rewrite this as , where Y ′ j is an independent observation of X j through the same channel and D j (y, y ′ ) defined to be . Now, we observe that D j (y, y ′ ) ∈ {0, 1} and it equals 0 unless y j ̸ = y ′ .If y j ̸ = y ′ , then this quantity is related to the influence (from the theory of boolean functions) and we see that , where I j is influence of the j-th received value on the EXIT function as defined in [9].Since we have Pr( From [9, Remark 18], we also know that I j (t) = d ds j H(X 0 |Y ∼0 (s 0 , . . ., s N −1 )) (s0,...,s N −1 )=(t,...,t) .
Notice that I 0 (t) = 0 because Y ∼0 does not depend on Y 0 .Assuming doubly transitive symmetry, we see that I j (t) = I 1 (t) for all j ∈ [N ] \ {0}.Thus, the total derivative formula implies that Following the approach in this paper, we can use (56) to see that , where the integral equals 1 if the minimum distance of the code is at least 2. We can also apply this bound to a subset A ⊆ [N ] by summing over all j ∈ A. From this, we see that the total contribution will vanish as long as |A|/N vanishes for the chosen sequence of codes.
In this paper, the remaining terms in (26) are grouped together.To analyze C = RM(r, m) with N = 2 m , we choose k ≥ 1 and define A = [2 m−k ].By Lemma 6, we see that X A is a uniform random codeword from RM(r, m − k).Then, we define B = [N ] \ A and recall, from Section V-B2, that ∆ B 0 (t) equals where Y ′ (t) denotes an independent second observation of X through a BEC with the same erasure probability.Since we are working on the BEC, both inner conditional expectations can only take values in the set {−1, 0, 1} with 0 indicating erasure and ±1 indicating successful recovery.Thus, we can simplify ∆ B 0 (t) by expanding the square and taking expectations to get ).The first inequality holds because it may be possible to recover X 0 by jointly processing Y A , Y B , Y ′ B even when it cannot be recovered separately from either Y A , Y B or Y A , Y ′ B .The second inequality follows from assuming that U (t) is the observation of a uniform random codeword X ′ from RM(r, m + k) and that (X 0 , Y A , Y B , Y ′ B ) is equal in distribution to (X ′ 0 , U A , U B , U C ) (e.g., see Lemma 9 and Section V-B2).
Finally, we can put things together.First, we can integrate the upper bound on ∆ B 0 (t) to see that where the last step follows from Lemma 8.Then, we can integrate (26) to see that .
This upper bound vanishes if we consider a code sequence where m → ∞ with k chosen according to k = ⌊log 2 m⌋.Thus, the EXIT function has a sharp threshold and the EXIT area theorem (e.g., see [9,Proposition 11]) implies that M (t) = H(X 0 |Y ∼0 (t)) will jump at 1 − R in the limit.

D. Localization of Jump in Extrinsic MMSE via Sequences
In Section V-C, we provide non-asymptotic bounds on the extrinsic MMSE associated with a family of BMS channels and an RM(r, m) code.Applying these bounds to a sequence of RM codes with strictly increasing blocklength and code rate converging to R ∈ (0, 1), shows that the extrinsic MMSE converges to a 0/1 step function that jumps at the unique point t R such that C(t R ) = R.For that result, this section provides an alternative proof which may be of independent interest.
In particular, we make use of Lemma 46 below which shows that convergence of the extrinsic MMSE to 0 or 1 is equivalent to convergence of the GEXIT to its lower and upper bounds, respectively.
Let {C (n) } n∈N be a sequence of transitive codes with strictly increasing blocklength and rate converging to R ∈ (0, 1).For a BMS family satisfying Assumption 1, let {(G (n) , M (n) )} n∈N be the corresponding sequence of GEXIT functions and extrinsic MMSE functions.The bounds given here and in Section V-C depend primarily on the quantity We will see that a code sequence achieves capacity on the family of BMS channels if a n → 0.
The approach taken in this section is a proof by contradiction.Suppose that a n → 0 but the sequence of extrinsic MMSE functions, M (n) (t), does not converge to a 0/1 step function that jumps at t = t R .Then, one of two things must happen.Either there is a t ′ < t R , an ϵ ∈ (0, 1), and a subsequence M (n k ) (t) such that M (n k ) (t ′ ) ≥ ϵ for all k ∈ N. Or, there is a t ′ > t R , an ϵ ∈ (0, 1), and a subsequence M (n k ) (t) such that M (n k ) (t ′ ) ≤ 1 − ϵ for all k ∈ N.
The following lemma implies that both possibilities lead to contradictions.To see this, we recall that the area theorem implies This also implies that the limit is the same for any subsequence G (n k ) (t).Now, for the t ′ < t R case, we see (57) implies that the limit inferior of the sequence of GEXIT integrals is at least C(t ′ ) > R which gives a contradiction.By comparing (17) and (19), it is easy to see that G (n) (t) ≤ H ′ (t).Thus, for the t ′ > t R case, we see the sequence of GEXIT integrals is upper bounded by C(t ′ ) = 1 t ′ H ′ (t) dt < R which gives a contradiction.
The lemma is obtained by combining an upper bound on a n (e.g., see Lemma 28) with the comparison between the GEXIT function G (n) (t) and the extrinsic MMSE M (n) (t) established in Lemma 46.Thus, the sequence of extrinsic MMSE functions, M (n) (t), converges to a 0/1 step function that jumps at t = t R .Finally, applying Lemma 46 again shows that the sequence of GEXIT functions G (n) (t) converges almost everywhere to a function that jumps from 0 to H ′ (t) at t = t R .
Lemma 45.Under the assumptions stated above, if a n → 0, then, for every t ′ ∈ (0, 1), we have Proof.If lim inf n→∞ M (n) (t ′ ) > 0 then there exists an ϵ ∈ (0, 1) and an integer N such that M (n) (t ′ ) ≥ ϵ for all n ≥ N .For t ∈ (t ′ , 1] and n ≥ N , we can write where the second inequality follows from ϵ ≤ M (n) (s) ≤ M (n) (t) for s ∈ [t ′ , t].By assumption, the LHS (i.e., a n ) converges to 0 and this proves that M (n) (t) → 1 for all t ∈ (t ′ , 1].By Lemma 46, it follows that which is equivalent to the stated result in view of the fact that For the second statement, the argument is essentially the same.If lim sup n→∞ M (n) (t ′ ) < 1 then there exists an ϵ ∈ (0, 1) and an integer N such that M (n) (t ′ ) ≤ 1 − ϵ for all n ≥ N .For t ∈ [0, t ′ ) and n ≥ N , we can write where the second inequality follows from 1 − M (n) (s) ≥ 1 − M (n) (t ′ ) ≥ ϵ for s ∈ [t, t ′ ].By assumption, the LHS (i.e., a n ) converges to 0 and this proves that M (n) (t) → 0 for all t ∈ [0, t ′ ).Similarly, the second result follows from applying Lemma 46.
Lemma 46.Using the setup from Lemma 20, assume that M(t) is strictly increasing and consider a sequence of problems where the BMS channel family is fixed but the code is changing (e.g., X depends on n).Let {(G (n) , M (n) )} n∈N be the corresponding sequence of GEXIT and extrinsic MMSE functions for the same symbol (say X 0 ).Then, for any t ′ ∈ (0, 1), we have t ′ 0 G (n) (s) ds → 0 ⇔ ∀s ∈ [0, t ′ ), M (n) (s) → 0, (58) Proof.Without loss of generality, we assume that G (n) (t) and M (n) (t) are the GEXIT and extrinsic MMSE functions of X 0 for the n-th problem in the sequence.We will need two bounds on the GEXIT function to proceed.The first is derived in (52) and rewriting it in the notation of this lemma gives The second will be derived shortly and can be stated as To see this, we can subtract ( 17) from (49) and lower bound by the first term in the resulting sum because all terms are non-negative.Then, (61) holds because −q ′ k (t) = M ′ (t) and Proof of ⇐= in (59): Starting with the fact that G (n) (t) ≤ H ′ (t) almost everywhere, we can write < ϵ for all n > N (1 − H(t)), where the last step follows from two applications of (62).By the continuity of H(•), for any ϵ > 0, there exists t ∈ (t ′ , 1] such that H(t) − H(t ′ ) < ϵ.Since M (n) (t) → 1, there exists N ∈ N such that 1 − M (n) (t) < ϵ for all n > N .Thus, the RHS converges to 0 because, for any ϵ > 0, there is an N ∈ N such that the RHS is less than 2ϵ for all n > N .
By the continuity of H(•), for any ϵ > 0, there exists t ∈ [0, t ′ ) such that H(t ′ ) − H(t) < ϵ.Since M (n) (t) → 0, continuity of the h b -term in M (n) (t) implies that there is an N ∈ N such that it is less than ϵ for all n > N .Thus, the RHS converges to 0 because, for any ϵ > 0, there is an N ∈ N such that the RHS is less than 2ϵ for all n > N .

Lemma 6 (
RM Puncturing).If one punctures the code C = RM(r, m) by only keeping symbol positions with indices in the set I = τ (V ), where V = {0, 1} m−k × {0} k , then C I = RM(r, m − k).Moreover, puncturing a uniform random codeword from C results in a uniform random codeword from C I .

2 and a vector b ∈ F m 2 ,
the degree of a polynomial is preserved by the affine change of variables

F m+k 2 spanned
by the first m canonical basis vectors and let I = τ (V ) = [2 m ] be the associated set of codeword indices.

Figure 2 .
Figure 2. Diagram highlighting two copies of RM(r, m) inside RM(r, m + k) for k = 2 as outlined in Lemma 9.The first copy is supported on the red and blue indices I = {0} ∪ A ∪ B and the second copy is supported on the red and yellow indices I ′ = {0} ∪ A ∪ C. The condition k = 2 is indicated by the fact that |B| = (2 k − 1)|{0} ∪ A|.There is also a copy of RM(r, m − k) supported on I ∩ I ′ = {0} ∪ A.

2 2 ,
(25) where Y S (t) is a modified version of Y (t) in which the entries indexed by S have been resampled according to the same input X.The quantity ∆ S i (t) can be interpreted as the generalized influence (see[62, Definition 8.22]) of the coordinates indexed by S on the conditional mean estimator f

Lemma 29 .
For positive integers (r, m, k) with r, k ≤ m, let N = 2 m and L = 2 m+k .If X [L] is distributed uniformly on the codewords of the RM(r, m + k) code then X [N ] is distributed uniformly on the codewords of the RM(r, m) code.Furthermore, for each i ∈ [N ], there exists a partition {i}, A, B of [N ], a set C ⊆ [L] \ [N ], and an automorphism

Figure 4 .
Figure 4. Illustration of the upper bound (red) on extrinsic MMSE given by(39) for a rate R = 0.5 code on a BEC channel with erasure rate t when ρ := 1 0 M (s)(1 − M (s))ds = 0.02.This bound is sharp because, for any t * ∈ [0, 1 − R − ρ), there is a non-decreasing function M : [0, 1] → [0, 1] that equals the upper bound at t = t * and also satisfies the area theorem (e.g., the area in blue is equal to R) and the integral constraint 1 0 M (s)(1 − M (s))ds = ρ.The function M (t) is shown (blue) for t * = 0.4 and is given by (41) with u * = 0.525.See Example 33 for more details.

Lemma 38 .
Consider a function f : [a, b] → R that is absolutely continuous on [a, b] and another function g : [a, b] → R.Then, if there is a constant γ < ∞ such that |g(y) − g(x)| ≤ γ|f (y) − f (x)| for all x, y ∈ [a, b], then g is absolutely continuous on [a, b].
we define the subvector x A = (x a0 , x a1 , . . ., x a M −1 ) ∈ X M without using boldface.A single random variable is denoted by a capital letter (e.g., X, Y, Z).
Vectors of random variables are denoted by boldface capital letters (e.g., X, Y , Z).For a bounded random variable X, the L p -norm (p ≥ 1) is denoted by ∥X∥ See the first steps in the proof of Lemma 20 for more details.Since U(t) − X [N ] − Y [N ] (t)forms a Markov chain, we can apply Lemma 21 to bound G + i (t) − G i (t) from below in terms of the difference in extrinsic MMSE.
terms are non-negative by Lemma 19 and ∥E (13)nd Y (t) is an observation of X through W (t), then(13)implies that H µ (t) = H(X | Y (t)).We also note that H 0 (t) = H(t) by Lemma 19.