Polar Codes’ Simplicity, Random Codes’ Durability

Over any discrete memoryless channel, we offer error correction codes such that: for one, their block error probabilities and code rates scale like random codes’; and for two, their encoding and decoding complexities scale like polar codes’. Quantitatively, for any constants <inline-formula> <tex-math notation="LaTeX">$\pi,\rho >0$ </tex-math></inline-formula> such that <inline-formula> <tex-math notation="LaTeX">$\pi +2\rho < 1$ </tex-math></inline-formula>, we construct a sequence of block codes with block length <inline-formula> <tex-math notation="LaTeX">${N}$ </tex-math></inline-formula> approaching infinity, block error probability <inline-formula> <tex-math notation="LaTeX">$\exp (-{N}^\pi)$ </tex-math></inline-formula>, code rate <inline-formula> <tex-math notation="LaTeX">${N}^{-\rho }$ </tex-math></inline-formula> less than the Shannon capacity, and encoding and decoding complexity <inline-formula> <tex-math notation="LaTeX">${O}({N}\log {N})$ </tex-math></inline-formula> per code block. The core theme is to incorporate polar coding (which limits the complexity to polar’s realm) with large, random, dynamic kernels (which boosts the performance to random’s realm). The putative codes are optimal in the following manner: Should <inline-formula> <tex-math notation="LaTeX">$\pi +2\rho >1$ </tex-math></inline-formula>, no such codes exist over generic channels regardless of complexity.


Introduction
Richard W. Hamming is one of the first few people who had the idea that by grouping information in blocks with redundancies, a calculating machine can correct errors by its own and proceed to the next command instead of halting. Their solution, now called Hamming codes, is found in [Ham50]. Claude E. Shannon, a colleague of Hamming in Bell labs, theorized the communication channels and showed that a channel associates to a number called capacity, which represents the ultimate limit of the efficiency of communications over that channel.
To brief the rest of the history, we follow the analogy [AW14] used. Shannon's eternal result, noisy channel coding theorem [Sha48], is considered the analog of the law of large numbers (LLN). The theorem implies that there exists a sequence of longer and longer block codes whose block error probabilities approach 0 and code rates approach the capacity, which is analogous to that the empirical average of random variables is close to the mean with high probability. Robert G. Gallager, Shannon, Robert M. Fano, and followers extended the LLN result by looking at how the block error probability P e scales when the code rate R is fixed. They showed that the error probability P e scales like exp(−E r (R)N ). Here N is the block length, E r (R) is a constant depending on R. This paradigm is considered the analog of the large deviations principle (LDP). See [Fan61,Gal65,SGB67,Gal68,Gal73,Bla74,BF02,iFLM11,DZF16]. Meanwhile, a series of works fix the error probability P e and looked at how the code rate R scales [Wol57,WEI60,Dob61,Str62,BKB04,Hay09,PPV10]. They showed that the code rate R scales like I − Q −1 (P e ) V /N for I the capacity, Q −1 the inverse of the standard Q-function, and V an intrinsic parameter of the channel. The parameter V is called the dispersion or varentropy by different authors. It is the "variance" of the channel while I is the "mean" of the channel. This turns out to be more than an analog-the random variable log(W (Y | X)/W out (Y )) called information density or information spectrum has mean I and variance V . This paradigm is considered the analog of the central limit They showed that the quantity N (I − R) 2 /|log P e | converges to 2V , twice the very dispersion appearing in the CLT paradigm. This paradigm is considered the analog of the moderate deviations principle (MDP). On a parallel track, the engineering aspects of the communication theory thrive. Codes with excellent practicality are proposed. To name a few, Reed-Muller (1964), trellis modulation (1970s), turbo (1990s), low-density parity-check (1963,1996), Repeat-accumulate (1998), Fountain (1998), and polar (2009).
Among the long list of inventions, only trellis modulation, low-density paritycheck, and polar achieve the LLN paradigm over nontrivial channels-they are capacity-achieving. Among these three, polar stands out as the only code that achieves the CLT paradigm (optimally), the only code that achieves the LDP paradigm (optimally), and the only code that achieves the MDP paradigm (suboptimally). If only polar code achieves the optimal MDP paradigm. We brief the history of polar codes below. Unless stated otherwise, I means the symmetric capacity in the next three paragraphs.
Erdal Arıkan's original works on channel polarization [Ari08,Ari09] established the foundation of polar codes, placing polar codes in the LLN paradigm on day one. Arıkan and Telatar [AT09] characterized the LDP behavior of polar codes, showing that P e scales like exp(− √ N ) when an R < I is fixed. Later, Korada-Şaşoglu-Urbanke [KSU10] generalized polar codes from Arikan's kernel [ 1 1 0 1 ] to any invertible -by-matrix G, granted that 2 and G is not column-equivalent to a lower triangular matrix. And then they showed that the LDP behavior is P e ≈ exp(−N Ec(G) ) where E c (G) is a constant depending on the kernel matrix G. The notation E c (G) is meant to resemble Gallager's error exponent E r (R) but the former is at this level exp(−N • ) while the latter is at this level exp(−•N ). The LDP behavior of polar codes is then refined in [HMTU13]. Therein, P e is approximated by exp(− E ) where E = E c (G)n − V c (G)nQ −1 (R/I) + o( √ n) is a more accurate exponent, is the matrix dimension, n is the depth of the code, and V c (G) is another constant depending on G. The notation V c (G) is meant to resemble the channel dispersion V . Appearing to be a CLT behavior, this result lies in the corner of the LDP paradigm that touches the MDP paradigm. Finally, Mori-Tanaka [MT14] generalized everything above to channels of prime power input size. Over arbitrary input alphabets, [ŞTA09a,Sas11]   Note that results utilizing different kernels over various channels are mixed. The higher ρ, π, the better performance. The curve part of [WD18] is ρ = 1 − h 2 (π). [Ari09,AT09]. Over binary but asymmetric channels, [SRDR12,HY13] showed the counterpart of [Ari09,AT09] with I being the Shannon capacity. No further result on the LDP side, e.g. over non-binary asymmetric channels, is known. The present work fills the gap.
The CLT behavior of polar codes turns out to be difficult to characterize. It was Korada-Montanari-Telatar-Urbanke [KMTU10] who came up with the idea that approximating an eigenfunction tightly bounds the eigenvalue −ρ . Here ρ > 0 is a number such that R scales like I − N −ρ with a fixed P e . They had 0.2669 ρ 0.2841 over binary erasure channels (BECs). The upper bound was brought down to 3.553ρ 1 [GHU12]. Hassini-Alishahi-Urbanke [HAU14] lifted the lower bound to 3.627ρ 1 over BECs and proved a lower bound 6ρ 1 over binaryinput discrete-output memoryless channels (BDMCs). The latter is suboptimal so [GB14,MHU16] improved the bound to 5.702ρ 1 and to 4.714ρ 1. Additive white Gaussian noise channles (AWGNCs) have continuous output alphabet, but [FT17] show that they have 4.714ρ 1 too. Over BECs particularly, [FV14,YFV19] examined a series of larger kernels; the current record is a 64-by-64 kernel believed to have 2.9ρ 1. Near the end of the road to 2ρ < 1, [PU16] showed that by allowing q → ∞, Reed-Solomon kernels achieve 2ρ < 1 over q-ary channels. This does not really prove that polar codes achieve 2ρ < 1 over any specific channel, but gave hopes. Fazeli-Hassani-Mondelli-Vardy [FHMV17,FHMV18], eventually, showed that large random kernels achieve 2ρ < 1 over BECs, breaking the barrier. Guruswami-Riazanov-Ye [GRY19] extended their result to all BDMCs utilizing the dynamic kernel technique. Over the remaining channels, the present work fills the gap.
Between LDP and CLT is polar codes' MDP behavior. Guruswami-Xia [GX13] showed that there exists ρ > 0 such that P e scales like exp(−N 0.49 ) while R scales like I − N −ρ over BDMCs. This raised a question about what are the possible pairs (π, ρ) such that (P e , R) scales like (exp(−N π ), I − N −ρ ). Mondelli-Hassani-Urbanke [MHU16] answered this, partially, in the same paper they bounded ρ. They showed that under a certain curve connecting (0, 1/5.714) and (1/2, 0) all (π, ρ) are achievable over BDMCs. For BECs the upper left corner is (0, 1/4.627). A straightforward generalization to AWGNCs was also given in [FT17]. We in [WD18] improved their result, suggesting that via a combinatorial trick the upper left corner of the curve is (0, ρ) for any ρ that is valid in the CLT regime. The same trick also implicated that over BECs all (π, ρ) such that π + 2ρ < 1 are achievable, which is mainly owing to [FHMV17]'s result that 2ρ < 1 over BECs is achievable. Meanwhile, [BGN + 18] made the first step to investigate the general kernel matrices over general prime-ary channels. They showed that it is possible to achieve ρ > 0 with P e ≈ N −Ω(1) . This is, strictly speaking, "only" a CLT behavior as the desired error probability in the MDP world is exp(−N π ). Later, B lasiok-Guruswami-Sudan [BGS18] were able to show that for all π < E c (G) there exists ρ > 0 such that (π, ρ) is achievable. This makes it a direct generalization of [GX13] to all polarizing kernel matrices G over all prime-ary channels. Over the remaining channels, the present work fills the gap.
I resumes to be the Shannon capacity. Readers are now prepared to be presented the main theorem.
Theorem 1 (the main theorem-polar codes' simplicity, random codes' durability). Let W be any discrete memoryless channel. Fix a prime ς 2. Fix constants π, ρ > 0 such that π + 2ρ < 1. There exists a sequence of block codes with encoding and decoding algorithms such that: (cs) the codes accept uniform ς-ary messages. (cn) the block length N approaches infinity; (cp) the block error probability falls below exp(−N π ); (cr) the code rate exceeds I − N −ρ ; and (cc) the encoding and decoding complexity is O(N log N ) per code block.
The proof of the main theorem spans over Sections 2 to 8, lemmas continuing in Appendices A to C. The entry points are Sections 2.1 for (cs), 2.2 for (cn), 3.2 for (cc), 4.2 for (cp), and 6.3 for (cr). The main theorem is optimal in the following manner.
Proposition 2 (optimality). Fix π, ρ > 0 such that π + 2ρ > 1. Assume V > 0. For the rest of the section, we outline the ideas to prove Theorem 1. The proof is a straightforward remix of polar coding techniques and random coding techniques if it were not for a few hurdles.
Hurdle of input alphabet size: The majority of the polar coding theory assumes that the input alphabet of the underlying channel is binary, of prime size, or, less likely, of prime power size. But the main theorem aims for arbitrary finite alphabets.
Finite alphabets do possess polarization behavior but the speed of polarization has room for improvement [Sas11,Theorem 3.5]. We will overcome this by adding "dummy symbols" into the input alphabet to make it a prime power.
Hurdle of asymmetric channel: Although asymmetric channels do polarize, the input distributions do not automatically become the uniform distribution. Precomposing a source coding machinery helps generate the desired distribution and has been proposed before [ŞTA09a, Section III.D] [Ari10, Section IV]. On the other hand Honda-Yamanoto [HY13] showed that one polar code can do both source coding and noisy channel coding at once. We borrow their idea.
Hurdle of kernel selection: Judging and identifying the best-behaved kernel gets harder as we need finer descriptions of the performance of the code. The good result for the BEC case depends heavily on the erasure nature of the channels (that they are ordered by their capacities). Other general results are not strong enough to meet our goal. To overcome, we borrow a technique called dynamic kernels from [YB15]. The idea is to prepare more than one polarizing kernel and apply the proper one on a channel-by-channel basis. This makes a paradigm shift from one kernel fits all channels to every channel deserves a tailor-made kernel. We will, once per channel, apply the random coding theory to show the existence of a proper kernel.
Hurdle of output alphabet size: Even with the great freedom to choose one kernel for each and every channel, there lies the difficulty that some performance bounds are proven with one fixed channel in mind to favor the big-O notations. Those bounds are prone to depend on the size of the output alphabet, which grows to infinity as the channel transformations take place. Meanwhile, some universal bounds are proven that depend only on the size of the input alphabet, which is invariant under channel transformations. We will borrow a bound derived in [CS07,DCS14].
1.1. Organization. Section 2 reviews channels and entropy notations; Section 2.1 explains how to overcome the hurdle of arbitrary input alphabet size. Section 3 reviews the channel transformations; Section 3.1 designs the decoder; Section 3.2 analyzes its complexity; Section 3.3 designs the encoder, overcoming the hurdle of asymmetric channel. Section 4 reviews the channel parameters such as the Bhattacharyya parameter; Section 4.2 shows how to control the block error probability. Section 5 reviews the channel processes; Section 5.1 argues that the global MDP behaviors of H(W n ) and H(V n ) imply the main theorem. The main theorem is thus reduced to the behavior of certain channel processes. Section 6 proves that the global MDP behavior we want holds granted that the local LDP and CLT behaviors hold, effectively boiling the main theorem down to the local behaviors. Section 6.2 introduces the random kernel trick and Section 6.3 introduces the dynamic kernel trick to overcome the hurdle of kernel selection. Section 7 confirms the local LDP behavior. The proof distills properties of the weight distribution of random codes. Sections 7.1 and 7.2 proves the two fundamental theorems of polar coding. Section 8 confirms the local CLT behavior. Contributions from Gallager and Hayashi are utilized. Section 8.2 invokes Chang-Sahai's universal bound, overcoming the hurdle of output alphabet size.
1.2. Three families of randomnesses. The randomnesses from the sender's message, the channel, and the randomized rounding constitute the first family. Typeset in Roman font are random variables (U, X, Y, . . . ), probability measures (P, Q, W ), entropies (H, I), and other parameters (P e , Z, T, S . . . ) in this family. The randomness from the channel process, one main technique in the polar coding literature, is the second family. Typeset in sans serif font are stochastic processes (K n , W n , H n , Z n , . . . ), probability measure (P), and expectation (E ) in this family. The randomness from random kernel ensembles, the main technique in the random coding literature, is the third family. Typeset in blackboard bold font are random variables (G, X, K), probability measure (P, with exceptions), expectation (E), and Kullback-Leibler divergence (D, with exceptions) in this family.

Channel and Entropy Preliminaries
A discrete memoryless channel is a Markov chain W : X → Y. Here X is a finite set of input alphabet; Y is a finite set of output alphabet; and W is an array of transition probabilities W (y | x) ∈ [0, 1] for all x ∈ X and y ∈ Y. The numbers satisfy y∈Y W (y | x) = 1 for all x ∈ X , which represents the fact that each x must be transitioned to some unique y. When X and Y are clear from the context, we call W a channel. Although the input distribution is not part of the channel data, we write W in (x) to denote the input distribution. When W in (x) is understood from the context, we write W (x, y) to denote the joint distribution W (y | x)W in (x), write W out (y) to denote the output distribution, and write W (x | y) to denote the a posteriori probability W (x, y)/W out (y). (Thus the interpretation of W (• | •) depends on the arguments and the context.) A tuple of inputs (x i , x i+1 , . . . , x j ) is abbreviated as x j i . Same for y j i for tuple of outputs, and for u j i for general variables. We assume memoryless channels, and write W (y 1 | x 1 ) to denote the product measure i=1 W (y i | x i ) for consecutive usages. We write W in (x 1 ), W out (y 1 ), W (x 1 , y 1 ), and W (x 1 | y 1 ) to denote the input, output, joint, and a posteriori probabilities.
Let X, Y be two r.v.s (random variables). Let H(X), H(X | Y ), and I(X ; Y ) be the standard entropy, conditional entropy, and mutual information. The base of the logarithm will be assigned later. When X is the input fed into some channel W : X → Y and Y is the corresponding output, we say H(W ) and I(W ) to mean H(X | Y ) and I(X ; Y ). When the distribution of X (the input distribution) is chosen to maximize I(W ), it is called the capacity-achieving input distribution and I(W ) is called the (Shannon) capacity of the channel W : X → Y. Unless stated otherwise, the input distributions will be capacity-achieving.
2.1. Reduce input size to prime power. Immediately after we declared what channels are concerned (those with finite input and output alphabets), we show that it suffices to consider input alphabets of prime power size. Let W : X → Y be a channel. Let the input alphabet X be of size s. Let q be any prime power greater than or equal to s. Degrade the channel W as follows: Let symbols in X be ξ 1 , ξ 2 , . . . , ξ s . Let ξ s+1 , ξ s+2 , . . . , ξ q be q − s extra symbols. Let X be X ∪ {ξ s+1 , ξ s+2 , . . . , ξ q }; this is the extended alphabet. Define a dummy channel : X → X by letting (ξ min(i,s) | ξ i ) be 1 for all i = 1, 2, . . . , , q. That is, all extra symbols collapse to ξ s while the old symbols remain. The composition of the two channels forms a degraded channel with prime power input size. By the data processing inequality, the Shannon capacity of the degraded channel W • is no greater than W 's Shannon capacity. Meanwhile, it is clear that the degraded channel W • achieves W 's capacity by the same input distribution, ignoring extra symbols. In other words, I(W • ) = I(W ). This constitutes the input size reduction.
Hereafter, we assume the size of the input alphabet X is q, where q is a prime power. If the sender wants to send uniform binary messages, let q be a power of 2. If the sender wants to send uniform quaternary messages, let q be a power of 3. In case the sender wishes to send uniform quaternary messages but does not want to split an information bit over two channel symbols, let q be a power of 4. Bonus: should the sender want to send uniform senary messages, choose q 2 a power of 2 and q 3 a power of 3 such that q 2 q 3 is a power of 6; then alternate between q = q 2 and q = q 3 . That is, the sender breaks every senary bit into a binary component and a ternary component, sends the binary component through the q = q 2 code block, and send the ternary component through the q = q 3 code block. For other message alphabets, apply the fundamental theorem of arithmetic.
Fix a q. Let F q be the finite field of order q (with the addition and multiplication structure). Identifying X with F q , we will use them interchangeably. We say that W is a q-ary channel when the variables X and Y are remotely relevant. It is worth keeping in mind that for inequalities in this work, q = 2 is the most difficult case and q 2 will be used silently.
2.2. On the message alphabet and the block length. The fact that we have some freedom to choose q blurs the meaning of the block length N since, say, a q 2 -bit bears twice as much message as a q-bit does. Notwithstanding, we would like to remind readers that multiplication and division of N by any constant do not alter the semantics of the main theorem. This is because O(N log N ) can absorb any constant; exp(−N π ) and N −ρ too can by fluctuating π and ρ a bit. A more series aftereffect is caused by mixing code blocks with distinct q. When the sender attempts to send uniform 30-ary messages, they choose q 2 , q 3 , q 5 s and switch among the three block codes. The q = q 2 blocks have their own block length N 2 just like the other blocks have N 3 or N 5 as block lengths. The de facto block length N , the minimal number of the channel usages before the receiver can decode everything sent so far, is thus three times the least common multiple of N 2 , N 3 , and N 5 . We claim without a proof (but it will be clear once we prove the rest of the main theorem) that it is possible to make N 2 = N 3 = N 5 and consequently N = 3N 2 . Again, increasing N by three-fold does not make any difference. For numbers with more prime factors, a similar reasoning applies.
We recommend readers not to worry about the message alphabet as there exists a powerful solution-to pre-compose another code that re-encodes an arbitrary finite message distribution (not necessarily uniform) to a uniform prime power-ary input distribution. The existence of such code, by duality, is tightly bonded to the existence of a error-correction code that carries uniform prime power-ary messages over channels of arbitrary arity. The latter is exactly what the main theorem concerns.
We clarified (cs) and (cn) in this section; there are (cc), (cp), and (cr) to go. We continue proving the main theorem in the next section.

Channel Transformation
Let 2 be an integer. This will be the dimension of the kernel matrices. But for now, let us introduce a flexible framework. Fix a q-ary channel W : X → Y. Let U 1 , U 2 , . . . , U be r.v.s taking values in X . For 1 i j , let U j i be the joint r.v. U i U i+1 · · · U j . Let g W : X → X be a bijective map; that is, H(U 1 | g W (U 1 )) = 0. We now feed X 1 := g W (U 1 ) into i.i.d. (independent and identically distributed) copies of the channel W . Let Y 1 ∈ Y be the corresponding output. The chain rule of conditional entropy reads (1) Interpretation: to estimate U 1 given Y 1 , we first estimate U 1 given Y 1 ; and then use the estimateÛ 1 to further estimate U 2 ; afterward, we estimate U 3 givenÛ 1 , U 2 , and Y 1 ; and so on. To achieve W 's capacity, g W (U 1 ) must follow a certain capacity-achieving distribution. Since g W is bijective, this induces a distribution of U 1 . (Remark: we imply nothing about whether U 1 , U 2 , . . . , U are i.i.d or not.) Fix this distribution, then These two chain rules motivate the channel transformation:

Its input distribution W
(i) in (u i ) is determined by that of U i . It may sound weird that W (i) will tell the receiver the input of W (1) , W (2) , . . . , W (i−1) for free. But in reality, W (i) acts as an interactive device where the receiver (not the sender) needs to input what U i−1 1 is and the device will output something that looks like U i−1 1 Y 1 ; only when the receiver inputs the correct U i−1 1 does the device return the correct U i−1 1 Y 1 . Under this interpretation, the de facto capability of W (i) is thus Y 1 ), which justifies the chain rule of the mutual information. To avoid confusion, we prefer H(W (i) ) over I(W (i) ) in calculations.
(Step 3-a) output (W (3) (u 3 |û 2 1 y 1 ) : u 3 ∈ X ). (Step 3-b) inputû 3 . Figure 2. A DU with = 3 and its I/Os. allow g W to depend on the channel W . That is to say, we need (presumably distinct) bijections g W (i) : X → X for every i ∈ [ ] when we want to define (W (i) ) (j) out of W (i) . Similarly, we need yet another 2 bijections g (W (i) ) (j) : X → X for every i, j ∈ [ ] in defining depth-3 channels. And the recursion goes on ad infinitum. 3.1. Design of the decoder. To implement channel transformations, we define a DU (decoding unit) to be an automata as follows: It is a box with pins on the left and pins on the right. Each pin is connected to another DU, a CH, an FH, or an IH (to be defined later). Each pin may take inputs or output but not at the same moment. A DU works as follows: Let W : X → Y be the channel it is to transform. (Step 0) For all i ∈ [ ], the i-th pin on the left takes the input y i . The input is passed in the form of the a posteriori distribution (W (x i | y i ) : x i ∈ X ). This is what Arıkan calls α-representation [Ari15b, Section II.A]. (Step 1-a) It computes the a posteriori distribution of U 1 given y 1 ; that is, (W (1) (u 1 | y 1 ) : u 1 ∈ X ). And then it outputs this tuple of probabilities to the first pin on the right. (Step 1b) At a later moment, it will receive an estimateû 1 of U 1 from the first pin on the right. Note thatû 1 is a hard symbol in X , not a soft tuple of probabilities. (Step 2-a) It computes the a posteriori distribution of U 2 givenû 1 y 1 ; that is to say, it pretends that U 1 happens to beû 1 and computes (W (2) (u 2 |û 1 y 1 ) : u 2 ∈ X ) accordingly. And then it outputs this tuple of probabilities to the second pin on the right. (Step 2-b) At a later moment, it will receive an estimateû 2 of U 2 from the second pin on the right. (Step i-a) In general, it computes W (i) (u i |û i−1 1 y 1 ) for all u i ∈ X and then output the tuple to the i-th pin on the right. (Step i-b) After a while, it will receiveû i . (Step + 1) Once it receivesû from the last pin on the right, it computesŷ 1 := g W (û 1 ), and then outputŷ i to the i-th pin on the left for all i ∈ [ ]. See Figures 2 to 4 for illustrations.
into (· · · (W (k1) ) · · · ) (km+1) . The k 1 -th pin on the left of the (k 2 , . . . , k n ; 1)-th DU connects to a CH (channel helper) indexed by (k 1 , k 2 , . . . , k n ). Each CH then connects to the output of a copy of the channel W . The k n -th pin on the right of the (k 1 , . . . , k n−1 ; n)-th DU connects to either an FH (frozen bit helper) or an IH (information bit helper); in either case, the connected helper is indexed by (k 1 , k 2 , . . . , k n ). Let I ⊂ [ ] n be the set of indexes (k 1 , k 2 , . . . , k n ) such that the k n -th pin on the right of the (k 1 , . . . , k n−1 ; n)-th DU connects to an IH. Then [ ] n \ I is the set of indexes where the pin connects to an FH. On the left hand side of the DU array, the task of the (k 1 , k 2 , . . . , k n )-th CH is to receive the channel output Y (k1,k2,...,kn) ∈ Y and then forward the a posteriori distribution (W (x (k1,k2,...,kn) | Y (k1,k2,...,kn) ) : x (k1,k2,...,kn) ∈ X ) to the DU array. On the right hand side, FHs correspond to what Arıkan called frozen bits-bits that do not carry information and the receiver knows their values as part of the communication protocol. The task of the (k 1 , k 2 , . . . , k n )-th FH is to receive the a posteriori distribution of the (k 1 , k 2 , . . . , k n )-th frozen bit and then return the correct symbol U (k1,k2,...,kn) ∈ X back to the DU array. IHs correspond to information bits that carry the sender's messages. The task of the (k 1 , k 2 , . . . , k n )-th IH is to receive the a posteriori distribution of the (k 1 , k 2 , . . . , k n )-th information bit and then return the most probable symbolÛ (k1,k2,...,kn) ∈ X back to the DU array. When all IHs are activated once, a code block completes. The most probable symbols they returned to the DU array form the decoded messageÛ I , meaning the tuple (Û (k1,k2,...,kn) : (k 1 , k 2 , . . . , k n ) ∈ I).
What we just established is the successive cancellation decoder of polar codes that could be found in most works that implement polar codes. For instance, [Ari09, Section VIII], [Kor09, Section 3.2], and [HY13, Section III], and [EKMF + 17, Section Vi.B]. See especially [GRY19, Section 9] for an almost identical construction albeit they had q = 2 in mind. We replicate the whole story to demonstrate that each DU may use a unique bijection "g" without changing the overall structure too much. Whether or not this construction can transmit information reliably is  . 12 DUs (with = 2) are chained together to implement ((W (1) ) (1) ) (1) , . . . , ((W (2) ) (2) ) (2) . DUs in the first column use g W ; DUs in the second column use g W (1) and g W (2) ; DUs in the third column use g (W (1) ) (1) , g (W (1) ) (2) , g (W (2) ) (1) , and g (W (2) ) (2) . discussed in Section 4.2. There, we will also clarify how to arrange FHs and IHs. The complexity can be estimated prior to further specification.
3.2. Complexity of the decoder. There are various models that measure the complexity of a structure. The polar coding community uses a variant of the circuit complexity where the arithmetic of real numbers costs O(1) and passing probabilities between DUs costs O(1). The complexity of the DU array is thus the number of the DUs multiplied by the complexity of a single DU. The number of DUs is n−1 n. The complexity of a DU depends on how a DU computes the a posteriori probabilities W (j) (u i | u i−1 1 y 1 ) out of W (x i | y i ). The naïve approach is to exhaust all possible inputs u 1 ∈ X and compute the a posteriori probabilities using Bayesian formulas. This costs O( 10 q +10 ) (here 10 is an overestimate). Hence the overall complexity is O( n−1 n 10 q +10 ). In our setup, however, q is fixed, will be chosen upon knowing π, ρ, and n goes to infinity afterwards. So we advertise that the complexity is O( n n), or O(N log N ). Here N := n is the block length, equal to the number of copies of the channel W attached to the DU array. The complexities of the CHs, FHs, and IHs can be computed similarly. They are all bounded by O( n+10 q 10 ). Thus the decoder as a whole costs O(N log N ).
We claim that the encoder has the same complexity O(N log N ) although we have not defined the encoder yet. The encoder is essentially a special decoder and is the subject of the next subsection.
3.3. Design of the encoder. The encoder will be an exact copy of the decoder except that CHs and IHs will behave differently. In greater detail: Let there be an n−1 -by-n array of DUs indexed and connected in the same way described in Section 3.1. Each DU executes the exact same task described in Section 3.1. The left pins of the DUs in the first column each connect to a CH. The right pins of the DUs in the last column each connect to the same type of device (an IH or an FH) as its twin-DU in the decoder does. Here, as part of the encoder, a CH will output the capacity-achieving input distribution (W in (x) for all x ∈ X ) into the DU array. For each (k 1 , k 2 , . . . , k n ) ∈ I, the (k 1 , k 2 , . . . , k n )-th IH will receive a recommended distribution of the (k 1 , k 2 , . . . , k n )-th information bit and then return the message symbol U (k1,k2,...,kn) ∈ X the sender wants to send back to the DU array. For each (k 1 , k 2 , . . . , k n ) ∈ [ ] n \ I, the (k 1 , k 2 , . . . , k n )-th FH will receive a recommended distribution of the (k 1 , k 2 , . . . , k n )-th frozen bit and then return a r.v. U (k1,k2,...,kn) ∈ X that follows that distribution back to the DU array. This r.v. is simulated by a pseudo random number generator shared between the encoder and the decoder. The twin-FH in the decoder, regardless what distribution it receives, will return the exact same symbol U (k1,k2,...,kn) back to the DU array. This step is called randomized rounding and is found in [Kor09, Section 3.3], [KU10, Section III], [KT10, Section II], and [HY13, Section III.A]. After all IHs return the sender's messages and all FHs returns randomly rounded bits to the DU array, the CHs will each get a codeword symbol X (k1,k2,...,kn) ∈ X from the DU array. And then each CH will forward that symbol to an i.i.d. copy of the channel W .
This design is a copy of [HY13]'s encoder explained in our terminology. It is clear that the encoding complexity will be O(N log N ), too. Alongside the decoder, the encoder creates its own channel transformations. Let W : X → Y be a q-ary channel and X be a capacity-achieving input. Define a flattening channel W : X → {η} that erases all information. Then the encoder is effectively synthesizing depth-1 channels

Moreover, H(W ) = H(X) and H(W
No "Y " plays any role here since they are constant. The fact that a channel as boring as W is helpful to our main theorem will be covered later, in Section 4.2. We clarified (cs), (cn), and (cc) up to this section; there are (cp) and (cr) to go.

Channel Parameters
Let W : X → Y be a q-ary channel. Let X be a capacity-achieving input and Y be the corresponding output. Besides H and I, there are several channel parameters that capture the qualities of channels. Here is a list of parameters extracted from the work [MT14] of Mori and Tanaka.
Both H(X | Y ) and H(W ) are the base-q conditional entropy, the base chosen such that 0 is the error probability of the maximum a posteriori (MAP) decoder. The MAP decoder looks at an output y ∈ Y and chooses a symbolx ∈ X that maximizes W (x | y). When the output is Y = y, the probability that the MAP decoder does not choose X asx is 1 − max x∈X W (x | y). Therefore, P e (X | Y ) = y∈Y W out (y)(1 − max x∈X W (x | y)). In a channel-centric narrative, we also write P e (W ) for P e (X | Y ).
Z(X | Y ) is the rescaled sum of Bhattacharyya coefficients of the transition distribution W (y | x) for the uniform input. For non-uniform inputs, a modification is made to generalize the definition and the properties that used to hold. Intuitively speaking, a MAP decoder seeing y is "confident" if W (x | y) is small for all but one x, or equivalently, if the product W (x, y)W (x , y) is small for all distinct x, x ∈ X . The Bhattacharyya parameter measures the "confidence" by In addition, define We also write Z(W ), and Z mad (W ) for these quantities. Remarks: The rescaling T (X | Y ) is the weighted average of the total variation distances from the a posteriori distributions (W (x | y) : x ∈ X ) to the uniform noise (1/q, 1/q, . . . , 1/q). More formally, it is defined to be y∈Y W out (y) x∈X |W (x | y) − 1/q|. We also write T (W ) for this quantity.
S(X | Y ) is the weighted average of the L 1 -norms of the Fourier coefficients of the a posteriori distributions. The formal definition is as follows. Let tr : F q → F p be the field trace, where F q = X and F p is the prime subfield. Let χ : F q → C be an additive character defined as χ(x) := exp(2πi tr(x)/p), where 2πi is temporarily the period of exp. Define the Fourier coefficient Define the S-parameters We also write S(W ) and S max (W ) for these quantities. Remarks: The rescaling is such that 0 S S max (q − 1)S q − 1. An interpretation is as follows: Fix a y. When W (x | y) is roughly equal to 1/q for all x ∈ F q , the Fourier coefficient M (w | y) = z∈Fq W (z | y)χ(wz) should be roughly z∈Fq χ(wz)/q = 0. The S-parameter measures how far those coefficients are from zero.

4.1.
Relations among channel parameters. The following is a series of lemmas we extract from existing works. They characterize the relations among H, I, P e , Z, T , and S.
Lemma 4. [MT14, Lemma 23 with k = q − 1] For any q-ary channel W , Lemma 6. [FM94, Theorem 1] For any q-ary channel W , and Here, h 2 is the binary entropy function; h 2 (1/2) = 1. The upper bound is Fano's inequality. The first lower bound fits when H(W ) and P e (W ) are small; the second lower bound fits when H(W ) and P e (W ) are close to 1.
The above lemmas inspire the following characterization: Let A and B be two channel parameters, we say A, B are bi-Hölder at . This notion generalizes to tuples of more parameters. Now we can summarize Lemmas 3 to 6 in a more concise statement.
See also [MT14,Corollary 28] for what inspired us. They use notation A e ∼ B to mean A, B are bi-Hölder at (0, 0) and at (1, 1). For some very technical details on the way toward the main theorem, we need explicit Hölder relations among H, Z mad , and S max . We claim them here. The proof is nothing but looking closer into Lemmas 3, 5, and 6. A written-out proof is in Appendix A.
Lemma 8 (explicit Hölder tolls). log is natural. For all q-ary channels W , the following hold: 4.2. Control of the block error probability. Let W be the channel we want to communicate over; and let X be any input. In the classical theory of polar coding, the second last step of the construction of the block code is to determine a subset I ⊂ [ ] n of indexes that points to the depth-n channels that transmit information bits. When decoding this code, a block error happens if the successive cancellation decoder fails to decode any information bit. Let E (k1,k2,...,kn) be the event that the first error occurs when the decoder is solving for the input to (· · · (W (k1) ) · · · ) (kn) , i.e., whenÛ (k1,k2,...,kn) = U (k1,k2,...,kn) and the equality holds for lexicographically earlier indexes. Then the event's probability measure P (E (k1,k2,...,kn) ) is no more than the bit error probability P e (· · · (W (k1) ) · · · ) (kn) . By the union bound, the block error probability of the decoder is bounded from above by a sum With this observation, we may define I to be the set of indexes (k 1 , k 2 , . . . , k n ) ∈ [ ] n such that H (· · · (W (k1) ) · · · ) (kn) < θ n for some clever choice of the threshold θ n > 0. This immediately implies P e (· · · (W (k1) ) · · · ) (kn) < cθ d n for some c, d > 0 by Lemma 7. Let θ n be exp(− πn n). The sum of P e is less than n cθ d n < exp(− πn ) for sufficiently large n, which is the block error probability we claimed. Remark: Arıkan used a different criterion Z < θ n . It still implies P e < cθ d n and that the sum of P e is less than n cθ d n < exp( πn ) for large n. The benefit of controlling P e using other parameters is that some parameters are easier to control (because Theorems 9 and 10 exist).
For the main theorem where the channel W is asymmetric, we want to control both the decoder block error and the encoder block error. Here, the encoder block error is not the encoder's failure to encode a message, but rather its failure to generate the capacity-achieving input distribution of W . To penalize, imagine that we employ an oracle that claims an encoder block error whenever the generated codeword should have been another word to fit the ideal distribution. That way, the actual block error probability will not exceed the sum of the encoder and decoder block error probabilities. More rigorously, let P be the probability measure assuming the ideal distribution of U I and Q be the probability measure assuming the actual U I generated by the encoder. Then the overall block error probability can be bounded by P {Û I = U I } as the decoder block error probability is bounded before. The encoder block error probability is represented by P − Q , the total variation distance from P to Q. There is a telescoping argument similar to how we control the decoder error-classifying events by the first input bit where the oracle disagrees with the encoder [Kor09, Lemma 3.5] [KU10, Lemma 4] [KT10, Lemma 2] [HY13, Lemma 1]. It yields that the encoder block error probability is bounded from above by the sum In controlling the encoder bit error probability, we strengthen the policy of collecting indexes (k 1 , k 2 , . . . , k n ) ∈ [ ] n for I by asking for H (· · · (W (k1) ) · · · ) (kn) > 1 − θ n .
The latter immediately implies T (· · · (W (k1) ) · · · ) (kn) < cθ d n by Lemma 7. As a consequence, the overall block error probability is controlled by Q{Û I = U I } P {Û I = U I } + P − Q < 2 n cθ d n < exp(− πn ) for n large. The preceding argument is a paraphrase of the proof of [HY13, Theorem 13]; Inequalities (59) and (57) therein are the keys. So far the block length, the complexity, and the error aspects of the main theorem are covered, it remains to control the code rate |I|/ n . In other words, we are to compute the cardinality of I given that I ⊂ [ ] n is the set of indexes such that H (· · · (W (k1) ) · · · ) (kn) < θ n and 1 − H (· · · (W (k1) ) · · · ) (kn) < θ n , where θ n := exp(− πn n).

4.3.
Before and after channel transformations. Alongside the relations among different parameters applied to the same channel, there are also relations between the same parameter applied to the original and the transformed channels. That There are two more that are pivotal in the theory of polar coding but require more prerequisites. Assume that g W : X → X is a linear isomorphism given by the multiplication of an invertible matrix G from the right-g W (u 1 ) := u 1 G. The following framework extends to nonlinear bijections but we do not need that much. (There is also the paradigm that random linear codes perform better than random codebooks for that a bad linear code tends to hoard a lot of short codewords at once, effectively removing them from the ensemble pool. So there is a good reason to stick to the linear case.) Let 0 i−1 1 1 i u i+1 ∈ F q be a tuple of i − 1 many 0 followed by a 1 and − i arbitrary symbols. A coset code is a subset of codewords of the form The coset codes have weight distributions just like every other code does. Let wt(x 1 ) be the hamming weight of x 1 . The weight enumerator of the i-th coset code is defined to be a one-variable polynomial over the integers We can now state the second relation. This is considered the main cause of why polar coding ever exists/works.
. The proof is postponed until Section 7.1. The fundamental theorems come as a pair. Let u i−1 1 1 i 0 i+1 ∈ F q be a tuple of i − 1 arbitrary symbols followed by a 1 and − i many 0. Let G − be the inverse transpose of G. The weight enumerator of the i-th dual coset code is defined to be this one-variable polynomial over the integers We can now state the third relation, the dual of the second. The proof is postponed until Section 7.2.
Remark: These two bounds are not tight-the equality does not hold for BECs. In detail, Arıkan's original bound reads Z mad (W (1) ) 2Z mad (W )−Z mad (W ) 2 while our bound turns into Z mad (W (1) ) 2Z mad (W ), the subtraction term missing. We are simply not able to prove a version that degenerates to an equality over erasure channels, nor does any prior work seem to. This causes a serious aftermath that Z mad (W n ) (to be defined later) is no longer a supermartingale. Nonetheless, this bound is strong enough to collaborate with the random coding theory. See, for example, how we compensate in Appendix C.1.
We clarified (cs), (cn), (cc) and (cp) up to this section; there is (cr) to go.

Channel Processes
These r.v.s provide a new family of randomness that does not appear in the encoding and decoding algorithms, but they help us understand the code rate |I|/ n in this manner: Counting how many indexes are in I is nothing more than measuring the probability P{(K 1 , K 2 , . . . , K n ) ∈ I}. With the processes W n and V n thus defined, it is further equivalent to measuring the probability P{H(W n ) < θ n and 1 − H(V n ) < θ n }, where θ n := exp(− πn n). Moreover, it suffices to know how H(W n ) and H(V n ) behave as stochastic processes taking values in [0, 1] without comprehending W n and V n themselves. The general fact is that H(W n ) is either very small (channel is reliable) or very close to 1 (channel is noisy). Arıkan called this phenomenon channel polarization. The following claim generalizes channel polarization and implies the main theorem.
Claim 11. Fix any π, ρ > 0 such that π + 2ρ < 1. We will choose an and a series of bijections of F q -namely, Here, o(n) is the little-o function in n; it is such that o(n)/n → 0 as n → ∞.
For polar codes over symmetric channels, the first inequality in Claim 11 alone implies that the code rate is 1 − H(W ) − −ρn+o(n) = I(W ) − N −ρ+o(1) . The first two inequalities imply the polarization behavior that channels become either satisfactorily reliable (low H(W n )) or desperately noisy (high H(W n )). For asymmetric channels, however, we need to characterize H(V n ) alongside H(W n ). The last two inequalities in Claim 11 show that the same series of bijections polarize W at the same time they polarize W . While W contains no randomness form the channel W , what is polarized is that each input bit U (k1,k2,...,kn) either depends heavily on lexicographically earlier input bits (low H(V n )) or behaves like a free r.v. conditioned on earlier bits (high H(V n )). We then categorize the fate of indexes in [ ] n into the following three types. (A) Free and reliable: These are indexes that will be in I; they point to channels that transmit information bits. (B) Free but noisy: The sender can feed information into these channels only to find that the decoder will almost always make some mistakes. The sender should, instead, feed some pseudo random numbers shared with the receiver. (C) Dependent and reliable. The input of these channels depends on previous inputs. Their main purpose is to shape the capacity-achieving input distribution. (D) Dependent but noisy is not possible because H(V n ) H(W n ). This is the key to [HY13, Theorem 1]. We reproduce their proof in the next subsection. 5.1. Claim 11 implies the main theorem. As mentioned, H(V n ) H(W n ) so (D) dependent but noisy is not possible. Let A n be the intersection event of follows the third inequality in Claim 11. Note that A n or B n implies free but not "neither reliable nor noisy"; that is, (free ∧ reliable) ∨ (free ∧ noisy) → free ∧ ¬(¬reliable ∨ ¬noisy). We deduce that P Similarly, since A n or C n implies reliable but not "neither free nor dependent," we deduce that P In summary, we derive that Finally, recall that I collects free and reliable indexes, so the code rate is |I|/ n = P(A n ) > I(W ) − −ρn+o(n) . We almost finish the proof of the main theorem except that we claimed I(W ) − N −ρ = I(W ) − −ρn , without the little-o term. It can be fixed by finding a slightly larger > ρ such that π + 2 < 1 still holds, and then rerunning the whole argument again with the new . The conclusion becomes that the code rate is at least I(W ) − − n+o(n) . Since − n + o(n) < −ρn for sufficiently large n, this completes the proof of the main theorem. It remains to show that Claim 11 can be achieved.

Global MDP Behavior Modulo Local Behaviors
In this section, we put constraints on an abstract process {H n } and show that they imply inequalities of the form P{H n < threshold} > limit measure − decaying gap as those in Claim 11. Let F n be the sigma-algebra generated by K 1 , K 2 , . . . , K n for each n. Then F 0 ⊂ F 1 ⊂ F 2 ⊂ · · · form a filtration of sigma-algebras. Let {H n }, {Z n }, and {S n } be three stochastic processes adapted to {F n } (meaning K 1 , K 2 , . . . , K n determine H n , Z n , S n ). The following assumptions are easy to verify when we reveal what those processes are: (cb) 0 H n , Z n , S n and H n 1; Furthermore, assume large kernels: (cl) max(e 4 , q 5 , 3 q ). Let α := log(log )/ log be a small number shrinking as increases. Define the potential function h α : [0, 1] → [0, 1] to be h α (z) := min(z, 1 − z) α . (Remark: h 2 is not a special case of h α for α = 2; we expect α 1 in practice.) Here are the difficult but sufficient criteria for the main theorem.
Lemma 12. (calculus machinery for global MDP) Assume criteria (cb), (cm), (ct), and (cl). Assume the local LDP behavior: Z n+1 exp(qZ n )(qZ n ) K 2 n+1 /3 and S n+1 exp(qS n )(qS n ) ( +1−Kn+1) 2 /3 . Assume the local CLT behavior: Then, for any constants π, ρ > 0 such that the following holds: We defer the proof until Appendix B. The term K 2 n+1 /3 in the lemma is to control the local LDP behavior of the process {H n }-the behavior of H n+1 when H n is close to 0 and the behavior that is closely related to the LDP behavior of polar codes. The term is chosen in a way such that h 2 ((k 2 /3 )/ ) < k/ and such that k (k 2 /3 ) t is easy to handle. In [FHMV17, Theorem 7], a similar criterion is stated and is annotated as faster polarization at the tails. In [BGS18, Definition 2.4], a similar criterion is stated and is annotated as strong suction at the low end. The eigenfunction h α in the lemma is to control the local CLT behavior of the process {H n }-the behavior of H n when it is away from 0 and the behavior that is closely related to the CLT behavior of polar codes. In [FHMV17, Theorem 7], a similar criterion is annotated as near optimal polarization in the middle with h FHMV (z) := (z(1 − z)) α for positive but small α at most log(log )/ log . In [BGS18, Definition 2.3], a similar criterion is annotated as variance in the middle with h BGS (z) := min(z, 1 − z). Note how our choice of h α (z) := min(z, 1 − z) α resembles theirs. In both cases, the criteria are local because they refer to a small slice of the process, focusing on how H n+1 (or Z n+1 ) behaves in terms of H n (or Z n ). This perspective frees [FHMV17,BGS18] from considering the (global ) process {H n } as a whole and simplifies the analysis. We specifically benefiti from the fact that we can choose the bijection g (···(W (k 1 ) )··· ) (kn ) solely according to the channels (· · · (W (k1) ) · · · ) (kn) and (· · · (W (k1) ) · · · ) (kn) instead of the complete channel family-tree. This is also the approach taken in [GRY19].
6.1. Lemma 12 helps achieve Claim 11. To do so, one advantage is that the two desired behaviors are local. They only involve how H n+1 , Z n+1 and S n+1 behave conditioned on the history F n . A potential tedious aspect is that for each candidate of the bijection g (···(W (k 1 ) )··· ) (kn ) , we have to verify the two behaviors four times, once for each of the four triples of channel parameters. Luckily, within random coding theory, we are in the situation that to choose an object that satisfies multiple criteria, it suffices to choose the object from an ensemble and compute the probabilities that each criterion fails; as long as the sum of failing probabilities is small, most objects satisfy. Even more luckily, when we choose a bijection g (···(W (k 1 ) )··· ) (kn ) from some ensemble, we only have to compute the probability that the local CLT or LDP behavior fails for {H n }, {Z n }, and {S n } being {H(W n )}, {Z mad (W n )}, and {S max (W n )} but not the other three triples. This is because other three triples are the special case and/or the dual of this triple. Elaboration: Since V n are q-ary channels just like W n are, inequalities hold true for arbitrary W n should hold true for any V n . Also, since Z mad and S max are in duality, inequalities hold true for H, Z mad , S max hold true for 1 − H, S max , Z mad . The duality is due to the duality between FTPCZ and FTPCS, within the explicit Hölder tolls, and within the ensemble of bijections we are to choose g (···(W (k 1 ) )··· ) (kn ) from.
6.2. Random linear isomorphisms as bijections. Fix q and . Let GL( , q) be the group of -by-invertible matrices over F q together with the ordinary matrix multiplication. Select an element G ∈ GL( , q) uniformly at random. Let g W : F q → F q be the multiplication of G from the right, namely g W (u 1 ) := u 1 G. This map is bijective since G is invertible. Let W be a q-ary channel. Recall that we defined q-ary channels W (1) , W (2) , . . . , W ( ) in Section 3. To emphasis that these imaginary channels depend heavily on the randomness source G, we call Lemma 14 (local CLT behavior). Fix an 20. Recall α := log(log )/ log and h α (z) := min(z, 1 − z) α . Let G vary; with probability less than 2 − log( )/20 , this fails: 6.3. Local behaviors imply Claim 11 (and hence the main theorem). We now can see how Lemmas 12 to 14 imply that Claim 11 is achievable for the right choice of and bijections g W , g W (i) , et seq.: For any given q-ary channel W , let be max(e 4 , q 5 , 3 q ). For any given π, ρ > 0 such that π + 2ρ < 1, enlarge such that Inequality (6) holds, given α := log(log )/ log . Consider a random kernel G as a candidate of the bijection g W . Increase further so that the failing probabilities-Lemma 13's 3q − √ /13 and Lemma 14's 2 − log( )/20 -amount to 1/3 or less. Recall the flattening channel W . The probability that any of the inequalities in Lemmas 13 and 14 fails for W is less than 1/3, too. Invoke the union bound; 1/3 + 1/3 < 1. Hence there exists a solid choice of g W as the multiplication of some proper instance of G from the right. With this g W determined, we define W (i) and W (i) for all i ∈ [ ]. Consider first i = 1, anything that has been done to W now applies to W (i) . That is, let g W (i) be the multiplication of a random kernel G from the right. With W , i, and W (i) G replaced by W (i) , j, and (W (i) ) (j) G , the probabilities that inequalities in Lemmas 13 and 14 fail add up to 1/3 or less. So is the flattening ( ) counterpart. Hence there is a solid choice of g W (i) . Repeat this for every other i = 2, 3, . . . , . Once finished, proceed to choosing g (W (i) ) (j) for all i, j ∈ [ ]. And so on and so forth for cases beyond depth-2. Notice that we always make a solid choice of a bijection before we proceed to the next level of channels, hence the failing probabilities of Lemmas 13 and 14 do not accumulate as the depth increases. By how we select bijections, the criteria in Lemma 12 hold for ({H n }, {Z n }, {S n }) being these four triples: In this section, we will first prove the two fundamental theorems of polar coding in Section 7.1 (for the Z-end) and in Section 7.2 (for the S-end). And then we will target that the following inequalities hold with high probability: exp(qZ mad (W ) )(qZ mad (W )) i 2 /3 , and (8's copy) By the duality between the two fundamental theorems and between the two targeted inequalities, it is not hard to see that it suffices to prove the Z mad -case and the S maxcase follows immediately. We will prove that the first targeted inequality, for each i ∈ [ ], holds with probability 1 − 3q − √ /13 in Section 7.4, closing this section.
7.1. Proof of FTPCZ (Theorem 9). As is promised in Section 4.3, we prove the two fundamental theorems of polar coding. We first go for the Z-end.
is the weight enumerator of the i-th coset code. Theorem 9 claims that Z mad (W (i) ) f By the nature of max 0 =di∈Fq , it suffices to show that the double sum is at most f (i) GZ (Z mad (W )) for arbitrary nonzero d i . In the upcoming argument, tuple concatenation takes precedence over vector-matrix multiplication and vector addition. Fix a d i , we argue that The first equality abbreviates the summation. The next equality expands W (i) by the very definition, where u i+1 and v i+1 are free variables in F q . The next inequality is by the sub-additivity of the square root. In the next equality we define d i+1 := v i+1 − u i+1 ; so summing over v i+1 is equivalent to summing over d i+1 . In the next equality we define x 1 := u 1 G; so summing over u 1 is equivalent to summing over x 1 as G is invertible. In the next equality we substitute e 1 := 0 i−1 1 d i G and reorder the summation. The next equality expands the product of the memoryless channels. The next equality classifies indexes into two classes-j ∈ J are those such that e j = 0 and k / ∈ J are such that e k = 0. The next equality is the distributive law ax + ay + bx + by = (a + b)(x + y). The next equality uses the fact that the W (x, y) sum to 1. In the next inequality we replace e j by a nonzero element that maximizes the sum in the parentheses. In the next equality we realize that the maximum is the Bhattacharyya parameter (surprisingly). The second last equality uses the fact that multiplying a vector by a scalar preserves its hamming weight. And quod erat demonstrandum. Experienced readers may find that all but the last inequality follows the proof strategy [KSU10, Lemma 10]. 7.2. Proof of FTPCS (Theorem 10). We now go for the S-end of the fundamental theorem of polar coding. Recall the character χ(x) := exp(2πi tr(x)/p). We need the following properties: (pa) χ(0) = 1; (pb) |χ(x)| = 1 for all x ∈ F q ; (pc) χ(x)χ(z) = χ(x + z) for all x, z ∈ F q ; (pd) x∈Fq χ(x) = 0. See also [MT14, Definition 24] or a dedicated book [Ter99]. To prove the theorem, we first verify that Fourier coefficients recover the origin: Let M (w, y) The first equality expands M (w, y) by the definition. The next equality uses that χ is an additive character (pc), and reorders the summation. The next equality uses w∈Fq χ(w) = 0 (pd) and w∈Fq χ(0) = q (pa); and I is the indicator function.
Knowing that W (x j , y j ) = q −1 wj ∈Fq M (w j , y j )χ(−x j w j ), we proceed to The first equality expands the definition of W (i) . In the next equality, we substitute x 1 := u 1 G. The next equality expand the definition of W down to W . The next two equalities Fourier expand W and reorder the operators. The next equality merges all χ(−x j w j ) into one term by additivity (pc). In the next equality we define M (w 1 , y 1 ) to be the product of all M (w j , y j ). The next two equalities use x 1 (w 1 ) = u 1 G(w 1 ) = u 1 (w 1 G ) . In the next equality we define v 1 := w 1 G ; so summing over w 1 is equivalent to summing over v 1 . (Recall that G − is the notation of the inverse transpose of G.) The next three equalities sum over u i+1 to force v i+1 = 0. Having In the first line we let M (i) be the Fourier coefficient of W (i) . The next equality plugs in what we have about W (i) in mind. The next three equalities sum over z i to force The first inequality expands the Fourier coefficient. The next inequality is triangle plus (pb). The next equality cancels the summation over u i−1 1 with q 1−i . In the next equality we substitute w 1 := v i−1 1 ω i 0 i+1 G − ; slightly different from the w 1 above, they are now restricted to a proper subspace. The next equality classifies indexes into two classes-j ∈ J are those such that w j = 0 and k / ∈ J are such that w k = 0. The next two equalities reorder the operators and simplify y k |M (0, y k )| = y k W out (y k ) = 1. The next inequality replaces w j by the one that maximizes y k |M (w j , y k )|. The rest is trivial. Theorem 10 claims that S max (W (i) ) f (i) GS is the weight enumerator of the i-th dual coset code. Since S max (W (i) ) is just the maximum of Formula (11) over 0 = ω i ∈ F q , we arrive at S max (W (i) ) f See the left half of Figure 6 for evidence. More generally, for all prime powers q, for all z ∈ [0, 1], where D is the Kullback-Leibler divergence. This falls back to the h 2 case when q = 2. See the right half of Figure 6 for evidences for q = 3, 4, 5, 7. It can be observed that as q → ∞ the function tends to a line connecting (0, 0) and (1, 1), hence the upper bound should hold. Taking derivative in q shows that the left hand side decreases as q increases and x < 1/2.  where z := Z mad (W ) for short. This inequality is in fact a consequence of because (1 + a) b < exp(ab). We will show the last inequality. Now divide i into two cases: 1 i √ 3 and √ 3 < i . For i = 1, 2, . . . , √ 3 , the exponent i 2 /3 is simply 1, so the inequality to be proven reads f For i = √ 3 + 1, √ 3 + 2, . . . , , let k := − i and let d := i 2 /3 . These variables resemble the dimension and the minimal distance of a linear block code as in the notation an [ , k, d]-code in classical (algebraic) coding theory. To make Inequality (12) hold, we execute a two-phase procedure to avoid all codewords of weight less than d and to eliminate kernels with poor overall scores. In further detail, we will reject a kernel G if there exists u i+1 such that wt(0 i−1 1 1u i+1 G) < d and call it phase I. Afterwards, among surviving kernels with only heavy codewords, we will reject a kernel if its overall score f (i) GZ (z) is too low and call it phase II. The failing probability 3q − √ /13 is the price we pay for rejecting. Up to this point, two things remain to be analyzed: how much probability we pay for rejecting light codewords in phase I (answer: q − √ /13 ), and what is the Markov cutoff that makes Inequality (12) in phase II (answer: 2q − √ /13 ). Phase I analysis is as follows: Fix u i+1 and vary G ∈ GL( , q); the codeword This distribution is almost identical to the uniform distribution on F q . Assume X 1 follows the latter; this makes X 1 lighter, which is compatible with the direction of the inequalities we want. Then the probability that X 1 has weight less than d is the probability that Bernoulli trials-X j being "zero" with probability 1/q and "nonzero" with probability (q −1)/q-result in less than d "nonzero"s. By the large deviations theory [DZ10, Exercise 2.2.23(b)], wt(X 1 ) < d holds with probability less than for the q = 2 case, where D is the Kullback-Leibler divergence. For general q, similarly, wt(X 1 ) < d holds with probability less than This quantity is less than q − (1−h2(d/ )) by Figure 6 (meaning that q = 2 is the most difficult case). By Figure 6, h 2 (d/ ) < ed/ = ei 2 /3 2 = ( e/3)i/ < 0.952i/ . So the rejecting probability is less than q − (1−h2(d/ )) < q − +0.952i . Take into account that there are q −i possibilities of u i+1 . The union bound yields q −i q − +0.952i = q −0.048i < q −0.048 √ 3 < q − √ /13 . Hence the rejecting probability q − √ /13 . Phase I ends here. Phase II analysis is as follows: After we reject some G in phase I, some codewords will disappear; particularly, this includes all codewords of low weights. Therefore, the expectation of f (i) GZ (z) is bounded by the weight enumerator of all heavy codewords rescaled by the number of codewords. In detail, start from I is the indicator function. In the denominator, 1 − q − √ /13 > 1/4 as 30. Put that aside and redefine d := i 2 /3 . The expected value part is bounded from above by The first equality expands the definition. The next inequality replaces G surviving phase I by a weaker condition. The next equality switches E and . The next inequality replaces the ensemble of u i+1 G by a uniform X 1 ∈ F q . The next equality expands the definition of the expectation over X 1 . The next equality counts codewords. The next inequality selects w positions by first selecting d and then selecting w − d. The next two equalities factor and apply the binomial theorem. The rest is by a series of inequalities that overestimate the scalar: . Hence the scalar part is less than q − √ /13 /2. Put 1 − q − √ /13 > 1/4 back to the denominator as in Inequality (13); By Markov's inequality, Inequality (12) holds with probability 1−2q − √ /13 , i.e., the rejecting probability is 2q − √ /13 . Phase II ends here. The sum of the two rejecting probabilities is 3q − √ /13 as claimed in Lemma 13, hence the lemma settled. 7.5. Bibliographic remarks. Concerning the fundamental theorems: Nonlinear g W is not taken into consideration for that it is hard to imagine how MacWilliams duality works then. Also the S-parameter does not generalize to non-field input alphabet. Concerning random linear codes: [BF02, Section II.C] portrays a clear picture of the weight distribution of binary random linear codes. Section 7.4 accommodates and extends their argument to general prime power q. Concerning the LDP behavior: [KSU10, Theorem 22] showed that π < 1 can be arbitrary close to 1 over binary alphabet utilizing the Bose-Chaudhuri-Hocquenghem codes. Our Lemma 13 on the other hand, implies that almost all kernels make π close to 1.
It remains to prove Lemmas 12 and 14.

Local CLT Behavior (Proof of Lemma 14)
We are to prove that the following inequality holds with high probability: The target inequality is the sum of the following three inequalities: The second one is trivial as h α (z) (1/2) α . The first one will be proven in Section 8.3 with failing probability − log( )/20 . The third one will be proven in Section 8.4 with failing probability − log( )/20 . Before the main proofs, we devote Section 8.1 to introduce the symmetrization trick, which will reduce our proof to the case of symmetric q-ary channels. A channel W being symmetric means that for any affine shifting ξ ∈ F q , there exists an permutation σ on Y such that W (y | ξ + x) = W (σ(y) | x) holds for all x ∈ F q and y ∈ Y. It also means that the uniform input achieves the Shannon capacity. This justifies the usage of linear codes. In Section 8.2, we invoke some universal bound on entropies and exponents from Chang, Draper, and Sahai's works. Finally, we will be abusing the theory of random linear codes in Section 8.3 for noisy channel coding and in Section 8.4 for secrecy over wiretap channels. 8.1. Symmetrize channel and uniformize input. Let W : F q → Y be any q-ary channel; let X and Y be some input and the corresponding output. Symmetrize the channel as follows: Let Ξ ∈ F q be a uniform r.v. independent of X, Y . Let W : F q × (F q × Y) → [0, 1] be the probability mass function of this combination of r.v.s (Ξ + X, (X, Y )) ∈ F q × (F q × Y). ThisW behaves like a channel such that, quote, unquote,W ((x, y) | z) = W (x, y)/q for all inputs z ∈ F q and outputs (x, y) ∈ F q × Y. Despite that this channel might be properly simulated by a symmetric channel with feedback to the sender, all that matters is that the biased input X is neutralized by the uniform r.v. Ξ, and becomes uniform. Let g W be the multiplication of an invertible matrix G from the right. LetW (i) (u i , u i−1 1 x 1 y 1 ) be the probability mass function of the tuple (U i , U i−1 1 X 1 Y 1 ), where U 1 G = Ξ 1 + X 1 . This definition is compatible with the channel transformation ofW as ifW was an actual channel in the first place. Let H(W (i) ) be H(U i | U i−1 1 X 1 Y 1 ); this is also compatible. The following lemma justifies whyW is useful in theory.
Lemma 15 (channel symmetrization).W is a symmetric q-ary channel, This lemma is by [MT14, Definition 6 and Lemmas 7 and 8], plus the arguments in between. See also [HY13,Theorem 2] where they cared about whether Z(W (i) ) = Z(W (i) ). One could also expand all definitions to verify the identities.
The consequence of this lemma is thatW behaves like a shadow copy of W , but is symmetric. All inequalities involving entropies of W and W (i) are reduced to inequalities involving entropies ofW andW (i) . Subsequently, passing statements toW is effectively assuming that the channel W is symmetric with the uniform input to begin with. In the upcoming subsections, we will prove that the targeted inequalities, (14) and (15), hold for any symmetric q-ary channel W with the uniform input with high probability. We conjecture that the symmetrization technique is optional as it seems like a wrapper of complicated Bayesian formulas. 8.2. Chang-Sahai's universal quadratic bound. This and the next two subsections contain the most convoluted part of the proof of Lemma 14. This subsection prepares a universal upper bound on Gallager's E-null function, which ultimately evolves into a universal lower bound on Gallager's error exponent.
Let W : X → Y be a q-ary channel. Symmetry is not required in this subsection but it is in the next two. Assume the uniform input distribution W in (x) = 1/q for all x ∈ X . Define Gallager's E-null function and its complement [CS07, Formula (1)]: , and By complement we mean that under the uniform input,Ē 0 (t) degenerates tō Equivalently, E 0 (t) +Ē 0 (t) = t log q. Do not confuse W with W t , the latter W is tilted. When t = 0, the tilted falls back to its italic origin W 0 (x, y) = W (x, y). These measures can be interpreted as follows: W t behaves like a channel with a dedicated input distribution. The first fraction in the definition specifies the output distribution W t out (y). The second fraction specifies the a posteriori distribution W t (x | y) when y is known. As W t is not an actual channel, it is not meaningful to alter the input distribution and ask for the corresponding output. Like the symmetrization trick, all that matters is that we can compute entropies, and what not, as if they were real channels. Quantities we are interested in are listed below: Let H e be the base-e entropy. Let H e (W t ) be H e (X t | Y t ) where (X t , Y t ) is a tuple r.v. that follows W t . Let H e (X t y) be the entropy of the a posteriori distribution of X t given Y t = y; to be specific, [CS07, Formula (13) and (19)] have that the following hold for t ∈ [0, 1]: Careful readers may verify them by hand or follow [CS07,Formulas (13) to (19)] and [DCS14, Lemmas 9 and 10]. Similar computations are also carried out by [AW10,AW14]. Notice thatĒ 0 (t), H e (W t ), and every other term in Equation (16) are all holomorphic functions in t on the half-plane Re t > −1 (there is a singularity at 1/(1 + t) = ∞). By the identity theorem in complex analysis [BMPS02,Corollary 8.16], Equation (16) holds for all t ∈ [−2/5, 1]. Dropping the nonpositive square, we deduce an upper bound for each t ∈ [−2/5, 1]: This upper bound onĒ 0 (t) is a linear combination of and H e (X t y) 2 parametrized by y ∈ Y, so it remains to bound them separately. For the second kind of constituents, the entropy cannot exceed log q so H e (X t y) 2 log(q) 2 . For the first kind of constituents, the following lemma adapted from [CS07, Lemma 1] helps.
With the lemma, we do have x∈X W t (x | y)(logW t (x | y)) 2 1.2 log(q) 2 . Now Inequality (17) becomes for all t ∈ [−2/5, 1]. Since E 0 (t) is a linear function t log q minusĒ 0 (t), their first derivatives sum to log q while their second derivatives are opposite. Hence the following lemma.
where j := H(W ) + 1/2+α for short. It suffices to prove that the right hand side is less than −1/2+α . In the spirit of the motivational Chain Rule (1), the sum of the chain of H(W (i) G ) on the right hand side is H(U j+1 | U j 1 Y 1 ). In order to prove Inequality (14), we will show But what is H(U j+1 | U j 1 Y 1 )? It measures the equivocation at Bob's end when U j 1 is known to Bob. In other words, we may as well pretend that there is a random rectangular full-rank matrix G with columns and only k := − j = − H(W ) − 1/2+α rows, that Alice computes and sends X 1 := U k 1 G to Bob, and that Bob attempts to decodeÛ k 1 upon receiving Y 1 using the MAP decoder. The equivocation is thus, by Fano's inequality, bounded in terms of the probability that Bob fails to decode U k 1 : −P e log q P e + P e log q + P e · k = P e · 1 − log P e log q + k . (19) Here P e is the probability that Bob fails to decode,Û k 1 = U k 1 . The following is how to compute Bob's decoder block error probability. The generator matrix G Alice uses is selected uniformly from the ensemble of full-rank k-by-matrices. The difference of every pair of codewords distributes uniformly on F q \ {0 1 }. Over symmetric channels, the difference alone determines the output's joint distribution because W (y 1 | ξ 1 + x 1 ) = W (σ 1 (y 1 ) | x 1 ) for some componentwise permutation σ 1 on Y depending on ξ 1 . Gallager's bound applies. To elaborate, let t ∈ [0, 1]. Bob's average error probability satisfies [Gal68, Inequalities (5.6.2) to (5.6.14)] = exp(kt log q − (the E-null function of W )(t)) = exp(kt log q − E 0 (t)).
The first inequality uses that the left hand side increases monotonically in k and k := − j = − H(W ) − 1/2+α < . The second inequality uses the assumption 2. In any regard, the quantity at the end of the inequalities decays to 0 as → ∞, so eventually it becomes less than 1/2+α , the right hand side of Inequality (18). This proves that Inequality (14) holds with failing probability − log( )/20 as soon as is large enough. The lower bound on in the statement of Lemma 14 is large enough, hence the first half of Lemma 14 settled. 8.4. Hayashi's argument at Eve's end. This subsection contains the very last ingredient of the proof of Lemma 14. We dealt with Inequality (14) in the last subsection. We now deal with (15's copy) Similar to how we motivated Inequality (18), we apply Jensen's inequality and the chain rule of conditional entropy to simplify Inequality (15). The left hand side becomes jh α (H(U j 1 | Y 1 )/j) where j := H(W ) − 1/2+α for short. (This is not the same j as in the last subsection.) The input uniform, the argument of h α is H(U j 1 | Y 1 )/j = 1 − I(U j 1 ; Y 1 )/j, which can be replaced by I(U j 1 | Y 1 )/j thanks to the evenness h α (1 − z) = h α (z). We will show But what is I(U j 1 ; Y 1 )? It is the amount of information Eve learns from wiretapping Y 1 if they know that U j+1 are junk. In other words, we may pretend that Alice transmits X 1 := U j 1 V j+1 G with confidential bits U j 1 and obfuscating bits V j+1 , Bob receives X 1 in full, and Eve learns Y 1 . This context falls back to (a special case of) the traditional setup of wiretap channels [Wyn75] where various bounds are studied, some in terms of Gallager's E-null function.
Here are some preliminaries to control the information leaked to Eve. We follow the blueprint of how Hayashi derived the secrecy exponent in [Hay06,Inequality (21)]. Consider the communication protocol depicted in Figure 7: Karl fixes a kernel G ∈ GL( , q) and everyone knows G. Alice chooses the confidential message U 1 . Vincent chooses the obfuscating bits V j+1 . Charlie generates Y 1 by plugging X 1 := U j 1 V j+1 G into a simulator of W . Eve learns Y 1 and is interested in knowing U j 1 alone. So the channel on topic is the composition of Vincent and Charlie. Notation: Running out of symbols, we all use P with proper subscriptions to indicate the corresponding probability measures. That said, indexes in the subscription will be omitted. As Eve is interested in the relation between U j 1 and Y 1 , let Y 1 Gu j 1 be the The channel Eve cares about Figure 7. A finer setup for Hayashi's secrecy exponent. Charlie generates Y 1 such that X 1 := U j 1 V j+1 G and Y 1 follow W . Despite of the seemingly sequential structure, Karl, Alice, and Vincent work independently.
r.v. that follows the a posteriori distribution of Y 1 given G = G and U j 1 = u j 1 . More formally, P Y Gu (y 1 ) = P Y |GU (y 1 | G, u 1 ) = P GU Y (G, u j 1 , y 1 )/P GU (G, u j 1 ). We could have defined Y 1 G to be the a posteriori distribution of Y 1 given G = G; but it is simply the same distribution as Y 1 since U j 1 V j+1 G traverses all inputs uniformly regardless of the choice of G. That is, P Y |G (y 1 | G) = P Y (y 1 ) for all y 1 ∈ Y .
Fix G as an instance of G. Let I e be the base-e mutual information. The channel Eve cares about leaks information of this amount: is the Kullback-Leibler divergence from the a posteriori distribution of Y 1 given G, u j 1 to the coarsest distribution Y 1 . We are to take expectation over G to find the average information leak since we are interested in Markov's inequality. Equality (21) yields We now discover that there are redundancies in traversing all G and u 1 : After all, which is a fixed linear combination of the first j rows plus a random vector from the span of the bottom − j rows. When V 1 varies, the track of X 1 forms an affine subspace of F q , a coset code as in the context of the fundamental theorems. So what matters is the distribution of this coset code. In this regard, we replace the uniform ensemble of (G, U j 1 ) by the uniform ensemble of K a rank-( − j) affine subspace of F q , where j := H(W ) − 1/2+α . Karl and Alice together choose K uniformly. Vincent chooses X 1 ∈ K uniformly. Charlie generates Y 1 by throwing X 1 into a simulator of W . See Figure 8 for the Karl and Alice choose K The modified channel Figure 8. A simplified setup for Hayashi's secrecy exponent. Charlie generates Y 1 such that X 1 and Y 1 follow W .
depiction of the new scheme. Hence Equality (22) becomes where Y 1 K is the a posteriori distribution of Y 1 given K = K. Suddenly, the quantity EI e (U j 1 ; Y 1 | G) we are interested in turns into the mutual information I e (K ; Y 1 ) between K and Y 1 as K replaces the role of U j 1 in Formula (21). Recall that in Lemma 17 the mutual information is the derivative of Gallager's E-null function. We exploit this. Define the double-stroke E-null function for (K, Y 1 ) as follows Then E 0 (0) = I e (K ; Y 1 ) = EI e (U j 1 ; Y 1 | G). Owing to the concavity of the E-null function, E 0 (0) E 0 (t)/t whenever −2/5 t < 0. Recap: To bound the average leaked information EI e (U j 1 ; Y 1 | G) it suffices to bound I e (K ; Y 1 ), which is then morphing to bounding E 0 (0) from above and to bounding E 0 (t) from below.

(minor arc)
Divide and conquer-the inner sum of the double-stroke E-null function is split into two arcs as shown. The major arc is exactly The minor arc is loosen to q j+js K P K (K) x 1 ∈K P XY (x 1 , y 1 )P XY (K\x 1 , y 1 ) s = q j+js Both major and minor arcs conquered, merge them and raise to the (1+t)-th power. The summand for any fixed y 1 in the definition of the double-stroke E-null function is We can finally bound the double-stroke E-null function per se: All efforts we spent on bounding I e (U j 1 ; Y 1 ) are for three creeds: First, we see Gallager's bound possessing innate elegance. Second, it fits the paradigm that solving the primary and the dual problems as a whole is easier than solving the primary problem alone. Third, the universal quadratic bound is waiting ahead for the E-null function. We infer that Recall the universal quadratic bound E 0 (t) I(W )t log q − t 2 log(q) 2 as stated in Lemma 17 and used in the previous subsection. But this time −2/5 t < 0. We obtain that the exponent is = log( )/2 − α log + log 2 + log log q − 2α /4 = log( )/2 − log log + log 2 + log log q − 2 log(log )/ log /4 < log( )/2 + log log q − log( ) 2 /4.
The first inequality uses − j = − H(W ) + 1/2+α . The last inequality uses the assumption e 2 . With the last line we conclude that EI e (U j 1 ; Y 1 | G) < exp(log( )/2 + log log q − log( ) 2 /4) = 1/2−log( )/4 log q. Switch back to the base-q mutual information EI(U j 1 ; Y 1 | G) < 1/2−log( )/4 . We now reject kernels such that I(U j 1 ; Y 1 | G) 1/2−log( )/5 . By Markov's inequality, the opposite direction (<) holds with probability 1 − − log( )/20 because 1/5 + 1/20 = 1/4. Plug this upper bound into h α . The left hand side of Inequality (20) is less than The inequality uses that the left hand side increases monotonically in j and j := H(W ) − 1/2+α < . In any regard, the quantity at the end of the inequalities decays to 0 as → ∞, so eventually it becomes less than 1/2+α , the right hand side of Inequality (20  [Ooh02], although no formal proof was found there. See [HM11,BTM17] for alternative descriptions and approaches on the same topic. For readers who took Lemma 12 as granted or went through Appendix B in advance, this is the last sentence of the proof of the main theorem-polar codes' simplicity, random codes' durability.

Conclusions
Shannon introduced what we now understand as discrete memoryless channels seventy-two years ago. In the beginning, Shannon had no tool but developed their own theory of typical set, proved the noisy channel coding theory, and justified the notion of capacity. Gallager brought in error exponents. Capacities and error exponents quantify first and second order terms in the asymptotic performance of codes. Only in 2010 we are revealed the complete second order term. It was around the time that polar coding as a graceful instrument to explore the limits at low cost was discovered when Arıkan experimented with the channel transformation and with error exponents. Another ten years it took to grow variants and proof techniques of polar coding. Ultimately, it is feasible, and done by us coincidentally, to piece the puzzle together to show the mere possibility to achieve the second order limits at low cost. An overall comparison is integrated in Table 2. Columns are classes of channels; from left to right: (BEC) binary erasure channels; (BDMC) binary-input discreteoutput memoryless channels; (p-ary) channels of prime input size; (q-ary) channels of prime power input size; (finite) channels of discrete input. Columns to the right are wider than columns to the left. The last column is exceptional; (asym.) is about whether we can achieve the true Shannon capacity, instead of the symmetric capacity. Rows are goals; from top to bottom: (LLN) to achieve (symmetric) capacity; (wLDP) there exists π > 0 such that P e < exp(−N π ); (wCLT) there exists ρ > 0 such that R > I − N −ρ ; (wMDP) there exists π, ρ > 0 such that P e < exp(−N π ) and R > I − N −ρ at once; (LDP) the π in (wLDP) can be arbitrarily close to 1; (CLT) the ρ in (wCLT) can be arbitrarily close to 1/2; (MDP) the (π, ρ)-pair can be arbitrarily close to π + 2ρ = 1. Row (MDP) implies every other row; row (CLT) implies (wCLT); row (LDP) implies (wLDP); and every other row implies row (LLN). Rows (LDP) and (CLT) together almost imply (MDP) (need the partial distance profile). Cells represent how various goals are achieved over various channels. The greenish background means it is possible using Arıkan's kernel [ 1 1 0 1 ]. The purplish background means it is possible using other kernels. The orangish background means it is only possible using dynamic kernels.
We did our best to excavate the archive but throughout the course of manuscript preparation we found ourselves underestimating early works multiple times so the record kept updating. We sincerely hope to hear about possible references to add to the table.
Potential improvements include but are not limited to the following: (Tolls) Tighten the explicit Hölder tolls. The current toll between any pair of parameters H, P e , Z, Z mad , T , S, and S max is roughly the sum of tolls collected when traveling through the spanning tree illustrated in Lemma 7. Some improvements potentially tighten the bounds in Lemma 12. (FTPC) Tighten the two fundamental theorems such that they degenerate to equalities over erasure channels. Once done, {Z n } is a supermartingale and Appendix C.1 is obsolete. (Symmetry) generalize the arguments in Sections 8.2 to 8.4 to asymmetric channels. Once done, Section 8.1 is obsolete. Note that the proof of the fundamental theorems applies to asymmetric channels. (Bijection) Early works on polar coding over arbitrary alphabets introduced arbitrary bijections g W . Generalize the two fundamental theorems to include arbitrary bijections. (Dynamic) Achieve the main theorem with a large, but fixed, kernel. This does not immediately make the code practical. But the answer should shed light on our understanding of coding. (Alphabet) Achieve the main theorem without the reduction to prime power alphabets. This is currently not an option because linear codes are barely defined over non-fields. Plus the S-parameter-and thus FTPCS-would just break. (Dispersion) Recall Proposition 2. Weird things happens when the channel dispersion vanishes V = 0. Can we describe those channels better? One example of such channels is this: We look forward to generalizations of the main theorem to non-identical channels (i.e., non-stationary) [Mah17], non-independent channels (i.e., with memory) [WHY + 15, ST16], deletion channels [TPFV19,LT19], channels with restrictions on input distributions (e.g., due to energy constrain) [FT16], wiretap channels [ŞV13], rate-distortion problem [HKU09], Wyner-Ziv problem [HKU09], Slepian-Wolf problem [Abb15], broadcast channels [GAG15,MHSU15], and multiple access channels [AT12,NT16]. We focus on noisy channel coding in this work for its historical significance.
Appendix A. Explicit Hölder Tolls (Proof of Lemma 8) As is promised in Section 4.1, we prove the explicit Hölder toll. Let W be a q-ary channel. In the upcoming arguments, H, P e , Z, Z mad , S, and S max mean H(W ), P e (W ), Z(W ), Z mad (W ), S(W ), and S max (W ), respectively. Also q means q − 1, and q means q − 2. Furthermore, lg means the base-2 logarithm; this is handy when we jump back and forth between nats, bits, and q-bits.
First we show Z mad q H log 4 q.
(2's copy) Start from Z mad : By the definition Z mad q Z. Move on to Z: By Lemma 3, The left hand side is qZ; in the right hand side √ 1 + q z + √ 1 − z has maximum q/ √ q at z = q /q by calculus. So Z P e /q (q/ √ q ) = q √ P e /q . Move on to P e : By Lemma 6 (the first lower bound), 2P e H lg q or equivalently P e H log 4 q. Now we chain the inequalities Z mad q Z q √ P e q H log 4 q. This completes Inequality (2). That being proven, we use the weaker form Z mad q 3 √ H in the calculus machinery for global MDP. Second we show H » eq Z mad /2.
(3's copy) Start from H: By Lemma 6 (the upper bound, Fano's inequality), H lg q h 2 (P e )+ P e lg q . By Figure 6, h 2 (P e ) + P e lg q √ eP e + P e lg q = √ P e ( √ e + √ P e lg q ). What is inside parentheses is less than √ e + q /q lg q . Hence H √ P e ( √ e + q /q lg q )/ lg q. Focus on the scalar part-( √ e + q /q lg q )/ lg q has maximum √ e at q = 2 (remember that q 2). So H √ eP e . Move on to P e : By Lemma 3, P e q Z/2. Move on to Z: By definition Z Z mad . Now we chain the inequalities H √ eP e eq Z/2 eq Z mad /2. This completes Inequality (3). That being proven, we use the weaker form H q 3 Z mad in the calculus machinery for global MDP.
Third we show (notice the logarithm is natural) (4's copy) Start from S max : By definition S max q S. Move on to S: By Lemma 5, S q q(q /q − P e ) » 1 − q q q q . The square root simplifies to 1/(q ) 2 = 1/q as qq = (q ) 2 − 1. So S q − qP e . Move on to q − qP e : By Lemma 6 (the upper bound, Fano's inequality), H lg q h 2 (P e ) + P e lg q . We claim that h 2 (P e ) + P e lg q lg q − 2(q /q − P e ) 2 / log 2. To prove the claim, Taylor expand both sides at P e = q /q. Verify that both evaluate to lg q at P e = q /q; verify that both have derivative 0 at P e = q /q; and verify that the acceleration of the left hand side, −1/(P e (1 − P e ) log 2), is more negative than the acceleration of the right hand side, −4/ log 2. By Taylor's theorem, mean value theorem, or Euler method, the function with greater acceleration is greater; hence the claim. See also [FM94, Fig. 1]; the Φ-curve seems parabolic at the upper right corner. Now we have H lg q lg q−2(q /q−P e ) 2 / log 2, which is equivalent to 2(q /q−P e ) 2 / log q 1−H and to q − qP e q (1 − H) log(q)/2. Now we chain the inequalities S max q S q (q − qP e ) q q (1 − H) log(q)/2. This completes Inequality (4). That being proven, we use the weaker form in the calculus machinery for global MDP. Fourth we show 1 − H q S max / log q.
(5's copy) Start from 1 − H: By Lemma 6 (the second lower bound), H lg q q q lg(q/q ) × (P e − q /q ) + lg q . The right hand side is lg q − q lg(q/q )(q − qP e ) by matching the (rational) coefficients of P e lg q, P e lg q , lg q, and lg q , respectively. As H lg q lg q − q lg(q/q )(q − qP e ) we bound lg(q/q ) = lg(1 + 1/q ) 1/q by the tangent line at 1/q = 0. So H lg q lg q − (q − qP e ) and hence 1 − H (q − qP e )/ lg q. Move on to q − qP e : By Lemma 5, 1 − qP e /q S so q − qP e q S. Move on to S: By definition S S max . Now we chain the inequalities 1 − H (q − qP e )/ lg q q S/ lg q q S max / lg q. This completes Inequality (5). That being proven, we use the weaker form 1 − H q 3 S max in the calculus machinery for global MDP. This is end of the proof of Lemma 8. The proof of Lemma 7 follows the same logic, only shorter.
Appendix B. Calculus Machinery for Global MDP (Proof of Lemma 12) We are to prove that given criteria (cb), (cm), (ct), and (cl), the local LDP behavior, the local CLT behavior, and that π + 2ρ 1 − 8α. The proof is split into several stepping stones. We will prove each of the following inequalities (including two equalities) in each of the upcoming subsections. This will be proven in Appendix B.1: The eigen behavior reads This will be proven in Appendix B.2: As a lemma, {H n } and {Z n } converges to 0 with probability 1 − H 0 , i.e., This will be proven in Appendix B.3: The en23 behavior reads This will be proven in Appendix C.1: As a lemma, {min( −2 , 4 √ Z n )} is a supermartingale, i.e., This will be proven in Appendix C.2: As a lemma, the following holds when Z 0 < −8 : E [( K 2 n+1 /3 · 3/4) −1/2 | F n ] < −1/2+2α . (27) This will be proven in Appendix C.3: The een13 behavior reads This will be proven in Appendix C.4: The elpin behavior reads for any constants π, ρ > 0 such that π + ρ 1 − 8α, The last inequality is a bi-Hölder toll away from P{H n < exp(− πn n)} > 1 − H 0 − −ρn+o(n) , (7's copy) our destination. This finishes the proof of Lemma 12. The eigen, en23, een13, and elpin behaviors are intermediate checkpoints pinned in a way that moving from one to the next is easy while skipping any of them makes the next unreachable. Their entire purpose is to form a chain that connects the local LDP and CLT behaviors to the global MDP behavior and we do not specify if any of them falls inside the LDP, CLT or MDP paradigm. B.1. The eigen behavior. We want to prove Inequality (23), E [h α (H n+1 ) | F n ] 4 −1/2+3α h α (H n ), given the local LDP behavior and the local CLT behavior. The idea is that, for H n that is close to 1/2, the local CLT behavior provides a measurement of the dichotomy/bifurcation behavior of H n+1 . For H n that are close to 0, the Z -part of the local LDP behavior provides a measurement of the attraction toward 0. For H n that is close to 1, the S-part handles it dually. The formal proof is below.
B.2. Polarization in mean. We want to prove Equality (24), P{Z 0 → 0} = P{H 0 → 0} = 1 − H 0 , given the martingale property and the eigen behavior. The idea is that the eigen behavior expels H n from being close to 1/2, so the only reasonable limits are 0 and 1. The formal proof is below.
As B.3. The en23 behavior. We want to prove P{Z n < exp(−n 2/3 )} < 1 − H 0 − (−1/2+4α)n+o(n) , namely Inequality (25), given the eigen behavior and the polarization in mean. The idea is to read off the behavior of {H n } from the behavior of {h α (H n )} in the eigen behavior. The formal proof is below.
This simplifies the eigenvalue. Without loss of generality, we rescale h α such that h α (H 0 ) = 1. Let ε n be exp(−n 3/4 ); note that ε n H 0 1 − ε n for n large enough. Owing to h α 's concavity, that h α (0) = h(1) = 0, and that h α (H 0 ) = 1, we deduce that h α (z) ε n whenever ε n z 1 − ε n . Consider these three events as a partition: let A n be {H n < ε n }; let B n be {ε n H n 1 − ε n }; let C n be {1 − ε n < H n }. Note that B n implies h α (H n ) ε n .
here is an aesthetic choice; min( −2 , 2+ε √ Z n ) is also a supermartingale but only for astronomic (depending on q). For any non-random kernel, a small enough power works provided that the kernel polarizes channels in the first place.
C.2. A Cramér-Chernoff gadget. Let D n+1 be K 2 n+1 /3 · 3/4. We want to prove inequalities in (27), that Z n < −8 implies Z n+1 Z Dn+1 n and E [D −1/2 n+1 | F n ] < −1/2+2α , given the local LDP behavior. The motivation is to reformat the local inequalities so that it is easy to telescope for future reference. The formal proof is below.
The last inequality uses e 4 . This validates the second inequality in (27). And the proof of inequalities in (27) is sound.
C.3. The een13 Behavior. We want to prove P{Z n < exp(−e n 1/3 )} > 1 − H 0 − (−1/2+4α)n+o(n) , namely Inequality (28), given the en23 behavior, the supermartingale property, and the Cramér-Chernoff gadget. The idea is to apply the gadget consecutively to show that Z n becomes smaller and smaller as n increases. To reach the goal exp(−e n 1/3 ), we apply √ n times to avoid losing too much code rate. /a m from above.) Conditioning on A m , we want to estimate the probability that Z k −8 for some k m, which is equal to the probability that min( −2 , 4 √ Z k ) −2 for some k m. Recall that min( −2 , 4 √ Z k ) was made a supermartingale. Hence by Doob's optional stopping theorem [Dur19, Exercise 4.8.2], P{min( −2 , 4 √ Z k ) −2 for some k m | A m } min( −2 , 4 √ Z m ) 2 < exp(−m 2/3 /4) 2 . This is an upper bound on b m /a m and will be summoned in Formula (32).
(Bound c m /a m from above.) We want to estimate how often does Inequality (31) happen. It is the probability of (D m+1 D m+2 · · · D m+ The first three equalities are by the definitions of g m and E m 0 . The next equality is simple algebra. The next two inequalities are by 0 e m /a m 1. The next inequality is by the definition of E m . The last inequality summons upper bounds derived in the last few paragraphs. The last line contains two terms in the big parentheses. Between them (−1/2+3α) happens. More precisely, we attempt to bound Z m+ √ n when E m happens for each m = √ n, 2 √ n, . . . , n − √ n. When E m happens, its superevent A m happens, so we know that Z m < exp(−m 2/3 ). But B m does not happen, so Z k < −8 for all k m. This implies that Z k+1 Z D k+1 k for those k. Telescope; Z m+ √ n is less than Z m raised to the power of D m+1 D m+2 · · · D m+ √ n . But C m does not happen, so the product is greater than 2α eqZ k for all k m + √ n so long as Z k stays below −8 , which it does because B m is excluded. Then telescope again; Z n ( eq) n−m− √ n Z m+ √ n < ( eq) n exp(−m 2/3 2α √ n ) < exp(−e n 1/3 ) provided that n is sufficiently large. In other words, E n− √ n 0 implies Z n < exp(−e n 1/3 ).
This subsection is parallel to [WD18, Section V]. Do not confuse this subsection with the next. The subtlety is explained in [WD18, Section III].
(Bound f + m from above.) The definition of f m reads 1 − H 0 − a m 0 . Here a m 0 is the probability measure of A m 0 , and A m 0 is a superevent of A m by how the former is defined. Event A m 0 must contain {Z m < exp(−e m 1/3 )} by how A m was defined. By the een13 behavior, P{Z m < exp(−e m 1/3 )} > 1 − H 0 − (−1/2+4α)m+o(m) . Chaining all inequalities together, we infer that f m < (−1/2+4α)m+o(m) . Let f + m be max(0, f m+ √ n ) so we can write f + m < (−1/2+4α)m+o(m) . This upper bound will be summoned in Formula (34).
(Bound e n 0 from below.) We start rewriting g m − f + m with (f m− happens. More precisely, we attempt to bound Z n when E m happens for each m = √ n, 2 √ n, . . . , n − √ n. When E m happens, its superevent A m happens, so we know that Z m < exp(−e m 1/3 ). But B m does not happen, so Z k < −8 for all k m. This implies Z k+1 Z D k+1 k for those k. Telescope; Z n is less than Z m raised to the power of D m+1 D m+2 · · · D n . But C m does not happen, so the product is greater than πn . Jointly we have Z n Z πn m < exp(−e m 1/3 πn ) < exp(− πn n 2 ). In other words, E n− √ n 0 implies Z n < exp(− πn n 2 ). (Summary.) Now we conclude that P{Z n < exp(− πn n 2 )} P(E n− √ n 0 ) = e n 0 > 1 − H 0 − −ρn+o(n) . And hence the proof of the elpin behavior, Inequality (29), is sound.
This subsection is parallel to [WD18, Section VI]. Do not confuse this subsection with the previous. The subtlety is explained in [WD18, Section III].
As we finish proving Inequalities (23) to (29), we finish the proof of Lemma 12. Lemmas 12 to 14 are all finished. This is the last sentence of the proof of the main theorem.

Appendix D. Constants Dependence Summary
Given a discrete memoryless channel W . The sender chooses the message alphabet size ς 2. Depending on the factorization of ς, we choose q to be a certain prime power or alternate between q 2 , q 3 , q 5 , . . . (a finite list depending on ς). Fix a q. Given π, ρ > 0 such that π + 2ρ < 1; fix them. Choose ; this also determines α := log(log )/ log . The choice of is such that π + 2ρ 1 − 8α and such that the failing probabilities in Lemmas 13 and 14 do not sum to one. It depends on q, π, ρ. Once is fixed, the complexity is a function in n (or in N = n ). The asymptotic complexity O(N log N ) hides the scalar term that is determined by q and . The decaying gap −ρn+o(n) in Claim 11 and Lemma 12 hides two things: A scalar term in front of determined by q and alongside with a O(n 1−ε ) term determined by the choice of en23 and een13 checkpoints. This ε is fixed throughout the paper and is irrespective of ς, π, ρ, q, .