Fast Successive-Cancellation List Flip Decoding of Polar Codes

This work presents a fast successive-cancellation list flip (Fast-SCLF) decoding algorithm for polar codes that addresses the high latency issue associated with the successive-cancellation list flip (SCLF) decoding algorithm. We first propose a bit-flipping strategy tailored to the state-of-the-art fast successive-cancellation list (FSCL) decoding that avoids tree-traversal in the binary tree representation of SCLF, thus reducing the latency of the decoding process. We then derive a parameterized path selection error model to accurately estimate the bit index at which the correct decoding path is eliminated from the initial FSCL decoding. The trainable parameter is optimized online based on an efficient supervised learning framework. Simulation results show that for a polar code of length 512 with 256 information bits, with similar error-correction performance and memory consumption, the proposed Fast-SCLF decoder reduces up to $73.4\%$ of the average decoding latency of the SCLF decoder with the same list size at the frame error rate of $10^{-4}$, while incurring a maximum computational complexity overhead of $27.6\%$. For the same polar code of length 512 with 256 information bits and at practical signal-to-noise ratios, the proposed decoder with list size 4 reduces $89.3\%$ and $43.7\%$ of the average complexity and decoding latency of the FSCL decoder with list size 32 (FSCL-32), respectively, while also reducing $83.2\%$ of the memory consumption of FSCL-32. The significant improvements of the proposed decoder come at the cost of $0.07$ dB error-correction performance degradation compared with FSCL-32.


I. INTRODUCTION
P OLAR codes are the first class of error-correcting codes that is proven to achieve the channel capacity of any binary symmetric channel under the low-complexity successive-cancellation (SC) decoding algorithm [1]. Recently, polar codes are selected for use in the enhanced mobile broadband (eMBB) control channel of the fifth generation of cellular communications (5G) standard, where codes with short to moderate block lengths are used [2]. The errorcorrection performance of short to moderate-length polar codes under SC decoding does not satisfy the requirements of the 5G standard. SC list (SCL) decoding was introduced in [3]- [5] to improve the error-correction performance of SC decoding by keeping a list of candidate message words at each decoding step. In addition, it was observed that under SCL decoding, the error-correction performance is significantly improved when the polar code is concatenated with a cyclic redundancy check (CRC) [3], [4]. Furthermore, SC-based decoding of polar code can be represented as a binary tree traversing problem [6] and it was shown that the decoders in [1], [3]- [5] experience a high decoding latency as they require a full binary tree traversal. Several fast decoding techniques were introduced to improve the decoding latency of the conventional SC and SCL decoding algorithms [7]- [11]. In particular, the decoding operations of special constituent codes under the fast SCL (FSCL) decoding algorithms proposed in [8]- [11] can be carried out at the parent node level, thus reducing the decoding latency caused by the tree traversal.
As the memory requirement of SCL decoding grows linearly with the list size [12], it is of great interest to improve the decoding performance of SCL decoding with a small list size. To address this issue, various bit-flipping algorithms of the SC-based decoders were proposed to significantly improve the error-correction performance of SC and SCL decoding with the same list size [13]- [23]. In [19], given that the initial SCL decoding attempt does not satisfy the CRC verification, the authors proposed an algorithm that VOLUME 4, 2016 1 arXiv:2109.12239v2 [cs.IT] 23 Jan 2022 flips the first erroneous bit of the best decoding path in the next decoding attempt and the error position is estimated using a correlation matrix. The SCL-Flip (SCLF) decoder proposed in [20] estimates the bit index at which the correct path is discarded from the initial SCL decoding, then in the next decoding attempt all the paths that were discarded at the estimated error position are selected to continue the decoding. It was observed that the decoder in [20] provides a better error-correction performance when compared with the decoder proposed in [19]. In [21], an improved SCLF decoding algorithm is proposed which addresses the highorder errors of SCL decoding. By utilizing a complex path selection error model and with the same number of additional decoding attempts, the decoder in [21] provides a slight error-correction performance improvement when compared with the SCLF decoder [20]. It is worth to note that all the bit-flipping algorithms of SCL decoding introduced in [19]- [22] suffer from a variable decoding latency, which is caused by the sequential nature of the bit-flipping operations. In addition, all the decoders in [19]- [22] fully traverse the polar code decoding tree as required by SCL decoding, thus resulting in a high decoding latency. The authors in [23] proposed a simplified node-based bit-flipping algorithm to improve the decoding latency of SCLF decoding, which is referred as SSCLF decoding in this paper. Specifically, a low-complexity bit-flipping metric based on the path metrics is utilized in [23] to select the first error position of FSCL decoding. However, when applied to an FSCL decoder with the list sizes of 2 and 4, the simplified path-selection scheme of [23] results in a significant FER performance degradation when compared with SCLF decoding [20].
In this paper, a fast SCLF (Fast-SCLF) decoding algorithm is proposed to tackle the underlying high-decoding latency of the SCLF decoder [20]. In particular, a bit-flipping strategy tailored to FSCL decoding of polar codes is first introduced. Then, a path selection error metric is derived for the proposed bit-flipping strategy. The proposed path selection error metric utilizes a trainable parameter to improve the estimation accuracy of the error position, which is optimized online using an efficient supervised learning framework. By utilizing online training, the proposed path selection errormodel does not require the parameter to be optimized offline at various signal-to-noise ratios (SNRs). Instead, the parameter is automatically optimized at the operating SNR of the decoder, which obviates the need for pilot signals. Our simulation results illustrate that for a polar code of length 512 with 256 information bits at the frame error rate (FER) of 10 −4 , with similar error-correction performance and memory consumption, the proposed Fast-SCLF decoder reduces up to 73.4% of the average decoding latency of the SCLF decoder with the same list size, while incurring a maximum computational overhead of 27.6%. For the same polar code of length 512 with 256 information bits and at practical SNR values, the proposed decoder with list size 4 reduces 89.3% and 43.7% of the average complexity and decoding latency of FSCL-32, respectively, while also reducing 83.2% of the memory consumption of FSCL-32. Note that the significant complexity reductions only come at the cost of less than 0.07 dB error-correction performance degradation. For the same polar code of length 512 with 256 information bits, when compared with the SSCLF decoder with list size 4 at the target FER of 10 −4 , an FER performance gain of 0.2 dB is obtained for the proposed Fast-SCLF decoder at the cost of 8.3% computational complexity overheads, while the average decoding latency and memory consumption of the Fast-SCLF decoder are relatively similar to those of the SSCLF decoder.
The remainder of this paper is organized as follows. Section II provides the background on polar codes and their decoding algorithms. Section III proposes the Fast-SCLF decoding algorithm. Simulation results are reported in Section IV, followed by concluding remarks drawn in Section V.

II. PRELIMINARIES
We start this section by first introducing notations. Throughout this paper boldface letters indicate vectors and matrices, while unless otherwise specified non-boldface letters indicate either binary, integer or real numbers. In addition, by a imax imin = {a imin , . . . , a imax } we denote a vector of size i max − i min + 1 containing the a elements from index i min to i max (i min < i max ). Sets are denoted by blackboard bold letters, e.g., R is the set containing real numbers. Finally, 1 X is an indicator function where 1 X = 1 if the condition X is true, and 1 X = 0 otherwise.

A. POLAR ENCODING
A polar code P(N, K) of length N with K information bits is constructed by applying a linear transformation to the binary message word u = {u 1 , . . . , u N } as x = uG ⊗n where x = {x 1 , . . . , x N } is the codeword, G ⊗m is the nth Kronecker power of the polarizing matrix G = 1 0 1 1 , and n = log 2 N . The vector u contains a set I of K information bit indices and a set I c of N − K frozen bit indices, with I and I c are known to the encoder and the decoder. The values of all the frozen bits are set to 0, while the values of the information bits are independent and identically distributed [1]. The codeword x is then modulated and sent through the channel. In this paper, binary phaseshift keying (BPSK) modulation and additive white Gaussian noise (AWGN) channel model are considered. Therefore, the soft vector of the transmitted codeword received by the decoder is given as y = (1 − 2x) + z, where 1 is an all-one vector of size N , and z is a Gaussian noise vector of size N with variance σ 2 and zero mean. In the log-likelihood ratio (LLR) domain, the LLR vector of the transmitted codeword is given as α n = ln P r(x=0|y) P r(x=1|y) = 2y σ 2 .

B. SUCCESSIVE-CANCELLATION AND SUCCESSIVE-CANCELLATION LIST DECODING
An example of a factor-graph representation for P (16,8) is depicted in Fig. 1 x 14 u 13 x 13 u 12 x 12 u 11 x 11 u 10 x 10 u 9 x 9 u 8 x 8 u 7 x 7 u 6 x 6 u 5 x 5 u 4 x 4 u 3 x 3 u 2  are propagated through all the processing elements (PEs), which are depicted in Fig. 1(b). The PE in Fig. 1 [1]. Note that α s,i and β s,i are the soft LLR value and the hard-bit estimation at the s-th stage and the i-th bit, respectively, and the minsum approximation formulations of f and g are f (a, b) = min(|a|, |b|) sgn(a) sgn(b), and g(a, b, c) = b+(1−2c)a [1]. The soft LLR values at the n-th stage are initialized to α n and the hard-bit estimation of an information bit at the 0-th stage is obtained asû i = β 0,i = 1−sgn(α1,i) 2 , ∀i ∈ I [1]. Given β s,i and β s,i+2 s , β s+1,i and β s+1,i+2 s are then computed as β s+1,i = β s,i ⊕ β s,i+2 s and β s+1,i+2 s = β s,i+2 s [1]. SCL decoding was introduced to significantly improve the error-correction performance of SC decoding [3]- [5]. Under SCL decoding, the estimation of an information bitû i (i ∈ I) is considered to be both 0 and 1, causing a path splitting and doubling the number of candidate codewords (decoding paths) after each splitting. To prevent the exponential growth of the number of decoding paths, a path metric is used to select the L most probable paths after each information bit is decoded. The path metric is obtained as [5] where α 0,i l denotes the soft value of the i-th bit (i ∈ [1, N ]) at stage 0 of the l-th path. Initially, PM l = 0, ∀l. After each information bit is decoded, only the L paths with the smallest path metric values, i.e., the L most-probable decoding paths, are selected from the 2L paths to continue the decoding. After the last information bit is decoded, the decoding path that has the smallest path metric is selected as the decoding output. In a practical scenario such as in the 5G standard, the  message word is often concatenated with a CRC of size C to allow error detection. In addition, it was observed in [3], [4] that the error-correction performance of SCL decoding is greatly improved when a CRC is concatenated to polar codes. Fig. 2(a) shows a full binary tree representation of P (16,8), whose factor graph is depicted in Fig. 1(a). The authors in [9]- [11] proposed fast decoding operations for various special nodes under SCL decoding, which preserve the errorcorrection performance of SCL decoding while preventing tree-traversal to the leaf nodes. Thus, the decoding latency of SCL decoding is significantly reduced. Similar to [9], we consider four types of special nodes, namely Rate-0, Rate-1, repetition (REP), and single parity check (SPC), for all the fast SCL-based decoding algorithms in this paper. Consider a parent node ν located at the s-th stage (0 < s ≤ n) of the polar code binary tree. There are N ν LLR values and N ν hard decisions associated with this node, where N ν = 2 s . Let α ν l and β ν l be the vectors containing the soft and hard values associated with a parent node ν of the l-th decoding path, respectively. α ν l and β ν l are defined as α ν l = {α s,imin ν l , . . . , α s,imax ν l } and β ν l = {β s,imin ν l , . . . , β s,imax ν l }, respectively, where i minν l and i maxν l are the bit indices corresponding to ν such that 1 ≤ i minν l < i maxν l ≤ N and i maxν l − i minν l = N ν − 1. For all the FSCL-based decoders considered in this paper, the elements of α ν l corresponding to the SPC and Rate-1 nodes are considered to be sorted in the following order [9]:

C. FAST SUCCESSIVE-CANCELLATION LIST DECODING
where i min ≤ i * min , i * max ≤ i max . In addition, let τ be the minimum number of path splittings occurred at an SPC or a Rate-1 node that allows FSCL decoding to preserve the errorcorrection performance of the conventional SCL decoding VOLUME 4, 2016 algorithm [9]. The definitions and decoding operations of each special node under FSCL decoding are given as follows.

1) Rate-0 node
All the leaf nodes of a Rate-0 node are frozen bits. Therefore, all the hard values associated with the parent node are set to 0 and the path metric of the l-th path is given as [9] 2) REP node All the leaf nodes of a REP node are frozen bits, except for β 0,imax ν l . The path metric of the l-th decoding path is calculated as [9] where β s,imax ν l denotes the bit estimate of the information bit of the REP node.

3) Rate-1 node
All the leaf nodes of a Rate-1 node are information bits. FSCL decoding performs τ path splittings, where τ = min(L − 1, N ν ) [9]. The path metric of the l-th decoding path for a Rate-1 node is calculated as [9] where β s,i denotes the bit estimate of the i-th bit of ν.

4) SPC node
All the leaf nodes of an SPC node are information bits, except for β 0,imin ν l . The parity check sum of the l-th path is first obtained as [9] The path metric is then updated as [9] PM l = PM l + p l α s,i * min l .
The decoding continues with τ path splittings, where τ = min(L − 1, N ν − 1) [9]. In each new path splitting at the i-th index, the path metric is updated as [9] PM l = then the parity check sum is updated as [10] p l = 1 ⊕ p l if 1 − 2β s,i = sgn(α s,i ), p l otherwise. (9) where i is selected by following the bit indices of the sorted absolute LLR values in (2) [9]. When all the bits are estimated, the hard decision of the least reliable bit is updated to maintain the parity check condition of the SPC node [9] β s,i * The memory requirements of SCL and FSCL decoding algorithms with list size L in terms of the number of memory bits are given as [5], [9] where b f is the number of bits used to quantize a floatingpoint number.

D. SUCCESSIVE-CANCELLATION FLIP AND DYNAMIC SUCCESSIVE-CANCELLATION FLIP DECODING
SC-Flip (SCF) decoding algorithm was proposed in [13] to improve the error-correction performance of SC decoding for short to moderate block lengths. Specifically, if the estimated message wordû does not satisfy the CRC test after the initial SC decoding attempt, an additional SC decoding attempt is made by flipping the estimation of an information bit in u that is most likely to be the first error bit. In the rest of this paper, as we only need to locate the error decision occurred when an information bit is decoded under SC-based decoding, the bit indices are referred to as information bits and are indexed from 1 to K + C. In [13], the most erroneous position is estimated as ı = arg min 1≤i≤K+C |α 0,i |. A problem associated with SCF decoding is its poor estimation accuracy of the actual error position [14]. To address this issue, the authors in [14] propose the Dynamic SC-Flip (DSCF) decoding algorithm that utilizes a conditional probability model to accurately estimate the error position, which is given as ı = arg min 1≤i≤K+C Q i , where Q i is the error metric of the i-th information bit under SC decoding: The parameter λ ∈ R + is a perturbation parameter that is optimized offline [14]. In [16]- [18], fast decoding schemes are proposed to reduce the decoding latency of SCF and DSCF decoding. However, it was observed in [14], [15] that the maximum number of decoding attempts required by the DSCF-based decoders to obtain a comparable FER performance of SCL decoding with list size 16 is significantly large. This problem prevents the DSCF-based decoders to be practical for applications with a stringent worst-case latency.

E. SUCCESSIVE-CANCELLATION LIST FLIP DECODING
SCLF decoding also relies on a CRC verification to indicate whether the initial SCL decoding attempt is successful or not. If the first SCL decoding attempt does not satisfy the CRC verification, the SCLF decoding algorithm tries to identify the first information bit index ı, at which the correct path is discarded from the list of the L most probable decoding paths [20]. Given that the ı-th bit index is correctly identified, in the next decoding attempt and after the path splitting occurred at the ı-th bit index, the erroneous path selection is reversed where the L decoding paths that have the highest (worst) path metrics are selected to continue the decoding [20]. This reversed path selection scheme recovers the correct decoding path, which was discarded at the initial SCL decoding at the ı-th bit index, to the list of the active decoding paths. SCLF decoding then performs conventional SCL decoding operations for all the bit indices following ı.
Given that at the i-th information bit under SCL decoding, there are L active decoding paths denoted as l, l ∈ [1, 2L]. After the path splitting of the current L active paths, the path metrics of the new 2L paths are computed and sorted. Let l be the index of a discarded decoding path after the path metric sorting, i.e., the path metric corresponding to l is among the L largest path metric values. The probability that the path with index l is the correct decoding path is [14] P r(û where P r(û A l is the set of information bit indices where their hard decisions follow the sign of the corresponding LLR values, while A c l is the set of information bit indices whose hard decisions do not follow the sign of the LLR values [20].
Note that P r(û j l = u j |α n ,û is not available during the course of decoding as u is unknown, thus it is approximated as [14], [20] P r(û j l = u j |α n ,û where λ ∈ R + is a perturbation parameter that is optimized offline to improve the approximation accuracy of (14). The probability that the correct decoding path is discarded at the information bit with index i is Therefore, the bit index at which the error decision is most likely to take place is ı = arg max log 2 L<i≤K+C P i [20]. Directly computing (15) is not numerically stable [14], [20]. Thus, a flipping metric based on the max-log approx-imation is derived from (15) as [20] With L = 1, (16) reverts to (12) since A c l = {i} indicates the decoding path forked from the initial SC decoding path at the i-th bit. Thus, with L = 1, SCLF decoding is equivalent to DSCF decoding when only the first error bit of the initial SC decoding is considered.
The computation of Q i can be further simplified by using a hardware-friendly approximation introduced in [18]: where a λ , b λ ∈ R + are tunable parameters selected based on a predetermined value of λ. However, this approach requires the optimizations of the parameters {λ, a λ , b λ }. On the other hand, in [15] the authors approximate the term P r(û j l = u j |α n ,û (14) as where similar to λ, θ ∈ R + is an additive perturbation parameter used to improve the approximation accuracy. The flipping metric Q i under the approximation provided in (18) is then given as [15], [20] where ReLU(a) = a if a > 0 and ReLU(a) = 0 otherwise. The most probable information bit index where the correct path is discarded is then estimated as ı = arg min log 2 L<i≤K+C Q i [20]. In this paper, we implement the hardware-friendly SCLF decoder using (19) as it only requires the optimization of θ.
Unlike DSCF decoding, SCLF decoding can achieve the FER performance of SCL decoding with a large list size using a reasonable number of maximum decoding attempts [20]. A critical problem associated with SCLF decoding is that it fully traverses the polar binary tree as required by SCL decoding, which results in a high decoding latency. The SSCLF decoder was proposed in [23] to improve the decoding latency of SCLF decoding by introducing a reversed path selection scheme to FSCL decoding. However, the proposed scheme in [23] is only applied to the decoding steps that perform path splitting and path metric sorting, thus it does not apply to the decoding steps that occur after the τ pathsplittings for the SPC and Rate-1 nodes. As a consequent, SSCLF decoding with a small list size (L ∈ {2, 4}) incurs a significant FER performance degradation when compared to SCLF decoding with the same list size. VOLUME 4, 2016

III. FAST SUCCESSIVE-CANCELLATION LIST FLIP DECODING A. BIT-FLIPPING SCHEME FOR FSCL DECODING
We first introduce the bit-flipping scheme tailored to FSCL decoding by illustrating the proposed scheme under various examples. We consider the case where an all-zero codeword of P (16,8) is transmitted through the channel, whose binarytree representation is depicted in Fig. 2(a). Similar to SCLbased decoding, under FSCL-based decoding, we denote by l the path index corresponding to the current L active decoding paths, whilel is used to indicate the indices of the paths that are forked from l. Finally, l indicates the path indices of the decoding paths that are discarded due to their high path metric values. Note that l,l, l ∈ [1, 2L]. Table 1 shows an example of FSCL decoding when applied to the SPC node of P (16,8) with L = 4. The decoding order is first determined by sorting the magnitude of the LLR values associated with the SPC node in the increasing order. In this example, the following decoding order is considered: {β 2,8 l , β 2,5 l , β 2,7 l , β 2,6 l }. Thus, β 2,8 l is selected as the parity bit of the SPC node for all the active decoding paths. The path splittings at β 2,6 l are considered in this example, and the paths with indicesl ∈ {5, 6, 7, 8} are forked from the paths with indices l ∈ {1, 2, 3, 4} at β 2,6 l , respectively, followed by the path metric sorting operations. The most likely decoding paths with indices l = {1, 2, 3, 4} are then selected to continue the decoding, while the paths with indices l ∈ {5, 6, 7, 8} are discarded as illustrated in Table 1.

1) Bit-Flipping Scheme for SPC Nodes
At this stage, the parity bit β 2,8 l of the SPC node is not yet decoded. As an all-zero codeword is considered, the correct decoding path is l = 5, which is discarded after β 2,6 l is decoded, i.e., after the third path-splitting index. Given that this erroneous decision in the initial FSCL decoding is detected by a CRC verification, this erroneous path selection is reversed in the next decoding attempt by swapping the path indices of l and l after the path splittings at β 2,6 l for all the decoding paths. The decoding continues by setting the values of the parity bit β 2,8 l with respect to (10) to maintain the parity constraint for all the corrected paths. Similar to the bit-flipping schemes introduced in [17], [23], in the proposed scheme, the bit-flipping operation is not applicable to the parity bits of the SPC nodes. This is due to the fact that the parity bits are determined after the all the other bits are  calculated to ensure the parity check is satisfied. Therefore, if all the other bits of the SPC node are correctly decoded, the parity bit of this decoding path is also correctly decoded. As a result, the proposed algorithm only considers a maximum of N ν −1 possibilities to identify a bit flip that occurs in an SPC node. This is significantly smaller than the maximum search space of size Nν 2 required to flip a pair of bits to maintain the parity check constraint, especially as N ν increases [18].
Note that in this example, the minimum number of path splittings required by the SPC node to preserve the SCL decoding performance is τ = min{L − 1, N ν − 1} = min{3, 3} = 3, where ν indicates the SPC node of size 4. Under the proposed bit-flipping scheme for SPC nodes, if a decision error occurs at the path-splitting index after the minimum number of τ path splittings are obtained, in the next decoding attempt, the hard values at the estimated error index are flipped for all the active paths. The path metrics PM l of the surviving paths are then updated by following (8), using the LLR value corresponding to the flipped position and the current parity checksum p l .

2) Bit-Flipping Scheme for REP Nodes
Since the soft and hard estimate of the information bit associated with a REP node can be directly obtained at the parent node level under FSCL decoding, the path splitting operation under FSCL decoding applied to the information bit of a REP node is similar to that of SCL decoding when applied to an information bit at the leaf node level. Therefore, in this paper, the reversed path selection scheme used in SCLF decoding is directly applied to the information bit associated with a REP node or to an information bit at the leaf-node level under FSCL decoding [17], [23].
3) Bit-Flipping Scheme for Rate-1 Nodes Table 2 shows an example of FSCL decoding on the Rate-1 node of P (16,8) at the fifth path-splitting index with L = 2. In Table 2, the hard estimates of the discarded paths with indices l ∈ {2, 4} are indicated, while the hard estimates of the surviving paths with indices l ∈ {1, 3} are omitted. It can be observed that the decoding path with index l = 2 is the correct path as all the estimated bits are 0, which is discarded after bit β 2,132 is decoded. Therefore, in the next decoding attempt, the decoding paths with indices l ∈ {2, 4} will be selected to continue the decoding instead of the paths with indices l ∈ {1, 3} [17], [23]. Similar to the case of SPC nodes, after τ path splittings, if the hard decision of a bit of the Rate-1 node results in the elimination of the correct path, this erroneous decision is reversed in the next FSCL decoding attempt by flipping the hard estimates of all the active paths at that erroneous index. The path metrics of the active paths are then added with the corresponding absolute LLR values of the flipping indices. In this example, the minimum number of path splittings is τ = min{L − 1, N ν } = min{1, 4} = 1, which is obtained at the fifth path-splitting index. Therefore, under FSCL decoding, the hard values of all the active Node type Path splitting index decoding paths following the fifth path-splitting index are set to follow the signs of their LLR values. Similar to SCLF decoding, the proposed scheme only aims at correcting the first erroneous decision in the initial FSCL decoding attempt. Fig 3 shows the ideal FER of the proposed bit-flipping scheme where the first erroneous path selection is always accurately corrected. In Fig 3, we use the 5G polar codes P(512, 256) and P(512, 384) concatenated with a 24bit CRC 1 . Note that the positions of the first erroneous decoding decision can be obtained by comparing the discarded paths with the correct path after each path splitting. The FER performance of the ideal SCLF [20] and SSCLF [23] decoders and the FSCL decoder with list size 32 [9] are also plotted for comparison. In As seen from Fig 3, I-Fast-SCLF-L obtains a slight FER performance gain over I-SCLF-L. In addition, as the reversed path-selection scheme of [23] is not applied to the decoding steps that occur after the minimum number of τ pathsplittings is obtained for the Rate-1 and SPC nodes, this simplified bit-flipping scheme of [23] introduces FER performance degradation when compared with the ideal SCLF and Fast-SCLF decoders, especially when the list size is small, (L ∈ {2, 4}). For L = 4 and at the target FER of 10 −4 , the error-correction performance degradations of 0.2 dB and 0.3 dB are recorded for the ideal SSCLF decoder when compared to the ideal Fast-SCLF decoder for P(512, 256) and P(512, 384), respectively. Also note that the FER performance of the ideal SSCLF decoder with L ∈ {2, 4} degrades quickly as the SNR increases.
We now explain the slight improvement in the errorcorrection performance of I-Fast-SCLF-L over that of I-SCLF-L as observed in Fig. 3. The error-correction performance of I-Fast-SCLF-L and I-SCLF-L are identical for REP nodes. Therefore, we empirically show that I-Fast-SCLF-L outperforms I-SCLF-L when applied to the same channel LLR vectors for Rate-1 and SPC nodes, where the LLR 1 We use the 24-bit CRC specified as 24C in the 5G standard.
vectors contain an exact number of c e (c e > 0) channel errors. A similar study was conducted in [24] for the case of fast SCF decoding.
We first consider the error event where only a single error is present in the LLR vector of the Rate-1 and SPC nodes (c e = 1), which causes an unsuccessful CRC verification in the first FSCL and SCL decoding attempt. After the correct decoding path is recovered in the second FSCL decoding attempt of I-Fast-SCLF-L, FSCL decoding operations, e.g., path forking and path metric sorting, are applied to the L recovered paths. Then, all the hard decisions of the correct path are set to follow the signs of the corresponding LLR values to maintain its path metric. Therefore, the path metric is equal to the absolute LLR value of the flipped bit. Recall that the LLR values of the Rate-1 and SPC nodes are sorted in accordance with (2). Thus, at the subsequent decoding steps after the flipped position, any new candidate path with at least a hard decision not following its corresponding LLR value will contain a higher path metric compared to that of the correct decoding path. Therefore, with c e = 1, the correct decoding path is always found in the list of the best paths after the second FSCL decoding attempt of I-Fast-SCLF-L.
Note that a single error at the parent node level can translate into multiple errors at the leaf node level. This phenomenon is illustrated in Fig. 4a for an all-zero polar code of length N = 64, where the number of error bits at the leaf node level is provided with respect to the position of the single error bit at the parent node level. Consequently, there are cases that I-SCLF-L has to perform the reserved path selection schemes multiple times to maintain the correct codeword in the list of the best paths, while I-Fast-SCLF-L only requires a single reserved path selection. Thus with c e = 1, the error correction performance of I-Fast-SCLF-L is improved when compared to I-SCLF-L. As c e increases (c e ≥ 2), due to the complicated error patterns caused by multiple error bits, we expect that both I-Fast-SCLF-L and I-SCLF-L almost equally likely discard the correct path in the second decoding attempt, causing a wrong estimation of the transmitted codeword. Also note that when the channel reliability is improved at the high SNR regimes and given that c e > 0, the performance gain of I-Fast-SCLF-L over I-SCLF-L is mainly obtained from the case of c e = 1, as the LLR vectors are more likely to contain a single error than multiple ones (c e > 1). On the other hand, at the low SNR regimes, it is more likely to have multiple errors at the parent node level, thus the error-correction performance gain of I-Fast-SCLF-L is incremental when compared to that of I-SCLF-L. This phenomenon can also be observed from Fig. 3.
In Fig. 4b, we plot the FER curves of I-Fast-SCLF-32 and I-SCLF-32 for the polar codes of lengths 64 and 128 concatenated with a 16-bit CRC used in the 5G standard, the values of K are selected to form the Rate-1 and SPC nodes, respectively. The simulations are carried out at E b /N 0 = 3 dB and the FER values of the decoders in Fig. 4b are only obtained for the channel LLR vectors that contain exactly c e ∈ {1, 2, 3, 4, 5, 6} errors. It can be confirmed from Fig. 4b   that with c e = 1, I-Fast-SCLF-32 has a significant FER performance improvement when compared to I-SCLF-32.
In addition, the error-correction performance gains of I-Fast-SCLF-32 with respect to I-SCLF-32 quickly reduce as c e increases, with the error probabilities of both decoders approaching 1 when c e ≥ 4.
Note that the error-correction performance degradation of I-Fast-SCLF-32 is also caused by the imperfect error detection of the CRC, where a wrong estimate of the correct codeword that satisfies the CRC is selected as the decoding output. As shown in Fig. 4b, if the CRC verification is replaced by a genie selection scheme, the FER values of I-SCLF-32 and I-Fast-SCLF-32 are relatively unchanged, except for the case of I-Fast-SCLF-32 with c e = 1, where an FER of 0 is obtained 2 . This confirms that after the second FSCL decoding attempt of I-Fast-SCLF-32, the correct codeword is always present in the list of the best decoding paths for c e = 1.
In the next section, a path selection error model is derived to accurately estimate the index of the path splitting that causes the elimination of the correct path at the initial FSCL decoding. Therefore, the error-correction performance of the I-Fast-SCLF-L decoder provided in Fig. 3 serves as the empirical lower bound of the proposed decoding algorithm.

B. PATH SELECTION ERROR MODEL FOR FSCL DECODING
We use the methods introduced in [14], [20] to estimate the erroneous path-splitting index, which predicts the error position using the LLR values associated with each discarded decoding path. Thus, the proposed path selection error model 2 We exclude the FER of I-Fast-SCLF-32 with ce = 1 from Fig. 4b as an FER of 0 cannot be plotted in the logarithmic scale. relies on the construction of the LLR vectors obtained at each path splitting under FSCL decoding.
Consider that the first FSCL decoding attempt does not pass the CRC verification and ν is a node located at the sth stage of the binary tree, that is visited by FSCL decoding. Let k ∈ [1, K + C] be the path-splitting index that occurs during the decoding of ν. Let γ l = γ k l 1 l be an LLR vector of a discarded decoding path l , which contains k LLR values corresponding to the hard estimates of the discarded path l . After each information bit is decoded, γ l is constructed progressively up to the k-th path splitting. Formally, γ l is obtained using the following procedure: • If ν is a leaf node that contains an information bit: • If ν is a REP node: • If ν is a Rate-1 node: Updating γ l using the following function after each path splitting at the j-th bit: where j ∈ {i * minν l , . . . , i * maxν l } is selected by following the indices of the sorted absolute LLR values in (2). • If ν is an SPC node: Updating γ l using the following function after each path splitting at the j-th bit: where j ∈ {i * whose LLR value is ignored when constructing γ l .
concat(γ l , a) is a function that concatenates a ∈ R to the end of γ l and initially γ l = ∅. In addition, γ l is not altered if ν does not satisfy any of the above conditions. For example, the LLR vector γ l obtained after the fifth path-splitting index in Table 2 for l = 2 is γ 2 = γ 52 12 = {α 2,52 , α 2,72 , α 2,62 , α 0,122 , α 2,132 }. We now define the hard estimates of γ l asη l =η k l 1 l , and the correct hard values associated with γ l as η l = η k l 1 l . For instance,η 2 =η 52 12 = {β 2,52 , β 2,72 , β 2,62 , β 0,122 , β 2,132 } is the discarded decoding path obtained after the fifth path-splitting index in Table 2 with l = 2, and η 2 = η 52 12 = {0, 0, 0, 0, 0}. It is worth to note that by not considering the bit-flipping operations for the parity bits of the SPC nodes, the search space of the first error path selection for FSCL decoding contains K + C possible positions, which is equal to that of the SCLF decoder.
Unlike SCLF decoding, at the same path splitting index the hard estimates and LLR values ofη l and γ l of different path indices l can correspond to different bit indices of the polar binary tree. However, similar to SCLF decoding, each instance of the hard estimates and LLR values ofη l and γ l are obtained sequentially by following the course of FSCL decoding. Therefore, in this paper we utilize the conditional error probability model considered in [14], [15], [20] to estimate the erroneous decision occurred at the kth path splitting index of FSCL decoding. Specifically, the probability that the discarded path l at the k-th path splitting index under FSCL decoding is the correct path is where A l is the set of bit indices j in which the hard estimatesη j l follow the sign of γ j l , and A c l is the set of bit indices j in which the hard estimatesη j l do not follow the sign of γ j l . Similar to u, η l is also not available during the decoding process, thus we use the approximation introduced in [15] to calculate P r(η j l = η j l |α n ,η . (25) The path selection error metric obtained at the k-th path splitting based on (24) and (25) can be obtained as Consequently, the most probable erroneous position ı is obtained as ı = arg min The error metric described in (26) can be progressively calculated during the course of decoding, allowing for an efficient implementation of the proposed decoder. In particular, for each active decoding path l we denote by q k−1 l the patherror metric at the (k − 1)-th path splitting index of l, which VOLUME 4, 2016 is given as ReLU (θ − |γ j l |) (28) if k > 1 and q 0 l = 0 ∀l. Thus, the path-error metric of the path l at the k-th path splitting index can be calculated from q k−1 l as The path-error metric of the forked path with indexl originated from l, whose hard value at the k-th path splitting index does not follow the sign of its LLR value, is calculated as (29) and (30) are used to compute the path-error metrics of all the 2L paths associated with the current L active paths and the L forked paths progressively. Next, the path metric sorting is carried out and a list of discarded paths with indices l is determined. The flipping metric in (26) is obtained as where l min = arg min l q k l . Therefore, under a practical implementation one only needs to maintain the path-error metrics q corresponding to the 2L decoding paths to progressively calculate the path selection error metric Q k .
In this paper, we tackle the disadvantage of Monte-Carlo simulation which optimizes the single parameter θ offline [14], [18], [20]. This is because in practice, e.g., in the 5G standard, there is a vast number of polar code configurations with different code lengths and rates, and the parameter also requires to be optimized at various SNR values. Thus, optimizing the parameter for each specific configuration is a time-consuming task as adequate training data samples need to be collected for each code configuration. Therefore, we propose an efficient online supervised learning approach to directly optimize the parameter at the operating SNR of the decoder, while obviating the need of pilot signals.
In particular, let D be a data batch that contains B = |D| instances of the path selection error metrics where the corresponding message word estimated by the initial FSCL decoding algorithm does not satisfy the CRC test. Under supervised learning, we need to obtain the erroneous path-splitting index ı e to train θ. Note that in a practical scenario, the proposed decoder often requires a maximum number of m additional FSCL decoding attempts where a different estimated error index is associated with each additional decoding attempt. By assuming that a correct codeword is obtained if the CRC verification is successful, the error index ı e can be obtained when a secondary FSCL decoding attempt passes the CRC verification. Let o be a onehot encoded vector of size K + C that indicates the error bit index ı e as A data sample d ∈ D contains a pair of the input Q and its corresponding encoded output o, i.e., d {Q, o}.
Given a data sample d, the path selection error metric introduced in Section III-B provides an estimate of ı e as ı by selecting the index corresponding to the smallest element of Q (see (27)). To enable training, the error metrics are converted to the probability domain using the following softmin conversion: where Q k is manually set to ∞ for k ∈ [1, log 2 L] as the correct decoding path is always present in the first log 2 L path splittings. It can be seen from (26) and (33) that the bit index that has the smallest error metric is also the bit index that has the highest probability to be in error. In this paper, we use the binary cross entropy (BCE) loss function to quantify the dissimilarity between the target output o and the estimated outputô as The parameter θ can then be trained to minimize the loss function by using the stochastic gradient descent (SGD) technique or one of its variants. An update step is given as where E ∈ R + is the learning rate and 1 B ∀d∈D ∂Loss ∂θ is the estimation of the true gradient obtained from a data set that contains an infinite number of data samples. By using the chain rule and simple algebraic manipulations, given an instance Q of a data sample d, ∂Loss ∂θ can be calculated as It can be observed that the computation of ∂Loss ∂θ requires the computation of ∂Q k ∂θ . Similar to Q k , ∂Q k ∂θ can also be progressively calculated during the course of decoding. In particular, from (29) and (30) we obtain and respectively, and Note that ∂Q k ∂θ contains integer values and ∂Q k ∂θ ∈ [−(K + C), K + C]. To reduce the computational complexity of the training process, we use the method in [25] to implement the exp(·) function as required in (33) and (36). Specifically, the Taylor series are utilized to approximate the exp(·) function, which is given as [25] exp(x) ≈ max{0, where x ∈ R, T ≥ 0 is an integer number, and the approximation is exact if T = ∞ [26].
In Algorithm 1, we outline the proposed Fast-SCLF decoding algorithm integrated with the online training framework. The inputs of Algorithm 1 contain the channel vector y, the list size L, the maximum number of additional FSCL decoding attempts m, and the size of the data batch D, denoted as B. The parameter θ is first randomly initialized from (0, 1). Given a channel output vector y, the initial FSCL decoding is carried out in the InitialFSCL(·) function described in Algorithm 2, which performs the conventional FSCL decoding operations to obtain the estimated message wordû init . In addition, at each path splitting with index k of the initial FSCL decoding attempt, the path-error metrics {q k l , q kl } and the derivatives {  In the next step, ifû init satisfies the CRC test, the Fast-SCLF decoder then outputsû init and terminates. Otherwise, the path selection error metrics Q are sorted in the increasing order such that Q i * 1 ≤ . . . ≤ Q i * K+C , and the pathsplitting indices corresponding to the m smallest elements of Q are selected for the secondary FSCL decoding attempts, i.e., {ı * 1 , . . . , ı * m }. The Fast-SCLF decoder then performs a maximum number of m additional FSCL decoding attempts (line 8, Algorithm 1) with each attempt performs the reversed path selection scheme at a different path-flipping index ı * i . If one of the secondary FSCL decoding attempts results in a successful CRC verification, the optimization process VOLUME 4, 2016 of θ implemented in the OptimizeTheta(·) function is queried, which performs the proposed optimization process based on supervised learning. The details of the function OptimizeTheta(·) are provided in Algorithm 3. To reduce the memory consumption required to store the data batch D for each parameter update, we use a variable ∆ in Algorithm 3 to store the accumulated gradients ∀d∈D ∂Loss ∂θ as shown in (35). In addition, each data sample d is completely different from the others due to the presence of channel noise. Therefore, the proposed training framework can prevent overfitting without the need of a separate validation set, which also reduces the memory consumption of the parameter optimization. Finally, if the resulting estimated message wordû flip obtained from one of the addtional FSCL decoding attempts satisfies the CRC test,û flip is returned as the final decoding output. On the other hand, if none of the addtional FSCL decoding attempt can provide a message word that passes the CRC verification, the estimated message wordû init of the initial FSCL decoding is returned as the final output of the decoding process.

C. QUANTITATIVE COMPLEXITY ANALYSIS
To quantify the computational complexity of the decoders considered in this paper, we compute a weighted complexity of the performed floating-point additions/subtractions, comparisons, multiplications, and divisions. The complexity of a floating-point addition/subtraction or a floating point comparison is considered to be one unit of complexity, a multiplication requires 3 units of complexity and a division requires 24 units of complexity [27]. In this paper, we use the merge sort algorithm to sort a vector with N elements, which requires a worst case of N log 2 N − 2 log 2 N + 1 floatingpoint comparisons if N is not a power of 2, otherwise the number of comparisons needed is N log 2 N [28, Chapter 2]. We compute the decoding latency of the SCL-based decoders by using the method considered in [9], [10]. In particular, we count the number of time steps for various decoding operations with the following assumptions. First, the hard decisions obtained from the LLR values and binary operations are computed instantaneously [5], [9], [10]. Second, we consider the time steps required by a merge sort algorithm to sort a vector of size N is log 2 N [28, Chapter 2]. In addition, we also measure the average runtime in seconds required to decode a frame of all the decoders considered in this paper. The runtime is measured based on a single-core C++ implementation of the considered decoders on a similar Linux system, with an AMD Ryzen 5 CPU and a DRAM memory of 16 GBytes.
Note that the OptimizeTheta(·) function can be executed in parallel with the decoding process presented in Algorithm 1 and the decoding latency in time steps of the OptimizeTheta(·) function is significantly smaller than the time steps required by an FSCL decoding attempt. Therefore, we do not include the number of time steps needed by the OptimizeTheta(·) function in the time steps of the proposed algorithm. However, to enable a fair comparison with other decoders that do not require parameter optimization during the course of decoding, we include the runtime of the OptimizeTheta(·) function when computing the runtime of the proposed decoder. Furthermore, the computational complexity and memory requirement of the OptimizeTheta(·) function are also considered when computing those of the proposed decoder. The memory consumption of the proposed decoder with list size L can be calculated as where b i is the number of memory bits used to quantize the integer values of ∂Q ∂θ and ∂q ∂θ . We consider that ∂Loss ∂θ is progressively calculated, thus 4b f memory bits are used to store the temporal values of K+C j=1 exp(−Q j ), exp(−Q k ), o k , and K+C j=1 ∂φj ∂θ , and b f memory bits are used to store ∂Loss ∂θ , whose value is progressively summed over K + C indices.

A. OPTIMIZED PARAMETER AND ERROR-CORRECTION PERFORMANCE
We measure the accuracy of the proposed training framework by calculating the probability that the most probable error index ı derived from (27) is the actual error index, denoted as ı * e , given that the initial FSCL decoding attempt does not satisfy the CRC test. Note that the error index ı e used as the training label can be different from ı * e . This is because satisfying the CRC test after performing the reserved path selection scheme at the ı e -th path-splitting index does not warranty that the estimated codeword is the sent codeword. Therefore, the training accuracy is quantified as In this paper, we use the conventional SGD algorithm to optimize θ with E = 2 −4 and B = 32, thus E B is fixed to 2 −9 and a multiplication with E B can be implemented as a shift operation. We set b f = 32 for both the training and decoding processes as single-precision floating-point format is used to quantize a floating-point number. In addition, an integer number a is quantized using the sign-magnitude representation, which requires log 2 (|a|) + 1 memory bits.  (40) and (41), respectively. Note that the spikes in the early part of the training accuracy are caused by the small number of the training samples, which makes the calculation of the training accuracy unreliable at the initial phases of the parameter optimization. As also observed from Fig. 5, the value of θ becomes relatively stable as the number of parameter updates increases. Thus, in practice the function OptimizeTheta(·) can be skipped after a predefined number of parameter updates to further reduce the computational complexity and memory accesses of the proposed framework. In this paper, we stop querying the OptimizeTheta(·) function after 50 parameter updates to further reduce the computational complexity of the proposed decoder.  Fig. 6, it can be observed that under all considered polar codes and list sizes, the SCLF decoder has a relatively similar error-correction performance compared to that of the proposed Fast-SCLF decoder. In some configurations of the polar codes, the Fast-SCLF decoder obtains a slight FER gain over the SCLF decoder with the same list size. This behavior is similar to that of the ideal Fast-SCLF decoder presented in Section III-A when compared with the ideal SCLF decoder. It can also be seen from Fig. 6 that with L ∈ {2, 4} the simplified pathselection scheme proposed in [23] results in a significant error-correction performance degradation in comparison with those of the Fast-SCLF and SCLF decoders at the target FER of 10 −4 , which also degrades quickly as the SNR increases.

B. COMPUTATIONAL COMPLEXITY, DECODING LATENCY, AND MEMORY REQUIREMENT
In Table 3, we summarize the average computational complexities (C) and the average decoding latency in time steps (L) of all the SCLF-based decoders considered in Fig. 6. The E b /N 0 values in Table 3 are selected from Fig. 6 where the simulated FER values of the proposed decoder are closest to the target FER of 10 −4 .
The effectiveness of the proposed decoder is confirmed in Table 3 as the decoding latency of the SCLF decoding algorithm is significantly higher than that of the Fast-SCLF decoder with the same list size. However, with the list size increases the Fast-SCLF decoder imposes a more significant computational complexity overhead when compared to that of the SCLF decoder. This is due to the complexity devoted for sorting the LLR values associated with the special SPC and Rate-1 nodes, which significantly increases with the increase of the list size under FSCL decoding. From Table 3     incurring a maximum computational overhead of 27.3%. As also seen from Table 3 and Fig. 6, when compared with the SSCLF decoder with L ∈ {2, 4}, the proposed decoder with the same list size only incurs negligible overheads in the computational complexity while achieving significantly error-correction performance improvements and maintaining relatively similar decoding latency in time steps. On the other hand, with L ∈ {8, 16, 32}, a maximum complexity overhead of 13.5% is recorded for the proposed decoder when compared with SSCLF decoding with the same list size, while obtaining relatively similar error-correction performance and decoding latency.
Note that the path selection error metric of SCLF decoding can be progressively calculated using a similar approach as described in (29) and (30). Therefore, the memory consump-tion of the SCLF decoder with list size L is calculated as In addition, the memory consumption of the SSCLF decoder only requires an addition of (K + C)b f memory bits to store the path-selection error metric when compared with that of the FSCL decoder with the same list size [23]. The memory consumption of the SSCLF decoder with list size L is given as [23] M In Table 4, we summary the memory consumption in KBits of all the SCL-based decoders considered in this paper.
We illustrate the average complexity, average decoding latency in time steps, and average runtime of the Fast-SCLF-   4-50, SCLF-4-50, and SSCLF-4-50 decoders in Fig. 7. As seen from Fig. 7, the proposed decoder requires a relatively similar decoding complexity when compared with the SCLF, while the SSCLF decoder has the lowest average computational complexity among all the SCLF-based decoders. In addition, the SSCLF and Fast-SCLF decoding algorithms require significantly smaller average decoding latency both in terms of time steps and runtime when compared with those of the SCLF decoder.
In Table 5, we summarize the average computational complexity, memory requirement, and average decoding latency in terms of time steps and runtime of the SCLF-based decoders with L = 4, m = 50, and those of the FSCL-32 decoder. The error-correction performance degradation of the SCLF-based decoders when compared to FSCL-32 is also provided in Table 5. The E b /N 0 values are selected to allow an FER performance close to the target FER of 10 −4 for all the considered decoders. In particular, for P(512, 256), the average complexity and average latency in time steps of Fast-SCLF-4-50 account for approximately 10.7% and 56.3% of the complexity and time steps of FSCL-32, respectively. For P(512, 384), Fast-SCLF-4-50 reduces 84.6% of the average complexity and 27.2% of the average time steps in comparison with FSCL-32. In addition, the proposed decoder with list size 4 requires around 17% of the memory requirement of FSCL-32, while having an FER degradation of less than 0.07 dB. When compared with the SSCLF decoder, the proposed decoder obtains the FER performance gains of 0.2 dB and 0.3 dB at the cost of 8.3% and 12.0% computational complexity overhead for P(512, 256) and P(512, 384), respectively, while the average decoding latency and memory consumption are relatively preserved at the target FER of 10 −4 . Note that due to its high complexity, the average runtime of FSCL-32 is significantly higher than those of all the SCLF-based decoders with list size 4.
In Fig. 8 we study the effects of the θ parameter on the error-correction performance of the proposed decoder when online training is considered. Specifically, we illustrate the FER values obtained at the first 200000 frames of Fast-SCLF-4-50 with and without online learning. In addition, the FER values of the ideal Fast-SCLF-4 decoder are also plotted for reference. It can be observed that the proposed online learning scheme effectively optimizes the θ parameter, allowing the FER of the proposed decoder to quickly approach its ideal FER performance. On the other hand, when online training is not considered, using the proposed decoder with the initialized value of θ results in a poor error-correction performance.

V. CONCLUSION
In this paper, we proposed a bit-flipping scheme tailored to the state-of-the-art fast successive-cancellation list (FSCL) decoding, forming the fast successive-cancellation list flip decoder (Fast-SCLF). We then derived a parameterized path selection error metric that estimates the erroneous pathsplitting index at which the correct decoding path is eliminated from the initial FSCL decoding. The trainable parameter of the proposed error model is optimized using online supervised learning, which directly trains the parameter at the operating signal-to-noise ratio of the decoder without the need of pilot signals. We numerically evaluated the proposed decoding algorithm and compared its error-correction performance, average computational complexity, average decoding latency, and memory requirement with those of the state-of-the-art FSCL decoder, the successive-cancellation list flip (SCLF) decoder, and the simplified SCLF (SSCLF) decoder. The simulation results confirm the effectiveness of the proposed decoder when compared with the FSCL and the SCLF decoders for different polar codes and various list sizes. As also observed from the simulation results, the errorcorrection performance of the Fast-SCLF decoder significantly outperforms that of the SSCLF decoder with small list sizes (2 and 4), at the cost of negligible computational complexity overhead, while maintaining relatively similar memory consumption and decoding latency also compared to SSCLF decoding. Future research includes designing and implementing a hardware architecture of the proposed decoder, where the bit-flipping scheme is extended to other special nodes of polar codes.