Asymptotically Tight MLD Bounds and Minimum-Variance Importance Sampling Estimator for Linear Block Codes Over BSCs

In this paper, we re-examine the classical problem of efficiently evaluating the block and bit error rate performance of linear block codes over binary symmetric channels (BSCs). In communication systems, the maximum likelihood decoding (MLD) bounds are powerful tools to predict the error performance of the coded systems, especially in the asymptotic regime of low error probability (or high signal-to-noise ratio). Contrary to the conventional wisdom, we prove that for BSCs, all bounds based on Gallager’s first bounding technique, including the famous union bound, are not asymptotically tight for all possible choices of the Gallager region. By proposing the so-called input demodulated-output weight enumerating function (IDWEF) of a code, asymptotically tight MLD upper and lower bounds for BSCs are then derived. In many practical scenarios where performance bounds are not applicable (e.g., due to the unavailability of the relevant coding parameters under a given decoder), the Monte Carlo simulation is commonly used despite its inefficiency, especially in the low error probability regime. We propose an efficient importance sampling (IS) estimator by deriving the optimal IS distribution of the Hamming weight of the error vector. In addition, the asymptotic relative saving on the required sample size of the proposed IS estimator over the state-of-the-art counterpart in the recent literature is characterized. Its accuracy in predicting the efficiency of the proposed IS estimator is verified by extensive computer simulation.


I. INTRODUCTION
Linear block codes have been widely used in real-world communication systems for providing reliable data transmission over noisy channels [1], [2]. How to efficiently evaluate the performance of such coded systems, especially at a very low target bit (or word) error rate, is a long-standing problem. Useful performance bounds on the average error probability of the random code are available for any given blocklength and code rate [3]- [5]. However, efficient performance evaluation of a specific coding scheme with a specific decoder remains indispensable for practical code selection and verification of codec implementation. As a result, performance bound analysis and Monte Carlo (MC) performance simulation based on some specific properties of the target The associate editor coordinating the review of this manuscript and approving it for publication was Zhongyi Guo . code and decoder receive ongoing research interests in the communication community.
Many known bounds on maximum likelihood decoding (MLD) performance have been reported in the literature [6], [7]. Many of them are derived using a general bounding technique developed by Gallager and later referred to as Gallager's first bounding technique (GFBT) by Divsalar [8]. The method introduces the so-called Gallager region around the transmitted codeword to divide the observation space and avoid over-estimating the error probability outside the region. The GFBT-based bounds resulted from different choices of the Gallager region include many well-known bounds as special cases. For instance, for the additive white Gaussian noise (AWGN) channel, the famous union bound can be obtained by setting the Gallager region to the whole observation space. The tangential bound of Berlekamp [9] tightens the union bound by determining the region as a boundary VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ of a plane. The sphere bound [10] derived by Herzberg and Poltyrev chooses the region as a sphere centered at the transmitted signal vector. The tangential sphere bound [11] selects the region as a circular cone whose central line passes through the origin and the transmitted signal vector. Somewhat surprisingly, we discover that for binary symmetric channels (BSCs), the GFBT-based bounds are not asymptotically tight for all possible choices of the Gallager region. In particular, the knowledge of the minimum Hamming distance and the number of minimum-distance codewords of a code are insufficient for characterizing its asymptotic word error probability. To fix this issue, we propose the so-called input demodulated-output weight enumerating function (IDWEF) of a given linear block code and a given decoder. Intuitively speaking, the IDWEF represents the input-output weight relationship of the decoder, while the conventional input-output weight enumerating function (IOWEF) represents that of the encoder. Hence, unlike the IOWEF, the IDWEF characterizes the classification of all dominating error vectors according to which codewords they are decoded to and contains sufficient knowledge of the encoder-decoder pair to express asymptotically tight bounds.
For many practical coding schemes, the performance bound analysis is difficult to conduct. This is because the parameters (such as the minimum distance, IOWEF or IDWEF) of a given encoder-decoder pair required in the bounding expressions may not have been tabulated and made publicly available. Determining such coding parameters by theoretical analysis or by computer enumeration searches is in general a very challenging and costly (if not, intractable) task. Under such scenarios, MC simulation is a common numerical technique to carry out performance estimation. One drawback of this technique is the requirement of a sufficiently large number of generated samples in order to provide reliable estimates of the target bit error rate (BER) or word error rate (WER), which may be very low in the error rate region of practical interest. For example, in the modern Dense Wavelength Division Multiplexing (DWDM) optical communication systems, the transmission capacity per fiber can exceed a Tbps and the target BER is typical 10 −12 and sometimes even as low as 10 −15 [12]. At this level of error probability, performance evaluation by MC simulation is prohibitively complicated. It is noteworthy that due to the high circuit complexity and power consumption associated with the very high transmission rate requirement, hard-decision decoding, in place of soft-decision decoding, has been widely used in optical communication applications [12]. The same holds in the storage applications, such as nanoscale and flash memories [13], [14]. Therefore, a fast simulation method for accurately estimating a very low target error probability under hard-decision decoding (or the resulting BSCs) is highly desirable.
One of the most efficient ways to reduce the computational complexity of the MC method is to use the importance sampling (IS) technique. The idea is to use a biased distribution, called the IS distribution, to generate samples so that the frequency of the occurrence of the rare events can be increased. It can significantly enhance the efficiency in terms of the sample size for simulating the performance of communication systems. Although the optimal IS distribution is known from [15], it does not allow a practical sampling process to be implemented since it depends on the parameters to be estimated itself. Therefore, several sub-optimal but implementable schemes are presented. Many works in the literature choose the IS distribution from a parametric family of distributions like the mean translation [16], the variance scaling [15]. This kind of method is easy to implement but its efficiency in terms of the sample size is low if the parametric family is far away from the optimal IS distribution. To overcome this problem, a mixture of components is used in the multiple IS methods [17], [18], and an iterative adaptation of the IS distribution is applied in the adaptive IS methods [19], [20]. The adaptive IS is used in [21] for error probability estimation for multiple access systems, and a nested IS method that estimates the random-coding error probability of the coded-modulation is provided in [22]. There also exist methods based on nonparametric IS distributions, such as the dual adaptive IS method [23]. The method shows a significant improvement in reducing the sample size of the simulation. However, the empirical reliability measure of the estimated results makes the efficiency analysis of the method ambiguous compared to the MC method.
For BSCs, the method presented in [24] divides the sample space into different regions according to the error weight (i.e., Hamming weight of an error vector) and calculates the WER by manually assigning large enough samples for each weight to counting the number of errors and estimating the conditional error probability. Although the method is efficient in the sense that no additional computation cost is needed for different signal-to-noise ratios (SNRs), the number of samples allocated to each weight in this method is empirically chosen which can be further improved. A quasi-analytical simulation method is presented in [25], which estimates the boundary of the decision region and predicts the error performance. It is efficient under the assumption of the geodesic channel and the star domain decoder. The state-of-the-art IS method over BSCs, called the minimum-variance Bernoulli estimator, is introduced in [26]. It is assumed therein that the IS distribution is a parametric Bernoulli distribution and the optimal parameter is determined by minimizing the variance of the IS estimator. However, the optimal parameter requires the conditional error probabilities for each error weight, which need to be estimated themselves. A fast simulation algorithm, called the IS-MC basic, that estimates both the optimal parameter and the WER iteratively is therein presented.
We take the state-of-the-art IS method in [26] as the benchmark estimator in this paper. If we regard the error weight as a random variable, we find out that this estimator can be interpreted as sampling the error weight with a binomial distribution as the IS distribution. Apparently, restricting the IS distribution of the error weight to the Binomial family is generally suboptimal. By relaxing the restriction and exploit the larger degree-of-freedom to optimize the IS distribution of the error weight, we propose a new IS estimator that can outperform this benchmark IS estimator [26] in terms of the sample size.
In this paper, we present new WER and BER evaluation methods including both performance bound analysis and IS simulation for linear block codes over BSCs. Our main contributions are summarized as follows.
1. We prove that the GFBT-based bounds for BSCs are not asymptotically tight for all possible choices of the Gallager region. By proposing the so-called IDWEF of a coding scheme, the asymptotically tight MLD upper and lower bounds for linear block codes over BSCs are proposed. 2. The optimal IS distribution among all the possible distributions of the error weight over BSCs is derived. A Hamming weight-based IS algorithm that can estimate the optimal IS distribution and the error probability iteratively without assuming any knowledge of the coding parameters is proposed 1, . 2 3. The asymptotic relative saving on the required sample size of the proposed IS estimator over the state-of-theart counterpart [26] is derived. It only depends on the error-correcting capability of the code and can be used to predict the efficiency of the estimator before the simulation. The rest of the paper is organized as follows. Section II briefly reviews some conventional GFBT-based bounds and some preliminaries of the IS method. In Section III, the definition of IDWEF for a linear block code and a given decoder is introduced and based on which, asymptotically tight upper and lower bounds are derived. In Section IV, the optimal IS distribution for error weight is given and a corresponding IS estimator is proposed. The asymptotic relative saving on the required sample size of the proposed IS estimator over the benchmark IS estimator [26] is derived in Section V. A comparison of the MC estimator, the benchmark IS estimator and the proposed IS estimator is presented in Section VI. Finally, some concluding remarks are given in Section VII.

II. PRELIMINARIES
For an (n, k) binary linear block code, a length-k information bit sequence is encoded into a length-n codeword c ∈ C ⊂ X n {0, 1} n , where C denotes an k-dimensional subspace (i.e., the code) of the n-dimensional Hamming space X n . After the codeword is transmitted over the BSC with 1 Part of the contribution was presented in the conference version [27]. In this paper, we extend the method to the BER simulation and analyze the efficiency of the proposed IS estimator shown in the last contribution. 2 We open-source our HW-IS algorithm in [28] and make it user-friendly for any linear block codes. cross-over probability p, a vector z ∈ X n is received. Since the code is linear and the channel is symmetric, without loss of generality, assume that the all-zero codeword c 0 is transmitted. Then the received vector z can also be regarded as the error vector. Denote wt(z) as the Hamming weight of z. Hence, the error vector z follows the multivariate Bernoulli distribution Bern(z; p) with the probability mass function The word error rate (WER) P e and the bit error rate (BER) P b are widely used metrics to measure the error performance of a block code and the associated decoder. To avoid possible confusion between the similar notations for WER and BER, subscripts e and b are used to differentiate them. Denote E as the error region of the all-zero codeword and I e (z) as the indicator function that equals 1 when z leads to a decoding error and 0 otherwise. The WER can be expressed in terms of I e (z) as Denote I b (z) as the ratio of the number of the non-zero information bits to k. The BER is defined as A. ERROR PROBABILITY BOUNDS FOR BSC Error probability bounds are widely used to evaluate the MLD performance of a binary linear block code. One thing that plays an essential role in the derivation of many conventional bounds is the IOWEF [6]. It is defined as where X and Y are the input and output indeterminates, respectively, and A h,d represents the number of weightd codewords generated from weight-h information bit sequences (i.e., the coefficient of the IOWEF with input weight h and output weight d with 0 ≤ h ≤ k and 0 ≤ d ≤ n).
By setting X = 1 in (3) we get the expression of the weight enumerating function (WEF). For example, denote Z as a random error vector, the famous union bound can be expressed as where the WEF coefficient A d k h=0 A h,d , and V d denotes the pairwise error region between a weight-d codeword and the all-zero codeword.
As we know, the union bound is loose when p is large. By limiting the usage of the union bound to an introduced Gallager region R around the transmitted codeword, a tighter bound can be derived by avoiding over-estimating the error VOLUME 10, 2022 probability outside the Gallager region. Mathematically, the GFBT-based bound can be derived as By choosing the region R as the Hamming sphere centered at the transmitted codeword and optimizing the radius to tighten the bound, the sphere bound is proposed by Poltyrev [11]. Based on the simpler formulation in [4], it can be expressed as where represents the number of weight-w error vectors at the same distance or closer to a weight-d codeword compared with c 0 .
On the other hand, by assuming that a decoding error occurs only if the error vector falls inside any packing sphere centered at a codeword other than c 0 [29, eqn. (20)], the union-of-packing-spheres (UPS) lower bound can be derived as whereB is the number of weight-w vectors within the decoding sphere of a weight-d codeword and t is the error-correcting capability of the code. It is noteworthy that if the WEF coefficient A d in (5), (4) and (8) are replaced by k h=1 h k A h,d , the GFBT-based bound, the classical union bound and the UPS bound for BER are obtained.

B. IMPORTANCE SAMPLING AND RELATIVE ERROR
The MC estimator of the WER is unbiased and can be written asP where N is the total number of samples. Variance is commonly used as the reliability measure of the estimator. It is well-known that the variance of the MC estimator is In IS simulation, the WER in (1) can be rewritten as where the p.m.f. f * (z) represents the IS distribution. The IS estimator can be expressed aŝ The IS estimator is also unbiased and its variance can be derived as Relative error of the estimator is commonly used as the stopping criterion of the simulation, which is defined as [30] κ Specifically, if the IS distribution is same as the original sampling distribution, the IS estimator is equivalent to the MC estimator. When the error probability is small (i.e. P e 1), the relative error in the MC estimator is The number of samples needed to achieve a given κ can be approximated as which suggests that N ≈ 100/P e samples are required in order to obtain a reliable estimation result with a relative error of 10%, i.e., κ = 0.1. The BER counterparts of P e ,P MC e andP IS e , denoted by P b , P MC b andP IS b , can be obtained by substituting I b for I e in (12), (10) and (13), respectively.

III. PROPOSED ASYMPTOTICALLY TIGHT BOUNDS
Many bounds in the literature have been derived to predict the error performance of the linear block codes for BSCs. Contrary to the conventional wisdom, we found that they are not asymptotically tight, as p tends to 0. For instance, as shown in Fig. 2  an asymptotically tight bound in this work. The asymptotic tightness of a BER bound is defined in a similar manner. The following theorem formalizes our claim for the GFBT-based WER bounds.

Theorem 1: The WER bound (5) based on Gallager's first bounding technique for binary linear block codes over BSCs
is not asymptotically tight for all possible choices of the Gallager region, as the cross-over probability p tends to 0.
Proof: Since only the asymptotic error probability is relevant, it is sufficient to only consider the weight-(t + 1) . . , m, as the set of weight-(t + 1) vectors which are at the same distance or closer to c i compared with c 0 . As p → 0, a weight-(t + 1) vector may only be wrongly decoded to a weight-(2t + 1) or weight-(2t + 2) codeword. Let m = A 2t+1 + A 2t+2 and index these codewords from 1 to m in a certain order. Obviously, the number of weight-(t +1) vectors that are decoded wrongly is | m i=1 V i |, where | · | represents the cardinality of a set. Therefore, the asymptotic WER is It follows that a WER bound is asymptotically tight if and only if it converges to the above value as p tends to 0. Define As p tends to 0, the GFBT-based bound (5) can be rewritten as On the other hand, we have where the second last equality holds if and only if R t+1 is a subset of m i=1 V i , i.e., the Gallager region R contains all weight-(t + 1) vectors that can be decoded correctly. In order to make the bound asymptotically tight, assume the chosen Gallager region meets this condition. The last equality holds if and only if the sets V i ∩ R t+1 , i = 1, 2, · · · , m, are mutually disjoint. According to the inclusion-exclusion principle, we have For general binary linear block codes, V i 's are clearly not disjoint. It is possible that the sets V i ∩ R t+1 , i = 1, 2, · · · , m, are disjoint with a suitable choice of the Gallager region, i.e., V i ∩ V j ⊂ R t+1 for all i, j. However, this requires the knowledge of the intersections of V i and V j , which involves the decoding error relationships among three codewords (including the all-zero codeword). This suggests that the conventional pairwise error weight distribution represented by the WEF coefficients A d is insufficient for expressing an asymptotically tight WER bound for BSCs because, except the first term on the right-hand side, the other terms involve the error relationships among two or more nonzero codewords. Obviously, the issue remains for all possible choices of the Gallager region R. Therefore, in general, there exists a non-vanishing gap between the GFBT-based bound and the asymptotic WER as p tends to 0. This completes the proof. Take the (6, 2) binary linear block code, whose codebook is shown in Table 1, as an example. The error-correcting capability is t = 1. The WEF coefficients of the code are A 0 = 1, A 3 = 2, A 4 = 1. The total number of weight-(t + 1) vectors is 6 2 = 15. If we index the three non-all-zero codewords in Table 1 from top to bottom, the sets V 1 , V 2 and V 3 can be written as [1 0 0 0 0 1], which can be viewed an additional code-specific parameter that cannot be deduced from the WEF.
Although the GFBT-based WER bounds are not asymptotically tight for all codes, they can be tight for some specific codes. One example is that if the code satisfies | m i=1 V i | = n t+1 (i.e., all the weight-(t + 1) vectors will be decoded wrongly, such as the perfect code), the bounds can be asymptotically tight. Another example is that if all the sets V i of the code are pairwise disjoint, the WER bounds can also be asymptotically tight. In particular, all V i 's satisfy the pairwise disjoint condition if A 2t+2 = 0. Note that some weight-(t +1) vectors are wrongly decoded although they do not fall in the radius-t decoding spheres of any codewords. Thus, the UPS lower bound is not asymptotically tight.
The arguments behind the proof of Theorem 1 can be readily extended to assert that the GFBT-based BER bound for BSCs is not asymptotically tight for all possible choices of the Gallager region. Specifically, it can be deduced that the pairwise error weight distribution represented by the IOWEF is insufficient for representing an asymptotically tight BER bound for BSCs.
By far, we can see that in order to get an asymptotically tight bound for the WER, we need to get the exact number of weight-(t +1) vectors that lead to an error. Furthermore, if the bounds for the BER are considered, we also need to know the weights of the information sequences that the weight-(t + 1) vectors wrongly decoded to. Therefore, we introduce the IDWEF defined as where D and Y are the output and input indeterminates of the decoder, respectively, and S h,w is the number of weightw vectors that will be decoded to an information sequence with weight h. It actually can be regarded as the IOWEF of the decoder as it focuses on the relationship between the weights of the error vectors and the information sequences. Consequently, the IDWEF is decoding algorithm dependent. Note that the sphere partitioning function defined in [29] is a special case of our IDWEF. It only shows the first term of the IDWEF and does not consider the weight of the information sequence. For a specific hard-decision decoder, the following proposition provides the exact WER and BER of a given code. Proposition 1: Given the IDWEF (i.e., S h,w ∀h, w) of a binary linear block code and its decoder over BSCs, give the exact WER and BER expressions, respectively. Unfortunately, the decision regions of a decoder are usually too complex for us to obtain the full details of the IDWEF.
It is reasonable to use a truncated IDWEF and the existing bounds in order to acquire asymptotically tight bounds of the WER and BER. By replacing the w = t + 1 term of the sphere bound (6) and UPS bound (8), we can get the proposed asymptotically tight upper and lower bounds for WER.
Proposition 2: Given the weight-(t + 1) IDWEF coefficients (i.e., S h,t+1 ∀h) of a binary linear block code and its decoder, give asymptotically tight upper and lower bounds on WER for BSCs, respectively. Similar modifications can be applied to the BER case, where we replace the first term of the union bound (4) and the UPS bound (8).
Proposition 3: Given the weight-(t + 1) IDWEF coefficients (i.e., S h,t+1 ∀h) of a binary linear block code and its decoder, give asymptotically tight upper and lower bounds on BER for BSCs, respectively. It is natural to think that the more terms in (4), (6) and (8) are replaced by those in (18) and (19), the tighter the final bounds can achieve. As the number of weight-w vectors increases dramatically w.r.t. the error-correcting capability of t (i.e., the value w t+1 is huge for large t), the cost of the brute-force is already unaffordable with some short blocklength codes. Efficient ways to acquire the IDWEF remains open. The simulation is still a practical way to estimate the error performance of long codes. The relative saving on the sample size that required for reliable estimation can be further improved if some fast simulation algorithms are applied.

IV. PROPOSED IMPORTANCE SAMPLING ESTIMATOR
In this section, we propose an IS estimator for error probability evaluation of linear block codes over BSCs with crossover probability p. A corresponding Hamming weight-based IS algorithm is then presented.
For a given linear block code with blocklength n, the complexity of searching for a proper n-dimensional IS distribution is high. We define W wt(Z) as a random variable, named as the error weight. It follows the binomial distribution Bin(w; p) with p.m.f. P w = n w p w (1 − p) n−w and its sample space is the one-dimensional Hamming weight space {0, 1, 2 · · · , n}. We search for the optimal one-dimensional IS distribution in the Hamming weight space that minimized the variance of the IS estimator. The complexity of the searching process can be significantly reduced while the most sensitive dimension of the randomness to the cross-over probability remains.

A. OPTIMAL IS DISTRIBUTION
Define the error probability conditioned on the weight-w vectors as which is named as the error ratio. Since θ e (w) is independent from p, assigning a larger sample size to the error weight that contributes the most to the accuracy of the estimation helps accelerate the simulation process. Assume P * w for w = 0, 1, · · · , n is the applied IS distribution in the Hamming weight space. The corresponding IS distribution over BSCs with cross-over probability p can be derived as The physical meaning of the IS distribution (25) can be illustrated as that we need to adjust the occurrence probability for each weight from a binomial distribution to an optimized {P * w } n w=0 while keeping the distribution conditioned on each weight unchanged. Therefore, the IS estimator for WER in (13) specializes tô where z i are generated from f * (z). By substituting the IS distribution (25) into (14), we can derive the expression of the variance of the IS estimator as where Y w = {z ∈ X n : wt(z) = w} is the weight-w Hamming shell. One remark is that the variance expression in Lemma 1 of benchmark paper [26] is not rigorous.
The following theorem provides the general expression for the p.m.f. {P * w } n w=0 that minimizes the variance of the proposed IS estimator (i.e., the sample size required to achieve a specific relative error is minimized).
Theorem 2: The optimal p.m.f. {P * w } n w=0 on the Hamming weight space that minimizes the variance of the proposed IS estimator (27) is given by √ θ e (j)P j , for w = 0, 1, . . . , n.
Proof: Since P * w , for w = 0, 1, · · · , n, are probabilities and only involved in the first term of the variance (27), the minimization problem can be formulated as Denote J (P * 0 , P * 2 , · · · , P * n ) = n w=0 C w P * w as the objective function, where C w = θ e (w)P 2 w . The Hessian of J is where diag(·) represents the diagonal matrix. The Hessian matrix is positive definite ∇ 2 J 0, ∀P * w ∈ R n + , where R n + is the n-dimensional non-negative real space. The feasible set is a subset of R n + and is convex. Furthermore, the objective function is a subset of R n + . Since both the objective function and the feasible set are convex on R n + , the optimization problem is convex.
Relax the problem by removing all the inequality constraints (i.e., extending the feasible set). The relaxed problem is still convex on R n + . Its Lagrangian can be derived as where λ denotes the Lagrange multiplier. The solution of the relaxed problem can be derived by setting the derivative of the Lagrangian to 0 The above solution satisfies 0 ≤ P * w ≤ 1, which indicates it falls inside the feasible set of the original problem. VOLUME 10, 2022 Therefore, the optimal solution for the problem in (29) is √ θ e (j)P j , for w = 0, 1, . . . , n.
By substituting (28) into (25), the optimal IS distribution w.r.t. p can be written as As we can notice, the optimal solution (28) contains part of the target information itself θ e (w), with which the error probability can be straightforwardly calculated. Therefore, it is impossible to obtain the optimal solution in practice. A straightforward way that can avoid the problem is to use the estimated or approximated θ e (w), denoted asθ e (w), to evaluate the p.m.f. {P * w } n w=0 . There exists some knowledge of the codebook like the error-correcting capability t and the IDWEF conditioned on weight t + 1 that can help to get a good approximation of θ e (w) for the asymptotic case.
A naive IS estimator is therefore introduced. Based on either some knowledge of the codebook or a preprocessing with limited computational power, a rough estimation of some dominant terms of θ e (w) is assumed. Then, an approximated IS distribution is derived and used for further error probability simulation. Also, since error ratios θ e (w) for small weights are important properties for the code, it's worthwhile to do an off-line simulation and tabulate them like the examples shown in [24].
One remark is that the counterparts of θ e for the BER case θ b can be obtained by substituting I b for I e in (24). All the above results still hold for BER case if these counterparts are replaced in all equations.

B. HAMMING WEIGHT-BASED IS ESTIMATOR
We propose the Hamming weight-based importance sampling (HW-IS) algorithm that is more efficient than the state-ofthe-art IS-MC basic algorithm [26] for the fast simulation purpose.
Since the optimal IS distribution requires the information of θ e (w) which is also absent at the very beginning, an iterative algorithm is proposed. With the initial guessθ 0 (w), the iteration alternates between performing a p.m.f. {P * w } n w=0 update and a WER or BER estimation. The implementation of the HW-IS algorithm is shown in Algorithm IV-B. Initially, the sample size counter N_tot is set as 0 and the relative error WERre is set as 1. The p.m.f. {P * w } n w=0 is initialized with the inputs of the cross-over probability andθ 0 (w).
The sample generation from the IS distribution f * (z) consists of two phases. During the first phase, an integer w is randomly generated from the p.m.f. {P * w } n w=0 . In the second phase, a sample z is generated by randomly permuting the element order of a weight-w length-n binary vector to mimic the uniform distribution of error vectors conditioned on wt(z) = w. During the iteration, the error ratioθ e (w) for each Hamming weight is updated bŷ where I w (z) returns 1 if wt(z) = w and 0 otherwise. The p.m.f.
{P * w } n w=0 for the next iteration is updated with (28). The error probability is estimated inside the while loop until the relative error calculated by (15)  In order to avoid the violation due to the insufficient number of samples during the first several iterations, we set the minimum sample size N_min needed for bothP e and {P * w } n w=0 to start an update. However, for some small θ e (w) terms, it's possible that no error has been found after N_min samples are generated. This may lead toθ e (w) = 0 ⇒ P * w = 0 after the update, which means no more samples with those weights will be generated afterwards and P * w will be frozen to be 0. In order to avoid this frozen distribution parameter problem, we sacrifice part of the efficiency by forcing P * w 0 −1 = βP * w 0 , where w 0 is the first weight with non-zeroθ(w 0 ) by far and β is a factor set heuristically. This will make the algorithm keep searching for weight-(w 0 − 1) vectors that may contain errors. Apparently, for the cases with the knowledge of t, no efficiency loss happens. While for the cases without t, we need to keep generating samples with weight w 0 − 1 until the stopping criterion satisfies. One can regard this as the efficiency loss led by the lack of knowledge of t.
Furthermore, the algorithm can be extended to an SNR invariant version by accurately estimating θ e (w) during one simulation. Compared with the SNR-invariant algorithm in [26] the advantage is that even for long codes, not only the error floor part but also the water-falling region can be accurately predicted. Similar to [26], the error-correcting capability t can be estimated according to the result. The algorithm is also applicable for the BER case if all the notations for WER are replaced by those for BER.

V. ASYMPTOTIC RELATIVE SAVING ON THE REQUIRED SAMPLE SIZE
In this section, the efficiency of the proposed IS estimator is analyzed. From the simulation results of the benchmark IS estimator in [26], one can observe that the number of the generated samples becomes saturated at some point in the high SNR region. A similar phenomenon can be found in the results of the proposed estimator as well, which means the IS distribution does not depend on the SNR or p anymore. Since for the asymptotic case, the weight-(t + 1) error vectors dominate the performance, once the reliable estimation of the error ratio θ e (t + 1) is obtained, the error probability result meets the accuracy requirement and no more samples are needed. Above all, it's natural to show the advantage of the proposed estimator by comparing the efficiencies of these two estimators in the asymptotic case.
In order to make a fair comparison, define the relative saving η as the percentage that the proposed estimator can save compared to the benchmark estimator in terms of the sample size under the reliability (i.e., the relative error values are equal).
where N prop and N bench are denoted as the sample sizes for the proposed estimator and the benchmark estimator, respectively. For the asymptotic case where the assumption np 1 is usually made, [31] suggests that the following approximation for (27) holds which indicates that the weight-(t + 1) error vectors not only dominates the error probability but also the variance of the estimator. It can be foreseen that the saturated value of the number of samples is only related to the term θ e (t + 1) in the error ratios of the code. For the proposed estimator, this will make the parameter of the optimal distribution P * t+1 approaches 1 to generated samples on the target weight as many as possible. An asymptotic approximation of the relative saving on the required sample size can be derived. Theorem 3: Assume that np 1. The relative saving η MC (in terms of the sample size) of the proposed IS estimator w.r.t. the MC estimator is The relative saving η of the proposed IS estimator w.r.t. the state-of-the-art counterpart in [26] is Proof: According to the approximation of the variance in (36) and the definition of the relative error in (15), we can derive the total number of samples required for the MC estimator, the benchmark IS estimator [26] and the proposed IS estimator with relative error κ as Under the assumption np 1, the approximation P e ≈ θ e (t +1)P t+1 holds. Therefore, the number of samples needed for the proposed estimator can be written as .
where the last approximation is made due to that most of the cases, the error ratio satisfies θ e (t + 1) 1. Similarly, for the MC and the benchmark estimator, the sample sizes can be derived as , .
As the probability P t+1 dominates the tail part w ≥ t +1 of the p.m.f. P w under the assumption np 1, the approximation P * t+1 ≈ 1 can be achieved according to (28). Furthermore, from [26], we know that the optimal parameter for the benchmark estimator in the asymptotic case is q = t+1 n for the minimum-variance purpose within the parametric family of Bernoulli distribution.
Hence, the efficiencies that the proposed IS estimator can achieve compared to the MC and the benchmark are

respectively.
As p tends to 0, the saving η MC compared to the MC estimator in (37) approaches 1. We are more interested in η compared to the benchmark estimator [26]. Since most of the block codes satisfy the condition n t + 1, the influence of the blocklength on η is negligible when n becomes large. This also can be verified through Fig. 1, where the curves become FIGURE 1. Asymptotic relative saving (38) on the required sample size that the proposed IS estimator can achieve compared to the benchmark estimator [26] in terms of the blocklength n with different error-correcting capabilities t . steady as n increases. Therefore, η can be further simplified by approximation as stated in the following corollary.
Corollary 1: Assume that np 1 and the code satisfies n t + 1. The relative saving η of the proposed IS estimator given in Theorem 3 can be further simplified as where Q(x) = 1 √ 2π ∞ x e − x 2 2 dx. Proof: When n is large, we can use the normal distribution N (np, np(1 − p)) to approximate the binomial distribution B(n, p). Here define a random variable X ∼ N (µ, σ 2 ) with mean µ = t + 1 and variance σ 2 = (t + 1) 1 − t+1 n ≈ t + 1 under the assumption n t + 1. The following approximation holds From the above corollary, we know that the asymptotic relative saving η can be represented as a function of t only and is indepedent of the blocklength n. The larger t the code has, the higher asymptotic η one can achieve. And we will show examples in Section VI that although several approximations are made during the analysis, the asymptotic η predicts the efficiency very well in practice.
Finally, since (39) is irrelevant to the error ratios of the code, both the BER and WER simulations have the same the relative saving on the sample size. There also exist ways to further improve η if more knowledge of the codebook  (17)) of the (15,7) primitive BCH code with MLD and t = 2.
is provided. Since the samples are uniformly drawn within each weight, the sample size needed for a reliable estimation of θ e (w) won't reduce no matter what kind of biased distributions is applied. Combining the derived IS distribution with a smarter biased distribution conditioned on each weight instead of the uniform one will further reduce the required sample size. But that may be code specific since the codebook knowledge, as well as the decoding algorithms' mechanism, are required.

VI. NUMERICAL RESULTS AND DISCUSSIONS
In this section, various bounds mentioned in Section II are firstly compared, and then the simulation results about the sample size needed for reliable estimation of the proposed HW-IS algorithm are shown. The comparisons between the results of the proposed algorithm with those of the stateof-the-art ''IS-MC basic'' algorithm in [26] are presented. Furthermore, the relative savings on the sample size of the example codes in the high SNR region are listed to show the accuracy of the derived asymptotic η. All the simulation results can be reproduced by our open-source tool [28]. VOLUME 10, 2022 FIGURE 5. A comparison of the asymptotic relative saving in (39) and the corresponding simulation results of the proposed HW-IS algorithm using the BCH codes considered in in Fig. 3 and Fig. 4.

A. COMPARISONS OF BOUNDS
We consider the (15,7) primitive BCH code with MLD and t = 2. The weight-(t + 1) IDWEF coefficients of the code are obtained by an exhaustive search, and the results are listed in Table 2.
In Fig. 2, we compare the proposed upper and lower bounds in Proposition 2 and 3 with several known bounds in the literature. For reference, the union bound (4), Poltyrev's sphere bound [10], the UPS bound [29] and the MLD simulation results are plotted for both WER and BER in Fig. 2(a) and (b), respectively. In addition, Keren-Litsyn's bound [32, Theorem 1] and Cohen-Merhav's bound [33,Proposition 4.2] on WER are plotted in Fig. 2(a). These two bounds have almost the same performance for the considered code and and are marginally tighter than the UPS bound. It can be seen that all three lower bounds have the same asymptotic performance and are not asymptotically tight. Since there are no BER counterparts derived in [32] and [33], only the UPS lower bound on BER is plotted in Fig. 2(b) for comparison with the proposed lower bound. One can see that the proposed bounds are asymptotically tight in the high SNR region, while the others have non-vanishing gaps from the MLD simulation results. For the union bound and the sphere bound, the gaps are caused by over-counting the number of wrongly decoded weight-(t + 1) vectors as we stated in Theorem 1.
Although the bounds are powerful tools for performance evaluation, it requires parameters of the applied encoderdecoder pair (such as the IOWEF or the IDWEF). Determining such parameters may be computationally costly or intractable. Also, MLD is not always applicable for the practical case. For example, people usually decode the BCH codes by the Berlekamp-Massey (BM) algorithm [34], [35] in practice, which is suboptimal. Simulation is a commonly used tool to estimate the error performance for the coded systems. Next, we will show how the proposed IS method can significantly improve the efficiency in terms of the sample size. Simulation results of some representative LDPC codes using the proposed HW-IS algorithm and the IS-MC basic algorithm [26]. The (273, 191) DSC LDPC code is taken from [36] and the others are MacKay's LDPC codes from [37].

B. COMPARISONS OF SIMULATION RESULTS
According to the HW-IS algorithm in Section IV, it is natural to think that a good initial guess ofθ 0 (w) will accelerate the FIGURE 7. Simulation results of EG-LDPC codes [38] using the proposed HW-IS algorithm and the IS-MC basic algorithm [26].
convergence of the optimal IS distribution. The choice of θ 0 (w) depends on what kind of knowledge is assumed. In this part, we consider two cases -the one with the knowledge about the error-correcting capability t and the other without any side information. For the former one, we can setθ 0 (w) to be that of the worst-case bounded distance decoder (i.e.,θ 0 (w i ) = 0 andθ 0 (w j ) = 1, for w i = 0, 1, · · · , t and w j = t + 1, 2, · · · , n). While the latter one can be regarded TABLE 3. A comparison of the asymptotic relative saving (''Asymptotic η'') in (40) and the simulated relative saving (''Simulated η'') of the proposed IS estimator (HW-IS algorithm) w.r.t. the benchmark IS estimator (IS-MC basic algorithm [26]) at WER ≈ 10 −13 using various LDPC and Polar codes. N prop and N bench denote their required sample sizes, respectively. as an uncoded system (i.e., t = 0). The stopping criterion of the simulations in this section is set as κ = 0.1.
We choose some representative BCH codes [31] with rate R = k/n ≈ 0.9 as examples. The BER results as well as the generated sample size w.r.t. E b /N 0 are shown in Fig. 3(a) and (b), respectively. All the codes are decoded by the BM algorithm. A sample size N_min = 10 3 is assured before we start to update the IS distribution.
It can be noticed that all the curves in Fig. 3 (b) increase exponentially and become saturated after reaching some specific E b /N 0 's. The reason is that as SNR increases, the sample size needed keeps growing until the error ratio of the dominant weight-(t + 1) Hamming shell is reliably estimated. The E b /N 0 's for the curves become saturated vary for different codes as they depend on the codebook knowledge and the decoding algorithm. Further increasing the sample size wastes the computational power. We may claim the HW-IS algorithm is SNR-invariant for high SNR.
The BER and sample size vs E b /N 0 of BCH codes with rate R ≈ 0.5 using the proposed HW-IS algorithm and the benchmark IS-MC algorithm are shown in Fig. 4(a) and (b), respectively. Similar observations as those with rate R ≈ 0.9 also hold.
The relative saving η vs the error-correcting capability t for the BCH codes considered in Fig. 3 and Fig. 4 are shown in Fig. 5, where the solid line represents the approximated asymptotic result derived in Corollary 1. All the simulation results are calculated in the high SNR region (i.e., the part that the number of samples becomes saturated). As we can see, the curve for the asymptotic η is achievable and can predict the efficiencies of the HW-IS algorithm pretty well. Also, the trend that a larger saving on the sample size can be obtained for the codes with bigger t is verified according to these simulation results.
Next, we consider the cases with no side information provided and apply the HW-IS algorithm on the LDPC and Polar codes. In order to make a fair comparison, let us start with the same examples used for the IS-MC basic algorithm in [26]. The LDPC codes taken from [36], [37] are implemented. In Fig. 6 (a) and (b), the simulation results of the WER and the sample size are shown, respectively. Since the knowledge of t is absent, the factor β = 1 is set to keep the estimator unbiased. Subsequently, the asymptotic η will be degraded to All the settings of the algorithm are the same as the cases for the BCH code. The LDPC codes are decoded by the bitflipping decoder presented in [24] with a maximum iteration number of 20. Similar to the BCH codes, we can observe that from Fig. 7 (b), the curves of the sample size reach a saturated value after some E b /N 0 points. This flat region indicates that the weight-t + 1 errors dominate the performance and the assumptions for the asymptotic analysis for the error probability hold now. Hence, the estimatedθ e (w) for this region can be used to predict the t of the LDPC codes. Since those saturated values are determined by the value of θ e (t + 1), for the code with a thinner weight spectrum, the smaller θ e (t + 1) it owns, the more samples needed for a reliable estimation in the asymptotic case.
We further implement the HW-IS algorithm on the EG-LDPC codes [38] and the Polar codes. The EG-LDPC codes are decoded by the same decoder as stated in the previous example. Due to the values of t for these codes have been tabulated, we can verify the derived asymptotic efficiencies η with the simulated ones. The Polar codes are decoded with the successive-cancellation algorithm proposed in [39].
One can see by Fig. 7 and 8 that for all of these examples, the proposed HW-IS algorithm can beat the IS-MC basic algorithm in the aspect of the efficiency while keeping the same accuracy of the WER estimation.
Usually, people are more interested in the performance of the LDPC codes over the water-falling region instead of the asymptotic cases. Since in practical applications, there is no need to get the specific position of the error floor as long as it's below the error probability level of interest. Hence, for the purpose of studying the performance of the HW-IS algorithm in the non-asymptotic cases, we apply it on two more practical EG-LDPC codes [38] with parameters (1023, 781) and (4095, 3367). The level of WER P e ≈ 10 −13 (approximated from the level of required BER P b = 10 −15 ) is focused on, which is commonly required for the optical communication applications [12].
The sample sizes and the asymptotic efficiencies of all the used LDPC codes and Polar codes compared to the IS-MC basic algorithm are summarized in Table 3.
Due to the efficiency loss caused by the lack of knowledge of t, the asymptotic η's of these codes are smaller than those of the BCH codes under the same t. Though, compared to the large sample size required for the low error probability estimation, the saving is still considerable. By combining the results shown in Table 3 and Fig. 6-8, we can see that the asymptotic η predicts the performance quite well even for (1023, 781) and (4095, 3367) codes, whose curves for the sample size have not reached the saturated region at P e = 10 −13 . These provide evidences that our method can outperform the IS-MC basic algorithm for the nonasymptotic cases as well.

VII. CONCLUSION
In this paper, the problem of efficiently evaluating the BER and WER of linear block codes over BSCs was studied. Firstly, we showed that any GFBT-based bounds are not asymptotically tight for all possible choices of the Gallager region. By proposing the IDWEF of a coding scheme, the asymptotically tight MLD upper and lower bounds were proposed. Secondly, aiming at accelerating the simulation process for the low BER and WER region, a Hamming weight-based IS estimator was proposed. Its relative saving on the sample size required for a reliable estimation compared with the state-of-the-art IS-MC basic algorithm [26] was investigated. The derived asymptotic η can predict the efficiency of the proposed IS estimator accurately. Our simulation results showed that the proposed IS estimator is more efficient than the benchmark [26] in all cases under consideration. The saving on the sample size ranges from 47% to 97% and increases with the error-correcting capability of the code. As a future work, it is interesting to extend the presented results for BSCs to the counterparts for continuous channels, including but not limited to the Gaussian channel.