Fast Simulation of Coded QAM Transmission in White Gaussian Noise at Low Packet Error Rates

Today’s communication system design heavily depends on computer simulation for performance evaluation. The burgeoning ultra-reliable communication systems, however, pose a significant simulation challenge as such systems operate at very low packet error rates (PERs) whereas the required simulation time of the conventional Monte Carlo (MC) method is many times the inverse of the PER. Various importance sampling-type techniques have been developed for more efficient simulation of channel-coded transmission, but they typically rely on exploiting code weaknesses (for the generation of error-causing noise samples) and most works on soft-decision decoding only treat binary signaling. In this paper, we propose to use a function, termed the noise gauging function (NGF), that roughly measures the error-causing propensity of noise samples and we present a way to adaptively optimize the noise sampling under such a function for simulation efficiency. Both binary and nonbinary signalings are considered. And the proposed technique does not require detailed knowledge of the code weaknesses, although some high-level understanding of the code properties can benefit the design of efficient NGFs. We investigate the application of the proposed technique to several common channel codes. Numerical results indicate an approximately 10- to 1,000-fold speedup versus MC.


I. INTRODUCTION
T HE FIFTH generation (5G) wireless communication standards spearheaded ultra reliable and low latency communication (URLLC) to support mission-critical communication in application scenarios such as factory automation and intelligent transportation. Later generations of standards and practical systems are expected to further such capability [1]. Low latency aspects notwithstanding, the ultra reliability aspects, as currently conceived, involve packet error rates (PERs) in the range of 10 −5 to 10 −9 with packet sizes on the order of 32 bytes [2], [3]. To achieve the required performance, proper channel coded modulation is indispensable. In this regard, current 5G specifications have continued the common recent practice of bit-interleaved coded modulation (BICM) in standard wireless communication systems [4], [5] and employed BICM built on low-density parity check (LDPC) and polar codings [6], [7]. There is no doubt that a similar practice will continue as related research continues to progress, e.g., [8], [9]. Now, today's communication system design makes heavy use of computer simulation for performance evaluation, for which the Monte Carlo (MC) method is a most ready choice due to its simplicity and general applicability. For errorprobability evaluation, however, it requires on the order of 1/( 2 p e ) simulation runs to attain a relative precision of at an (a priori unknown) error probability p e . Hence it encounters difficulty in dealing with low error probabilities (such as PERs below 10 −7 ), especially for complicated systems. Usually, one is not interested only in obtaining the This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ VOLUME 3, 2022 1887 error performance at one particular operating point, but in tracing out the PER performance over some range of signalto-noise (SNR) values for various design alternatives. At times the extreme-value theory or some understanding of the tail probability property can be invoked to extrapolate the MC-obtained error performance for lower SNR values to higher SNR values [10], [11]. This can certainly reduce the MC simulation burden significantly. But the prerequisite is that one has reached the tail part of the error performance curve in MC simulation. Otherwise, the extrapolation would be based on an incorrect slope which may result in substantial over-or underestimation of the error probabilities for higher SNRs [12]. But for complicated coded transmission it can be difficult to ascertain whether one has reached the tail part of the performance curve. Hence one may be obliged to conduct simulation over the full range of SNR or PER of concern. The deficiency in MC efficiency thus strongly calls for a robust and efficient alternative for simulation of ultra-reliable communication systems. One long-standing approach to fast simulation (versus conventional MC) is importance sampling (IS), for which various techniques have been developed [10]. The basic idea of IS is to increase the frequency of error-causing events by generating simulation samples with a biased probability distribution. Lu and Yao [13] consider uncoded transmission over noisy intersymbol-interference (ISI) channels and find that biasing the Gaussian noise by mean translation is increasingly more efficient than biasing by variance scaling as the SNR increases. Making use of the large deviations theory, Sadowsky and Bucklew [14] show that, for Gaussian noise, proper mean-translation biasing is asymptotically efficient at high SNR values. As a result, translation of a Gaussian noise's mean to the decision boundary becomes a standard practice in IS simulation [15].
For coded transmission, take LDPC coding as an example. The decoding of such codes is known to be plagued by bit patterns called "trapping sets" which lead to error floors in high SNR. There exist efforts in using mean translation to simulate LDPC decoding in additive white Gaussian noise (AWGN) and Rayleigh fading [11], [16], [17], [18], [19], as mean translation on the trapping sets constitutes an efficient means to estimating the error floors. However, finding the full trapping set is an nondeterministic polynomial-time (NP)-hard problem [20] which is generally much more difficult than estimating the PER itself. Adaptive IS [21], [22] and universal simulation distributions [23], [24] can help the finding of proper mean translations, but they are effective only in low-dimensional code spaces.
Taking a different route, Minja and Šenk [25] propose a quasi-analytic simulation method for so-called "star domain" decoding. The method yields highly remarkable reduction in simulation runs but its application has been limited to several classical decoders for the binary symmetric channel or for binary signaling in AWGN.
The above simulation methods exploit known code weaknesses or the decoder property to concentrate simulation samples in the regions of the noise space that have higher probabilities of causing errors. The issue is that, in a highdimensional space, the amount of effort required to determine such regions can loom large over the gain in simulation runs. One approach to mitigating this problem is to "map" the multi-dimensional simulation space into a single dimension judiciously and design proper noise sampling over properly organized subsets (or bins) of this one-dimensional (1D) space. In this regard, Holzlöhner et al. [26] define a scalar function of the noise vector whose values are positively correlated with the probability of decoding error and employ the multicanonical Monte Carlo (MMC) method to sample the noise to yield a flat histogram in the function's values. A subsequent study based on a similar flat-histogram approach also shows prominent efficiency advantage over MC [27].
However, a nonflat histogram may be more efficient for a given 1D mapping. For example, in simulating hard-decision decoding performance, Mahadevan and Morris [28] (naturally) use Hamming distance between bit patterns as the 1D mapping and generate more samples in the vicinity of half the minimum Hamming distance of the code. Indeed, Liang et al. [29] derive a general formula for the optimal histogram. Yet, there have been few studies that try to employ such an optimal histogram in simulating coded transmission. The reason is that it requires knowing the a priori unknown conditional error probabilities associated with the 1D mapped bins. In addition, for either a flat or nonflat histogram design, an issue is the efficient generation of properly distributed samples for each bin of the employed 1D mapping, as the distribution can be very complicated for a mapping of choice.
In previous works [30], [31], we considered simulating coded transmission performance employing a 1D mapping of the noise space termed the noise gauging function (NGF). More particularly, in [30] we developed a prototypical adaptive sampling method that can estimate the optimal (nonflat) histogram on the run, and in [31] we considered specifically the simulation of convolutionally coded systems and enhanced the method for their simulation efficiency. However, one limitation of these works is that they in effect only considered binary signaling, as is the case with all the reported studies on fast simulation techniques. (More exactly, the works considered quadriphase shift keying [QPSK], but that is mathematically equivalent to the direct sum of two biphase shift keying [BPSK] signals.) In the present paper, we revisit the design of the NGF as well as the adaptive sampling method under different situations. In particular, we consider quadrature amplitude modulation (QAM) in addition to QPSK, develop new NGFs along with revamped adaptive sampling mechanisms, and conduct a more in-depth analysis into the associated performance. To reiterate, our proposed technique is essentially code-agnostic in that it does not require detailed knowledge of the code weaknesses. However, some high-level understanding of the code properties can benefit the design of efficient NGFs. We will illustrate this point in several cases, one of which being burst-error codes. Fig. 1 illustrates, conceptually, some key features of the conventional MC, conventional IS, and the proposed simulation techniques.
The remainder of this paper is organized as follows. Section II describes the system model and common PER measures. Section III introduces the proposed fast simulation approach. Section IV presents several practical NGFs and develops associated methods for noise sampling conditioned on given NGF values. Section V develops a method for adaptive shaping of the NGF histogram towards the optimal pattern. Section VI presents some numerical results. Section VII extends the proposed technique to simulation of burst-error coding. Finally, Section VIII concludes the paper.
Some notational conventions are as follows. P(·) denotes the probability of an event. Random variables are denoted using uppercase italic letters and random vectors uppercase boldface letters. Corresponding lowercase letters denote their sample values. Depending on whether X (resp. X) is discrete or continuous, f X (·) (resp. f X (·)) denotes its probability mass or probability density function. E[ ] denotes expectation; any subscript to E denotes the random quantity over which the expectation is taken. V(·) denotes the variance of a random variable. Bernoulli(p) denotes Bernoulli distribution with success probability p, and N (μ, ) denotes multivariate Gaussian probability distribution with mean vector μ and autocovariance matrix . Fig. 2 shows the considered system model, where B is a vector of k information bits that observe independent and identical Bernoulli(0.5) distribution. Let n denote the codeword length. The channel encoder, of rate r = k/n, encodes B into a binary codeword vector C = [C 1 , . . . , C n ] and a modulator operates on C to yield a channel symbol vector W = [W 1 , . . . , W l ] where l depends on the modulation method. For convenience, we represent the in-phase and quadrature parts of a QAM (including QPSK) symbol separately in W so that W is a real vector and l is equal to two times the number of QAM symbols corresponding to C. Each two successive elements of W make up a QAM symbol of the form W 2i−1 + jW 2i , where j = √ −1 and 1 ≤ i ≤ l/2. We shall often refer to W as a packet in this work.

We let E[W 2
i ] = E s /2 for 1 ≤ i ≤ l where recall that E[ ] denotes expectation, and E s is the average QAM symbol energy. The packet W is sent over an AWGN channel with two-sided noise power spectral density N 0 /2, which yields a received symbol vector i.e., a white Gaussian noise vector with autocovariance σ 2 2 I where σ 2 = N 0 and I denotes an identity matrix. The receiver demodulates Y into V = [V 1 , . . . , V n ] and decodes it into a binary data vectorB.
A packet error occurs when any bit inB is different from the corresponding bit in B so thatB = B. Under our system model, the PER is given by where p e (b) denotes the PER of b, F(b) = {z|b = b} is the set of z that would cause a decoding error for b, and I F (b, z) is an indicator function defined as I F (b, z) = 1 if z ∈ F(b) and I F (b, z) = 0 otherwise. For linear codes with BPSK or Gray-coded QPSK modulation, p e (b) is equal ∀b so that p e can be evaluated by evaluating p e (b) for any b, such as b = 0. For convenience, define (B, Z), and we shall refer to F(b) either as an error region or an error set.
For reliability reason, URLLC has favored use of loworder modulations. Fig. 3 shows the Gray-coded QPSK and 16QAM constellations [6]. For QPSK, l = n and the bits are mapped to symbol values as Except for the scaling factor of 1/ √ 2, a Gray-coded QPSK symbol is just the direct sum of two BPSK symbols. Hence, in AWGN, maximum likelihood (ML) decoding under coherent Gray-coded QPSK is mathematically no different from that under coherent BPSK. For 16QAM, l = n/2 and For the time being, consider QPSK only. Let a i be the number of codewords of Hamming weight i, where 0 ≤ i ≤ n. It is well-known that, for coherent BPSK, a union bound on the PER under ML decoding and a high-SNR approximation are given by [32,Ch. 10] where d min is the minimum Hamming weight of the codewords (excluding 0), a min is the number of minimum-weight codewords, and Q(·) is the Gaussian Q function.

III. APPROACH TO FAST SIMULATION
While (4) provides a concise characterization of the PER under coherent Gray-coded QPSK (or BPSK), for an arbitrary code d min and a min are not always known, not to say the full weight distribution. Indeed, the evaluation of (1) can be highly more intractable for more complicated coding and modulation schemes. As a result, simulation invariably becomes the recourse for determining the PER performance. However, an efficient simulation setup can be obtained often only with substantial design effort. In this section, we introduce the proposed technique for fast simulation of digital communication systems. To put it in perspective, we first review some features of the MC simulation technique and the IS approach. IS reduces the required simulation time by biasing the distribution of the random variables (typically the noise vectors) in the system. Such biasing is conventionally effected by mean translation, that is, translating the mean of the noise to points which best discriminate between errorcausing and non-error-causing noise vectors. However, the determination of proper mean translations can be difficult for complicated codes. We develop a method that divides the simulation space into a number of regions ("bins") employing a function that roughly measures the propensity to yield error from a noise vector (a point in the noise space). The method carries out MC simulation in each bin, but seeks to distribute the simulation samples among the bins in a way for best efficiency. The resulting technique is thus termed histogram-shaping Monte Carlo (HSMC).

A. PERFORMANCE OF THE MONTE CARLO TECHNIQUE
Consider (1). MC replaces the last expectation by the sample mean to obtain an estimate of p e aŝ where n MC is the number of simulation runs, N E is the number of packet errors obtained in simulation, and θ (j) is the jth simulation sample of . By the fact that N E has the binomial distribution arising from n MC Bernoulli(p e ) trials, the variance ofp MC is given by Normalizing the standard deviation of the estimate by its mean provides a way to characterize the estimation quality known as the relative precision, given by where the approximation holds for small p e . To attain a relative precision MC ≤ , therefore, MC requires n MC ≥ ( 2 p e ) −1 simulation runs. But since p e is the a priori unknown parameter to be estimated, one cannot use it to fix n MC . A common practice is to substitutep MC for p e in (7) and obtain n MCpMC ≈ −2 MC . Then, for example, for a 10% relative precision, one may stop after collecting 100 packet errors.

B. IMPORTANCE SAMPLING
For small PER values, MC becomes computationally prohibitive because, at given relative precision, n MC is inversely proportional to p e . IS seeks to reduce the number of simulation runs by drawing samples of from a biased probability density function (PDF) f * (·), for each b, more concentrated around F(b). The PER is then given by where E * [ ] denotes the expectation with respect to f * (·) and the ratio W(·) f (·)/f * (·) is called the weight function. Similar to MC, IS replaces the last expectation in (8) by the sample mean to yield an estimate of p e aŝ where θ (j) * is the jth simulation sample of generated using the biased PDF. The variance ofp IS is given by The well-known "unconstrained optimal" biased probability distribution is given by [33] with associated W(θ ) = p e I F (θ) ∀θ . The optimal biased PDF requires knowing p e and is hence unworkable. Nevertheless, it has been considered useful in suggesting some IS design heuristics [15]. In particular, it has been used to motivate mean translation as has the large deviations theory [14].
To illustrate the mean translation technique and its issues, consider linear codes with Gray-coded QPSK modulation. Without loss of generality, consider transmitting the allzero codeword so that = (0, Z). By (2), w i = 1/ √ 2 ∀i. Consider ML decoding. As indicated in (4), minimum-weight codewords dictate the PER performance in high SNR so that where a is the index of the minimum-weight codewords, F a is the set of noise vectors that would cause a pairwise decision error to codeword a, and p min = Q( √ d min E s /N 0 ) is the corresponding pairwise error probability (PEP). Let a (1), . . . , a (d min ) be the indices of the 1-bits in codeword a. Ideal mean translation captures all the pairwise error conditions with a Gaussian mixture [14] where Z + μ a ( Z * ) denotes a translation of Z by μ a , that is, Z + μ a ∼ N (μ a , σ 2 2 I), with μ a being a vector whose elements are all zero except for those having indices In other words, Z + μ a is centered at the nearest decision boundary associated with pairwise decision error to codeword a. (Incidentally, the simulation for the aforementioned effects of the trapping sets in LDPC codes can be done similarly [11], [16], [17], [18], [19], [20].) For a first-order understanding of the efficiency of mean translation, consider the hit rate of Z * ∈ F a and the expected value of the weight function in F a . The probability of Z * ∈ F a can be obtained as where the approximation is due to that P(Z + μ a ∈ F a ) = 0.5 P(Z + μ a ∈ F a ) ∀a = a. The expected value of the weight function conditioned on Z ∈ F a is then given by ∀a. Thus, with such mean translation the overall hit rate P(Z * ∈ F) = a P(Z * ∈ F a ) is raised to 0.5 and the expected value of the weight function is of a similar order as p e . This comes close to the "unconstrained optimal" solution [15] and is optimal in a large deviations theory sense [14].
Its fundamental appeal notwithstanding, an application of the technique to complicated systems requires the appreciation of two issues as follows.

1) UNDERESTIMATION OF PER DUE TO INCOMPLETE KNOWLEDGE OF ERROR STRUCTURES
As already indicated, the "error-prone structures" of a code are not always known, such as the set of minimum-weight codewords of a linear code or the trapping sets of an LDPC code. One could do a computer search for these structures, but for complicated codes this is not necessarily practical, as it can be an NP-complete problem [20], [34]. Hence one may need to settle for a simulation based on known structures only. This could result in underestimation of the PER with mean-translating IS. We explain the mechanism in Appendix A.

2) EXTENSION TO HIGHER-ORDER QAM
Mean translation works only when the nearest points on the error region's boundary are well-characterized [14]. This is the case with Gray-coded QPSK under a channel code with known distance structures. For higher-order QAM, however, the (nonlinear) mapping between code bits and constellation points can drastically complicate the structure of the error region associated with a codeword. For example, the 16QAM plot in Fig. 3 shows that one noise element (say, Z i ) impacts the reliabilities of multiple code bits (two in this case). Moreover, an inner constellation point has more nearest neighbors than an outer point and is thus more susceptible to error than the latter. Further, one cannot only simulate the all-zero codeword because, unlike Gray-coded QPSK, code linearity is not preserved in the QAM domain.

C. HISTOGRAM-SHAPING MONTE CARLO
In view of the above issues of mean translation, we look for a different IS method. Existing theory [14], [15] indicates that an efficient method should seek to identify the error set F(B) and sample it frequently enough.
To identify the error set, we consider defining a realvalued function (or random variable) Q(B, Z) whose values are correlated with the PER. In other words, the function measures, in some way, the likelihood that a certain noise vector z will cause a decoding error to a certain information vector b and in this way provides a "soft identification" of the error set. (An unrealistic ideal is to have a binaryvalued Q(B, Z) that unambiguously indicates whether any given z will result in a decoding error for any given b and thereby provides a "hard identification" of the error set.) From elementary probability theory, the marginal PDF of Q Q(B, Z) is given by (16) and, by independence between Z and B, the conditional PDF of Z on B is given by We may therefore rewrite p e as (cf. (1)) where p e (q) is the PER contributed by all b and z for which Q(b, z) = q. Fig. 4 illustrates typical shapes of f Q (q), p e (q), and f Q (q)p e (q) that we can normally attain. (Examples will be given later.) Practicality limits us to consider only a finite range of q such as [q min , q max ] as shown in the figure, where should be large enough so that the approximation p e ≈ q max q min p e (q)f Q (q)dq holds. Also for practicality, we divide into a number of bins (intervals) as illustrated. Having obtained a "soft identifier" of the error set, we now consider the simulation method. The conventional MC is such that the simulation samples would observe the distribution f Q (q), while only a limited portion of the q values contribute significantly to the PER. Efficient IS should bias the sampling of Z such that the corresponding q values fall more in the range where f Q (q)p e (q) has more substantial values. For this, let be divided into n b bins denoted m , m = 1, 2, . . . , n b , respectively, and denote the number of simulation samples in the mth bin by n m . We thus have n IS = n b m=1 n m and the set {n m , m = 1, . . . , n b } constitutes a histogram of the q values. The corresponding biased (sampling) PDF is given by where I m (q) is the indicator function for m and p m P(Q ∈ m ). Contextualizing (8)- (9) to this setting, the corresponding PER estimator is given bŷ where q ∀m constitute the histogram of decoding errors. Note that the ratio E m /n m is precisely the MC estimate of the conditional PER p err|m in m from n m simulation samples, where By the fact that MC estimations are unbiased,p err|m E m /n m provides an unbiased estimate of p err|m , and hencep e from (20) provides an unbiased estimate of p e . We refer to the simulation technique summarized in (20) as HSMC, which performs MC estimation of the conditional PER in each bin and averages over the bin-wise estimates to obtain the overall PER estimate. The technique is denoted "histogramshaping" because a key work in its proceeding is to control the histogram {n m }, as discussed below.
In HSMC, the set {p m } constitutes the weight function. In contrast to conventional IS, the function is deterministic rather than random and, depending on the design, can be evaluated beforehand so that it does not have to be evaluated per sample. A known weight function also facilitates a simple closed-form characterization of the estimator variance. In particular, sincep e is a weighted sum of MC estimates, its variance is given by a square-weighted sum of MC variances (see (6)) as Minimization of the variance therefore depends on optimization of the histogram {n m } subject to a constraint on the total simulation samples n IS . A routine application of the Lagrange multiplier method yields Substituting it into (22) gives the minimum variance as where subscript Q indicates that the minimum depends on the distributions of p m and p err|m which in turn depend on how Q(B, Z) is defined. A useful measure of the efficiency of HSMC versus MC is provided by the ratio n MC /n IS at , which may be termed the ideal speedup factor and is given by Since p err|m ∀m can only be estimated after simulation, (23) cannot be used to set the histogram beforehand. But we can use it in an adaptive manner to update the histogram as simulation progresses and some estimates of p err|m are available. The details are discussed in a subsequent section. In Appendix B, we show (essentially) that V * Q (p e ) is upper bounded by V(p MC ) and that the upper bound is attained when p err|m is constant ∀m. In other words, any function Q(B, Z) that yields a nonconstant p err|m over m would be more efficient than MC under the optimal histogram. Indeed, from (24), zero variance can be achieved by having p err|m = 0 or 1 ∀m and, from (21), this is the case when Unsurprisingly, this is in general as unrealizable as the "unconstrained optimal" biased distribution of (11), despite any difference in the underlying mechanism. But it suggests that we find a function that marks well the propensity to cause error by any noise vector z to any information vector b. We have termed such a function a noise gauging function (NGF) in previous work [31]. It can be expected that, for complicated transmission systems, a close-to-ideal NGF may be hard to come by and, if obtainable, the computational complexity may be very high. In particular, the corresponding evaluation of p m and sampling of m for any m may be highly arduous work.
In this situation, it should be sensible to employ an NGF that may be less ideal but is easy to work with, such as one with a closed-form expression for the PDF of Q and facilitating the sampler design. With a view on the tradeoff between efficiency and complexity, we develop several practical NGFs in the next section. The following section develops the proposed method for adaptive shaping of the sample histogram towards its optimal form under a chosen NGF.

IV. PRACTICAL NOISE GAUGING
In this section, we present several practical NGFs for which the PDF of Q can be readily obtained and the corresponding sampling of m for given m is also relatively straightforward.
To start, we define = [q min , q max ] as such that P(Q < q min ) = P(Q > q max ) ≈ p out where p out = 0.01 p e so as to make cover at least 98% of the total probability of f Q (q)p e (q). For example, to simulate ultra-reliable communication we may let p out = 10 −11 . In absence of knowledge of p e (q), we divide uniformly into n b bins with bin width q = (q max − q min )/n b , so that 1 = [q min , q min + q] and In this manner, p m can be calculated from the PDF of Q and only p err|m needs to be evaluated by simulation for any m.
To facilitate sampling of noise vectors conditioned on some value of Q, say Q = q ∈ m , we note from [35] that the sampler is particularly easy to implement when Q(B, Z) is a sufficient statistic for some parameter of the PDF of Z. In particular, the PDF of a Gaussian random vector is characterized by its mean and autocovariance, which have well-known sufficient statistics [36]. In general, the sampling of a random vector conditioned on a certain sufficient statistic having a certain value can be effected by unconditioned sampling of the random vector followed by adjusting of the sample values to satisfy the given condition on the sufficient statistic [35]. Thus in the case of a white Gaussian random Z, its sampling conditioned on a sufficient statistic for mean or variance having a certain value can be effected by unconditioned sampling followed by shifting or scaling of the sample values properly to satisfy the given condition.
In the above regard, in this section we develop two noise gauging methods designated Q 1 (·) and Q 2 (·) that are closely associated with the sufficient statistics for mean and variance of Z, respectively. Both are code-agnostic in the sense that they assume no knowledge of the error-prone structures of a code. Between them, Q 1 (·) is linear in Z. Part of its attractiveness is its amenability to theoretical performance analysis, and it can be shown to be asymptotically (as n → ∞) optimal under random linear coding. But it has some limitations, which prompts our designing the NGF Q 2 (·) that takes into account only the noise elements that are conducive to decoding error.

A. CODE-AGNOSTIC LINEAR NOISE GAUGING
Consider QPSK. From (2), we see that a Z i having a different sign than W i should be more evocative of decoding error than one having a similar sign. Thus letZ i − √ 2W i Z i . For a linear code, it is understood that, in high SNR, the maximum of i= a (1)Z i over all minimum-weight codewords a well correlates with the ML soft-decision decoding error probability. But using it for noise gauging implies a codespecific design for which some issues have been discussed previously. An appealing code-agnostic NGF appears to be (where recall that w i andz i denote sample values of W i and Z i , respectively). By the assumption that Z ∼ N (0, σ 2 2 I), we also haveZ [Z 1 , . . . ,Z l ] ∼ N (0, σ 2 2 I), and thus the random variable Q 1 Q 1 (B, Z) has PDF N (0, σ 2 2l ) and p m can be readily evaluated.
After evaluating p m , HSMC requires the evaluation of p err|m ∀m. For linear coding with QPSK modulation, by symmetry of the signal structure we only need to simulate the transmission of b = 0 to determine the complete system performance. But for generality, in the below we also consider sampling over B. Then by the law of total expectation, we can rewrite (21) as where E Q 1 | m is expectation with respect to the PDF and the last equality in (28) holds because the second expectation has made Q 1 ∈ m and hence the conditioning of the innermost expectation on m becomes redundant. According to the sequence of conditioning set forth in (28), the generation of each simulation sample (indexed j) progresses in the order B → Q 1 → Z. To start, we draw an Algorithm 1 Sample Generation Method Under Proposed Linear Noise Gauging information vector sample b (j) and encode it into c (j) and obtain w (j) = [w 1 , we first draw a Gaussian vector samplez (j) according to N (0, σ 2 2 I) and add an equal bias to each element ofz (j) to make their sum equal to lq (j) 1 . This, according to [35], yields a proper sample of Z|(Q 1 = q (j) 1 ), which we denotez (j) . Simple algebra shows that its elements are given byz i . The desired noise vector sample z (j) is then obtained by letting z i ) ∀i. Algorithm shows a summary of the sampling method.
It is of interest to understand how Q 1 (·) performs statistically over different codes. Thus consider random linear (n, k) coding, for which the probability of any binary nvector being a codeword is 2 k−n and thus the average number of codewords of weight i is equal to n i 2 k−n . Then for PER at Q 1 = q, we have an average union bound as where p e (i, q) is the PEP of decoding to a particular weight-i codeword and n < n may be chosen to yield a suitably tight bound.
The conditional PER p err|m can be evaluated by evaluating the mean of p e (Q 1 ) over Q 1 ∈ m . Fig. 5(a) shows the ideal speedup factors for some code lengths under different bin counts. It is not surprising that a finer division of the noise space tends to enable a better optimization of the histogram, and V * Q (p e ) should converge to ( √ p e (q)(1 − p e (q))f Q 1 (q)dq) 2 /n IS when the bin count goes to infinity. The fluctuations in speedup factor at low n b values, which are particularly prominent for n ≥ 512, come about because for some values of n b , a bin edge fortuitously falls on a value of q which dichotomizes into an error-causing subset (where f Q (q)p e (q) > 0) and an almost non-error-causing subset (where f Q (q)p e (q) ≈ 0). (Recall the illustration in Fig. 4.) Consequently, they result in a higher speedup than their immediate neighboring values which make less clear-cut division between the two subsets. And the effect is especially prominent when n b is small because the binning of is coarse. From Fig. 5(a), a reasonable rule of thumb seems to be to let n b = 20 if the ideal speedup factor < 1000 and let n b = 100 otherwise. Fig. 5(b) shows the ideal speedup factors at different SNR values. It indicates that HSMC with Q 1 should be increasingly efficient as the code length and SNR increase. As an example, it can be 10 9 times faster than MC with n = 2048 at E s /N 0 = 5 dB.
It is known from information theory that, as n → ∞, typical random linear codes have a minimum distance that grows linearly with n in the sense that H(d min /n) = 1 − r where H(p) = p log 2 p + (1 − p) log 2 (1 − p) is the binary entropy function [37]. Let α n/d min . Then with i = d min in (31) we have When n → ∞ or σ 2 → 0 (SNR → ∞), we have σ 2 /(2n) → 0 and p e (d min , q) approaches a unit step function in q with the step located at q = √ E s /2. Hence the set {z|Q 1 (b, z) ≥ √ E s /2} is precisely the error region F(b) pertaining to some minimum-distance codeword corresponding to a given b, and Q 1 (·) becomes an optimal NGF in the sense of (26). However, the above results for random coding may not hold for common practical codes, because a practical linear code may have d min n as n → ∞ (e.g., common convolutional codes). From (23), the optimal histogram But since Q 1 ∼ N (0, σ 2 2l ) and p err|m (1 − p err|m ) ≤ 0.5 for 0 ≤ p err|m ≤ 1, the optimal histogram for q becomes concentrated at 0 as n → ∞, resulting in p e (d min , q) → Q( √ d min E s /N 0 ). In other words, the optimal HSMC degenerates to MC in this situation. In addition, Q 1 (·) is applicable only to QPSK (as well as binary signaling). Thus we turn to another way of code-agnostic noise gauging, which can alleviate these deficiencies of Q 1 (·).

B. CODE-AGNOSTIC QUADRATIC NOISE GAUGING
Based on (26), an efficient NGF should distinguish well between error-causing and non-error-causing noise vectors. Consider QPSK for the moment. Noting that a noise vector with a greater Euclidean norm is more likely to induce a decoding error than one with a smaller Euclidean norm, but a noise component (however large it may be) that increases the distance of a transmitted signal point from other points does not induce error, we define a quadratic NGF as , z) otherwise. In this way, z i contributes to Q 2 (b, z) only if the direction of noise is towards a neighboring constellation point. The squaring of z i not only accords with the error-inducing propensity of large noise but also facilitates the derivation of the PDF of Q 2 Q 2 (B, Z). For this, sincez i has equal probability being positive or negative, the cardinality of (B, Z), D | (B, Z)|, is binomial-distributed with probability mass function (PMF) As a result, Q 2 is a mixture of scaled chi-squared random variables with cumulative distribution function (CDF) where F (d) χ (·) represents the CDF of a chi-squared random variable with d degrees of freedom (DoF).
HSMC requires the evaluation of p m and p err|m ∀m. As p m can be evaluated from the above CDF, the below concentrates on how to estimate p err|m . Similar to Q 1 (·), by the law of total expectation we can rewrite (21) as where the sampling of Q| m in (21) has been replaced by sampling of D| m and Q 2 |D, m . According to the sequence of conditioning set forth in (37), the generation of each simulation sample (indexed j) progresses in the order B → D → Q 2 → Z. To start, we draw an information vector sample b (j) and encode it into c (j) to obtain w (j) . Based on the probability distributions of D| m and Q 2 |D, m (detailed expressions given in Appendix C), we employ the A/R method to draw d (j) and q l ]. Second, we scale the elements in s (j) whose indices belong to (j) to make their sum-of-squares equal to q (j) 2 (which by [35] yields the proper conditional noise PDF). Specifically, we setσ 2 = 2q 2 and obtain the jth sample of Z|D, Q 2 as Algorithm shows a summary of the sampling method. The most computationally demanding part in it is the numerical evaluation of the chi-squared CDF needed for (60) and (62) (see Appendix C), but the CDF can be precomputed and tabulated so as to make the computational overhead of sampling much lower than the decoding complexity required to determine if the received signal vector y (j) ∈ I F (b (j) , z (j) ).

C. EXTENSION OF QUADRATIC NOISE GAUGING TO HIGHER-ORDER QAM
The above discussion on noise gauging has been restricted to QPSK. In this section, we extend the quadratic NGF to deal with 16QAM, whose idea can be further extended to Algorithm 2 Sample Generation Method Under Proposed Quadratic Noise Gauging (j) ) (via A/R with (62)) 5:s (j) ← N (0, I); randomly flip the signs of elements until | (j) | = d (j) and let the result be s (j) 6: Setσ 2 = 2q l ] according to (38) deal with other higher-order QAMs. The idea is to redefine (b, z) (the set of noise indices over which their sum-ofsquares is computed) according to the modulation method: an index i is included in (b, z) only if z i shifts w i towards a neighboring constellation point. Consider the 16QAM constellation in Fig. 3. In the case w i takes an "outer value," i.e., w i = ±3/ √ 10, index i is included in (b, z) if z i has an opposite sign with respect to w i (a case similar to that of QPSK). But in the case w i takes an "inner value," i.e., An exact evaluation of f D (d) can be very complicated because it depends on the details of the QAM mapping (see Fig. 3) although, in principle, it can be evaluated as can be well approximated in the following way. To begin, let D i denote the number of terms contributed by Z i in (B, Z), where, for 16QAM, w ∈ {±1/ √ 10, ±3/ √ 10} and The approximation consists in assuming that P(W i = w) = 0.25 ∀w ∀i and that D i are independent, so that f D i (d) are the same ∀i and that where * denotes convolution and superscript * l denotes l-fold self-convolution of the given function. The approximation is particularly appropriate when the QAM mapping is randomized, such as with an interleaver or scrambler. Substituting f D (d) into (35) yields an approximation to F Q 2 (·) which can be used to evaluate p m . To estimate p err|m , we need to sample (B, Z)| m . Different from QPSK, the nonlinearity in the bit-to-signal value mapping of 16QAM (or any other higher-order QAM) breaks the symmetry about b = 0 in a linear code, making sampling of B a necessity rather than an option. Moreover, B and m are no longer mutually statistically independent because the modulated signals associated with an information vector may have different counts of inner and outer values than that associated with another information vector, resulting in a distribution of D that depends on B, and hence dependence between Q 2 and B or between m and B. Therefore, in contrast to the QPSK case in (37) we have F (B, Z)]. (43) Compared to (37), the two outermost expectations are additionally conditioned on Q 2 ∈ m and B, respectively. We derive the corresponding conditional probability distributions of B| m and D|B, m in Appendix D.
To draw samples b (j) according to f B| m (·) under given m , the A/R method requires a good grasp of the shape of f B| m (·) over the sample space of B. Otherwise it may risk a very high rejection rate, resulting in a low sampling efficiency. Hence the A/R method is only suitable for very short B (say k < 30) because for long vectors it is difficult to scrutinize the sample spaces sufficiently exhaustively for a detailed understanding of f B| m (·). Therefore, we resort to the Metropolis algorithm [38]. Different from the A/R method which determines the acceptance/rejection of each random sample individually according to f B| m (·), the Metropolis algorithm makes the decision according to the relative probability of two successive samples. Specifically, suppose we have generated b (j−1) . To generate b (j) , we first draw a candidate sample b t from f B (·) and a value u t from U [0, 1] (where U [0, 1] denotes uniform distribution over [0, 1]). Then we 1) otherwise. In the end, the set of samples {b (j) , j = 1, 2, . . . } will follow the distribution f B| m (·).
Due to the setting of some b (j) = b (j−1) , the samples generated by the Metropolis algorithm are not all independent. So the estimator variance is no longer as given in (22) but is changed to where δ m is the correlation coefficient between samples in the mth bin [39]. This coefficient is system-dependent and generally unknown beforehand, but can be estimated during the simulation along with p err|m to provide a correction to the optimal histogram in (23). Given b (j) , the sampling of D according to f D|B, m (·|b (j) ) does not suffer the dimensionality issue as the sampling of B. Hence we employ the A/R method to draw d (j) based on f D|B, m (·|b (j) ). Then we follow the steps in Algorithm to draw q (j) 2 and z (j) , except that | (j) | = d (j) is satisfied by only randomly flipping the signs of elements ins (j) corresponding to signal values ±3/ √ 10 but not those corresponding to signal values ±1/ √ 10, as the latter are always included in (j) .
Before proceeding further, a remark may be ready: While the above discussion has focused on Gray-coded QPSK and 16QAM, it is not hard to see that the working principle of the proposed method is also applicable to other kinds of signal constellation and other kinds of bit-to-symbol mapping. What needs to be attended to in NGF design for a specific constellation is the error behavior of each signal point in noise. The bit-to-symbol mapping does not affect NGF design but will affect the distance property of the modulated codewords and thus the PER. In the end, each different case may come with a different simulation complexity.

V. ADAPTIVE SHAPING OF HISTOGRAM
As shown in (23), the optimal simulation histogram ρ m that minimizes the estimator variance under a particular NGF depends on both p m and p err|m ∀m. However, it is seen in the last section that, even for relatively simple NGFs, p err|m needs to be estimated by running simulations (not to say that, for more complicated NGFs, even p m may also need to be estimated using simulations). Hence the determination of ρ m and estimation of p err|m become intertwined. In this section, we develop an adaptive method to shape the simulation histogram. We assume that the NGF is suitably designed so that f Q (q) and p e (q) exhibit the typical shapes of distribution depicted in Fig. 4 and that p err|m is an increasing function of m as a higher NGF value q should indicate that the packets are subject to more noise and suffer a higher PER.
The proposed approach to adaptive histogram shaping is illustrated in Fig. 6. Some of the detailed design is geared toward convenience of coarse-grain parallel implementation on multiple processors. Since p err|m increases with m, the upper part of the bins contribute more to p e . So we start by concentrating the simulation samples there. As indicated in Fig. 6(a), we initially try to generate a flat histogram over the bins to the right of the peak in p m (see bars denoted n m in Fig. 6(a)). Via the A/R formalism, we initialize the sampling with an "acceptance rate vector" over all the bins as where the parenthesized superscript is the iteration index. For convenience, each iteration consists of n w simulation runs (distributed evenly to all parallel processors). For each simulation run, sampling is conducted by first obtaining an accepted bin (say m) and then generating a sample of (B, Z) in it using Algorithm (for Q 1 (·)) or Algorithm (for Q 2 (·), with the minor tuning stated in the penultimate paragraph of Section IV in the case of 16QAM). Let the sample generated in the jth simulation run be (b (j) , z (j) ). The sample count and error count for the bin (denoted n m and e m , respectively) are incremented by 1 and I F (b (j) , z (j) ), respectively. Iterations continue untilp e = m p mperr|m > 0 wherê  Fig. 6(a) illustrates a case where, after the first iteration, n m is indeed a step-like function and packet errors are observed so that e m = 0 for some m.
To proceed, we need to address two points: first, whether we have attained the desired precision inp e so that no further simulations are needed and, second, how to perform further simulations when they are needed. We address the second point first. The conditional PER estimates in (46) facilitate a rough estimate of the optimal normalized histogram aŝ We cannot base further simulations solely on it, however, particularly because those bins withp err|m = 0 would be excluded from subsequent sampling, yet the reason that we getp err|m = 0 in those bins in the early iterations may not be due to a zero p err|m but due to an insufficient number of total simulation runs to make errors show up therein. To address this issue, we include some bins that have so far yielded no errors in the next iteration of simulations in the following way. First, define the sample mean and standard deviation ofρ m as where t is the iteration index. By Cantelli's inequality, a random variable has at most 20% of probability being smaller than its mean minus two times its standard deviation. Hence we consider the set of bins where q init is to guard against a too-small q (t) std , which may occur due to insufficient observation of p err|m . Experiments show that a good setting of q init is q init = (q max − q min )/10, which forces at least 10% of the bins being selected. For each bin in H (t) , we re-parametrizep err|m asp (t) err|m =p err|m + where m + is the index of the nearest bin where packet errors are observed. This makesp (t) err|m = 0 ∀m ∈ H (t) . If there are two nearest bins (i.e., one on each side), we take the one with the larger index, which tends to yield a largerp (t) err|m . Then the estimated normalized optimal histogram is updated aŝ For the next iteration, we update the acceptance rate as where n (t) IS = m n m , so as to steer n m towards the optimal distribution. Fig. 6(b) illustrates a typical intermediate condition, where the acceptance rate vector β (t+1) m has been generated at the end of iteration t for iteration t + 1. The progressively increasing samples over the iterations improve the accuracy ofp err|m , leading towards a better estimate of ρ m and a reduced variance inp e . The looping decision is made by checking the estimated current precision given bŷ err|m is used in the numerator to effect a conservative estimate of the current precision. Fig. 6(c) illustrates a typical condition at the conclusion of the simulation. Not all bins are sampled and not all the sampled bins have yielded packet errors. But the conservative setting of H (t) ,p (t) err|m , andρ (t) m should contribute to ensuring having sufficient samples in the bins of nonnegligible p err|m for a robust PER estimate.
The proposed adaptive HSMC simulation method is summarized in Algorithm .

VI. SOME NUMERICAL RESULTS
In this section, we present some numerical results on several aspects of the proposed adaptive HSMC. To save simulation time, we do coarse-grain parallel computing on 50 processors and set n w = 10 4 and n b = 20 to avoid excessive data exchanges among the processors.  if m e m > 0 then 10: Evaluatep err|m by (46) andρ m by (47) ∀m 11: Determine H (t) by (48) and (49) 12: We first verify the precision of the HSMC using some BCH codes, where we consider QPSK modulation and employ ML decoding in HSMC. We compare the PER performance obtained under HSMC with the union bound in (4). The (n, k) values of the BCH codes are (63, 7), (127, 8), and (255, 9), where the values of k are small to ease ML decoding. The minimum distances of these codes are 31, 63, and 127, respectively, which are approximately linearly proportional to n (with a decreasing r as n increases). To facilitate QPSK modulation, we add a filler bit to each codeword to make the word length even. Fig. 7 compares the PER estimates of HSMC under Q 2 (·) with the union bound. We see that they closely match each other.

Re-parametrizep
We next look at the efficiency of HSMC. For this, Table 1 lists the speedup factors at some PER values under Q 1 (·) and Q 2 (·), where G * is the ideal speedup factor (25),Ĝ * is an estimate of the ideal speedup factor evaluated using (25) by setting p err|m therein equal top err|m and p e equal to m p mperr|m , and G is the actual speedup factor given by n MC /n IS with n MC 100/p e and n IS obtained from running HSMC with = 10% in Algorithm . Commensurate with the theory in Section IV-A, the ideal speedup under Q 1 (·) increases with n. Interestingly, Q 2 (·) outperforms Q 1 (·) in ideal speedup for the (63, 7) code but is outperformed by Q 1 (·) for the longer codes. We deem this relative behavior to have to do with the selective exclusion of some noise elements in Q 2 (·), but a more detailed theoretical analysis of this relative behavior appears difficult and is not pursued in this work. Overall, the actual speedup ranges approximately from 10-to 1000-fold. The suboptimality embodied in the adaptive histogram shaping has led to approximately one order-of-magnitude loss in speedup compared to using the optimal histogram.
We now investigate the performance of HSMC under polar coding, where we consider both the QPSK and 16QAM modulations and hence only examine Q 2 (·). Let code rate r = 1/2 and n = 256, and let the code be of the cyclic redundancy code (CRC)-aided kind with 24 CRC bits attached to the information sequence. We also apply the bit interleaving in [7,Sec. 5.4.1.3], which legitimizes the approximation in (42). At the receiver, we employ successive cancellation list decoding (SCLD) with different list sizes for different decoding complexities. Fig. 8 shows some PER results, where L denotes the list size. Note that we have run MC only down to a PER of approximately 10 −7 due to simulation time concern. Similar to the earlier BCH cases, the HSMC and MC results closely match each other.
Concerning the speedup factors, Table 2 shows some results (with "actual speedup factor" defined similarly to Table 1). The proposed HSMC shows better efficiency as L increases. To better understand this intriguing performance feature of HSMC, we analyze the corresponding codeword error patterns. Specifically, we re-encode the decoded information vector in the case of a packet error and compute the Hamming distance between the re-encoded codeword   and the transmitted codeword. Table 3 tallies the MC results from 100 packet errors for each of three list sizes. As can be seen, the number of smaller-distance errors (in particular, distance ≤ 16) increases significantly with decreasing list size. This indicates that, when the decoding is further away from being ML (which is the case with a smaller list size), the performance of HSMC with Q 2 (·) would suffer more from the presence of lower-weight error patterns. This has some resemblance to the declining of performance of Q 1 (·) with decreasing d min . On the negative side, this seems to indicate a deficiency in the proposed HSMC under code-agnostic noise gauging, in that it does not fully surmount the effect of code weaknesses or decoder suboptimality. But on the other hand, this may also indicate that, aside from being a simulation tool, the method can possibly be used to assist system design by detecting code weaknesses or decoder suboptimality via observing how the speedup factor vary with the transmission mechanism or transceiver parameters. Further examination of this perspective is left to potential future work.
Lastly, an aspect of interest is how serious the sample correlation is in the Metropolis algorithm used in Q 2 (·)-based HSMC 16QAM simulation. For this, Fig. 9 shows p m ,p err|m , the estimated δ m and acceptance rates of codeword samples for the bins where packet errors are observed. As can be seen, the acceptance rate decreases with increasing m (i.e., increasing q), because the codeword distribution progressively deviates from the unconditioned f B (·) as q increases, and the Metropolis algorithm tends to reject a randomly generated codeword samples and keep the previous sample that has sufficient inner signal values to yield the required q value. The correlation factor increases due to increasing rejections. But the high-m bins have low probabilities (p m ).
In the end, we find that V M (p e )/V(p e ) = 1.029. That is, the effect of the correlation on estimator variance is minor and can be ignored.

VII. EXTENSION TO BURST-ERROR CODES
It is well-known that there is much difficulty in applying IS to complex systems with high dimensions [15]. One way to combat this difficulty is to break the system into several interconnected subsystems of lower dimensions. For PER simulation, insight for how to do this may be gained by considering how packet errors relate to code structures. In the case of burst-error codes, such as a convolutional code, it is known that the Viterbi decoding delay has much to do with decoding performance and hence should serve well to define the subsystem dimension. More generally, any segmentation of the received signal may serve this purpose as long as the segment properties correlate highly with the PER. Thus we propose a code-aware NGF as where l m is a proper segment length which divides the whole received signal into l/l m segments, λ is the segment index, and Q (λ) , z) is the segmental NGF given by The reason for taking the maximum of the segmental NGF values is based on the intuition that the noisiest segment in the signal may largely determine the error probability (as is the case in Viterbi decoding of convolutional codes).
Recall that HSMC requires the evaluation of p m and p err|m , where p m is now determined by the distribution of Q c Q c (B, Z). For simplicity, we only treat QPSK herein, although the approach is applicable to higher-order QAM. Define segmental cardinality D (λ) | (λ) (b, z)|. From the discussion on D in Section IV-B we see that D (λ) has PMF f (λ) 2 (B, Z), 1 ≤ λ ≤ l/l m , are independent and identically distributed random variables, their maximum Q c has CDF 2 is similarly distributed as Q 2 and hence F (λ) Q 2 (q) can be evaluated using (36) by letting l = l m . The above suffices for the evaluation of p m ∀m.
Now consider the evaluation of p err|m , for which we need to sample B, Z| m . The symmetry with respect to b under QPSK in AWGN (as noted in Section IV) makes the sampling of B independent of Z and m (the latter being totally depending on Z). To deal with the maximization in (53), which selects the "noisiest" signal segment based on segmental NGF, let c arg max 1≤λ≤l/l m Q (λ) 2 (B, Z) and D c | c (B, Z)|. By the AWGN assumption, every signal segment is equally probable to be the noisiest, and hence c is uniformly distributed over {1, 2, . . . , l/l m } and the distribution of D c is just that of D (λ) for any λ. We thus have (cf. (37)) The generation of simulation samples thus proceeds as B → c → D c → Q c → Z. The conditional probability distributions of D c | m and Q c |D c , m , which are needed in performing the sampling, are derived in Appendix E, where we show that, under narrow bins (i.e., small q), f D c | m (·) ≈ f D| m (·) and f Q c |D c , m (·) ≈ f Q 2 |D, m (·), with l replaced by l m . These approximations work well for n b ≥ 100, especially for the bins associated with larger q values for which the CDF F (λ)  c th segment of the noise vector, the sampling method is similar to lines 5-7 in Algorithm . For every other segment, we merely iterate the same way of sampling until its corresponding segmental NGF value is less than q To verify the performance of the proposed code-aware NGF, we evaluate the ideal speedup factors for some tail-biting convolutional codes (TBCC). Specifically, we consider the convolutional codes of optimal distance spectra of constraint lengths 3, 7, and 11 given in [40], with (n, k) = (256, 128). The generator polynomials are given by (in octal numbers) [5,7], [133,171], and [3345, 3613], respectively, with their free distances equal to 5, 10, and 14, respectively. (Although these codes are originally not optimized for TBCC, we find them well-suited for it in our case. For one thing, the number of minimum-distance codewords of a resulting TBCC is dominated by that associated with the zero initial state. In particular, the [133, 171] code has formed the basis of some standardized TBCC schemes [5], [41].) It is known that for a rate-1/2 convolutional code, Viterbi decoding yields near ML performance at decoding delays on the order of 5 constraint lengths [32], [42]. Hence typical burst errors under proper convolutional decoding would not exceed this duration. This hints that an l m on order of 5 times the constraint length or greater may be a good choice. Experiments show that a good choice of l m for the codeaware NGF is approximately 3l c /r where l c is the constraint length. Table 4 and Fig. 10 present some numerical results, where we have used the wrap-around Viterbi decoding technique [43] and adjusted the SNR to make p e ≈ 10 −5 for each code. And we have set n b = 100.
First, Table 4 shows that the speed gain of Q 1 (·) is very little and is lower than either Q 2 (·) or Q c (·). This is not surprising in view of the property of Q 1 (·) discussed in Section IV-A, for a convolutional code has a fixed free distance (or minimum distance) although one may lengthen the codewords at will. In addition, Table 4 shows that Q 1 (·) and Q 2 (·) yield greater efficiency as l c increases, but the advantage of Q c (·) over them is significant, although its performance decreases with increasing l c . Fig. 10 shows that, under Q 2 (·), thep err|m curve becomes notably steeper as l c increases whereas under Q c (·) this is not the case. This should account for the gain in speedup factor with increasing l c under Q 2 (·). We have also run a simulation using the suboptimal decoding algorithm of [44]. It turns out the speedup behaviors are substantially similar.
Several remarks are in order. First, if the code is particularly prone to noise over a certain codeword segment, then it is effectively a burst-error code that the approach of this section can be applied. Second, if the codewords do not show a burst-error nature but they can be made so with a permutation of the code bits, then the technique of this section can be applied. And third, breaking the code into disjoint segments does not really capture the burst-error nature of a convolutional code, as an error burst could run across a segment boundary. A more pertinent NGF should thus consider arbitrarily located (and thus overlapping) segments [31]. But in this case, p m does not have a closed-form expression and the sampling mechanism becomes more complicated.

VIII. CONCLUSION
Today's communication system design heavily depends on computer simulation for performance evaluation. In this work, we examined the principles and key issues of the importance sampling approach to efficient simulation and developed a histogram-shaping Monte Carlo (HSMC) technique for coded communication systems simulation. The technique employs a noise gauging function (NGF) to measure the error-causing propensity of channel noise samples and adaptively learns the proper histogram of simulation samples under the NGF.
Concerning the NGF, we proposed two basic kinds that are advantageous in two respects: first, they claim relative ease of sample generation (by being closely related to some sufficient statistics of AWGN parameters); and second, they require no detailed understanding of code properties (and thus code-agnostic in this sense, although some crude knowledge of the code properties can benefit their choice). One of the two is linear and the other is quadratic. The linear one is asymptotically efficient under random coding when the code length and SNR approach infinity but is applicable only under QPSK (or equivalently, BPSK). The quadratic one, with a proper sampling method, can be applied to higherorder modulations. They can therefore be chosen based on code and modulation properties. Under QPSK (or BPSK), the linear one is preferable when the code has a large minimum Hamming distance. Otherwise the quadratic one should be preferable.
We further considered the case of burst-error codes whose minimum distance properties are far inferior to that of random codes, and proposed a quadratic-type NGF to exploit the burst-error property of such codes to improve the simulation efficiency. Use of this NGF requires a little more knowledge of the code properties than the two basic kinds above and it is thus dubbed a code-aware one.
We presented the detailed method of how to adaptively learn the optimal sample histogram under a given NGF.
Simulations on short-length coding typically considered for ultra-reliable communication showed approximately 10to 1,000-fold speedup of the proposed HSMC versus conventional MC. Some directions of potential future work were pointed out along the way.

APPENDIX A MECHANISM OF PER UNDERESTIMATION
For a linear code, the Hamming distance d H between a known and an unknown minimum-weight codeword satisfies d min ≤ d H ≤ 2d min . This can be understood by considering how the 1-bit positions may be related between them: Known (codeword a) : d 11 · · · 1 d min −d 11 · · · 1 00 · · · 0 00 · · · 0 Unknown (codeword u) : 11· · · 1 d 00 · · · 0 11 · · · 1 d min −d 00 · · · 0 In the above, d denotes the number of overlapping 1-bit positions. Since d H is equal to 2(d min − d), it is always even. In simulating QPSK transmission of the all-zero codeword (term it codeword 0), μ a has its first d min elements equal to −1/ √ 2 with all the remaining elements being zero. Let w i denote the packets associated codewords i, i = 0, a, u, respectively. A pairwise ML decoding error to codeword u results if w 0 + μ a + Z is closer to w u than to w 0 . Straightforward algebra shows that this error probability is And similarly as for (62), it follows that Comparing (68)