Hardware Design of Concatenated Zigzag Hadamard Encoder/Decoder System With High Throughput

Both turbo Hadamard codes and concatenated zigzag Hadamard codes are ultimate-Shannon-limit-approaching channel codes. The former one requires the use of Bahl-Cocke-Jelinek-Raviv (BCJR) in the iterative decoding process, making the decoder structure more complex and limiting its throughput. The latter one, however, does not involve BCJR decoding. Hence its decoder structure can be much simpler and can potentially operate at a much higher throughput. In this paper, we investigate the hardware design of a concatenated zigzag Hadamard encoder/decoder system and implement it onto an FPGA board. We design a decoder capable of decoding multiple codewords at the same time, and the proposed system can operate with a throughput of 1.44 Gbps — an increase of 50% compared with the turbo Hadamard encoder/decoder system. As for the error performance, the encoder/decoder system with a 6-bit quantization achieves a bit error rate of <inline-formula> <tex-math notation="LaTeX">$2\times 10^{-5}$ </tex-math></inline-formula> at <inline-formula> <tex-math notation="LaTeX">$E_{b}/N_{0} = -0.2$ </tex-math></inline-formula> dB.


I. INTRODUCTION
With the fast development of communication technologies, the requirements on forward-error-correction (FEC) codes are becoming more and more rigorous. Among the good FEC codes, turbo codes [1]- [3], low-density paritycheck (LDPC) codes [4]- [7] and polar codes [8]- [11] have been intensively studied because they can perform close to the capacity limits. In addition, turbo Hadamard codes (THCs) [12], LDPC Hadamard codes [13] and concatenated zigzag Hadamard codes [14] have been shown to perform well even near the ultimate Shannon limit (i.e., −1.59 dB). These ultimate-Shannonlimit codes are applicable to multi-user environments, e.g., code-division multiple-access or interleave-division multiple-access (IDMA) [15] systems. In [16], [17], the hardware design of turbo Hadamard code has been investigated. Since Bahl-Cocke-Jelinek-Raviv (BCJR) decoding is The associate editor coordinating the review of this manuscript and approving it for publication was Yi Fang . required, the overall decoder design is relatively complex, limiting the throughput to less than 1 Gbps. On the other hand, the concatenated zigzag Hadamard codes do not require BCJR decoding, potentially making the decoder simpler and operating with a higher throughput.
In this paper, we investigate the hardware design of a concatenated zigzag Hadamard encoding/decoding system. We analyze the latency, throughput and utilization rate of the components. We implement the encoding/decoding system and compare the resources utilization and throughput with those of THC systems. The organization of the paper is as follows. Sect. II briefly reviews the structure of Hadamard code, zigzag Hadamard code and concatenated zigzag Hadamard code. Sect. III and Sect. IV present details of the hardware design of the concatenated zigzag Hadamard encoder and decoder, respectively. Sect. V shows the FPGA implementation results, including hardware utilization, throughput and bit error rate performance compared to the THC system. Finally, Sect. VI provides some concluding remarks.

II. CONCATENATED ZIGZAG HADAMARD CODE (CZHC) A. HADAMARD CODE
The codewords of an order-r Hadamard code are directly derived from Hadamard matrices of the same order. For example, Hadamard matrices of order r = 3 are given by and Hadamard matrices of order-r are constructed recursively by with n = 2 r and ±H 1 = [±1]. The codewords are given by the columns (or rows) of the Hadamard matrices ±H n . For each codeword of length 2 r , the bit indices We suppose an Hadamard codeword c = (c[0], c [1], c [2], . . . , c [2 r − 1]) is transmitted through an additive white Gaussian noise (AWGN) channel with noise mean 0 and variance σ 2 . We also denote the noisy observation at the receiver by x = (x[0], x [1], x [2], . . . , x [2 r − 1]). The a posteriori probability (APP) logarithm-likelihood ratio (LLR) of the ith bit in the code is obtained by [12], [14] Pr(x|c) Since the codewords are transmitted through an AWGN channel with noise variance σ 2 , we have The a priori information exp( <cx> σ 2 ) can be calculated by an r-stage fast-Hadamard transform (FHT). After that, a same order dual-fast-Hadamard transform (DFHT) is applied to calculate (4).

B. ZIGZAG HADAMARD CODE
A zigzag Hadamard code (ZHC) is graphically described in Fig. 1(a) where each segment represents an order-r Hadamard code [14]. The overall code structure is also shown in Fig. 1(b). Assuming an information block D with length L = rK is segmented into K sub-blocks. For the kth segment (k = 1, 2, . . . , K ), the information bits d k = [d k (1), d k (2), . . . , d k (r)] are represented by blank nodes (area) and the remaining parity-check bits are represented by grey nodes (area). Moreover, the last parity bit of each segment is copied to the first input of the next segment and is denoted as the common bit (black nodes/area in the figures). Note that the first input bit of the first segment is fixed as 0 and is omitted.
Denote the Hadamard codeword in the kth segment as where c k (0) = c k−1 (2 r − 1) and c k (2 j−1 ) = d k (j), j = 1, 2, . . . , r. We also denote the common bit q k = c k (0) = c k−1 (2 r − 1) and the parity bits p k = {c k (i), i = 0, i = 2 j−1 , j = 1, 2, . . . , r}. The kth segment of a ZHC codeword can then be rewritten as c k = (d k , q k , p k ). The encoding process of ZHC is a Markov process and the correlation between any two consecutive segments depends only on the common bit. To decode the ZHC, a two-way decoding algorithm with two stages can be used [14]. 1) Forward recursion: Starting from the first segment to the (k − 1)th segment, perform FHT and DFHT on the VOLUME 8, 2020 current segment to obtain the APP LLRs of the bits based on the aforementioned discussion; then use the APP LLR of the last bit of the current segment to update the a priori LLR of the first bit of the next segment. 2) Backward recursion: Starting from the K th segment to the first segment, perform FHT and DFHT on on the current segment to obtain the APP LLRs of the bits (including information bits); then use the extrinsic LLR of the first bit of the current segment to update the a priori LLR of the last bit of the previous segment. Fig. 2 shows the code structure of a CZHC [14] with M component codes. (When the zigzag Hadamard encoders in Fig. 2 are replaced by convolutional Hadamard encoders, the output codeword becomes a THC [12].) M copies of the same but interleaved information bits are sent to M zigzag Hadamard encoders producing M copies of parity bits. The information D together with the parity bits p (1) , p (2) , . . . , p (M ) are sent to the channel. The encoder design of CZHC will be discussed in Section III. The decoding of CZHC involves the interleaving and passing of LLRs among different zigzag Hadamard codes (or component codes) and will be explained in Section IV.

III. CZHC ENCODER DESIGN
The data flow of our CZHC encoder/decoder system is shown in Fig. 3. The structure of the CZHC encoder is shown in Fig. 4 where M components of ZHC are encoded in parallel. To generate each CZHC codeword, the following steps are performed.
1) Generate random information bits of length rK using a pseudo random number generator (PRNG), which is realized by the use of linear feedback shift registers (LFSRs). Form the first component code using Step 2) below. 2) Divide the information bits into segments of length r.
The r information bits in each segment together with the common bit are then sent to the Hadamard encoder, producing Hadamard codewords. Note that the common bit is the feedback of the Hadamard encoder from the last segment. For the first segment, the common bit is set to 0. 3) For each of the M − 1 component codes, send the original information bits to the corresponding interleaver denoted by 1 , 2 ,. . . , M −1 , respectively; and apply Step 2) above. 4) Send the original information bits together with all the parity-check bits generated from all component encoders to the channel.

IV. CZHC DECODER DESIGN
The structure of the decoder is illustrated in Fig. 5. The decoding process includes:  2) Forward recursion: The a priori LLRs are sent to the decoder to perform forward recursion. The forward recursion processor consists of an order-r FHT block and an order-r DFHT block. The a priori LLRs (2 r for each segment of ZHC) are directly input to the FHT block where simple addition/subtraction operations are performed to produce 2 r outputs after r stages. Exponential functions are performed to the 2 r outputs and their additive inverse (total 2 r+1 data) before sending them to the DFHT block, which also produces 2 r+1 outputs after r stages. Then divisions are performed to the 2 r+1 outputs to generate 2 r APP LLRs. The above operations are used to realize (4). Note that the exponential functions greatly increase the dynamic range of data in DFHT block and a large number of quantization bits is required to maintain the accuracy of decoding in DFHT block. To avoid the implementation of complicated exponential functions, we use logarithm quantization in the DFHT block. The benefits of the quantization include: • turning the exponential functions between two blocks into simple bitwise-NOT functions; • reducing the dynamic range of operations in DFHT block and hence the number of bits used to quantize the LLRs; • simplifying the decision block from division logics to subtraction logics. An illustration of the proposed APP decoder for ZHC with order-2 is shown in Fig. 6.
3) Backward recursion: The backward recursion also consists of an order-r FHT block and an order-r DFHT block. Thus the APP decoding processors in the forward recursion can be reused. The backward recursion processor starts outputting the APP LLR continuously after 2r clocks delay. The outputs start from the K th segment and then all the way to the first segment. 4) The output data from the backward recursion processor are interleaved and passed to the next sub-decoder.

5)
The extrinsic LLRs of the information bits in this iteration are generated and stored in the RAMs at the same time.
The FHT/DFHT blocks are implemented in the CZHC decoder to fast calculate the APP LLRs [18]. Each segment in both the forward and backward recursions must wait for the update from the previous (next) segment before continuing decoding. For each of the K segments, the FHT and DFHT processors take a total of 2r clocks to complete the computations.
As shown in Fig. 7, only one of the 2r stages is working at any time. To better utilize the decoder hardware and to improve the throughput, we decode 2r CZHCs at the same VOLUME 8, 2020   time in our design in a pipeline manner. These 2r CZHCs are sent into the decoder segment-by-segment, i.e., first segment of the first code is sent to the decoder, followed by the first segment of the second code, and so on. After the first segments of all 2r CZHCs are sent, the second segments of the 2r CZHCs are sent. Note that the time when the DFHT processor finishes computing the APP LLR of the first segments of the 2r CZHCs, the a priori (AP) LLR of the second segments are arriving at the decoder. Both LLRs will then be sent to the FHT/DFHT processors to compute the forward recursion of the second segment. The operations of the FHT/DFHT processors for 2r CZHCs are illustrated in Fig. 8. The utilization rate of the FHT/DFHT processors are therefore greatly improved. Moreover, the latency (time between the last input code bit entering the decoder and the last decoded bit coming out of decoder) of each CZHC codeword is actually the same as that of decoding a single CZHC and the throughput of the decoder is increased by 2r times.
Note that the CZHC is a concatenated code with M component codes. Each CZHC codeword needs to go through the decoding process in Fig. 5 M times to complete one iteration. To simplify the control logic and to increase the throughput, we construct M CZHC decoders (each called a sub-decoder) in our decoding system. Hence, M times more CZHC codewords can be decoded simultaneously in a pipeline manner. The usage of control logic and block RAMs between consecutive component code decoders are reduced and the throughput of the decoding system is increased by another M times, i.e., a total of 2rM times. Fig. 9 shows the decoding system that consists of M sub-decoders. To decode 2rM CZHCs simultaneously, the decoder receives and stores the 2rM CZHCs in both the information RAMs and the parity RAMs which are shown in Fig. 10.
Between consecutive sub-decoders, interleavers (omitted in Fig. 9 for simplicity) are needed to shuffle the outputs of the current sub-decoder before inputting them to the next subdecoder. We use fixed inter-window shuffle (FIWS) interleavers to enable parallel interleaving [17], [19]. The size of the interleaver is N = 2r × L = 2Kr 2 because we need to perform interleaving on the information bits of 2r CZHC codes at the same time. The interleaver is divided into r sub-interleavers (also called windows) each with a window size of 2rK . The windows are designed in such a way that memory contention is avoided when performing parallel interleaving. In other words, the first information bits of all K segments in all 2r CZHCs are interleaved/deinterleaved in  The FIWS interleaver is realized by r width-(r × N FHT ) depth-2rK RAMs and a depth-K ROM. The operations are described below and illustrated in Fig. 11. 1) Store the output APP LLRs from the ith sub-decoder (i = 1, 2, . . . , M ) to the RAMs in the order that is shown in Fig. 11, i.e., the jth bit of all the K segments of all the 2r codes are stored in the jth RAM (j = 1, 2, . . . , r). 2) Read the interleaver patterns of the CZHC code from the ROM. Extract the interleaving information and evaluate the interleaver pattern for all the 2r CZHC codes in the r RAMs correspondingly.

3) Read the interleaved APP LLRs from the r different
RAMs. Regroup the APP LLRs and send them to the next sub-decoder as the a priori LLRs.

V. IMPLEMENTATION RESULTS AND ANALYSIS
For an order-r CZHC, the code length is l = rK + MK (2 r − r) and the code rate is r c = rK rK +MK (2 r −r) = r r+M (2 r −r) . A concatenated zigzag Hadamard encoder/decoder system with the following parameters is implemented.  Fig. 12 shows the hardware utilization of each unit inside the sub-decoder. The FHT and DFHT units are both utilized in forward and backward recursions. Denoting the hardware utilization rate of the FHT, DFHT and interleaver as U FHT , U DFHT and U π , respectively. Note that 1 K → 0, we have The FHT/DFHT units work almost all the time during decoding while the interleavers operate approximately half of the time.
Referring to Fig. 12, it takes 4rK + 2r = 2r(1 + 2K ) clocks to process one component code in each sub-decoder. Assuming the total number of iterations is I and the number of component codes for each CZHC code is M , it takes 2rIM (1 + 2K ) clocks to decode the 2r CZHC. Also assuming the operating frequency of the sub-decoder is f c and the CZHC code length is l, the sub-decoder can decode 2rlf c 2rIM (1+2K ) bits in one second. Moreover, the decoder consists of M sub-decoders and can decodes 2rM CZHC in parallel. The throughput of the whole decoding system is approximated by where the approximation is made because 1/K 1 and (M − 1)r 2 r M . The decoding path in the THC decoder involves going through the FHT block, the BCJR block and then the DFHT block with a latency of approximately 2K . The decoding path of the CZHC decoder includes the FHT/DFHT block of the forward recursion and then the FHT/DFHT block of the backward recursion with a latency of approximately 4rK . The latency of CZHC decoder is 2r times higher because the forward/backward recursions in CZHC decoder which cannot be performed in pipeline for one single code. However, it is possible to perform pipeline decoding of multiple codes, which is realized in our design.
In Table 1 and Table 2, we compare the FPGA implementation results of the encoder/decoder systems using CZHC and turbo Hadamard code (THC) [17] under the same code rate and code length. Table 1 indicates that BCJR processor is not required in the CZHC decoder. The throughput of the CZHC system is 50% higher than that of the THC system. Table 2 shows that with the same code length and code rate, the look-up tables (LUT), look-up table RAMs and flip-flops used in the CZHC system are, respectively, 73%, 41% and 85% of those in the THC system. The block RAM usage in CZHC system however, is higher than that in the THC system. Fig. 13 shows the bit-error-rate (BER) results of CZHC and THC. Compared with floating-point CZHC decoder, fixedpoint decoder shows a performance loss of about 0.1 dB at BER= 2 × 10 −5 when 5-bit or 6-bit quantized channel LLRs are used. For even smaller number of quantization bits, the BER performance is further degraded. Fig. 13 also shows that the BER performances of CZHC and THC are

VI. CONCLUSION
An efficient design of an ultimate-Shannon-limit approaching encoder/decoder system based on CZHC has been explored using FPGA. It can achieve a throughput of 1.44 Gbps at a code rate of 0.0333 and BER = 1.5 × 10 −5 at E b /N 0 = −0.2 dB. Compared to the THC system, the CZHC system achieves 1.5 times larger throughput with less complex hardware architecture but more block RAM usage. The main drawback of the CZHC system is a higher decoding latency. Future research work should aim at reducing the latency of the CZHC decoder.  Elsevier, 2007). He is also a co-holder of five U.S. patents and one pending U.S. patent. He has published more than 300 articles. His main research interests include channel coding, cooperative networks, wireless sensor networks, chaos-based digital communications, applications of complex-network theories, and wireless communications. He is also a Fellow of IET.