Error Rate-Based Log-Likelihood Ratio Processing for Low-Density Parity-Check Codes in DNA Storage

Due to the advantages of high storage densities and longevity, DNA storage has become one of the attractive technologies for future data storage systems. However, the writing/reading cost is still high and more efficient techniques for DNA storage are required. In this paper, we propose improved log-likelihood ratio (LLR) processing schemes based on observed statistics for low-density parity-check (LDPC) code decoding to reduce reading cost while encoding schemes are kept unchanged. Due to the mismatch between the real channel and the observed statistics and also the limit of maximum decoder input value, scaling the magnitude of LLR can lead to a better error correcting performance. Therefore, we propose two strategies: 1) directly scaling LLRs and 2) scaling pairwise substitution error rates, which changes the magnitude of LLRs. We also suggest the relation between substitution error rate and scaling values in the strategies by using curve fitting methods. Simulation results show that the error correcting performance from the proposed LLR calculation is better than that from the conventional scheme. Finally, we verify that the proposed LLR methods can be generally applied in DNA storage systems, and present practical methods to calculate error rates.


I. INTRODUCTION
With the rapid increase in the total amount of data, the demand for data storage is increasing. Since conventional storage requires exponentially increasing costs to store massive amounts of data and is susceptible to damage over a relatively long period, DNA storage has been proposed for high storage densities, longevity in ideal condition, and efficient data duplication based on PCR technology [1]- [3]. In DNA storage, binary data are synthesized into DNA sequences with four-base nucleotides (A, T , G, C) at a certain writing cost [4] and are then stored in a DNA pool. During the reading The associate editor coordinating the review of this manuscript and approving it for publication was Jin Sha. process, Illumina sequencer reads randomly sampled DNA sequences from the pool [2]. The DNA sequences obtained by the sequencer are regarded as reads [4], [5], which may not be equal to the original sequence due to errors in synthesizing and sequencing processes. A reading cost was defined to evaluate performance, which can be obtained by dividing the total number of bases in the minimum reads required to recover the original DNA sequences by the number of information bits [4].
Currently, the error rates during DNA synthesis and sequencing are high [6]. To address this issue, simple error-correction channel codes [7] such as Reed-Solomon (RS) codes [6], Bose-Chaudhuri-Hocquenghem (BCH) codes [4], and fountain codes [8], have been considered for DNA storage system. To further improve the error-correcting capability, it is natural to apply the modern coding techniques such as low-density parity-check (LDPC) codes [9], which can achieve the channel capacity limit. Since balanced GC-content (45%-55%) and short homopolymer length (≤3) are essential for DNA synthesis/sequencing [8], Deng et al. proposed variable-length run-length limited codes combined with LDPC codes [10]. Also, Chandak et al. proposed a DNA storage system with LDPC codes [4], where a simple channel model with Poisson distribution was considered. A machine learning based DNA basecaller [11] was also proposed, which uses convolutional code in DNA storage.
For better error-correction capability of LDPC codes, the input of LDPC decoders should be soft information such as log-likelihood ratio (LLR). In conventional radio frequency communication systems, calculation of LLRs or soft information is already known since those channel models such as additive white Gaussian noise (AWGN) channel are given. Unfortunately, there is no research about exact channel models in DNA storage so far, and it is not possible to calculate exact LLR. This situation motivates us to find how to calculate LLR even though we do not know exact channel model in DNA storage.
We present to calculate the ratios by using error statistics. To solve the mismatch between exact channel and error statistics, we propose two scaling methods which are to directly change magnitude of LLR and to change pairwise error rate for adjusting the magnitude of LLRs. Simulation results show that the performance of the proposed methods is better than or equal to that of the conventional method. To verify application of the proposed LLR calculation methods to DNA storage, we suggest an example and results in the example are almost same with those in the simulation results. The main contributions of our work are summarized as below.
• We propose how to calculate LLRs by using error rates on the assumption that there is no exact channel model in DNA storage. Some research [4], [12] started to use LDPC codes in DNA storage but they assumed simple symmetric channels [4] and asymmetric channels [12] which are not exact channel models in DNA storage.
• We propose two scaling methods suitable for decoding of LDPC codes since input range in LDPC decoders depends on the error rate and there is a mismatch between exact channel model and LLR calculation.
• We suggest relation between scaling values in two scaling and the error rates by using curve fitting methods.
• Finally, we suggest a verification example how to apply the proposed LLR calculation to new experiments.
The paper is organized as follows. In Section II, we introduce the system model briefly. In Section III, LLR calculation and two scaling methods are proposed in detail. In Section IV, the results of the comparison between Chandak et al.'s work [4] and our work are presented, and in Section V, we apply two experiments to our proposed strategies with their simulation results and present practical calculation of error rates. We conclude the paper in Section VI.

II. SYSTEM MODEL
The proposed system model of DNA storage is shown in Fig. 1. Since encoding, multiple sequence alignment (MSA), and decoding are similar with those in [4], we briefly explain them in relation to the proposed LLR calculation.

A. DNA BASE MAPPING AND MSA
The encoding scheme for the proposed systems has the following steps [4]. First, the binary file is encoded by large block LDPC codes, which can achieve excellent error correction performance. These binary data are split into fragments with a specific length and mapped to the DNA bases. Two bits ('00', '01', '10', and '11') in binary data are mapped to the DNA bases A, C, G, and T , respectively. Then, a synchronization marker [4] can be added to the middle of each fragment to detect deletion or insertion errors on both sides of the marker. Finally, we add an address index protected by BCH code [4] to the front of each sequence to distinguish the randomly sampled reads after sequencing.
According to the consistency of the bases, the MSA [4], [13] is the general alignment of three or more DNA sequences at the same address to great extent. During the MSA processing, some positions of DNA sequences are substituted by blanks, which are regarded as erasures E. Note that there are other recent alignment algorithms such as pairwise sequence alignment and structural alignment. However, these algorithms are not very different from MSA, so that the proposed method can be applied to the DNA storage systems with other alignment algorithms. The most important reason we consider MSA in our work is for fair comparison since the referenced system [4] considered MSA as an alignment method.

B. LDPC DECODING
Since the LDPC decoding in this work is similar to those in previous works [4], [14], we briefly summarize it in this VOLUME 8, 2020 section. The sum-product algorithm [14] for binary LDPC decoding is introduced by the following steps.
First, the initial LLR for a variable node v is given as where x is a transmitted bit and y is a received signal. Let vc is calculated as where C v is a set of check nodes connected to the variable node v. Third, the message M (l) cv is calculated as where V c is a set of variable nodes connected to the check node c.
The message to estimate a variable node v at the l-th iteration is calculated as c v .

C. CONVENTIONAL LLR CALCULATION
Since LLR calculation depends on the channel model, we briefly introduce the conventional LLR calculation under the binary AWGN channel and binary symmetric channel model with Poisson distribution [4]. Under the binary AWGN channel model, the LLR L is expressed as where σ 2 denotes the variance of the noise. In [4], it was assumed that reads pass through a binary symmetric channel with error probability and an ideal Poisson random sampling model is considered. A memoryless approximation was also considered for the channel. The input and output of the channel are a bit and a tuple (k 0 ,k 1 ), respectively, where k b is the number of times that the bit is read as b. Then, the transition probability for the channel is given as cv from a check node c to a variable node v .
where λ is the ratio of reading cost to writing cost [4]. Then, the LLR information can be calculated as In [4], the error probability was set as a fixed value of 4%. This channel model [4] can suggest a simple LLR calculation but it does not reflect the real channel. Although DNA storage channels have been studied, there is no exact channel model with a specific distribution. Throughout this paper, we do not assume any specified channel model. We compute LLR using the ratio of occurrences of each base obtained from the experiments. The detailed method of the proposed LLR calculation will be explained in Section III.

D. SCALING METHOD OF LLR FOR LDPC DECODING
Since there is a mismatch between the true DNA channel and the occurrences obtained from the experiments, we consider LLR scaling.
From our observation, there exists a bound in the calculation of check nodes. To explain this, for all v and c, we assume that the initial LLR M  Fig. 3. We can find that f tends to be 1 when M This means that if the magnitudes of all initial LLRs are larger than 12.39, the output at the first iteration will be saturated. According to the property of tanh −1 in Fig. 4, f tends to be 1 and the check node message M (1) cv increases. Thus, it is required to set a range of message values at check nodes. The range of LLR calculations based on occurrence rate will be dynamically changed based on error rate. If the substitution error rate is too small, the magnitude of the LLRs  based on occurrence rate is too high, and it makes the LDPC decoding worse.

III. PROPOSED PREPROCESSING FOR LDPC CODES IN DNA STORAGE
In this section, we first explain the definition of the LLR and LLR calculation method based on the substitution error rate. Next, we explain two scaling strategies of the proposed LLR calculation for LDPC decoders.

A. LLR CALCULATION
Note that error statistics are required to compute the LLR. We assume that the error statistics are provided after alignment. Then, we derive a new calculation method for LLRs in LDPC decoding to improve decoding performance based on error statistics. Let x = (x 1 , x 2 , . . . , x n ) be an encoded DNA codeword of length n and y = (y 1 , y 2 , . . . , y n ) be a DNA read. Note that y can include erasure symbols.
}. For all j, 1≤j≤n, each DNA base symbol is treated as two binary bits denoted by x j = (x 1 j , x 2 j ). Suppose y j = b i ; then, the initial LLRs, L 1 j and L 2 j , are given as and According to Bayes' theorem, the two LLRs in (2) and (3) can be expressed as and Thus, the LLRs in (4) and (5) are expressed as and since two bits '00', '01', '10', and '11' in binary data are mapped to A, C, G, and T in DNA symbols. Then, the four terms in (6) and (7) can be generally expressed as p( where N is the number of all aligned reads after MSA and o(x j = b k , y j = b i ) denotes the number of occurrences that symbols of x j and y j are b k and b i , respectively. Then, the average substitution error rate is given as To calculate exact substitution error rates, we consider the case that the symbol x j in (8) is known and total reads are used. These assumptions are not valid in practice. In Section V, we discuss more on error rate estimation without these ideal assumptions.
During the process to obtain the statistics, we found that there is a mismatch of substitution errors between statistical and theoretical analyses due to the lack of channel models in DNA storage systems. Since insertion and deletion errors exist and the alphabet of binary codeword bits is different from the alphabet of DNA symbols, the substitution error rate must be scaled for LDPC decoding. Moreover, the larger the substitution error, the smaller the LLR, so you need to adjust the LLR to fit the input range of the LDPC decoder. To solve this issue, we propose two scaling strategies, for which the main ideas are described in the following subsection.

B. FIRST SCALING METHOD FOR LLR
In Subsections III.B and III.C, we introduce two scaling methods, namely Strategy 1 (S1) and Strategy 2 (S2), respectively. To compare these methods with the conventional schemes, we consider eight experiments in [4]. In [4], Experiments 1-2 and Experiment 8 have payload lengths of 164 bits and 174 bits without a marker, which is a sequence 'ACT ' at the center of each payload, respectively. Experiments 3-5, Experiment 6, and Experiment 7 have payload lengths of 174 bits, 168 bits, and 180 bits, respectively, including a 6-bit marker. In these eight experiments, we first find an optimal linear scaling strategy to minimize the number of reads required to calculate the reading cost. Next, we explain the relation between substitution error rate S E and the scaling coefficients.
From the LLRs L 1 j and L 2 j in (6) and (7), we define the first scaled LLRs L 1 j and L 2 j as L 1 j = αL 1 j and L 2 j = αL 2 j , where α is a scaling coefficient. When 0≤α≤1, this scaling makes the magnitude of the LLRs smaller. The values of S E and the best scaling coefficient α for each experiment are listed in Table 1. 'Exp.' in Table 1 means Experiment. According to Table 1, as S E increases, α also increases. To interpret the relation between S E and α, we consider curve fitting methods.
LLR values in eight experiments imply that we need to scale down the LLR values. To find α, we increase α by 0.01 from 0 to 1 and check whether all 30 trials are successful with the minimum number of reads. Finally, we use the median of such α values since the median provides robust results. Moreover, these α values can provide margin to determine new α for new experiments, which will be discussed in Section V.
To determine the optimal curve fitting between scaling coefficient α and S E , we use the Curve Fitting Toolbox in MATLAB. In our work, we define S E as the independent variable and α as a variable dependent on S E . The most  important term in curve fitting is R 2 , which is a statistical measure that indicates how much variation of a dependent variable (e.g.,α) is changed by independent variables (e.g., S E ) in a regression model. The formula for R 2 is defined as R 2 = 1−(unchanged variation of S E ) / (total variation of S E ). If the value of R 2 is near 1, it shows that S E is strongly correlated with α. Since there are various curve fitting results between S E and α, we choose the curve fitting method resulting in the largest R 2 . The best curve fitting method with the associated equation and parameters is shown in Fig. 5.
The best fit is the linear one-term curve, which can be expressed as the equation f (x) = ax + c, where independent variable x is the substitution error rate S E and the dependent variable f (x) is the scaling coefficient α. From the results of Fig. 5, when new experiment with LDPC codes and substitution error rates is performed, we can assign proper scaling values in LLR calculations. When the substitution error rate is larger than 4.7%, the coefficient α is larger than 1, which increases magnitude of LLRs.
While the first scaling is directly applied to the magnitude of the LLRs, the second scaling addresses pairwise substitution error rate.

C. SECOND SCALING METHOD FOR LLRs
From the pairwise substitution error rate, p( where β is a scaling coefficient satisfying β>0. Then, the scaled LLRs, L 1 j and L 2 j , are calculated from L 1 j in (6) and L 2 j in (7) by using the scaled substitution error rate defined in (10). For example, when y j is C, the scaled LLRs are given as This scaling also makes the magnitude of the LLRs smaller. The values of S E and the best scaling coefficient β for each experiment are listed in Table 2. Similar to α, we need to optimize the parameter β. We increase β by 0.01 from 0 to 1, and check whether all 30 trials are successful with the minimum number of reads. Finally, we use the median of such β values.
According to Table 2, as S E increases, β decreases and we consider the curve fitting methods. After we compare various curve fitting results between S E and β, we choose the curve fitting method with the largest R 2 of 0.9999. The best curve fitting with the associated equation and parameters is shown in Fig. 6.
The best curve fitting method is Gaussian two-term curve, which can be expressed as Fig. 6, if S E is smaller than 5.76%, β is larger than 1, which corresponds to reducing the magnitude of the LLRs.

IV. SIMULATION RESULTS
To compare the proposed scaling methods with the conventional one, we use the experiment data in [4], which  are synthesized by CustomArray and sequenced by Illumina iSeq technology. The detailed experimental parameters and procedures are described in Section 3 and 4 of Supplementary Material in [4], respectively. The code is available at https: //github.com/shubhamchandak94/LDPC_DNA_ storage. Among the parameters, we explain any changes or important ones belows.
In these experiments, we consider 30 trials of experiments instead of 20 trials and find the minimum number of reads that successfully decode all trials, since the 20 trials in [4] are too few. We consider the number of incremental reads as 500. We denote the minimum number of reads to make all trials successful as M read . Then, the reading cost R cost is defined as R cost = (M read × oligo length) / (file size (bytes) × 8). We compare the referenced system in [4] and the two proposed methods. The simulation results with the minimum number of reads M read and the reading cost R cost are given in Table 3 and Fig. 7. 'Ref. [4]' in Table 3 and Fig. 7 stands for referenced system [4]. VOLUME 8, 2020 Table 3 and Fig. 7 show that the two scaling methods lead to improvements in Experiments 2, 4, 5, and 7 and similar performances in Experiments 1, 3, 6, and 8. To compare decoding performances, we list the number of successful decoding trials among 30 trials for the three LLR calculations in the experiments as the number of reads increases. Here, we mainly list Experiments 2, 4, 5, and 7 in Tables 4, 5, 6, and 7, respectively, where 'No. Reads' in the tables denotes the number of reads.

V. DISCUSSION ON GENERALIZATION OF THE PROPOSED SCALING METHODS AND PRACTICAL CALCULATION OF ERROR RATES
To verify that the proposed scaling methods are suitable for DNA storage with LLR calculation, we must apply the scaling methods to new experiments and compare results. However, since there are no new experiments with similar environments, it is unclear that the proposed scaling methods are generally applied to DNA storage. In this section, we suggest one application to verify the proposed scaling method. Moreover, we explain how to obtain error rates in more practical application later.

A. APPLICATION OF THE PROPOSED SCALING METHODS
In this subsection, we verify the application of our proposed scaling methods S1 and S2 for decoders with other experiments in DNA storage, which require the soft information.   Instead of running new experiments, we fit the scaling curve based on six experiments only, and test the scaling using the two remaining experiments. Since the number of cases to choose six experiments among eight experiments is 28, we consider only one case in which we already know the results of Experiments 1-6 and derive curve fitting functions for the two scaling methods in a manner similar to that explained in Section III. Then, we derive the optimal scaling coefficients of Experiments 7-8 using the curve fitting results of the other six experiments.
First, we apply the proposed scaling strategy S1 to Experiments 7-8. According to the optimal scaling coefficients of Experiments 1-6 in Table 1, we perform curve fitting of  Experiments 1-6 with a linear one-term function, as shown in Fig. 8.
When Experiments 7-8 with LDPC codes and substitution error rate are provided, we can assign proper scaling values in LLR calculation. The substitution error rate S E and scaling coefficients α of Experiments 7-8 derived by the linear one-term function in Fig. 8, are listed in Table 8.
In Table 8, coefficients α for Experiments 7-8 are 0.78 and 0.64. Compared to Table 1, where we assume that true symbols are known, the differences are −0.01 and 0, respectively. Therefore, it is obvious that the performance of Experiment 8 is same. The performance of Experiment 7 is also same, which will be shown in Table 10.
Similarly, we apply Experiments 7-8 for the proposed scaling strategy S2; according to the optimal scaling coefficients of Experiments 1-6 in Table 2, we perform curve fitting of Experiments 1-6 with a Gaussian two-term function, as shown in Fig. 9.
The substitution error rate S E and scaling coefficients β of Experiments 7-8 derived by the Gaussian two-term function in Fig. 9 are listed in Table 9.
Note that β values are 2.74 and 4.89 for Experiments 7-8 in Table 9, where the differences from Table 2     in Tables 8-9, we consider similar conditions to those used in Section IV. We still consider 30 trials of Experiments 7-8, and the incremental number of reads is 500. For the three LLR calculation methods which are Ref. [4], S1, and S2, the simulation results with the minimum number of reads M read and reading cost R cost are shown in Table 10 and Fig. 10.
From Table 10 and Fig. 10, we know that Experiments 7-8 have similar performances in Section IV. This means that our proposed scaling methods, S1 and S2, can are generally applicable to estimate scaling values for the decoders of new experiments in DNA storage. Similar to Section IV, we list the number of successful decoding trials among 30 trials for the calculations of the three LLR types in Experiment 7 as the numbers of reads increase in Table 11.

B. PRACTICAL CALCULATION OF ERROR RATES
In Section III.A, we determine scaling values with statistics, where we compute statistics based on the assumption that the encoded symbol x j in (8) is known. These statistics in VOLUME 8, 2020 Section III.A can provide actual error rates, but it is not easy to obtain these error rates in practice. In this subsection, we suggest two calculation methods of error rates.
The issue in the statistics calculation is that we cannot know the encoded symbol x j . However, we can estimate the symbol x j using whole reads after MSA. Thus, the first calculation method of error rates (E1) is to estimate the symbolx j at each position by using majority rule, and use the estimated symbolsx j in the calculation. In other words, we replace the probability p( In all eight experiments, we find that the estimated symbolŝ x j are the same as the true encoded symbols x j . This implies that the probability in (11) is equivalent to the probability (8). Then, scaling values obtained from E1 are the same as ones in Tables 1 and 2. Moreover, M read and R cost from E1 are also the same as ones in Table 3. The advantage of E1 is that it computes error rates without knowing the encoded symbols. However, it requires to check all reads, which results in computational issues.
The second calculation method of error rates (E2) is to use partial reads instead of all reads. Similar to E1, we estimate symbolx j at each position using majority rule, but from partial reads. Then, the estimated probability from partial reads is given aŝ where N is the number of partial reads. We choose the minimum number of reads M read in Table 3 out of various N for eight experiments. Since each trial provides different statistics, we denote the substitution error rate for the m-th trial by S m E , which is given as wherex m j and y m j denote estimated symbols and received symbols at the m-th trial. Then, the average substitution error rate for 30 trials S E can be defined as We summarize the number of total reads (N ), the number of partial reads (N ), and average error rate for 30 trials (S E ) in Table 12. We also present parameters α and β which are obtained from curve-fitting functions in Fig. 5 and Fig. 6 using S E .
Since the average substitution error rate calculated from partial reads in Table 12 is lower than the error rate in Table 1 and Table 2, α and β tend to decrease and increase, respectively. However, scaling values in each trial are not always the same since scaling values are determined by not S E but S m E at the m-th trial. After comparing the 30 S m E with one S E in eight experiments, we find that 30 substitution error rates are near the average values at each experiment. In Table 13, the substitution error rate for 30 trials in Experiment 1 is shown.
The results using E2 are almost the same as in Table 3, except the α scaling (S1) of Experiment 1. Among 30 trials, we find that decoding of a single trial (trial 24) in Experiment 1 is unsuccessful. In the trial 24, α is 0.49 since S 24 E is 0.41. To find the reason why the trial 24 is not successful, we perform two additional tests. The first test is to find minimum α , where we increase α by 0.01 to decode successfully in the trial 24 when M read is 35000. We obtain the minimum α = 0.53. The second test is to find minimum M read , where we increase M read by 500 to decode successfully in the trial 24 when α is 0.49. We obtain the minimum M read = 35500. From these results, we can infer that there is slight mismatch in α scaling (S1) in Experiment 1. However, since the results using E2 (error rates from partial reads) are consistent with the results with true error rates (and the result with E1) in most cases, the error-rate calculation from partial reads can be used in practice. The results using E2 are also omitted here because of almost the same as Table 3.

VI. CONCLUSION
In this work, new LLR estimation schemes have been proposed to reduce the reading cost in DNA storage with LDPC codes. Considering the mismatch of substitution errors between the statistic and theoretical analysis resulting in larger magnitude of LLRs, we proposed two strategies for scaling LLR magnitude. The proposed scaling values were obtained from eight experiments, but proper scaling values can be assigned via curve fitting equations. Since recent error correcting codes with powerful error correcting capability need soft information such as LLR, the proposed LLR calculation methods can be applied to any modern codes in DNA storage.