Design of nonbinary error correction codes with a maximum run-length constraint to correct a single insertion or deletion error for DNA storage

Due to the advantages of high information densities and longevity, DNA storage systems have begun to attract a lot of attention. However, common obstacles to DNA storage are caused by insertion, deletion, and substitution errors occurring in DNA synthesis and sequencing. In this paper, we first explain a method to convert binary data into general maximum run-length r sequences with specific length construction, which can be used as the message sequence of our proposed code. Then, we propose a new single insertion/deletion nonbinary systematic error correction code and its corresponding encoding algorithm. For the proposed code, we design the fixed maximum run-length r in the parity sequence of the proposed code to be three. Additionally, the last parity symbol and the first message symbol are always different. Hence, the overall maximum run-length r of the output codeword is guaranteed to be three when the maximum run-length of the message sequence is three. Finally, we determine the feasibility of the proposed encoding algorithm, verify successful decoding when a single insertion/deletion error occurs in the codeword, and present the comparison results with relevant works.


I. INTRODUCTION
As people gradually rely on more and more data, the hardware of data storage systems has also been gradually upgraded. However, the large space and higher running cost of data centers makes it difficult to meet people's needs. As a result, researchers have begun to turn their attention to next-generation storage techniques, such as DNA storage. In recent years, DNA storage has gradually been put into practical use due to its longevity, high storage density, and ability to support random access through PCR technology [1]- [3]. For example, Organick et al. stored 200MB of data in 13 million DNA strands [2]. In addition, in [4]- [6], the length of the synthesized DNA strands with four-base nucleotides is 160-200 nt, which helps to meet low-cost requirements.
Although DNA storage has many potential advantages, recent studies [1], [2], [4], [7] have indicated that insertion, deletion, and substitution errors occur in DNA synthesis and sequencing processes. To improve the robustness of the DNA synthesis and sequencing processes, error correction codes such as fountain codes [5], Reed-Solomon codes [7], Bose-Chaudhuri-Hocquenghem codes [8], and low-density paritycheck codes [8]- [10] have been applied to DNA storage systems. Since a DNA strand is subject to two constraints, i.e., balanced GC-content (45%-55%) and short homopolymer length (≤3), which enable low error rates during DNA synthesis and sequencing processes [5], the constrained codes [4], [11] without the capability of error correction have been considered to reduce the occurrence of insertion, deletion, and substitution errors in DNA storage. Furthermore, the constrained code combined with the error correction code has been considered for DNA storage [12].
A binary single insertion/deletion error correction (SIDEC) code, which is called a Varshamov-Tenengolts (VT) code [13], was first proposed in 1965. Levenshtein modified the binary VT code as an extended VT (EVT) code [14] to correct a single insertion/deletion/substitution error. The nonbinary VT codes [15], which also can correct a single insertion/deletion error, were proposed almost TABLE 1. The limitation and weakness of the related works [16], [18] and improvement of our work Limitations and weakness of the related work Improvement of our work [16] • Code construct failure (CCF) issue was not considered, and some parity symbols could not be generated.
• The CCF issue is considered in codeword construction and the solution in encoding steps is presented. • Codewords to satisfy the maximum run-length constraint are not guaranteed.
• All output codewords satisfy the maximum run-length constraint. [18] • A single insertion/deletion bit error can be corrected.
• A single nonbinary insertion/deletion symbol can be corrected.

•
From the maximum run length of the binary SIDEC code e ≤ 7, the length of the binary codewords is less than 136.
• In our code construction, there is no limitation of codeword length. 20 years later. The authors of [16] proposed an efficient systematic encoding algorithm for nonbinary SIDEC codes whose codewords consist of parity symbols and message symbols. A binary EVT code combined with a GC-balanced constrained code [12] and a binary VT code combined with a GC-balanced and r run-length limited constrained code [17] were applied for DNA storage systems. Additionally, a new binary SIDEC code combined with the maximum runlength r constrained code and efficient systematic encoding algorithm was proposed in [18]. All of these studies [8], [9], [11], [12], [17] focused on binary coding schemes for DNA storage systems. So far, there have been few studies focusing on q-ary coding schemes [19] for DNA storage systems. Moreover, insertion and deletion errors are inevitably bound to occur in the process of DNA synthesis and sequencing. For these purposes, we propose a new nonbinary SIDEC code with the maximum run-length r constraint and systematic encoding algorithm. We can apply the proposed code with q = 4 and the maximum run-length r = 3 which are suitable for DNA storage.
Our work is especially inspired by related works [16] and [18]. First, the nonbinary code design in [16] did not consider the maximum run-length constraint and some of the codewords from the deign in [16] cannot be successfully encoded. However, these issues are addressed in our code construction. Second, although the binary SIDEC code in [18] involved the maximum run-length e constraint, the length of output codewords was limited by the maximum run length e. However, our proposed code has no limitations on codewords length. We summary the limitations and weakness of the related works [16], [18] and improvement of our work in Table 1.
We first present a method to convert the binary data into q-ary symbol sequences with the maximum run-length r constraint. These sequences can be used as inputs for the proposed nonbinary SIDEC code with the maximum run-length r constraint. The construction of the proposed q-ary SIDEC code and its corresponding systematic encoding algorithm are also presented. The output codeword is a sequence of the q-ary SIDEC code with the maximum run-length r constraint. Simulation results show that the encoding algorithm is feasible and the q-ary SIDEC code with the maximum runlength r constraint can correct a single deletion or insertion error. The main contributions of our work are summarized as follows.
• We propose a new construction of a q-ary systematic SIDEC code whose codewords meet the maximum runlength r constraint. • For the proposed code, we propose systematic encoding algorithm with the parity sequence whose the maximum run-length r is three. The paper is organized as follows. In Section II, the notation is given, the proposed code construction and algorithm are introduced, and the method of converting binary data into the maximum run-length r sequences is presented. In Section III, a nonbinary SIDEC code with the maximum runlength r constraint and its systematic encoding algorithm are proposed. In Section IV, we briefly explain the decoding method for the proposed nonbinary SIDEC code with the maximum run-length r constraint. In Section V, we supply the encoding and decoding simulation results and comparison results. Finally, we conclude the paper in Section VI.

II. PRELIMINARIES
In this section, the overall of the proposed code and the method used to construct the q-ary message sequences with the maximum run-length r constraint are introduced.
Given a sequence v = (v 1 , v 2 , · · · , v n ) ∈ [q] n , min(v), max(v), and sum(v) are defined as functions that return the minimum element, maximum element, and summation of all elements in vector v, respectively. The consecutive i are all the same and they are different from v i−1 and v j+1 . The maximum run-length is the length of the longest run in the sequence. Let G b a be the set of sequences v ∈ [q] b of length b with the maximum run-length a constraint for any positive integer a, b, and a < b. Here, · , · , mod, ∩, log(·), and dec a (·) denote the ceiling function, floor function, modulo operation, intersection set, logarithm of base 2, and decimal representation of base a for any positive integer a, respectively. The Hamming weight of a sequence v is the number of symbols that are different from the zero symbol. For any two sequences u = (u 1 , u 2 , · · · , u a ) with length a and v = (v 1 , v 2 , · · · , v b ) with length b, the concatenation of u and v is defined as (u, v) = (u 1 , u 2 , · · · , u a , v 1 , v 2 , · · · , v b ).

B. PROPOSED CODE CONSTRUCTION AND ALGORITHM
The overall block diagram of the proposed code is shown in Fig. 1. Since the construction method that converts the binary data B into a q-ary sequence z with the maximum run-length r constraint is similar in [19], we briefly explain the method in Section II.C. We also present an application of a codeword where the maximum run-length r is three for DNA storage. We segment the q-ary sequence with the maximum run-length r constraint into multiple sequences with the specified length; this is the message sequence of the proposed SIDEC code C a,b (n + 1) in Section III.A. The message sequences used as the inputs pass through the algorithm of the systematic proposed SIDEC code in Section III.C, and the output codewords have properties of single insertion/deletion error correction and the maximum runlength r constraint. The maximum run-length of the output codewords is r = 3, since we design the fixed maximum runlength r to be three in message the parity sequences and the last symbol of the parity sequence is always different with the first symbol of the message sequence. The codewords with maximum run-length r = 3 can be used for DNA synthesis due to the constraints for DNA storage systems [5].

C. CONSTRUCTION OF SEQUENCES WITH THE MAXIMUM RUN-LENGTH CONSTRAINT
Our method for constructing the maximum run-length r sequences is similar to the method used in [19]. We extend the method in [19] to the general case to obtain the qary maximum run-length r constrained sequence. First, we assume that the binary data B can be divided into µ blocks of length bits, which are represented as b i Then, we build one-to-one mapping to convert b i , where x j is a ((q−1)q r−1 )-ary element for j ∈ [1, µh]. Then, the mapping between the binary subsequence b i (i−1) +1 with bits in B and the ((q − 1)q r−1 )-ary subsequence x ih Next, we can build a one-to-one mapping to convert a ((q − 1)q r−1 )-ary element x j in x into a q-ary subsequence z jr (j−1)r+1 with r symbols, since the subsequence z jr (j−1)r+1 in z has the property that z jr−1 = z jr for j ∈ [1, µh]. It can represent ((q − 1)q r−1 ) different values. Then, the sequence x can be converted into a q-ary sequence z = (z 1 , z 2 , · · · , z µhr ) = (z r 1 , z 2r r+1 , · · · , z jr (j−1)r+1 , · · · , z µhr (µh−1)r+1 ) of length µhr. Since z jr (j−1)r+1 has z jr−1 = z jr for j ∈ [1, µh], for any two subsequences z j1r (j1−1)r+1 and the first symbol z (j2−1)r+1 in z j2r (j2−1)r+1 have z j1r = z (j2−1)r+1 , and the maximum run-length is r when z j1r = z (j2−1)r+1 . Therefore, the q-ary sequence z is the maximum run-length r sequence if the maximum run-length of any two adjacent subsequences z jr (j−1)r+1 and z ) in the sequence z is r. Finally, the sequence z of length µhr is segmented into µhr/k blocks of sequences y = (y 1 , y 2 , · · · , y k ) ∈ [q] k with specified length k, which can be used as the message sequences of the proposed code. Since the sequences y are a subset of sequence z and the maximum run-length of sequence z is r (as long as the maximum run-length of the concatenation of any two adjacent subsequences with r symbols is r), the run-length of sequences y is less than or equal to r.
Since the alphabet and maximal run-length are popularly used as q = 4 and r = 3 in DNA storage, respectively, we can consider converting a block of = 16 bits to 48-ary h = 3 elements. If A = 0, T = 1, G = 2, and C = 3, an example of one-to-one mapping between a 48-ary element and q-ary subsequence z jr (j−1)r+1 with r symbols for j ∈ [1, µh] is given in Table 2.

III. CODE CONSTRUCTION AND ENCODING FOR A NONBINARY SIDEC WITH A MAXIMUM RUN-LENGTH CONSTRAINT CODE
In this section, we propose a method to construct a nonbinary SIDEC code and then present a systematic encoding VOLUME 4, 2016 TABLE 2. The example of one-to-one mapping between a 48-ary element and q-ary subsequence z jr (j−1)r+1 with r symbols.

48-ary
Subsequence 48-ary Subsequence 48-ary Subsequence element z jr algorithm of the proposed nonbinary SIDEC code. We design parity symbols that satisfy the following constraints: the maximum run-length r is three and the symbols can be obtained by mapping through tables. The outputs of the systematic encoding algorithm are the maximum runlength r = 3 constraint codewords when the maximum runlength r of the message sequence is not larger than three. Hence, the output codewords have two properties: single insertion/deletion error correction and the maximum runlength r constraint.
The codeword c is (p m 0 , y k 1 ), which is the concatenation of the parity sequence p m 0 = (p 0 , p 1 , · · · , p m ) of length m + 1 and the message sequence y k 1 = (y 1 , y 2 , · · · , y k ) of length k. To determine codeword c, monotonically increasing integer sequence I is explained in Definition 1. In Table 3, the subsequence I n 1 of I is used, and I n+1 will be used for syndrome calculation later. When the length of message sequence k is given, I n+1 should be first determined for the sequence I and the number of the parity symbols can be determined by I n+1 . However, since I n+1 is also recursively calculated after determining I n 1 in Definition 1, it is not possible to determine I n+1 and B = log(I n+1 ) directly. To solve this problem, we first provide the relationship between the message sequence length k and the value of B in Table 4, which can be understood after the sequence generation has ended. We also present the number of parity symbols by (5) and the length of codeword c by (6), as as shown in Table 4.
Moreover, since the number of parity symbols depends on whether B is even or odd, and around half of B can make a special pattern in the sequence, we need to define σ. From B, the value of σ is defined as Conversely, from σ, the value of B can be defined as From the relationship between the message sequence length k and the value of B in Table 4, we conclude that Definition 1: We define a positive, monotonically increasing integer sequence I = (I 1 , I 2 , · · · , I n+1 ) of length n + 1. When the length of message sequence k is predefined, the value of B can be determined according to (2). The sequence I is defined as where T i for i ∈ [1, σ − 1], L, m, n + 1, and I n+1 m are expressed as where We present some examples of the construction of the positive, monotonically increasing integer sequence I. In Example 1, we can intuitively see that the integer sequence I is a monotonically increasing sequence. The reason why some indices in T i for i ∈ [1, σ − 1] are a multiple of three in (4) is that the fixed maximum run-length r is designed to be three in p m 0 of the proposed code.
The relationship between the message sequence length k, the value of B, the number of parity symbols m + 1, and the length of codeword c n + 1 .
Proof: The sequence I is a positive, monotonically increasing sequence. Hence, the sequence I is a monotonically increasing code [20]. Since monotonically increasing codes are also SIDEC codes [20] and sum(c) = a (mod q) satisfies the property of nonbinary VT codes [15], c ∈ C a,b (n + 1) is a codeword of a SIDEC code.

B. ENCODING ALGORITHM
In this subsection, we present a systematic encoding algorithm for nonbinary SIDEC codes, where the maximum runlength r in the parity sequence is three. To satisfy the runlength constraint between a sequence of parity symbols and a sequence of message symbols, the last parity symbol is designed to be always different from the first message symbol. The output codewords have the properties of nonbinary SIDEC codes and the maximum run-length r which depends on the maximum run-length of the message sequence.

1) Overview
This systematic encoding algorithm converts a message sequence y k 1 with the maximum run-length r constraint into a SIDEC codeword with the maximum run-length r constraint. The output codeword c ∈ C a,b (n + 1) ∩ G n+1 r is defined by c = (p m 0 , y k 1 ). The last symbol p m of p m 0 is used for avoiding the run-length between p m 0 and y k 1 , and (σ − 1) symbols of p m 0 whose indices are a multiple of three are used for avoiding the run-length in the parity sequence. The other symbols of p m 0 are represented by the value of Syn(α). We explain the encoding algorithm in six steps according to Table 3. The parity sequence p m 0 and message sequence y k 1 correspond to the partial codeword c m 0 = (c 0 , c 1 , · · · , c m ) and c n m+1 = (c m+1 , c m+2 , · · · , c n ), respectively.

2) Encoding algorithm steps
Step 1. The length of the message sequence y k 1 is predefined as k. The subsequence I n 1 of I and the value of I n+1 can be determined by Definition 1. Then, the length m + 1 of the parity sequence p m 0 can be obtained by (5) with I. This encoding algorithm requires d 3i for i ∈ [1, σ − 1], a, b, and message sequence y k 1 ∈ G k r as input parameters. The output codeword c ∈ C a,b (n + 1) ∩ G n+1 r is obtained from the proposed encoding algorithm. The symbols p 3i are used for avoiding the run-length in the parity sequence p m 0 for i ∈ [1, σ − 1] and the symbol p m is used for avoiding the run-length between the parity sequence p m 0 and message sequence y k 1 . When the message sequence y k 1 embeds into the partial codeword c n m+1 , the corresponding partial auxiliary sequence α n m+2 = (α m+2 , α m+3 , · · · , α n ) with length k − 1 can be obtained according to (8).
After the symbols c 3t 3(t−1) are determined, if CCF occurs or if the maximum run-length r > 3, we flip the current α 3t .
Step 6. The rest of the symbols c 2 0 = (c 0 , c 1 , c 2 ) in the output codeword c is used for computing to satisfy the sum(c) = a (mod q). We define γ as According to the value of γ and the α 3 1 = (α 1 , α 2 , α 3 ), we can determine the symbols c 2 0 . Since sometimes there is not only one case of c 2 0 that satisfies the value of γ and the α 3 1 , we should determine the case to aviod the run-length issue as much as possible. Similarly, after the symbols c 2 0 are determined, if CCF occurs or if the maximum run-length r > 3, we filp α 3 .
According to the encoding algorithm, the codewords are successfully constructed. The simulation results are shown in Section V. In Fig. 2, Y, and N are represented encoding step number, yes, and no, respectively. VOLUME 4, 2016 Algorithm 1 Encoding the SIDEC code with the maximum run-length r constraint Input: d 3i for i ∈ [1, σ − 1], a, b, and message sequence y k 1 ∈ G k r . Output: codeword c = (p m 0 , y k 1 ) ∈ C a,b (n + 1) ∩ G k r . 1: Step 1: Embed message sequence y k 1 → the partial auxiliary sequence α n m+2 .  Table 6. 9: if c m 3(σ−1) causes CCF then 10: flip α m → c m = max([q]\y 1 ) → α m+1 .

IV. DECODING FOR AN INSERTION OR DELETION ERROR
Due to the strong relationship between the auxiliary sequence α and codeword c of C a,b (n + 1) code, when a single insertion/deletion error occurs in the codeword c of C a,b (n + 1) code, a single insertion/deletion error occurs in the corresponding auxiliary sequence α at the same position. Recall that the sequence I is a monotonically increasing sequence; hence, we can acquire the decoding algorithm by the method used in [21]. Since the C a,b (n+1) code has similar properties as the nonbinary VT codes [15], we can decode for a single insertion/deletion error. Since decoding for an insertion error is similar, we will briefly introduce the decoding method for the deletion error but omit the explanation of decoding for insertion in this paper. We assume that a deletion error occurs at the posi-tion f ∈ [1, n]. From the received codeword c = (c 0 , c 1 , · · · , c f −1 , c f +1 , · · · , c n+1 ) of length n, the auxiliary sequence α = (α 1 , α 2 , · · · , α f −1 , α f +1 , · · · , α n ) of length n − 1 can be recalculated, where α f ∈ [2] and c f ∈ [q]. To determine the syndrome, two mappings are defined as Here, L I (1, α ), which coincides with the Hamming weight of the received auxiliary sequence α if the integer sequence I = (1, 2, · · · , n + 1). The difference in the syndrome values between α and α is defined as ∆ = a −Syn(α ) (mod I n+1 ).
The deleted value α f is determined by comparing the ∆ value and w I (α ) as  After we find the deleted value and possible position of the recalculated auxiliary sequence α , we can infer the deleted value and the possible position in the received codeword c from the relationship between the deleted symbol c f and its left symbol c f −1 according to (8). Similar to the decoding of nonbinary VT codes, we define the deleted symbol c f as c f = a − sum(c ) (mod q).
With the above method, we can successfully decode a deletion or insertion error, and these results are shown in Section V.

V. SIMULATION AND COMPARISON RESULTS
In this section, we first present the simulation results for our proposed code construction, determine the feasibility of the proposed encoding algorithm, and verify successful decoding when a single insertion/deletion error occurs in the codeword. Then, we present the comparison results between relevant work [16] and our work to present the improvement of our code design.

A. SIMULATION RESULTS OF CODE ENCODING AND DECODING
In this simulation, we determine the values of q = 4 and r = 3. We segment message sequences y k 1 into different lengths   Tables 7 and 8. and in Tables 7 and 8 denote successful encoding result and encoding failure result, respectively. From these results, the length of 190 codewords corresponds to I n+1 = 433, and the proposed encoding algorithm can generate the codewords construct successfully.
We assume that a single deletion or insertion error occurs  in the received codewords. The position of a single deletion or insertion error is randomly determined and the inserted symbol is randomly chosen from [q] in the received codewords. Then, the decoding algorithm in Section IV is used to correct a single deletion or insertion. The simulation decoding results for single deletion and single insertions are listed in Tables 9 and 10, respectively. and in Tables 9 and 10 also denote successful decoding result and decoding failure result, respectively. From Tables 9 and 10, all codewords with a single deletion/insertion are successfully decoded.

B. COMPARISON RESULTS OF RELATED WORK [16] AND OUR WORK
Among the related works, the systematic encoding method for nonbinary SIDEC code [16] is most relevant and we provide comparison results with our work. For fair comparison with the related work [16], we still consider the 1.5 × 10 8 message sequences with maximum run-length r = 3 as inputs, and then make codewords with length n = 150 by the systematic encoding method in [16]. The code construction in [16] did not consider the maximum runlength constraint and CCF issues which are caused by some parity symbols c 2 j located between two message symbols for j ∈ [2, logn + 1 ]. Then, the parity symbols c 2 j should be less than zero according to (8) if the message symbol c 2 j −1 = 0 and auxiliary bit α 2 j = 0. The ratios of the CCF occurrence and maximum run-length violation in [16] are compared with ones in our work.  Among the 1.5 × 10 8 outputs codewords from [16], the numbers of codewords which only occur CCF and only violate the maximum run-length constraint are 5.779 × 10 7 and 3.158 × 10 7 , respectively. The number of codewords which occur CCF and violate the run-length constraint simultaneously is 2.599 × 10 7 , and the number of codewords do not occur CCF and satisfy the maximum run length constraint is 3.442 × 10 7 . However, the whole codewords of our proposed SIDEC code do not occur CCF and satisfy the maximum run-length constraint. We present the ratios of the CCF occurrence, violation of run-length constraint, and no CCF occurrence and no maximum run-length constraint violation of output codewords with n = 150 in Table 11.
From Table 11, about 22.96% codewords of [16] can be used as DNA strands since the CCF do not occur and the maximum run-length constraint is satisfied. Therefore, the outputs codewords in [16] are not suitable for DNA storages. However, the whole codewords from our systematic encoding method for our proposed nonbinary SIDEC code do not occur CCF and satisfy the maximum run-length constraint, which has the potential advantages for DNA storages.

VI. CONCLUSION AND FUTURE WORK
In this work, we propose a nonbinary SIDEC code with the maximum run-length r constraint and its systematic encoding algorithm for the proposed code. Moreover, we verify the feasibility of the encoding and decoding processes of the proposed code. Thanks to the properties of this code design and the error-prone nature of DNA channels, we can apply the proposed code to DNA storage systems. Future work will focus on designing a more efficient encoding process for the nonbinary SIDEC code with the maximum run-length r constraint and constructing codes with the GC-balanced constraint, which is also an important characteristic of DNA storage systems.