Properties and constructions of constrained codes for DNA-based data storage

We describe properties and constructions of constraint-based codes for DNA-based data storage which account for the maximum repetition length and AT/GC balance. We present algorithms for computing the number of sequences with maximum repetition length and AT/GC balance constraint. We describe routines for translating binary runlength limited and/or balanced strings into DNA strands, and compute the efficiency of such routines. We show that the implementation of AT/GC-balanced codes is straightforward accomplished with binary balanced codes. We present codes that account for both the maximum repetition length and AT/GC balance. We compute the redundancy difference between the binary and a fully fledged quaternary approach.


I. INTRODUCTION
The first large-scale archival DNA-based storage architecture was implemented by Church et al. [1] in 2012. Blawat et al. [2] described successful experiments for storing and retrieving data blocks of 22 Mbyte of digital data in synthetic DNA. Ehrlich and Zielinski [3] further explored the limits of storage capacity of DNA-based storage architectures.
Naturally occurring DNA consists of four types of nucleotides: adenine (A), cytosine (C), guanine (G), and thymine (T). A DNA strand (or oligonucleotides, or oligo in short) is a linear sequence of these four nucleotides that are composed by DNA synthesizers. Binary source, or user, data are translated into the four types of nucleotides, for example, by mapping two binary source into a single nucleotide (nt).
Strings of nucleotides should satisfy a few elementary conditions, called constraints, in order to be less error prone. Repetitions of the same nucleotide, a homopolymer run, significantly increase the chance of sequencing errors [4], [5], so that such long runs should be avoided. For example, in [5], experimental studies show that once the homopolymer run is larger than 4 nt, the sequencing error rate starts increasing significantly. In addition, [5] also reports that oligos with large unbalance between GC and AT content exhibit high dropout rates and are prone to polymerase chain reaction (PCR) errors, and should therefore be avoided.
Blawat's format [2] incorporates a constrained code that uses a look-up into strands of nucleotides with a homopolymer run of length at most three. Blawat's format did not incorporate an AT/GC balance constraint. Strands that do not satisfy the maximum homopolymer run requirement or the weak balance constraint are barred in Erlich's coding format [3].
In this paper, we describe properties and constructions of quaternary constraint-based codes for DNA-based storage which account for a maximum homopolymer run and maximum unbalance between AT and GC contents. Binary 'balanced' and runlength limited sequences have found widespread use in data communication and storage practice [6]. We show that constrained binary sequences can easily be translated into constrained quaternary sequences, which opens the door to a wealth of efficient binary code constructions for application in DNA-based storage [7], [8], [9]. A further advantage of this binary approach instead of a 'direct' 4-ary translation approach is the lower complexity of encoding and decoding lookup tables. The disadvantage is, as we show, the loss in information capacity of the binary versus the quaternary approach.
We start in Section II with a description of the limiting properties of AT/GC-balanced codes, while Section III presents code designs for efficiently generating AT/GCbalanced strands. Limiting properties and code constructions that impose a maximum homopolymer run are discussed in Section IV. In Section V, we enumerate the number of binary and quaternary sequences with combined weight and run-length constraints. We specifically compute and compare the information capacity of binary versus 'direct' quaternary coding techniques. Section VI concludes the paper.

II. AT/GC CONTENT BALANCE
We use the nucleotide alphabet Q = {0, 1, 2, 3}, where we propose the following relation between the four decimal symbols and the nucleotides: G = 0, C = 1, A = 2, and T = 3. The AT/GC content constraint stipulates that around half of the nucleotides should be either an A or a T nucleotide. In order to study AT-balanced nucleotides, we start with a few definitions. We define the weight or ATcontent, denoted by w 4 (x), of the n-nucleotide oligo x = (x 1 , . . . , x n ), x i ∈ Q, as the number of occurrences of A or T, or where ϕ(u) = 0, u < 2, 1, u > 1.
The relative unbalance of a word, α(x), is defined by α(x) = w4(x) n − 1 2 . An n-nucleotide oligo is said to be balanced if α(x) = 0. In case we have a set S of n-symbol codewords, we define the worst case relative unbalance of S, denoted by α S , by α S = max x∈S α(x). Similarly the weight of a binary word x = (x 1 , . . . , x n ), x i ∈ {0, 1}, denoted by w 2 (x), is defined by If we write the 4-ary word x = (x 1 , . . . , x n ), x i ∈ Q, as x = y + 2z, where both y i and z i ∈ {0, 1} then For DNA-based storage, we do not require that the strands of the codebook, S, are strictly balanced, as a small unbalance, that is α S ≪ 1, between the GC and AT content is permitted without affecting the error performance. Such a constraint is called a weak balance constraint. Let S w denote the set of 4-ary words of length n with balance w = w 4 (x), or The cardinality of S w , denoted by N (w, n), equals The number of oligo's, N a (n), of length n, whose relative unbalance α(x) ≤ a, is given by The redundancy of nearly balanced strands, denoted by r(a, n), equals r(a, n) = log 2 4 n N a (n) . Figure 1 shows examples of computations of the redundancy versus n with the relative unbalance, a, as a parameter. The raggedness of the curves is caused by the truncation effects in the summation in (7). The distribution for asymptotically large n of N (w, n) versus w is approximately Gaussian shaped, that is where G(u; µ, denotes the Gaussian distribution and µ and σ 2 denote the mean and variance of the distribution. The number of oligo's, N a (n), of length n, whose relative unbalance α(x) ≤ a, is given by [ [3], supplement] where the Q-function is defined by In the next section, we discuss various embodiments of codes that balance strands of nucleotides.

III. IMPLEMENTATIONS OF BALANCED GC/AT CONTENT
There is a wealth of prior art binary balanced codes [10], and application of such prior art codes to the problem at hand is shown below. Earlier embodiments can be found in [11], [12].

A. Binary sequences, Construction I
We assume the encoder receives a string of ℓ + n, n ≥ ℓ, binary symbols, which are translated into a balanced word of n 4-ary symbols. To that end, let (y 1 , . . . , y ℓ+n ), y i ∈ {0, 1}, n ≥ ℓ, be an (ℓ + n)-bit source string. We translate the first ℓ bits of the binary source data, (y 1 , . . . , y ℓ ), into a (nearly) balanced binary string (u 1 , . . . , u n ), u i ∈ {0, 1}. We merge the n-bit string, (u 1 , . . . , u n ), and the remaining n-bit segment of the source string, (y ℓ+1 , . . . , y ℓ+n ), into the 4-ary vector v, v i ∈ Q, using the operation v i = y ℓ+i + 2u i , 1 ≤ i ≤ n. The balance of the output string, v, is given by, see (4), w 4 (v) = w 2 (u). The rate of the above 4-ary code construction equals R = 1 + ℓ n . Implementations of balanced codes can be found in the literature. For example, the 8B10B is a binary code of rate 8/10 that has found application in both transmission and data storage systems [13]. The 10-bit codewords may have four, five or six 'one's, and the two-state code guarantees that the unbalance of the encoded sequence is at most ±1. In case we translate p 8-bit words into p 10-bits words, we have α S = 1 10p . The (overall) rate R = 9 5 .
1) Weak Knuth code: Knuth [14] presented an encoding technique for generating binary balanced codewords capable of handling (very) large binary blocks. An n-bit user word, n even, is forwarded to the encoder, which inverts the first k 0 bits of the user word, where k 0 is chosen in such a way that the modified word has equal numbers of ones and zeros. Knuth showed that such an index k 0 can always be found. The index k 0 is represented by a (preferably) balanced word, called prefix, of length p 0 , p 0 ≥ log 2 n bits, so that the redundancy of Knuth's method is approximately log 2 n (bit). The (balanced) p 0 -bit prefix and the balanced n-bit user word are both transmitted. The receiver can easily undo the inversion of the first k 0 bits of the received word. Modifications of the generic Knuth scheme have been presented by Weber & Immink [15].
DNA-based storage does not require exact strand GC/ATcontent balance, and we may attempt to construct less redundant nearly-balanced codes. We modify Knuth's algorithm for generating nearly balanced binary codes. Let Mimicking the original Knuth encoder, the encoder successively inverts the symbols of the ith segment of x, i = 0, · · · , m 0 − 1, thereby successively inverting the symbols The encoder selects the index, bî, that enables the least unbalance. In similar vein as in Knuth's method, the indexî is represented by a redundant (balanced or nearly balanced) p-bit prefix that is appended to the weakly-balanced word. According to Knuth we can choose at least one index k 0 , 1 ≤ k 0 ≤ n, such that exact balance can be achieved. As an 'exact' balancing index, k 0 , is at most ⌊s/2⌋ positions away from position bî, we conclude that the relative unbalance is The redundancy of the above weak Knuth code equals at least p 0 bits (note that additional redundancy is needed to encode the prefix into a nearly balanced word). Let, for example, the code redundancy be p 0 = 3, then α S = 0.0625. Figure 1 shows that for a relative unbalance a = 0.0625 we need, in theory, less than 1.5 bit redundancy for n > 25, so that we conclude that the above modification of Knuth's algorithm falls far short of the minimum redundancy required.
In the next section, we discuss constructions for generating strings that avoid long repetitions of the same nucleotide.

IV. MAXIMUM RUNLENGTH CONSTRAINT
Long repetitions of the same nucleotide (nt), called a homopolymer run or runlength, may significantly increase the chance of sequencing errors [4], [5], and should be avoided. Avoiding long runs of the same nucleotide will result in loss of information capacity, tand codes are required for translating arbitrary source data into constrained quaternary strings. Binary runlength limited (RLL) codes have found widespread application in digital communication and storage devices since the 1950s [6], [10]. MacLaughlin et al. [16] studied multi-level runlength limited codes for optical recording. An n-nucleotide oligo, a string of 4ary symbols of length n, can be seen as two parallel binary strings of length n, namely a string of a least and a most significant bit with which the 4-ary symbol can be represented. Such a system of multiple parallel data streams with joint constraints is reminiscent of 'two-dimensional' track systems, which have been studied by Marcellin and Weber [17].
We start in the next subsection with the counting of qary sequences that satisfy a maximum runlength, followed by subsections where we describe limiting properties and code constructions that avoid m + 1 repetitions of the same nucleotide.

A. Counting q-ary sequences, capacity
Let the number of n-length sequences consisting of qary symbols have a maximum run, m, of the same symbol be denoted by N q (m, n). The number N q (m, n) can be found using the next Theorem which defines a recursive relation [18], Part 1.
Theorem 1: (14) Proof: For n ≤ m the above is trivial as all sequences satisfy the maximum runlength constraint. For n > m we follow Shannon's approach [18] for the discrete noiseless channel. The runlength of k symbols a can be seen as a 'phrase' a of length k. After a phrase a has been emitted, a phrase of symbols b = a of length k can be emitted without violating the maximum runlength constraint imposed. The total number of allowed sequences, N q (m, n), is equal to (q − 1) times the sum of the numbers of sequences ending with a phrase of length k = 1, 2, . . . m, which are equal to N q (m, n−k). Addition of these numbers yields (14), which proves the Theorem. Using the above expressions, we may easily compute the feasibility of a q-ary m-constrained code for relatively small values of n where a coding look-up table is practicable, see Subsection IV-C for more details.
1) Generating functions: Generating functions are a very useful tool for enumerating constrained sequences [19], and they offer tools for approximating the number of constrained sequences for asymptotically large values of the sequence length n. The series of numbers {N q (m, n)}, n = 1, 2 . . ., in (14), can be compactly written as the coefficients of a formal power series H q,m (x) = N q (m, i)x i , where x is a dummy variable. There is a simple relationship between the generating function, H q,m (x), and the linear homogenous recurrence relation (14) with constant coefficients that defines the same series [19]. We first define a generating function Let the operation [x n ]g(x) denote the extraction of the coefficient of x n in the formal power series G(x), that is, define Let Theorem 2: The number of n-symbol m-constrained qary words is Proof: The generating function for the number of q-ary sequences with a maximum runlength m is We may rewrite the above as which proves the Theorem.
2) Asymptotical behavior: For asymptotically large codeword length n, the maximum number of (binary) user bits that can be stored per q-ary symbol, called (information) capacity, denoted by C q (m), is given by [18] C q (m) = lim where λ q (m), is the largest real root of the characteristic equation [18], [16] x m+1 − qx m + q − 1 = 0.
Table I shows the information capacities C 2 (m) and C 4 (m) versus maximum allowed (homopolymer) run m.
For asymptotically large n we may approximate N q (m, n) by [19] N q (m, n) ∼ A q (m)λ n q (m).

B. Binary-based RLL code construction, Construction II
In a similar vein as presented in Section III, we may exploit binary maximum runlength limited (RLL) codes for generating quaternary RLL sequences. Construction II exemplifies such a technique for m > 1. Let u = (u 1 , . . . , u n ) be an n-bit RLL string. We merge the RLL n-bit string, u, with an n-bit source string y = (y 1 , . . . , y n ), by using the addition v i = u i + 2y i , 1 ≤ i ≤ n, where v = (v 1 , . . . , v n ), v i ∈ Q is the 4-ary output string. It is easily verified that the 4-ary output string, v, has maximum allowed run m, the same as the binary string u. The number of distinct 4-ary sequences, v, of Construction II equals 2 n N 2 (m, n), so that the redundancy, denoted by r 2 (mn, n) is The capacity loss with respect to the runlength limited 4-ary channel, denoted by η(m), is expressed by  with any binary RLL code, and there are many binary code constructions for generating maximum runlength constrained sequences, see [10] for an overview. We propose here, for the efficiency assessment, a simple two-mode block code of codeword length n. Runlength constrained codewords in the first mode start with a symbol 'zero', while codewords in the second mode start with a 'one'. When the previous sent codeword ends with a 'one' we use the codewords from the first mode and vice versa. The number of binary source words that can be accommodated with Construction II equals 2 n−1 N 2 (m, n), so that the code rate, denoted by R m,0 , is where we truncated the code size to the largest power of two possible. Table IV shows selected outcomes of computations of the rate efficiency R m,0 /C 4 (m) versus m and n.

C. Encoding of quaternary sequences without binary step
In this subsection, we investigate constructions of codes that transform binary words directly (that is, without an intermediate binary coding step) into 4-ary maximum homopolymer constrained codewords. An example of a simple 4-ary block code was presented by Blawat et al. [2]. The code converts 8 source bits into a 4-ary word of 5 nt. The 5-nt words can be cascaded without violating the prescribed m = 3 maximum homopolymer run. The rate of Blawat's construction is R = 8/5 = 1.6. As C 4 (m = 3) = 1.9824, see Table I, the (rate) efficiency of the construction is R/C 4 (m) = 0.807. Alternative, and more efficient, constructions are described below.

1) State-independent decoding:
A source word can be represented by two n-symbol 4-ary m-constrained codewords. The two representations differ at the first position. In case we cascade a new codeword to the previous codeword, we are always able to choose (at least) one representation whose first symbol differs from the last symbol of the previous codeword. Then, clearly, the cascaded string of 4-ary symbols satisfies the maximum homopolymer run constraint. The rate of this two-mode construction, denoted by R m,1 , is where we truncated the code size to the largest power of two possible. Table V shows selected outcomes of computations of the rate efficiency R m,1 /C 4 (m) versus m and n. We observe that, for m = 2, the 'quaternary' efficiency R 2,1 /C 4 (2) is slightly better than the 'binary' R 2,0 /C 4 (2), For m > 2, both approaches have the same efficiency. The conversion of the binary source symbols into the 4-ary n-nt strands and vice versa can be accomplished using look-up tables of complexity 4 n .
2) State-dependent decoding: In the above construction, the encoded codeword depends on the last symbol of the previous codeword. Decoding, however, is based on the observation of the n symbols of the retrieved codeword. In this subsection, we discuss a state-dependent decoding construction, where the codeword chosen depends on the last symbol of the previous codeword, and decoding is based on the observation of the n symbols of the retrieved codeword plus the last symbol of the previous codeword. We define four tables of codewords, denoted by L(i, a), where i, 1 ≤ i ≤ K, denotes the decimal representation of the source word to be encoded, K denotes the size of the table, and a denotes the encoder state a =∈ {1, 2, 3, 4}. We construct the four tables in such as way that the codewords in each table L(i, a) do not start with the symbol a. Then, the maximum size of the tables equals K = 3 4 N 4 (m, n) (note that N 4 (m, n) is a multiple of 4). The representation, L(i, a), chosen depends on the last symbol of the previous codeword, a. The rate of this four-mode construction, denoted by R m,2 , is  [2]) n = 5 and m = 3. We simply find, using (14), N 4 (3, 5) = 996, so that the code may accommodate K = 3/4 × 996 = 747 binary source words. Since K > 512 = 2 9 we may implement a code of rate 9/5, which is 12% higher than that of Blawat's code of rate 8/5. As we have the freedom of deleting 747-512=235 redundant codewords, we may bar the words with the highest unbalance.
In the next section, we take a look at the combination of balance and maximum polymer run constrained codes.

CONSTRAINED CODES
Kerpez et al. [20], Braun and Immink [21], and Kurmaev [22] analyzed properties and constructions of binary combined weight and runlength constrained codes. Their results are straightforwardly applied to the quaternary case at hand. In the next section, we count binary and quaternary sequences that satisfy combined maximum runlength and weight constraints. We start by counting the number of binary sequences, x, of length n that satisfy a maximum runlength constraint m and have a weight w = w 2 (x). Paluncic and Maharaj [23] enumerated this number for the balanced case w = w 2 (x) = 0.

A. Counting binary RLL sequences of given weight
Define the bi-variate generating function H(x, y) in the dummy variables x and y by and let [x n1 y n2 ]h(x, y) denote the extraction of the coefficient of x n1 y n2 in the formal power series h i,j x i y j , or Define The number of n-bit codewords, x, with maximum runlength m, denoted (with a slight abuse of notational convention by adding an extra parameter) by N 2 (m, w, n), that satisfy a given unbalance constraint w = w 2 (x) is given by the next Theorem. Theorem 3: Proof: Let the sequence start with a runlength of zero's, then the generating function for the number of binary sequences with a maximum runlength m is In case the sequence starts with a run of one's, we obtain for the generating function The generating function for the number of binary sequences with a maximum runlength m starting with a one or a zero runlength is the sum of the two above generating functions.
Working out the sum yields which proves the Theorem.
With the above bi-variate generating function, we may exactly compute the number of binary m-constrained words of weight w. More insight is gained by an approximation of N 2 (m, w, n). For a given maximum runlength, m, and large n, we are specifically interested in the distribution of N 2 (m, w, n) versus the weight w. For asymptotically large n, according to the central limit theorem, the distribution of the number of sequences versus weight, w, is approximately Gaussian [19]. Theorem 4: where Proof: The probability of occurrence of a runlength of length k, k ≤ m, is λ −k 2 (m), see [10], Chapter 4. So that the average number of runlengths in a sequence of n symbols is n/l. The weight w is the sum of the runlengths of ones, so that according to the central limit theorem the weight distribution is approximately Gaussian for large n with mean n 2 and variance γ2(m)n 4 .  Table VII shows results of computations (the parameter γ 4 (m) is explained in Section V-B). Perusal of the outcomes clearly demonstrates that for small values of m the unbalance variance, γ 2 (m)n, is smaller than that of unconstrained sequences (that is, m = ∞) of the same length n. In other words, a maximum runlength 'helps' to reduce the expected unbalance.

B. Counting quaternary RLL sequences of given weight
We count the number of n-tuples x of 4-ary symbols that satisfy a maximum run length constraint, m, and have weight w = w 4 (x), denoted (with a slight abuse of notational convention) by N 4 (m, w, n).

1) Maximum runlength constraint:
For the special case m = 1, Limbachiya [24] et al. presented a closed expression of N 4 (1, w, n). For other values of the prescribed maximum runlength, m, we may readily compute the number of 4ary sequences, N 4 (m, w, n), versus weight, w = w 4 (x), by applying generating functions. The 4-ary symbols are generated by a constrained data source that can be modelled as a four-state Moore-type finite-state machine. The machine steps from state to state where when state i ∈ Q is visited a sequence of k, 1 ≤ k ≤ m, symbols 'i' are emitted. After visiting state i, the data source may not return to state i (and thus emit a sequence of the same symbol 'i' again), but it enters state j = i, j ∈ Q. When the machine enters state 3 or 4, the word weight, w, is incremented by k, where k, 1 ≤ k ≤ m, denotes the run of symbols '3' or '4'. When, on the other hand, states 1 or 2 are entered, the weight increment is nil. The resulting 4 × 4 one-step skeleton or state-transition matrix, D(x, y), of the finite-state machine is where a 0 = T (x) and a 1 = T 1 (x, y).
Theorem 5: The number of 4-ary sequences of length n with maximum runlength constraint m and weight w equals where d [n] i,j (x, y) denotes the entries of D n (x, y). Proof: The entries d [n] i,j (x, y) of D n (x, y) are equal to the number of sequences (paths) of length n starting in state i and ending in state j. Summation of the entries and division by 3 yields the generating function of N 4 (m, w, n).
In the next subsection, we derive a simple approximation to N 4 (m, w, n) valid for large n.
2) Estimate of the weight distribution: For asymptotically large n, the weight distribution is approximately Gaussian, that is, we may conveniently approximate N 4 (m, w, n) using the next Theorem.
Theorem 6: where σ 2 4 (m, n), denotes the variance of the Gaussian weight distribution. Proof: Let u i , i = 1, 2, . . ., u i ∈ Q, be an infinitely long 4-ary sequence generated by a maxentropic source that satisfies a prescribed maximum runlength, m. Note that although the 4-ary sequence u i , i = 1, 2, . . ., satisfies a limited runlength constraint, m, that runs of the binary weight sequence v i = ϕ(u i ), i = 1, 2, . . ., are without limit. The variance, σ 2 4 (m, n), of the Gaussian weight distribution is governed by the runlength distribution, P (k), of the binary sequence v i , where P (k), k > 0, denotes the probability of occurrence of a runlength k. Clearly, k>0 P (k) = 1. The probability P (k) is given by where the normalization constant c is chosen such that ∞ k=1 P (k) = 1. The term N 2 (m, k) is the number of AT combinations of length k, which may exist of a single A or T run or a plurality of alternating A and T runs. We have σ 2 4 (m, n) = γ4(m)n

4
, where, see [10], Chapter 4, Table VII shows results of computations of γ 4 (m) versus m. We may notice that the weights of the quaternary RLL sequences are more concentrated around the mean n/2 than those of binary RLL sequences. The above outcome is not consistent with the results by Ehrlich and Zielinski [3], as they assume that the balance variance equals n/4, independent of m.

C. Redundancy of binary and quaternary codes with combined constraints
As in Constructions I and II, let the quaternary word x = (x 1 , . . . , x n ), x i ∈ Q, be written as x = y + 2z, where the constituting elements y i and z i ∈ {0, 1}. If the binary sequence z is m-constrained and has a weight w = w 2 (z), then x is m-constrained and it has a weight w 4 (z) = w. The redundancy of the binary constrained sequences, z, denoted (with a slight abuse of convention) by r 2 (m, a, n), equals r 2 (m, a, n) = n − log 2 N 2 (m, w, n). (41) Using (24) and (32), we obtain for n ≫ 1, that the redundancy of the binary approach is r 2 (m, a, n) ∼ r 2 (m, n) − log 2 1 − 2Q 2a n γ 2 (m) .
A numerical analysis of the above expressions shows that the redundancy difference due to the balance (right hand) term is around 0.5-1 bit for m = 2. For larger values of the homopolymer run m the extra redundancy is negligible. The redundancy difference, r 2 (m, n) − r 4 (m, n), due to the imposed runlength constraint is much larger for n > 10 than the redundancy due the balance constraint. For m > 6 the difference between r 2 (m, n) and r 4 (m, n) is negligible, see Subsection IV-B, so that considering the much larger lookup tables needed for quaternary codes, the binary approach using Construction 1 for combined constraints is preferable from a practical point of view.

VI. CONCLUSIONS
We have described coding techniques for weakly balancing GC and AT-content and avoiding homopolymer runs larger than m nt's of quaternary DNA strings. We have found exact and approximate expressions for the number of binary and quaternary sequences with combined weight and run-length constraints. We have compared two coding approaches for constraint-based coding of DNA strings. In the first approach, an intermediate, 'binary', coding step is used, while in the second approach we 'directly' translate source data into constrained quaternary sequences. The binary approach is attractive as it yields a lower complexity of encoding and decoding look-up tables. The redundancy of the binary approach is higher than that of the quaternary approach for generating combined weight and run-length constrained sequences. The redundancy difference is small for larger values of the maximum homopolymer run.