DNA-Based Storage: Models and Fundamental Limits

Due to its longevity and enormous information density, DNA is an attractive medium for archival storage. In this work, we study the fundamental limits and trade-offs of DNA-based storage systems by introducing a new channel model, which we call the noisy shuffling-sampling channel. Motivated by current technological constraints on DNA synthesis and sequencing, this model captures three key distinctive aspects of DNA storage systems: (1) the data is written onto many short DNA molecules; (2) the molecules are corrupted by noise during synthesis and sequencing and (3) the data is read by randomly sampling from the DNA pool. We provide capacity results for this channel under specific noise and sampling assumptions and show that, in many scenarios, a simple index-based coding scheme is optimal.


Introduction
Due to its longevity and enormous information density, and thanks to rapid advances in technologies for writing (synthesis) and reading (sequencing), DNA is on track to become an attractive medium for archival data storage. DNA is a long molecule made up of four nucleotides (Adenine, Cytosine, Guanine, and Thymine) and, for storage purposes, can be viewed as a string over a four-letter alphabet. While in a living cell a DNA molecule can consist of millions of nucleotides, due to technological constraints, it is difficult and inefficient to synthesize long strands of DNA. Thus, in practice, data is stored on short DNA molecules which are preserved in a DNA pool and cannot be spatially ordered.
In recent years, several groups have demonstrated working DNA storage systems [CGK12; Gol+13; Gra+15; Yaz+15; EZ17;Org+18]. In these systems, information was stored on molecules of no longer than one or two hundred nucleotides. At the time of reading, the information is accessed via state-of-the-art sequencing technologies. This corresponds to (randomly) sampling and reading sequences from the pool of DNA. Sequencing is preceded by several cycles of Polymerase Chain Reaction (PCR) amplification. In each cycle each molecule is replicated by a factor of 1.6-1.8. Thus, the proportions of the sequences in the DNA mixture just before sequencing and the probability that a given sequence is read depends on the synthesis method, the PCR steps, and the decay of DNA during storage. Finally, sequencing and in particular synthesis of the DNA may lead to insertions, deletions, and substitutions of nucleotides in individual DNA molecules. See [HMG19] for a detailed discussion of the error sources and probabilities for different experimental setups.
Given these constraints, a mathematical model for a DNA storage channel is as follows. Data is written on M DNA molecules, each of length L. From this multiset of sequences, N sequences are drawn according to some distribution Q, and then perturbed by introducing individual base errors. A critical element of this model is that by drawing N sequences according to some distribution Q, the order of the sequences is lost. The decoder's goal is to reconstruct the information from the multi-set of N reads. Note that the decoder has no information about which molecules were sampled, and in general a fraction of the original DNA fragments may never be sampled. Our goal is to study the capacity of this channel under different modeling assumptions on the sampling distribution and introduced errors.

Contributions
In this paper we study the fundamental limits of the DNA storage model outlined above. Our analysis aims to reveal the basic relationships and trade-offs between key design parameters and performance goals such as storage density and reading/writing costs. Throughout, we consider the asymptotic regime where M → ∞. The main parameter of interest is the storage capacity C, defined as the maximum number of bits that can be reliably stored per nucleotide (the total number of nucleotides is M L).
Capacity in the case of noise-free sequences: We start with a channel without errors in the individual sequences. Thus, randomness is only introduced through the distribution Q, which describes the number of copies we draw from each input sequence. According to Q, some of the individual sequences might never be drawn and others are drawn many times. Our main result for this channel states that if lim M →∞ L log M = β > 1, then where q 0 is the probability that a given sequences is never sampled. Interestingly, our result only depends on the distribution Q through q 0 , which is the probability that a given sequences is never sampled. Moreover, if lim M →∞ L log M < 1, no positive rate is achievable. The factor 1 − q 0 is the loss due to unseen molecules, and 1 − 1/β corresponds to the loss due to the unordered fashion of the reading process.
One important implication of our result is that a simple index-based scheme (as commonly used by DNA data storage systems) is optimal; i.e., prefixing each molecule with a unique index incurs no rate loss. More specifically, our result shows that indexing each DNA molecule and employing an erasure code across the molecules is capacity-optimal. Furthermore, the capacity in (1) is only non-trivial if the read length scales as L = Θ(log M ). For that reason, throughout the paper we focus on the regime L = β log M , where β is a positive constant.
Suppose that each sequence is drawn according to a Poisson distribution with mean λ, so that in expectation λM sequences are drawn and λ can be thought of as the sequencing coverage depth. Then, the probability that a sequence is never drawn is e −λ and it decays exponentially in the coverage depth. For this scenario, our expression for the capacity suggests that practical systems should not operate at a high coverage depth N/M , as high coverage depth significantly increases the time and cost of reading, but only provides little storage gains. Notice that, in order to guarantee that all M sequences are observed at least once, we need N = Ω(M log M ) [LW88 ;MBD13]. When M is large, it is wasteful to operate in this regime, as this only gives a marginally larger storage capacity, but the sequencing costs can be exorbitant.
Capacity in the case of noisy sequences: Our second contribution is an expression for the capacity for the case where the reading of the sequences is noisy. The goal of this second statement is to understand the effect of errors within sequences, in addition to the shuffling and sampling of the sequences. We assume that the distribution Q with which the sequences are drawn is a simple Bernoulli distribution; i.e., a sequence is either drawn once with probability 1 − q 0 or not drawn with probability q 0 . Furthermore we focus on substitution errors within sequences. Thus, we study a noisy shuffling-sampling model where the output sequences are obtained as follows: (i) each original sequence is drawn with probability 1 − q and not drawn with probability q, (ii) the drawn sequences are shuffled, and (iii) passed through a binary symmetric channel with crossover probability p.
In the low-error regime (where p is sufficiently small), the capacity of this noisy shuffling-sampling channel is given by (2) Note that 1 − H(p) is the capacity of the binary symmetric channel. As it turns out, (2) can be achieved by treating each length-L sequence as the input to a separate BSC and encoding a unique index into each sequence, and using an erasure outer code to protect against the loss of a q 0 -fraction of the M sequences. For a large set of parameters β and p (described in Section 4), is capacity-optimal. This result provides a theoretical justification for a number of works, starting with [Gra+15], which have used a similar coding scheme in real implementations of DNA-based storage systems [Gra+15; Yaz+15; EZ17; Org+18; Mei+20].

Related literature
Computer scientists and engineers have dreamed of harnessing DNA's storage capabilities already in the 60s [Nei64;Bau95], and in recent years this idea has developed into an active field of research. In 2012 and 2013 groups lead by Church [CGK12] and Goldman [Gol+13] independently stored about a megabyte of data in DNA. In 2015, Grass et al. [Gra+15] demonstrated that millenia long storage times are possible by protecting the data both physically and information-theoretically, and designed a robust DNA data storage scheme using modern error correcting codes. Later, in the same year, Yazdi et al [Yaz+15] showed how to selectively access parts of the stored data, and in 2017, Erlich and Zielinski [EZ17] demonstrated that practical DNA storage can achieve very high information densities. In 2018, Organick et al. [Org+18] scaled up these techniques and stored about 200 megabytes of data. The capacity of a DNA storage system under a related model has been studied in an unpublished manuscript by MacKay, Sayer, and Goldman [MSG15;Sayr ]. In the model in [MSG15], the input to the channel consists of a (potentially arbitrarily large) set of DNA molecules of fixed length L, which is not allowed to contain duplicates. The output of the channel are M molecules drawn with replacement from that set. The approach in [MSG15] considers coding over repeated independent storage experiments, and computes the single-letter mutual information over one storage experiment. This indicates that the price of not knowing the ordering of the molecules is logarithmic in the number of synthesized molecules, similar to our main result.
The capacity of a DNA storage system under a different model was studied in [EZ17]. Specifically [EZ17] assumes that each DNA segment is indexed which reduces the channel model to an erasure channel. While this assumption removes the key aspects that we focus on in this paper, namely that DNA molecules are stored in an unordered way and read via random sampling, [EZ17] considers other important constraints, such as homopolymer limitations.
Several recent works have designed coding schemes for DNA storage systems based on this general model, some of which were implemented in proof-of-concept storage systems [CGK12; Gol+13; Gra+15; Bor+16; EZ17]. Several papers have studied important additional aspects of the design of a practical DNA storage system. Some of these aspects include DNA synthesis constraints such as sequence composition [KPM16; Yaz+15; EZ17], the asymmetric nature of the DNA sequencing error channel [GKM15], the need for codes that correct insertion errors [Sal+17], and the need for techniques to allow random access [Yaz+15]. The use of fountain codes for DNA storage was considered in both [EZ17] and [MSG15].
Finally, the recent paper [Len+19] studies a related channel and proves a converse on the capacity through a combination of techniques, including ideas from [Hec+17; SH19].

Problem setting and channel models
An (M, L) DNA storage code C is a set of codewords, each of which is a list [x L 1 , . . . , x L M ] of M strings of length L, together with a decoding procedure. The alphabet Σ is typically {A, C, G, T}, corresponding to the four nucleotides that compose DNA. However, to simplify the exposition we focus on the binary case Σ = {0, 1}, and we note that the results can be extended to a general alphabet in a straightforward manner. Throughout the paper we use the word molecule or sequence to refer to each of the stored strings of length L over the alphabet Σ. We study the following general noisy shuffling-sampling channel model: 1. Given that codeword [x L 1 , . . . , x L M ] ∈ C is chosen, each sequence x L i is sampled a number N i ∼ Q of times, for some distribution Q = (q 0 , q 1 , . . .), where q n = Pr (N i = n) is the probability that x L i is drawn n many times. We let N = M i=1 N i be the total number of resulting strings, and we define λ : The distribution Q models imperfections in synthesis, sequencing, and a loss of whole sequences during storage (see [HMG19] for a detailed discussion on how this distribution looks like for specific choices of sequencing and synthesis technologies).
2. Each of the resulting N strings is passed through a discrete memoryless channel.
3. The resulting N strings are shuffled uniformly at random to yield the output [y L 1 , . . . , y L N ]. Equivalently, the output of the channel is the (unordered) multi-set of N output sequences {y L 1 , . . . , y L N }. A decoding function then maps the received sequences [y L 1 , . . . , y L N ] to a message index in {1, . . . , |C|}. The main parameter of interest of a DNA storage system is the storage density, or the storage rate, defined as the number of bits written per DNA base synthesized, i.e., We consider an asymptotic regime where M → ∞ and we let L := β log M for some fixed β. As our main results show, L = Ω(log M ) is the asymptotic regime of interest for this problem. We say that the rate R is achievable if there exists a sequence of DNA storage codes C M with rate R such that the decoding error probability tends to 0 as M → ∞.

Storage Capacity for the Noise-free Channel
An important property of DNA storage channels is the fact that the order or the molecules are lost. We first focus on this aspect of the channel model by studying the noise-free channel (where all copies are noise-free, i.e., the discrete memoryless channel is just the "identity channel").
The main result of this section is the characterization of the storage capacity, given by the following theorem.
Theorem 1. The storage capacity of the noise-free shuffling-sampling channel is In particular, if β ≤ 1, no positive rate is achievable.
The capacity expression in (4) can be intuitively understood through the achievability argument. A storage rate of R = (1 − q 0 ) (1 − 1/β) can be easily achieved by prefixing all the molecules with a distinct tag, which effectively converts the channel to a block-erasure channel. More precisely, we use the first log M bits of each molecule to encode a distinct index. Then we have L − log M = L(1 − 1/β) symbols left per molecule to encode data. The decoder can use the indices to remove duplicates and sort the molecules that are sampled. This effectively creates an erasure channel, where molecule i is erased if it is not drawn (i.e., N i = 0) which occurs with probability q 0 . Since the expected number of erasures is E 1 surprising aspect of Theorem 1 is that this simple index-based scheme is optimal. It is also worth noting that the capacity expression only depends on the sampling distribution Q through the parameter q 0 , i.e., the fraction of sequences that is not seen at the output of the channel.
In order to gain intuition on a practical implication of this theorem, suppose that each sequence is drawn according to a Poisson distribution with mean λ, so that in expectation λM sequences in total are drawn and λ can be thought of as the sequencing coverage depth. Then, the probability that a sequence is never drawn is e −λ and the capacity expression becomes This suggests that practical systems should not operate at a high coverage depth N/M , as high coverage depth significantly increases the time and cost of reading, but only provides little storage gains, according to our capacity expression. Notice that, in order to guarantee that all M sequences are observed at least once, When M is large, it is wasteful to operate in this regime, as this only gives a marginally larger storage capacity, but the sequencing costs can be exorbitant. The result in Theorem 1 is flexible to allow different sampling models. In particular, one can consider separating the PCR amplification performed on each synthesized molecule from the sequencing step. Since one cannot control the PCR amplification factor precisely, it is reasonable to assume that a molecule x L is first randomly amplified and a total of A ≥ 0 copies is stored. If we consider a Poisson sampling model for the sequencing step, the effective coverage depth is λ/E[A] (since we are actually sampling from M E[A] molecules). In this case, the probability that none of the copies of x L is sampled at the output . This can be recognized as the moment-generating function of A evaluated at −λ/E[A]. In particular, when PCR is also modeled as a Poisson random variable with mean E[A] = α, E[e θA ] = e α(e θ −1) , and the capacity of the resulting noise-free shuffling-sampling channel is

Motivation for Converse
A simple outer bound can be obtained by considering a genie that provides the decoder with the "true" index of each sampled molecule. In other words, [x L 1 , . . . , x L M ] are the stored molecules, and the decoder observes [y L 1 , . . . , y L N ] and the mapping σ : . This converts the channel into an erasure channel with block-erasure probability q 0 , which yields It is intuitive that the bound (7) should not be achievable, as the decoder in general cannot sort the molecules and create an effective erasure channel. However, it is not clear a priori either whether prefixing every molecule with an index is optimal. Notice that one can view the noise-free DNA storage channel as a channel where the encoder chooses a distribution (or a type) over the alphabet Σ L and the decoder observes a noisy version of this type where the frequencies are perturbed accoding to Q. From this angle, the question becomes "how many types t ∈ Z 2 L + with t 1 = M can be reliably decoded?", and restricting ourselves to index-based schemes restricts the set of types to those with t ∞ = 1; i.e., no duplicate molecules are stored.
While this restriction may seem suboptimal, a counting argument suggests that it is not. The number of types for a sequence of length M over an alphabet of size |Σ L | = 2 L is at most M 2 L and thus at most bits can be encoded per symbol. We conclude that, if β < 1, the capacity is C = 0. An actual bound on the rate can be obtained by counting the number of types more carefully. This is done in the following lemma, which we prove in the appendix.
Lemma 1. The number of distinct vectors t ∈ Z a + with t 1 = b is given by Since our types are vectors t ∈ Z 2 L + with t 1 = M , and 2 L = 2 β log M = M β , it follows that at most bits can be encoded per symbol, for some α > 1, and Therefore, if we had a deterministic channel where the decoder observed exactly the M stored molecules, an index-based approach would be optimal from a rate standpoint. The converse presented in the next section utilizes a more careful genie to show that the bounds in (7) and (8) can in fact be combined, implying the optimality of index-based coding approaches.

Converse
Let [x L 1 , . . . , x L M ] be the M length-L molecules written into the channel and [y L 1 , . . . , y L N ] be the length-L molecules observed by the decoder. Notice that, whenever the channel output is such that y L i = y L j for i = j, the decoder cannot determine whether both y L i and y L j were sampled from the same molecule x L or from two different molecules that obey x L = x L k , = k. In order to derive the converse, we consider a genieaided channel that removes this ambiguity. As illustrated in Figure 2, before sampling the N molecules, We emphasize that the indices z i are all unique, and are chosen randomly and independently of the input sequences {x L i } M i=1 . Notice that, in contrast to the naive genie discussed in Section 3.1, this genie does not reveal the index i of the molecule x L i from which y L was sampled. Therefore, the channel is not reduced to an erasure channel, and intuitively the indices are only useful for the decoder to determine whether two equal samples y L = y L k came from the same molecule or from distinct molecules.
The output of the genie-aided channel, denoted by {(y L i , z σ(i) )} N i=1 , is then obtained by sampling from the set of tagged molecules M ] is such that y L i was sampled from x L σ(i) . Notice that the actual mapping σ is not revealed to the decoder.
It is clear that any storage rate achievable in the original channel can be achieved on the genie-aided channel, as the decoder can simply discard the indices, or stated differently, the output of the original channel can be obtained from the output of the genie-aided channel. Notice since all tagged molecules are distinct objects, and sampling the same tagged molecule ( in the following way. The entry of f corresponding to y L , for y L ∈ Σ L , is given by Hence, f is essentially a histogram that counts the number of occurrences of y L ∈ Σ L in the set of tagged Notice that the entries of f can take values greater than one.
and the tags added by the genie were chosen at random and independently of Hence, we can view the (random) frequency vector f as the output of the channel without any loss. Notice . Furthermore, the following lemma asserts that f 1 does not exceed its expectation by much.
Lemma 2. For any δ > 0, the frequency vector f at the output of the genie-aided channel satisfies Proof. Note that the number of distinct fragments that have been drawn is Since 1 {N i > 0} are independent random variables with expectation 1 − q 0 , Hoeffding's inequality yields which concludes the proof 1 .
We ≤ H(f ) + 1 + P e M LR s , where P e is the probability of a decoding error, which by assumption goes to zero as M → ∞. We can then upper bound the achievable storage rate R as where T [a, b] is the number of vectors x ∈ Z a + with x 1 = b. An application of Lemma 1 yields where α is a positive constant. Analogously, we obtain Dividing (10) by M L and applying the bounds above yields since Pr(E) → 0 by Lemma 2. Since δ > 0 can be chosen arbitrarily small, this concludes the converse proof of Theorem 1.

The noisy shuffling-sampling channel
Next, we study the effect of errors within sequences, in addition to the shuffling and sampling of the sequences. Instead of the general sampling distribution Q considered in Section 3, we now focus on a simple choice of sampling distribution and let Q be distributed as Bernoulli(1 − q). Hence, a sequences is either drawn never or once, with the corresponding probabilities given by Pr(N i = 0) = q and Pr(N i = 1) = 1 − q, for i = 1, . . . , M . Moreover, we assume that the molecules are all corrupted by a BSC with error probability p. We refer to this channel as the noisy shuffling-sampling channel.

The capacity of the noisy shuffling-sampling channel
As in the error-free shuffling-sampling channel considered in Section 3, we again consider a simple indexbased coding scheme. As we will show, for a large set of parameters p and β, this scheme turns out to be capacity-optimal. We consider a erasure-correcting code with block length M and rate (1 − q), where each symbol is itself a binary string of length L(1 − H(p) − 1/β − ), for some small epsilon. This code will be used as an outer code. Our inner code will be a code designed for a BSC with codewords of length L and rate R BSC = 1 − H(p) − . We first encode the information using the outer code, which yields M symbols, which are binary strings of length We take each symbol, add a unique binary index of length log M and encode the resulting sequence using the BSC code, which yields M length-L sequences.
With this scheme, we encode a total of (1 − q)M (LR BSC − log M ) data bits, with a data rate of Since > 0 can be chosen arbitrarily small, this scheme achieves a rate arbitrarily close to Strictly speaking, the simple index-based scheme described above needs to be slightly modified to account for the fact that, if an inner codeword is decoded in error (which occurs with a small probability) its unique index will also be decoded in error, likely causing an "index collision" with another correctly decoded inner codeword. Such a collision effectively creates two erasures. Moreover, there exists an even smaller probability that two inner codewords are decoded in error in a way that the true indices are swapped. Such an event may not be detected at the decoder side based on the set of decoded indices. Notice, however, that since the error probability of the inner code goes to zero as M → ∞, these events are much rarer than the erasures caused by the sampling distribution Q. It is straightforward to show that by considering an outer code with rate 1 − q − 2 , for an arbitrarily small 2 , these additional small-probability events can be accounted for. Hence, (12) is a lower bound to the capacity C of the noisy shuffling-sampling channel.
On the other hand, the result from Section 3, with Q ∼ Ber(1 − q) implies that C ≤ (1 − q)(1 − 1/β), since the error-free shuffling-sampling channel cannot be worse than the noisy shuffling-sampling channel. Furthermore, a simple genie-aided argument where the decoder observes the shuffling map can be used to establish that C ≤ (1−q)C BSC , where C BSC = 1−H(p) is the capacity of a BSC with crossover probability p. Hence, a capacity upper bound is given by Our main result improves on the upper bound in (13), and establishes that for parameters (p, β) in a certain regime, the lower bound in equation (12) is the capacity.

Converse
To derive the converse, we view the input to the channel as a binary string of length M L, denoted by where N = i N i . It is useful to define a vector S N ∈ {1, . . . , M } N indicating the input string from which each output string was sampled. Furthermore, we let Z N L = Z L 1 , . . . , Z L N be the random binary error pattern created by the BSC on the N non-deleted strings. We can now define the input-output relationship where ⊕ indicates elementwise modulo 2 addition. Note that the N i 's are fully determined by the vector S N since N i = |{i : S(k) = i}|. Also note that, since Q ∼ Ber(1 − q), N ≤ M with probability 1. and, by Jensen's inequality, In order to finish the converse, we need to jointly bound the first and third terms in equation (16). This step is summarized in the following lemma: Lemma 3. If β and p < 1/4 satisfy then it holds that The parameter regime (p, β) for which (18) holds is the regime in which our capacity expression holds, illustrated in Figure 3. Combining (16), (17) and Lemma 3, we have Dividing by M L and letting M → ∞ yields the converse.

Intuition for Lemma 3
In order to discuss the intuition for Lemma 3 let us focus on the case q = 0; i.e., none of the molecules are lost at the output. In this case, N = M , and S M is chosen uniformly at random from all permutations of To see this, first note that from X M L = x M L and Y M L = y M L , one can estimate the permutation S that maps each output string to the corresponding input string, S M , by finding, for each y L i , the x L j that is closest to it and setting S(i) = j. This is a good estimate if no other x L k is close to x L j . There are two regimes, illustrated in Figure 4, one where S N can be estimated well and one where it cannot. In the first {0, 1} L < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > {0, 1} L < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > ?
x L j < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > y L j < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t >  Fig. 4(a) instead of the regime of Fig. 4(b). This leads to a tradeoff of the terms H Y N L and H S N |X M L , Y N L , which we exploit to prove Lemma 3. The detailed proof, which considers the general case where q = 0, is presented in the appendix.

Discussion
In this paper we studied the fundamental limits of models of DNA-based storage systems, characterized by random sampling of the input sequences, shuffling, and perturbing them. Specifically, we considered a large class of channel models that capture a range of specific instances of DNA storage channels, specified by choices of synthesis, sequencing, and DNA handling technologies. We focused our analysis on two cases: (1) the error-free shuffling-sampling channel for an arbitrary sampling distribution Q and (2) the noisy shuffling-sampling channel where Q ∼ Ber(1 − q) and the noisy channel is a BSC. In both cases we proved that a simple index-based scheme is capacity optimal, with the caveat that, for the noisy shuffling-sampling channel, the capacity expression in (14) only holds for the parameter regime of (p, β) in the blue region of Figure 3, and most importantly only holds in the low-error regime.
While the parameter regime in Figure 3 is arguably the most relevant one, an interesting question for future work is whether expression (14) is still the capacity of the BSC-shuffling channel if β and p do not satisfy (18) (i.e., the gray region in Figure 3). Notice that this is a high-noise, short-block regime, and it is reasonable to postulate that coding across the different sequences can be helpful and an index-based approach might not be optimal. Another natural question raised by Theorem 2 is whether a similar capacity expression holds for different noisy channels, including corruptions induced by deletions and insertions.

General symmetric channels
Recalling that the capacity expression for the noisy shuffling-sampling channel given by (14) is (1 − q)(C BSC − 1/β), it is natural to ask whether for a different sequence-level noisy channel with capacity C noisy , the corresponding noisy shuffling-sampling channel has capacity (1 − q)(C noisy − 1/β). As it turns out, the converse proof in Section 4.2 can be extended to the class of symmetric discrete memoryless channels (those channels are described in [CT12, Chapter 7.2]).
Specifically, consider a noisy shuffling-sampling channel with sampling Q ∼ Ber(1−q), and a symmetric discrete memoryless channel (SDMC) with output alphabet Y. It is then straightforward to generalize the converse proof in Section 4.2 to establish the following result.
Theorem 3. If β is large enough, the capacity of the SDMC shuffling-sampling channel is given by Moreover, if β ≤ log |Y|, C = 0.
For symmetric channels, capacity is achieved by making the distribution of the output uniform, which allows an analogous result to Lemma 3 to be obtained. How large β needs to be for this statement to hold, depends on the specific channel transition matrix.
Beyond symmetric channels, new converse techniques must be developed in order to characterize the capacity of the corresponding noisy shuffling-sampling channels.

Storage-Recovery Tradeoff
Most studies on DNA-based storage emphasize the storage rate (or storage density), while sequencing costs are disregarded. From a practical point of view, it is important to understand, for a given storage rate, how much sequencing is required for reliable decoding, as this determines the time and cost required for retrieving the data. Thus, characterizing the storage-recovery trade-off is of practical relevance relevance.
One way to do this is to consider, in addition to the storage rate, the recovery rate, defined as the number of bits recovered per DNA base sequenced, In a practical setting, one can control the amount of sequencing performed, typically specified in terms of the coverage depth N/M . If we consider the error-free shuffling-sampling channel from Section 3, in the case where Q is a Poisson distribution with mean λ, then λ = N/M is the coverage depth, and one would like to choose a value of λ that achieves a good trade-off between storage rate and recovery rate. If we let R s be the storage rate (previously just R, see (3)), from Theorem 1 and the fact that R s = λR r , the (R s , R r ) feasibility region can be fully characterized.
Corollary 1. For the error-free shuffling-sampling channel with Q ∼ Pois(λ), rates (R s , R r ) are achievable if and only if, for some c > 0, This region is illustrated in Figure 5. This tradeoff suggests that a good operating point would be achieved by not trying to maximize the storage rate (which technically requires λ → ∞). Instead, by using some modest coverage depth λ = 1, 2, 3, most of the storage rate (63%, 86%, 95%, respectively) can be achieved. This is somewhat in contrast to what has been done in the practical DNA storage systems that have been developed thus far, where the decoding phase utilizes very deep sequencing.
To be concrete, suppose that we are interested in minimizing the cost for storing data on DNA. Synthesis costs are currently larger than sequencing costs by about a factor q = 10, 000-100, 000. Thus, if our goal is to minimize the cost for synthesizing and sequencing a given number of bits in DNA, the cost is proportional to q/R s + 1/R r = q+λ 1−e −λ . This quantity can be maximized over λ, to obtain the optimal cost per bit stored. For example, for q = 10000, λ ≈ 9.2. Moreover, one might be interested in optimizing other quantities such as reading time or considering a scenario where the data is read more than once.

Storing data on short molecules
Throughout this paper, we focused on the regime L = β log M , with β ≥ 1. For β ≤ 1, no positive rate can be achieved (as shown by Theorem 1). However, motivated by the fact that it is in general much easier to synthesize very short sequences of DNA than longer ones, it is interesting to ask whether with very short sequences, it is still possible to build useful DNA storage systems.
Towards this goal, in this section we briefly discuss how fast the rate tends to zero in the regime when β ≤ 1. Notice that, when β ≤ 1, the total number of distinct molecules of length L = β log M is 2 β log M = M β < M . Hence, it is impossible to write M distinct molecules. In this case, it is reasonable to study the amount of bits that can be stored relative to the number of potentially distinct molecules. Towards this goal we define the short-molecule rateR asR Proposition 1. Suppose that each molecule is drawn N i ∼ Q times, with expectation EN i > 0, and that β < 1. Then, any achievable short-molecule rate satisfiesR ≤ 1/β − 1.
The proof, provided in the appendix, is based on the genie-aided and counting-based argument used in Section 3.2. The proposition guarantees that the (true) rate R tends to zero at least as 1/M 1−β . While at first sight, it might seem surprising that there is no dependency on 1 − q 0 , this is reasonable, since in the regime of β < 1, no more than M β distinct molecules exist. Thus, we see each fragment about −β many times, which tends to infinity, regardless of Q.
We point out that index-based coding schemes cannot achieve the scaling R = Θ(M β L) suggested by the proposition. To see this, suppose we encode the sequences by using L − 1 bits for the index and only one bit for the information, and repeat each such segment M/(2 L−1 ) = 2M 1−β many times. We see each segment at least once with probability one as M → ∞. Thus we reliably store 2 L−1 = M β /2 bits. Simple variations of this scheme (where we change the number of bits allocated to the index) can be similarly shown to only encode Θ(M β ) bits reliably. Hence, for the regime β ≤ 1, our upper bound to the number of bits that can be reliably stored is Θ(M β L), while our lower bound is Θ(M β ), and it is an open question what the correct scaling is.

Outlook
In this paper we took steps towards the understanding of the fundamental limits of DNA-based storage systems. We proposed a simple model capturing the fact that molecules are stored in an unordered fashion, are short, and are corrupted by individual base errors. Our results show that a simple index-based coding scheme is asymptotically optimal for a large set of parameter choices.
While the model captures (moderate) substitution errors which are the prevalent error source on a nucleotide level of current DNA storage systems, the current generation of systems relies on low-error synthesis and sequencing technologies that are relatively expensive and limited in speed. A key idea towards developing the next-generation of DNA storage systems is to employ high-error, but cheaper and faster synthesis and sequencing technologies such as light-directed maskless synthesis of DNA and nanopore sequencing. Such systems induce a significant amount of insertion and deletion errors. Thus, and important area of further investigation is to understand the capacity of channels which introduce deletions and insertions as well.

A Proof of Lemma 1
Notice that vectors x ∈ Z a + with x 1 = b are in one-to-one correspondence with binary strings containing (a − 1) 0s and b 1s. For x = (x 1 , . . . , x a ), the corresponding string is It is clear that such a string has (a − 1) 0s and b 1s, and that distinct strings with (a − 1) 0s and b 1s correspond to distinct vectors x. The number of distinct strings of this form is The upper bound in the statement of the lemma is a standard bound for binomial coefficients.

B Proof of Lemma 2 under a sampling-with-replacement model
As it turns out, Lemma 2 can be proved under a sampling-with-replacement model. Under this model, instead of sampling each molecule according to a probability distribution Q, N sequences are sampled out of the pool of M stored sequences. Since there are multiple copies of each molecule in the pool due to PCR, we consider a sampling with replacement model. By proving Lemma 2 in this setting, one can establish a version of Theorem 1 for the sampling-with-replacement shuffling channel, as previously described in [Hec+17]. Consider the same genie-based argument described in Section 3.2. In the sampling-with-replacement setting, the 1 norm of the frequency vector f at the output of the genie-aided channel is distributed as the number of distinct coupons obtained by drawing N = λM times with replacement from a set of M distinct coupons. Thus, Lemma 2 is an immediate consequence of the following stronger statement.
Lemma 4. Let Q be the number of distinct coupons obtained by drawing N = λM times with replacement from a set of M distinct coupons. We have that, for any δ > 0, Proof. Since Pr Q ≥ (1 − e −λ + δ)M is a non-increasing function of δ, we can assume that δ ∈ (0, e −λ /2], as that simplifies the expressions. Let t i be the number of draws to collect the i-th coupon after , we obtain Here, 1 i is the M -th harmonic number, and the first inequality follows by the asymptotic expansion where γ is the Euler-Mascheroni constant. The second inequality follows from 1 1−α ≤ 1 e −λ −e −λ /2 = 2e λ . Moreover, the variance can be upper-bounded as Using the bound on the expectation and Chebyshev's inequality, we have for any β > 0, that which, when plugged back into (24) implies that where d H is the Hamming distance and α > 2p. We assume that in case of ties, an arbitrary tie-breaking rule is used to define T (the actual choice will not be relevant for the proof). Let E n be the expectation conditioned on N = n; i.e., E n [·] = E[·|N = n]. We prove that, given the conditions in Lemma 3, the following two bounds involving E n |T | hold: For large E n |T |, we are typically in the regime of Figure 4 Hence f (x) > 0 if x < e −1 M γ/2 .
We see that, as long as γ > 2, the right-hand side of (29) is greater than M for M large enough. This means that f (x) is increasing for 1 ≤ x ≤ M . Since E n |T | ≤ n ≤ M , f must attain its maximum at f (n). Therefore, (28) can be upper-bounded by setting x = E n |T | = n, which yields H Y N L |N = n + H S N |X M L , Y N L , N = n ≤ nL + n log M n + o(M L).
Notice that this holds if, for some α > 2p, From the continuity of H(·), such α can be found if (18) holds, proving the lemma. It remains to prove (B1) and (B2).

C.1 Proof of (B1)
Since T is a deterministic function of Y N L and can take at most 2 n values, Next we notice that, for a given t, we can write The first term in (31) is trivially bounded as H [Y L i : i ∈ t]|T = t, N = n ≤ |t|L.
Each of the remaining length-L strings Y L i with i / ∈ t must be within a distance αL from one of the strings in [Y L i : i ∈ t], from the definition of T . Hence, conditioned on [Y L i : i ∈ t], each of them can only take at most |t||B(αL)| values, where B(αL) is a Hamming ball of radius αL. Since |B(αL)| ≤ 2 LH(α) for α < 1/2, we bound the second term in (31) where we used the fact that (n − x) log x is a concave function of x and Jensen's inequality.

D Proof of Proposition 1
We use a similar genie-aided and counting-based proof as in Section 3.2. The only difference is on how the number of frequency vectors is bounded. As before, the frequency vector on the output of the genie-aided channel satisfies, for any δ > 0, f 1 ≤ M (1 − q 0 + δ). We next upper bound the number of different frequency vectors f ∈ Z M β + with f 1 = M (1 − q 0 + δ). By Lemma 1, the number of different frequency vectors we see at the output is upper bounded by where the second equality follows from n k = n n−k . Taking the logarithm we get log T [M β , M (1 − q 0 + δ)] ≤ M β ((1 − β) log M + log(1 − q 0 + δ) + 1).