Generative Adversarial Networks for DNA Storage Channel Simulator

DNA data storage systems have rapidly developed with novel error-correcting techniques, random access algorithms, and query systems. However, designing an algorithm for DNA storage systems is challenging, mainly due to the unpredictable nature of errors and the extremely high price of experiments. Thus, a simulator is of interest that can imitate the error statistics of a DNA storage system and replace the experiments in developing processes. We introduce novel generative adversarial networks that learn DNA storage channel statistics. Our simulator takes oligos (DNA sequences to write) as an input and generates a FASTQ file that includes output DNA reads and quality scores as if the oligos are synthesized and sequenced. We trained the proposed simulator with data from a single experiment consisting of 14,400 input oligo strands and 12,108,573 output reads. The error statistics between the input and the output of the trained generator match the actual error statistics, including the error rate at each position, the number of errors for each nucleotide, and high-order statistics. The code is available at https://github.com/gyfbianhuanyun/DNA_storage_simulator_GAN.


I. INTRODUCTION
DNA storage is one of the most promising next-generation storage systems [1], [2]. Contrary to the modern digital system, which stores data in binary format using 0 and 1, the DNA storage system converts data in the quaternary format using four bases (Adenine(A), Cytosine(C), Guanine(G), and Thymine(T)). Then, the system writes data by synthesizing The associate editor coordinating the review of this manuscript and approving it for publication was Jiachen Yang . corresponding DNA strands and reads data using the DNA sequencing process [3]. The data density of DNA storage can reach 215 petabytes per gram [4]. At the same time, maintenance costs are low where fragments of DNA encapsulated in silica can be preserved for thousands of years [5]. However, the synthesis and sequencing technologies are errorprone [6], and therefore an appropriate error-correcting code (ECC) is necessary for reliable data recovery [4], [7], [8], [9].
The errors in the DNA storage channel are often asynchronous (insertion and deletion). Moreover, the sequence contains higher-order statistics; For example, the error rate increases in the region of consecutive bases. Thus, it is notoriously hard to characterize the channel statistics perfectly, so designing an ECC is also challenging. A line of research focuses on ECC for a such noisy channel, which we call the DNA storage channel; Blawat et al. [10] proposed a forward error correction scheme, Erlich and Zielinski [4] combined Reed-Solomon (RS) code and fountain code, where Jeong et al. [7] also applied RS codes with an improved decoding technique. Also, Press et al. [8] corrected asynchronous errors using hashing and greedy search, and Chandak et al. [9] used low-density parity-check (LDPC) and BCH codes. In addition, Hulett et al. [11] proposed a specific coding scheme for Nanopore sequencing, where Chandak et al. [12] used convolutional codes to correct many insertions and deletions.
Although the cost of sequencing is rapidly decreasing [13], the high price of synthesis [14] is still the main bottleneck for developing ECCs for DNA storage channels. It is well known that simulation has predictive functions before conducting actual experiments. It is used to predict the performance of new ECC schemes in wireless communications [15], [16]. This approach significantly reduces the cost burden of designing and testing ECC. Likewise, we hope to achieve a similar effect using simulation in the DNA storage channel.
In addition, the simulator is also valuable for the channel-related random access verification [17], [18], analysis of new synthesis techniques [19], [20], and bound analysis of the channel [21], [22], [23], [24]. However, to the best of our knowledge, there is no publicly available DNA storage channel simulator except Antonio et al. [25], which mostly focused on the Nanopore sequencer.
There are various types of simulators related to DNA sequencing at present. ART [26] mimics a next-generation sequencer, whereas Flux [27] simulates RNA-seq. Also, pRIS [28] and simLoRD [29] simulate Illumina and PacBiuo reads, respectively, when the error profiles are given. PBSIM2 [30] simulates long reads of PacBio sequencer with the quality value generated by the trained Hidden Markov Model (HMM). However, the above sequencing simulators focus only on 1st order statistics (e.g., error probabilities at each location). Deep Simulator [31], [32] uses deep learning techniques to mimic the entire pipeline of Nanopore sequencing. NanosigSim [33] improves the signal generator module of the Deep Simulator using a bidirectional gated recurrent unit (GRU). Nanopore SimulatION [34] is a modular software tool capable of simulating the raw current values of Nanopore sequencing reads. Schwarz et al. [35] simulated error sequences with pre-defined error probability, which needs to be specified by the user.
However, most of the simulators have hard-coded error probabilities or have been designed for a specific sequencing technology. Since it is a rapidly developing research area, there are many different technologies in both sequencing (e.g., Illumina [36] and Oxford Nanopore [37]) and synthesis (e.g., Twist Bioscience). Each technology has unique charac-teristics; for example, Nanopore technologies can sequence long read (>1000 base pairs) while Illumina sequencing can handle up to a few hundred base pairs. Simulators based on specific experimental setups may not apply to an environment with new sequencing and synthesis technologies. For example, Deep Simulator [31] has pore modeling that requires parameter engineering and may not apply to Illumina sequencing technologies.
In this paper, we propose a deep learning-based DNA storage channel simulator. Our model is based on a generative adversarial network (GAN) [38] which captures high-order statistics of data. Also, the model's training is universal because it does not depend on specific sequencing and synthesis technologies. More precisely, we employ an end-toend optimized deep learning-based algorithm, not requiring parameter engineering but only depending on data statistics.
Note that designing a GAN-based simulator is non-trivial. Most GAN is optimized to match the output distribution, but the simulator needs to match the joint distribution of input and output. Also, since the goal of the DNA storage simulator is to mimic a noisy channel, we expect the randomness of the generated sequence. In other words, the same input sequence must produce various outputs where the distribution should match the actual experimental data. This is not the case in most GAN problems where the randomness is obtained from a random latent input vector, and the generator (simulator) is deterministic. This work proposes a novel GAN structure where the discriminator takes both input and output to verify the input-output statistics to overcome the problems.
We train the proposed model on 10 5 input sequences and 10 9 output sequences were obtained from a single experiment, using a similar procedure described by Jeong et al. [7]. Note that this is enough to train the proposed model, which implies that an extensive data acquisition experiment is unnecessary. The proposed framework allows us to build a simulator with the same statistics based on data from a single experiment.
To verify the effectiveness of the proposed simulator, we measure the error statistics. Our simulator shows similar insertion, substitution, and deletion error rates. It also matches the higher-order statistics, such as the probability of consecutive deletions (2, 3, 4-deletions). Moreover, we achieve the output randomness using multiple weights obtained from the training procedure. Finally, we apply the ECC technique proposed by Jeong et al. [7] to simulated reads. We observe a similar error-correcting performance compared to that of the real experimental data.

A. DEEP LEARNING MODELS FOR SEQUENCES
In sequential data processing tasks, a gated recurrent unit (GRU) [39] and Long-short term memory (LSTM) [40], variants of RNN [41], are widely used. Specifically, in language representation tasks, which is the most popular task for sequential data, RNN-based models' performance is not satisfactory.
On the other hand, a line of research combined an encoder and a decoder for the language model [42], [43], [44], [45] but had bottleneck problems. Recently, Vaswani et al. [46] proposed a Transformer, an attention-based network, and showed remarkable performance in various language representation tasks. It leads to pre-trained attention-based models such as BERT [47] and GPT [48], [49], [50], which show state-ofthe-art performances. The pre-LN Transformer [51], which places the layer normalization inside the residual block, improves training with well-behaving gradients [52].

B. GENERATIVE ADVERSARIAL NETWORK
A Generative Adversarial Network (GAN) [38] is a framework that adversarially trains two networks, a generator and a discriminator. The role of the generator is to produce fake (in our case, simulated) data that are similar to the given data distribution, while the discriminator tries to distinguish between the real and generated (fake) data. Equivalently, the generator minimizes the minimax loss, while the discriminator maximizes it [53]. With the competing nature of the discriminator and the generator, the generator keeps producing more realistic generated (fake) data while the discriminator detects the generated data more accurately. Finally, the generated data has similar statistics to the actual data and cannot be distinguished by the discriminator.
Wasserstein GAN (WGAN) [54] minimizes the Wasserstein distance between the empirical distribution of actual data and generated data. WGAN is more resistant to the mode collapse problem in which the generator generates only specific data caused by unbalanced training between the generator and discriminator [55], [56]. Berthelot et al. [57] improves WGAN by boundary equilibrium. Gulrajani et al. [58] introduced a gradient penalty that shows better stability in training. Additionally, Mao et al. [59] map the generated data to a latent vector, where Miyato et al. [60] constrained the Lipschitz condition using weight normalization to overcome mode collapse.
Although many generative models focus on mimicking data distribution, there are GANs that considered the input and output correlation [61], [62], [63]. However, the above work proposed deterministic generators. To achieve randomness of the output, many works [64], [65], [66], [67], [68], [69] inject noise into the generator. However, their primary concern is the variability of the output instead of targeting a specific joint distribution of the input and the output.
Most GAN generators (and discriminators) are convolutional neural networks since GAN is originally designed for vision tasks [38]. Recently, some efforts are tried to create RNN-based GAN [70], [71], [72]. Further, Transformerbased GANs [73], [74] were proposed and showed the robustness of training. On the other hand, there are a few works that considered generating sequences using GAN, including the RNN with reinforcement training [75], [76], and the FIGURE 1. Overall structure of the simulator. The read generator, consists of insertion, substitution, and deletion generators, takes an oligo sequence as an input and generates a read sequence. The generated output is aligned with an input oligo sequence, and the quality score generator produces corresponding quality scores.
Since GAN is to mimic the given data distribution, many studies have proposed GAN-based simulators. Peng et al. [81] constructed a GAN medical ultrasound simulator suitable for medical simulation and clinical training. Also, Rahnemoonfar et al. [82] used a similar architecture to CycleGAN [61] to synthesize snow radar images, Kim et al. [83] shows user-parameterized GAN-based simulator of fluid, and Erdmann et al. [84] simulated an electromagnetic calorimeter showers using WGAN. Notably, Yang et al. [85] suggested a GAN-based wireless channel modeling for the continuous additive white Gaussian noise channel, and Orekondy et al. [86] proposed a GAN-based simulator that learns the multi-channel distributions of the multiple-input multipleoutput channel.

A. PROBLEM DESCRIPTION
A DNA storage system stores data in oligonucleotides and reads the stored sequence from synthesized multiple oligos. The goal of the simulator is to imitate the sequential processes of the DNA storage system from the input oligo sequence to the output read sequence.
Let X = {A, C, G, T } be a set of nucleotide bases, and X e = {A, C, G, T , −} be a set of extended symbols where '-' is a placeholder that corresponds to inserted or deleted symbols. Further, let Q ⊂ [0, 1] denote the set of normalized quality values. The input of the simulator O is a nucleotide sequence of length n, which we call the oligo: VOLUME 11, 2023 where O i ∈ X for 1 ≤ i ≤ n. Then, the simulator's output consists of length M of DNA sequence R, which we call the read, and corresponding quality scores Q: where R j ∈ X and Q j ∈ Q for 1 ≤ j ≤ M . The length of the output read sequence (and quality values) is M , which is a random variable due to asynchronous (insertion and deletion) We propose a read generator G read followed by a quality score (qscore) generator G qs . Given an input oligo O, the read generator G read (O) outputs a read R, and G qs (O, R) generates a quality score Q based on oligo and read sequences. This is based on the nature of quality scores which indicates the confidence level of the read. Roughly speaking, if there is a mismatch between the bases of the oligo and the read, it is more likely to have a lower quality score.
Due to the notoriously challenging nature of the asynchronous error and unbalanced error rates between substitution errors and others, we could not train a single generator that handles all three types of errors simultaneously. Instead, we divide the read generation procedure into three steps: insertion generation, substitution generation, and deletion generation. We call each intermediate sequences by X ins , X sub , and X del , i.e., O → X ins → X sub → X del → R (we will specify later why X del and R are different). Let L denote the length of the intermediate sequence, which is a random variable and may be distinct from n and M . More precisely, intermediate sequences are given by Due to the insertion, we have n ≤ L, and similarly we have M ≤ L because of deletion errors. Then, the read generator is composed of three generators G ins , G sub , and G del , which introduce insertion, substitution, and deletion errors. Note that the sequence O and X ins may have different lengths due to insertion errors. However, the sequence lengths of X ins , X sub , and X del are always the same since it has a placeholder symbol '-'. Finally, a trimmer removes '-' symbol from X del , and we obtain the output read sequence R.
The error-contained input-output pairs are rare compared to error-free pairs in the dataset. In such a case, the generator trained by the GAN [38] framework may encourage the trivial generator, which outputs the input without any errors. In this work, we train the generator only on the error-contained pair to focus on the error statistics. Then, we introduce a profile, whether each type of error occurs or not. More precisely, While training, we estimate the probability distribution of the profile vector based on occurrences. In the simulation step, for each oligo input, the simulator samples a random profile vector based on the empirical distribution of the profile vector and feeds it to all generators (G ins , G sub , and G del ).
There are two main reasons why we use three separate generators instead of a single generator: 1) different nature of synchronous and asynchronous errors, and 2) biased error statistics where substitution errors dominate the other types of errors. Thus, we design generators that can solely focus on each type of error, which is more efficient and robust in training compared to a single generator network. We discuss more on GAN training in the presence of a profile vector in Section IV.
The qscore simulator G qs takes an oligo and read pair (O, R) as an input. It first aligns the oligo and the read which we call O aligned and R aligned , respectively. Then, it produces quality scores corresponding to an aligned read. Similar to the read generation, since the error rarely occurs in G read and the qscore statistics are significantly different when there exists an error, we introduce two generators G  error. In other words, The entire structure of the proposed simulator is described in Figure 1.

IV. TRAINING GENERATORS USING GAN A. GANs FOR READ GENERATOR
We train G ins , G sub , and G del using three separate GANs (insGAN, subGAN, and delGAN for G ins , G sub , and G del , respectively). However, we need to define each generator's input and output since the intermediate sequences (X ins , X sub , and X del ) are not provided. The insGAN requires O and X ins , the subGAN needs X ins and X sub , and finally, the delGAN is trained on X sub and X del . To achieve intermediate sequences, we first align the sequences to get O aligned and R aligned of the same lengths for given oligo O and read R. Then, we obtain an inserted sequence X ins by replacing '-' in O aligned with the base of the same index of R aligned . Similarly, a substituted sequence X sub is obtained by replacing '-' in R aligned with the base of the same index of O aligned . Finally, we have X del = R aligned , a deleted version of the substituted sequence. An example of the above data processing is described in Figure 2. In this example, oligo O = CTCACGGT and read R = CCACAGGA are given. First, we align two sequences and get aligned oligo O aligned = CTCAC-GGT and aligned read R aligned = C-CACAGGA. Then, we obtain X ins = CTCACAGGT by replacing '-' in O aligned with 'A' (the 6th base of R aligned ), and obtain X sub = CTCACAGGA by replacing '-' in R aligned with 'T' (the 2nd base of O aligned ). The deleted sequence X del is simply C-CACAGGA which is identical to R aligned .
The goal of GAN training is to get a generator that produces similar error patterns as if it is obtained from the actual experiments. The training procedure of GANs  (insGAN, subGAN, and delGAN) are the same; The generator takes an input sequence and produces an output sequence, where the discriminator checks both input and output sequences then determines whether the output sequence is from real experiments or from the generator. We use the framework of WGAN-GP [58] that minimizes the Wasserstein distances between distributions with a gradient penalty that stabilizes the training. This training procedure of GANs (insGAN, subGAN and delGAN) is shown in Figure 3.

1) insGAN
The critical component of insertion generator G ins is a gated recurrent unit (GRU) [39]. Since GRU takes input sequentially, handling asynchronous errors such as insertion is suitable. Also, GRU has lighter architecture than Transformer, but it can learn short-term dependency effectively and is more appropriate to generate realistic consecutive errors. We stack three bidirectional GRU [87] layers of 64 hidden units, which effectively extract the sequence features. The bidirectional GRU layers are followed by a single fully-connected layer with a softplus activation function. Unlike subGAN or del-GAN, the input and output of the insGAN may have different lengths of sequences.
We denote D ins as the discriminator corresponding to the insertion generator G ins . It takes the generator's input and output (O, X ins ) as input to check whether the input and output show proper channel statistics. The discriminator D ins is based on a 1-dimensional convolutional neural network (1D-CNN) [88], which investigates the joint distribution of input O and output X ins . We use various CNN kernels and multiple layers to capture the high-order dependencies. The discriminator takes stacked input and output sequences as input. Notably, we exclude batch normalization and dropout from the discriminator to stabilize the GAN training. Finally, 1D-CNN layers are followed by a fully connected layer without an activation function. The more details of the discriminator are the following. We stacked ten layers of 1D-CNN with eight hidden dimensions. The CNN contains 5 × 5 kernels and zero padding.
Note that G ins is a generator that introduces the insertion error, and therefore we construct a dataset with erroneous pairs only. I.e., all input and output pairs (O, X ins ) in the dataset for insGAN contain insertion errors. This is also the case for subGAN and delGAN where the datasets consist of pairs with substitution errors and deletion errors, respectively.

2) subGAN
The substitution generator G sub is based on the Transformer model [46]. Since the substitution error is the synchronous error and is the most dominant type of error, the proposed model focuses on higher-order statistics with an attention mechanism. The generator G sub takes X ins as an input and outputs X sub . The proposed model contains a learnable positional encoding and three encoder layers. An encoder layer consists of two multi-head attention layers, followed by a feed-forward layer with 64 hidden dimensions and layer normalization layers. Finally, the generator outputs a sequence VOLUME 11, 2023 processed by a fully connected layer with a soft plus activation function. The discriminator D sub adopts a similar construction to insGAN, ten layers 1D-CNN with 128 hidden dimensions, 5 × 5 kernel, and a single zero padding.

3) delGAN
The deletion generator G del and discriminator D del are identical to that of insGAN. The generator G del has three layers of bidirectional GRU and 64 hidden dimensions, where D del has ten layers of 1D-CNN with 8 hidden dimensions, 5×5 kernel, and a single zero padding. However, since the input X sub and X del have the same sequence lengths due to place holder symbol '-', it is much easier to learn the statistics of the deletion channel than the insertion channel.

4) RANDOMNESS IN READ GENERATOR
A distinctive point of the GAN-based simulators apart from the other GANs is the stochasticity of the generator. To achieve the output variability, we first introduce an additional random number for each input base. We also store ensembles of generators obtained during a single training procedure but in different epoch numbers. Because of instability in the minimax training of GAN [56], [89], the generator after each epoch varies meaningfully. Thus, the simulator randomly selects a generator from the ensemble and then produces the output. We provide experimental results and verify the output variability in Section V-C.

B. GAN FOR QUALITY SCORE GENERATOR
The quality score generator takes aligned oligo and read pair (O aligned , R aligned ) as an input. We train two quality score generators (G (g) qs , G (e) qs ) separately; a generator for error-free sequence pairs and a generator for error-contained sequence pairs. The training framework is the same for both cases.
Given the dataset consists of oligo, read, and quality scores, we align the oligo and read sequences where the quality scores follow the aligned reads. This provides aligned oligo O aligned , aligned read R aligned , and aligned quality scores Q aligned . Note that there is no quality score corresponding to the deleted base. As described in the following example, we fill such a gap (2nd base in the example) with the nearest quality value (1st quality score '!' in the example).
The quality score discriminator takes triplet of aligned oligo, aligned read, and aligned quality scores (O aligned , R aligned , Q aligned ) as an input. Then, it determines whether the quality scores are generated or not. We also follow the framework of WGAN-GP [58] that minimizes the Wasserstein distances with a gradient penalty, which is resistant to the mode collapse problem. The training procedure for quality score GAN is described in Figure 4.

1) QSCORE GAN
The quality score generator G qs has a similar structure with G ins , which consists of two bi-directional GRU layers with 64 hidden dimensions. The GRU layers are followed by the fully connected layer and hyperbolic tangent activation function. Note that the generator produces normalized quality scores between 0 and 1. The quality score discriminator D qs consists of ten 1D-CNN layers with 8 hidden dimensions, 5 × 5 kernels, and a single zero padding. A fully connected layer that follows CNN layers generates the scalar value to compute Wasserstein distance.
Similar to generators in G read , the generator takes additional random numbers as input corresponding to the base. It introduces the randomness of generated quality scores. While training, we added L1-regularization for quality score GAN for stability.

V. EXPERIMENT A. DATASET
In the experiment, we use the same oligo design proposed by Jeong et al. [7]. There are 18,000 oligos of length 152 in the dataset, where all oligos have balanced GC contents (base G and C proportion is between 45 to 55 percent) and homopolymer run length is limited by 3. We obtain the corresponding FASTQ file that contains only reverse direction results from the single experiment, consisting of reads and quality scores. We wrote the data using Twist Bioscience synthesis and read it with the Illumina Miseq Reagent v3 kit.
Given the FASTQ file, we match each read to the nearest oligo based on edit distance. After discarding noisy reads (i.e., reads with Ns or distance to nearest oligo is larger than 5), we obtain 15,126,429 reads. Note that a single oligo can be sequenced multiple times (see Figure 5). Then, the oligos (and the matched reads) are divided into two sets; the training dataset consists of 14,400 oligos and corresponding 12,108,573 reads and the test dataset consists of 3,600 oligos and corresponding 2,999,656 reads. For insGAN, subGAN, and delGAN, we extract the error-contained sequence pair from the training dataset where the edit distance is nonzero and generate training datasets for insGAN, subGAN, and FIGURE 6. Error rates of insertion, substitution, and deletion channels at each index.

FIGURE 7.
Error rates of consecutive deletion errors at each index. Our simulator shows similar rates, which implies it learns high order dependencies.

FIGURE 8.
Comparison of error distribution (index of erroneous base) between generated reads using single weight and using multiple weights. delGAN as described in Figure 2. Also, we trimmed aligned sequences at 145 lengths as the later part of the sequence is considered noisy. For qscoreGAN, we generate an error-free dataset for G (g) qs and the error-contained dataset for G (e) qs . We then run the simulator with oligos in the test dataset. The error statistics of simulated data and an actual experiment are compared.
To train models with sequential data in batch, we need to match the lengths of sequences. We introduce additional symbols for sequences to match the length; symbols S and E represent the start and the end of the sequence, and symbol P is padded to match the length. Then, we apply the one-hot encoding to sequences, where the one-hot vector is an 8-dimensional vector (ACGT-SEP). For insGAN, subGAN, and delGAN, an additional uniformly distributed random number is added to a one-hot vector to induce randomness of the simulator, as we discussed in Section IV-A4. For qscore-GAN, the oligo and read sequences are aligned, then covert to one-hot encoded vectors. Then, encoded vectors are stacked, and we also add a random number to form a 17-dimensional vector.

B. TRAINING
We train WGANs (insGAN, subGAN, delGAN, and qscore-GAN) with gradient penalty. The batch size is 1024, and learning rates of the generator and discriminator are 5 × 10 −4 for insGAN, 5 × 10 −4 for G sub , 1 × 10 −4 for D sub , 10 −3 for VOLUME 11, 2023  For the randomness of read generator, we obtain the trained weights of 2500 to 2509 epochs for insGAN, 500 to 509 epochs for subGAN, and 1300 to 1309 epochs for del-GAN. From the training dataset, the estimated profile vector distribution is the following: the percentage of error-free pairs is 84.253%, the percentage of insertion is 0.945, substitution is 10.646, deletion is 3.572, insertion with substitution is 0.101, insertion with deletion is 0.039, substitution with deletion is 0.437, insertion with substitution and deletion is 0.007.

C. RESULTS
To the best of our knowledge, our work is the first attempt to design a DNA storage channel simulator, while other simulators focus only on DNA sequencing. Thus, in this section, we mainly provide the error statistics of the proposed simulator and compare them with the error statistics of actual data.

1) READ GENERATORS: ERROR STATISTICS
To evaluate the performance of the proposed simulator, we first investigate the error statistics of reads. Figure 6-(a) presents the insertion error rates, the number of insertion errors at each index divided by the total number of reads with insertion. Similarly, Figure 6-(b) and Figure 6-(c) show the substitution and deletion error rates, respectively. Although the error probability is not given explicitly, the generated sequences follow the distribution of positional error (1st order statistics).
We also analyze the consecutive insertion and deletion error rates to investigate the high-order statistics of the generator. Since an insertion error is rare, we focus on consecutive deletion errors. Figure 7 shows the rate of consecutive deletion errors (1-deletion to 4-deletions) starting at each index. Table 1 compares the ratio of consecutive errors for both insertion and deletion errors, which are averages of the rate of consecutive errors. The results indicate insertion and deletion errors do not occur independently; for example, the rate of 2-deletion is not a square of the rate of 1-deletion. The proposed generator mimics the consecutive error rates of the experiment, which cannot be achieved with predefined error probabilities.
We further count the number of errors for each base, illustrated in Table 2 and 3. In Table 2, we count the number of insertion error bases among 33,158 oligo-read pairs from the test dataset, and count the number of deletion error bases among 37,668 oligo-read pairs from the test dataset. In Table 3, we count the number of substitution error bases among 19,170 oligo-read pairs from the test dataset. Errors of three bases (A, C, and T) occur uniformly in the experiment, whereas base G is inserted more often than other bases, and deleted bases have occurred nearly equally. In the case of substitution errors, the frequency of each base error type varies widely. Among them, the number of base G changing to base T is the most, and the number of base C changing to base G is the least. The proposed simulator also captures these per-base error statistics.

2) READ GENERATORS: ERROR DISTRIBUTIONS
Since our goal is to simulate the noisy channel, the simulator's output should be random. More precisely, we expect our simulator generates various sequences for the same input, and the distribution should match the experiment. Recall that we added random numbers for each input base and used the ensemble of generators (trained with a different number of epochs) to induce a probabilistic behavior of the simulator. For the insertion generator G ins , we use 11 generators with trained weights at epochs from 2400 to 2410, where we use 11 generators for the deletion generator G del with trained weights at epochs from 1290 to 1300. In the substitution generator G sub , the same 11 generators are used, and the weights are trained at epochs 500 to 510 due to the faster convergence obtained by using the Transformer structure. To verify that the read simulator generates output diversity, we generate 25 reads from the fixed oligo. Table 4 and Figure 8 show the analysis of the simulated reads by the insertion generator G ins , the substitution generator G sub , and the deletion generator G del . Table 4 shows that generators with different weights produce significantly various error patterns, especially in insertion.
We also investigate the error occurrences when the input is fixed to a given sequence. The comparison is between generated sequences using single weight configuration and sequences using 10 different weights from 10 epochs. Figure 10 implies that the generation with multiple weights has more variability in error patterns. Moreover, Figure 9 shows the error statistics of generated sequences from the fixed oligo, where we generated 25 sequences to test G ins , 100 sequences to test G sub , and 50 sequences to test G del . Note that we produce output read sequences from a single input sequence, the exact error location does not perfectly match due to the random nature of the generator. However, it shows that the error statistics of the generated sequence are similar to the true error statistics (from an actual experiment) in terms of both error location and bases. Figure 10 shows the positional mean quality scores from the experiment and the simulation. In the experiment, the quality score decreases except for the first few indices, and the generator also mimics this behavior. Figure 11 shows that the output distributions of quality score generators also match the experimental results. For error-free bases (when oligo and read coincides), the corresponding quality score is mostly G, which is the highest value. However, in the presence of error, a certain portion of lower quality scores (such as ',' and '+') exists. Our quality score simulator captures this characteristic. Note that the original GAN often suffers from mode-collapse where the generator only recovers a single mode of the target distribution. However, thanks to WGAN-GP, we successfully recovered two modes (one on 'G' and another around ',' and '+'). VOLUME 11, 2023 FIGURE 11. Distribution of corresponding quality scores when the substitution error occurs.

4) APPLICATION OF ECC ALGORITHM
We test the error correcting code (ECC) to simulated reads and the actual experimental reads. More precisely, we apply ECC that proposed by Jeong et al. [7], where the author combined Hamming distance-based clustering, Reed-Solomon (RS) based error correcting techniques, and quality-scorebased decoding. The authors seek a minimum number of read samples where the ECC algorithm perfectly recovers the oligo input. With experimental reads, the minimum number of reads was 86000, where 200 out of 200 decoding trials succeeded. We also applied the same ECC technique to the simulated data from the same oligo inputs. The minimum number of generated reads was 84000, where 10 out of 10 decoding trials succeeded.
Furthermore, Table 5 also shows the other statistics provided by the ECC algorithm, including cluster-related statis-tics. The ECC algorithm produces about the same number of clusters for both experimental and simulated data. Interestingly, the number of size-1 clusters is also identical, which is a crucial error factor in the cluster-based ECC algorithms. These results imply that the proposed simulator produces similar reads to the experiment.

VI. CONCLUSION AND DISCUSSION
We presented the generative adversarial network framework for the DNA storage channel simulator. Our simulator generated sequences and quality scores with similar error statistics to an actual experiment. The proposed framework is purely data-centric that does not depend on specific sequencing or synthesis technologies. However, our model does have limitations. First, learning data from Nanopore sequencing may be challenging due to the training complexity of Transformer models. We leave developing a training-efficient network for the longer sequence as crucial future work. Another interesting direction is to manage the GAN training. It is well-known that the GAN is extremely hard to train, and our framework also requires fine-tuned hyperparameters to train the generator successfully. We believe that the well-behaving GAN architecture allows us to apply the technique to a more challenging setting, including long sequence length, biased error probabilities, high-order dependencies, and a massive amount of data.

ACKNOWLEDGMENT
(Sanghoon Kang and Yunfei Gao contributed equally to this work.)