Abstract:
Short-read DNA sequencing instruments can yield over 1012 bases per run, typically composed of reads 150 bases long. Despite this high throughput, de novo assembly algori...Show MoreMetadata
Abstract:
Short-read DNA sequencing instruments can yield over 1012 bases per run, typically composed of reads 150 bases long. Despite this high throughput, de novo assembly algorithms have difficulty reconstructing contiguous genome sequences using short reads due to both repetitive and difficult-to-sequence regions in these genomes. Some of the short read assembly challenges are mitigated by scaffolding assembled sequences using paired-end reads. However, unresolved sequences in these scaffolds appear as “gaps”. Here, we introduce GapPredict – An implementation of a proof of concept that uses a character-level language model to predict unresolved nucleotides in scaffold gaps. We benchmarked GapPredict against the state-of-the-art gap-filling tool Sealer, and observed that the former can fill 65.6% of the sampled gaps that were left unfilled by the latter with high similarity to the reference genome, demonstrating the practical utility of deep learning approaches to the gap-filling problem in genome assembly.
Published in: IEEE/ACM Transactions on Computational Biology and Bioinformatics ( Volume: 18, Issue: 6, 01 Nov.-Dec. 2021)
Funding Agency:
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- Genome Assembly ,
- Language Model ,
- Draft Genome Assembly ,
- Reference Genome ,
- Deep Learning ,
- Short Reads ,
- Short-read Sequencing ,
- Deep Learning Approaches ,
- Contiguous Sequences ,
- Target Sequence ,
- Long Short-term Memory ,
- Flanking Sequences ,
- Forward Direction ,
- FASTA File ,
- Percent Cover ,
- Computational Biology ,
- Training Iterations ,
- Word Embedding ,
- Sequence Gaps ,
- Correct Percentage ,
- Assembly Gaps ,
- Beam Search ,
- Output Gap ,
- Percent Sequence Identity ,
- Validation Loss ,
- Entire Assembly ,
- Bottom Left Corner ,
- High Percentage
- Author Keywords
- MeSH Terms
Keywords assist with retrieval of results and provide a means to discovering other relevant content. Learn more.
- IEEE Keywords
- Index Terms
- Genome Assembly ,
- Language Model ,
- Draft Genome Assembly ,
- Reference Genome ,
- Deep Learning ,
- Short Reads ,
- Short-read Sequencing ,
- Deep Learning Approaches ,
- Contiguous Sequences ,
- Target Sequence ,
- Long Short-term Memory ,
- Flanking Sequences ,
- Forward Direction ,
- FASTA File ,
- Percent Cover ,
- Computational Biology ,
- Training Iterations ,
- Word Embedding ,
- Sequence Gaps ,
- Correct Percentage ,
- Assembly Gaps ,
- Beam Search ,
- Output Gap ,
- Percent Sequence Identity ,
- Validation Loss ,
- Entire Assembly ,
- Bottom Left Corner ,
- High Percentage
- Author Keywords
- MeSH Terms