Abstract:
Counting the frequency of every distinct substring of length k in sequence reads is an initial step in many bioinformatics applications such as genome assembly, correctio...Show MoreMetadata
Abstract:
Counting the frequency of every distinct substring of length k in sequence reads is an initial step in many bioinformatics applications such as genome assembly, correction of errors in sequencing reads, fast multiple sequence alignment, and detection of repeats. This problem is called as a k-mer counter problem. Although k-mer counting problem looks simple, when size of the input sequence reads dataset is massive and the number of k-mers increases, single node based k-mer counter tools would exhaust the memory and hard disk capacity of a single computer. Hadoop is identified as one of the scalable, parallel big data frameworks for data-intensive applications and to process large data sets in a cluster of computers with low-cost commodity hardware. In this paper, a Hadoop based k-mer counter with Minimum Substring Partitioning (HMSPKmerCounter) method is developed and compared with k-mer counting program of BioPig which is the first Hadoop based k-mer counter and KMC3, which is the recent single node multithreaded k-mer counter tool. Our results show that Hadoop based K-mer counter with Minimum Substring Partitioning outperforms k-mer counter program of Biopig for the k values = 28, 40, 55 and 65. Also, results show that our implementation outperforms KMC3 as k value increases.
Date of Conference: 26-28 October 2018
Date Added to IEEE Xplore: 25 July 2019
ISBN Information: