HMSPKmerCounter: Hadoop based Parallel, Scalable, Distributed Kmer Counter for Large Datasets | IEEE Conference Publication | IEEE Xplore

HMSPKmerCounter: Hadoop based Parallel, Scalable, Distributed Kmer Counter for Large Datasets


Abstract:

Counting the frequency of every distinct substring of length k in sequence reads is an initial step in many bioinformatics applications such as genome assembly, correctio...Show More

Abstract:

Counting the frequency of every distinct substring of length k in sequence reads is an initial step in many bioinformatics applications such as genome assembly, correction of errors in sequencing reads, fast multiple sequence alignment, and detection of repeats. This problem is called as a k-mer counter problem. Although k-mer counting problem looks simple, when size of the input sequence reads dataset is massive and the number of k-mers increases, single node based k-mer counter tools would exhaust the memory and hard disk capacity of a single computer. Hadoop is identified as one of the scalable, parallel big data frameworks for data-intensive applications and to process large data sets in a cluster of computers with low-cost commodity hardware. In this paper, a Hadoop based k-mer counter with Minimum Substring Partitioning (HMSPKmerCounter) method is developed and compared with k-mer counting program of BioPig which is the first Hadoop based k-mer counter and KMC3, which is the recent single node multithreaded k-mer counter tool. Our results show that Hadoop based K-mer counter with Minimum Substring Partitioning outperforms k-mer counter program of Biopig for the k values = 28, 40, 55 and 65. Also, results show that our implementation outperforms KMC3 as k value increases.
Date of Conference: 26-28 October 2018
Date Added to IEEE Xplore: 25 July 2019
ISBN Information:
Conference Location: Allahabad, India

Contact IEEE to Subscribe

References

References is not available for this document.