Abstract:
The growing volume of sequencing data and the ever-larger size of variants databases challenge genotyping procedures to handle massive genomics datasets efficiently. Rece...Show MoreMetadata
Abstract:
The growing volume of sequencing data and the ever-larger size of variants databases challenge genotyping procedures to handle massive genomics datasets efficiently. Recent alignment-free solutions leverage exclusively on the k-mers counts to speed up the analysis, but have to trade off the time gain against the memory requirements, to make the elaborations possible on a single workstation. In this paper, we present SparkGeno+, a novel alignment-free (AF) distributed pipeline for the fast and accurate genotyping of Single Nucleotide Polymorphisms (SNPs) and indels on a large scale. Starting from a previous pipeline, we identified and evaluated the performance bottlenecks that arise when performing genotyping using a standard AF approach, to develop and implement several innovations to better exploit the resources of a distributed system. The effectiveness of our proposal has been validated through an experimental analysis on widely studied datasets. The results show that the accuracy of SparkGeno+ matches the one of state-of-the-art alignment-free tools like Vargeno and MALVA. Moreover, the time performance of SparkGeno+ scales well with the number of computing units, thus allowing execution times that are in order of growth smaller than those of classical genotyping tools. This indicates SparkGeno+ to be a promising solution for large-scale genotyping applications.
Published in: IEEE Transactions on Computational Biology and Bioinformatics ( Early Access )