Skip to Main Content
The next-generation sequencing technologies dramatically accelerate the throughput of DNA sequencing in a much faster rate than the growth rate of computer speed as predicted by the "Moore's Law." It is a problem even to load and run these sequencing data in memory. There is an urgent need for de novo assemblers to efficiently handle the huge amount of sequencing data using scalable commodity servers in the clouds. In this paper, we present CloudBrush, a parallel algorithm that runs on the MapReduce framework of cloud computing for de novo assembly of high-throughput sequencing data. The algorithm uses Myers's bi-directed string graphs as its basis and consists of two main stages: graph construction and graph simplification. First, a vertex is defined for each non-redundant sequence read. We present a prefix-and-extend algorithm to identify overlaps between a pair of reads and to reduce transitive edges. The graph is further simplified by using conventional operations including path compression, tip removal and bubble removal. We also present a new operation, Similar Neighbour Edge Adjustment, to remove error topology structures in string graphs. Besides, we also disconnect repeat regions by revised A-statistics. The goal is to partition the string graph so that all paths in each connected subgraph correspond to similar subsequences of the underlying genome. We then traverse each connected subgraph to find a long path supported by a sufficient amount of reads to represent the subgraph. Preliminary results show that the CloudBrush assembler, compared with Contrail and Edena on the sequencing data of E. coli genomes, may yield longer contigs.