By Topic

De Novo Assembly of High-Throughput Sequencing Data with Cloud Computing and New Operations on String Graphs

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$33 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

4 Author(s)
Yu-Jung Chang ; Inst. of Inf. Sci., Acad. Sinica, Taipei, Taiwan ; Chien-Chih Chen ; Jan-Ming Ho ; Chuen-Liang Chen

The next-generation sequencing technologies dramatically accelerate the throughput of DNA sequencing in a much faster rate than the growth rate of computer speed as predicted by the "Moore's Law." It is a problem even to load and run these sequencing data in memory. There is an urgent need for de novo assemblers to efficiently handle the huge amount of sequencing data using scalable commodity servers in the clouds. In this paper, we present CloudBrush, a parallel algorithm that runs on the MapReduce framework of cloud computing for de novo assembly of high-throughput sequencing data. The algorithm uses Myers's bi-directed string graphs as its basis and consists of two main stages: graph construction and graph simplification. First, a vertex is defined for each non-redundant sequence read. We present a prefix-and-extend algorithm to identify overlaps between a pair of reads and to reduce transitive edges. The graph is further simplified by using conventional operations including path compression, tip removal and bubble removal. We also present a new operation, Similar Neighbour Edge Adjustment, to remove error topology structures in string graphs. Besides, we also disconnect repeat regions by revised A-statistics. The goal is to partition the string graph so that all paths in each connected subgraph correspond to similar subsequences of the underlying genome. We then traverse each connected subgraph to find a long path supported by a sufficient amount of reads to represent the subgraph. Preliminary results show that the CloudBrush assembler, compared with Contrail and Edena on the sequencing data of E. coli genomes, may yield longer contigs.

Published in:

Cloud Computing (CLOUD), 2012 IEEE 5th International Conference on

Date of Conference:

24-29 June 2012