By Topic

Optimizing Sequence Alignment in Cloud Using Hadoop and MPP Database

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

4 Author(s)
Vijayakumar, S. ; ITPB, TATA Consultancy Services Ltd., Bangalore, India ; Bhargavi, A. ; Praseeda, U. ; Ahamed, S.A.

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. This information can effectively be used for medical and biological research only if one can extract functional insight from it. To obtain functional insight the factors to be considered while aligning sequences are: optimized querying of sequences, high speed matching and accuracy of alignment. The FAST-All (FASTA) for both proteins and nucleotides program considers all these factors and follows a largely heuristic method, which contributes to the high speed of its execution. The program initially observes the pattern of word hits, word-to-word matches of a given length, and marks potential matches rather than performing a more time-consuming, optimized search using a Smith-Waterman type of algorithm. In this paper, we propose an optimized approach to sequence alignment using FASTA algorithm, which incorporates high speed word-to-word matching. In the current scenario where data growth is in petabytes a day and processing requires state of the art technologies, Greenplum Massively Parallel Processing (MPP) database and Hadoop are emerging parallel technologies which form the backbone of this proposal. The complex nature of the algorithm, coupled with data and computational parallelism of Hadoop grid and Massively Parallel Processing database for querying from big datasets containing petabytes of sequences, improves the accuracy, speed of sequence alignment and optimizes querying from big datasets. Bioinformatics labs and centers across the globe today upload enormous amount of data and sequences in a central location for the scientific analysis. The transfer of such large datasets can also be simplified with Cloud approaches. So, Cloud Computing Technology is used in our implementation for the ease of gathering such sequences an- data from various sources like medical research centers, scientists and biomedical labs around the globe. A plan for the final "publicly consumable" form of the program is to make it web-based and running on the Cloud.

Published in:

Cloud Computing (CLOUD), 2012 IEEE 5th International Conference on

Date of Conference:

24-29 June 2012