By Topic

Rapid Sequence Homology Assessment by Subsampling the Genome Space Using Difference Sets

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$33 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

1 Author(s)
Andrzej K. Brodzik ; The MITRE Corporation, Bedford, MA, USA

Availability of DNA data is growing roughly at the rate specified by Moore's law. In many molecular biology applications this data must be compared with a reference sequence, either to establish similarity of genomes or to identify functionally homologous subsequences. Current approaches based on pair-wise sequence alignments are computationally expensive and often data dependent. To ameliorate this problem, alternative, less complex sequence comparison schemes, designed to capture the essential features of genomes, must be explored. In this work a new sequence comparison approach, based on difference set models, is proposed. These models are conceptually appropriate, as they quantify, in a certain sense, two key genome attributes: sequence complexity and symbol repetition. Moreover, it is shown that difference sets are abundant in bacterial genomes and that they coincide with homologous sequence segments. These findings motivate the construction of compact representations of DNA sequences in the difference set space. An alignment of these representations permits computationally efficient identification of differences between the DNA sequences. To illustrate the efficacy of the difference set approach, characterization of indels in closely related bacillus anthracis strains is performed, resulting in the discovery of two previously unreported collections of polymorphisms. In addition to these results, an open problem of extending the difference set approach to difference set and almost difference set families, for the analysis of more distant DNA sequences, is discussed.

Published in:

IEEE Transactions on Information Theory  (Volume:56 ,  Issue: 2 )