Biological sequences can contain regions of unusual composition, e.g., proteins contain DNA binding domains, transmembrane regions, and charged regions. The linear-time Ruzzo-Tompa algorithm finds such regions by inputting a sequence of scores and outputting the corresponding “maximal segments”, i.e., contiguous, disjoint subsequences having the greatest total scores. Just as gaps improved the sensitivity of BLAST searches, they might improve the sensitivity of searches for regions of unusual composition as well. Accordingly, we generalize the Ruzzo-Tompa algorithm from sequences of scores to paths in weighted, directed graphs on a one-dimensional lattice. Within the generalization, unfavorable scores can be deleted from contiguous, disjoint subsequences by paying a penalty, and the Ruzzo-Tompa algorithm can then find gapped subsequences having the greatest total gapped scores. An application to finding gapped inexact repeats in biological sequences exemplifies some of the concepts.
Published in:
Computational Advances in Bio and Medical Sciences (ICCABS), 2012 IEEE 2nd International Conference on
Date of Conference: 23-25 Feb. 2012