Skip to Main Content
This paper presents an alignment-free technique to efficiently discover similar regions in large sets of biological sequences using position sensitive p-mer frequency clustering. A set of sequences is broken down into segment and then a frequency distribution over all oligomers of size p (referred to as p-mers) is obtained to summarize each segment. These summaries are clustered while the order of segments in the set of sequences is preserved in a Markov-type model. Sequence segments within each cluster have very similar DNA/RNA patterns and form a so called quasi-alignment. This fact can be used for a variety of tasks such as species characterization and identification, phylogenetic analysis, functional analysis of sequences and, as in this paper, for discovering conserved regions. Our method is computationally more efficient than multiple sequences alignment since it can apply modern data stream clustering algorithms which run in time linear in the number of segments and thus can help discover highly similar regions across a large number of sequences efficiently. In this paper, we apply the approach to efficiently discover and visualize conserved regions in 16S rRNA.