By Topic

CUBIC: identification of regulatory binding sites through data clustering

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

3 Author(s)
Olman, V. ; Protein Informatics Group, Oak Ridge Nat. Lab., TN, USA ; Xu, D. ; Xu, Y.

Transcription factor binding sites are short fragments in the upstream regions of genes, to which transcription factors bind to regulate the transcription of genes into mRNA. Computational identification of transcription factor binding sites remains an unsolved challenging problem though a great amount of effort has been put into the study of this problem. We have recently developed a novel technique for identification of binding sites from a set of upstream regions of genes, that could possibly be transcriptionally coregulated and hence might share similar transcription factor binding sites. By utilizing two key features of such binding sites (i.e., their high sequence similarities and their relatively high frequencies compared to other sequence fragments), we have formulated this problem as a cluster identification problem. That is to identify and extract data clusters from a noisy background. While the classical data clustering problem (partitioning a data set into clusters sharing common or similar features) has been extensively studied, there is no general algorithm for solving the problem of identifying data clusters from a noisy background. In this paper, we present a novel algorithm for solving such a problem. We have proved that a cluster identification problem, under our definition, can be rigorously and efficiently solved through searching for substrings with special properties in a linear sequence. We have also developed a method for assessing the statistical significance of each identified cluster, which can be used to rule out accidental data clusters. We have implemented the cluster identification algorithm and the statistical significance analysis method as a computer software CUBIC. Extensive testing on CUBIC has been carried out. We present here a few applications of CUBIC on challenging cases of binding site identification.

Published in:

Bioinformatics Conference, 2003. CSB 2003. Proceedings of the 2003 IEEE

Date of Conference:

11-14 Aug. 2003