Skip to Main Content
Gene expression is modulated by transcription factors (TFs), which are proteins that generally bind to DNA adjacent to coding regions and initiate transcription. Each target gene can be regulated by more than one TF, and each TF can regulate many targets. For a complete molecular understanding of transcriptional regulation, researchers must first associate each TF with the set of genes that it regulates. Here we present a summary of completed work on the ability to associate 104 TFs with their binding sites using support vector machines (SVMs), which are classification algorithms based in statistical learning theory. We use several types of genomic datasets to train classifiers in order to predict TF binding in the yeast genome. We consider motif matches, subsequence counts, motif conservation, functional annotation, and expression profiles. A simple weighting scheme varies the contribution of each type of genomic data when building a final SVM classifier, which we evaluate using known binding sites published in the literature and in online databases. The SVM algorithm works best when all datasets are combined, producing 73% coverage of known interactions, with a prediction accuracy of almost 0.9. We discuss new ideas and preliminary work for improving SVM classification of biological data.
Note: The Institute of Electrical and Electronics Engineers, Incorporated is distributing this Article with permission of the International Business Machines Corporation (IBM) who is the exclusive owner. The recipient of this Article may not assign, sublicense, lease, rent or otherwise transfer, reproduce, prepare derivative works, publicly display or perform, or distribute the Article.