Skip to Main Content
Biological research is becoming increasingly database driven, motivated, in part, by the advent of large-scale functional genomics and proteomics experiments such as those comprehensively measuring gene expression. These provide a wealth of information on each of the thousands of proteins encoded by a genome. Consequently, a challenge in bioinformatics is integrating databases to connect this disparate information as well as performing large-scale studies to collectively analyze many different data sets. This approach represents a paradigm shift away from traditional single-gene biology, and it often involves statistical analyses focusing on the occurrence of particular features (e.g., folds, functions, interactions, pseudogenes, or localization) in a large population of proteins. Moreover, the explicit application of machine learning techniques can be used to discover trends and patterns in the underlying data. In this article, we give several examples of these techniques in a genomic context: clustering methods to organize microarray expression data, support vector machines to predict protein function, Bayesian networks to predict subcellular localization, and decision trees to optimize target selection for high-throughput proteomics.