By Topic

sBGMM: A Stratified Beta-Gaussian Mixture Model for Clustering Genes with Multiple Data Sources

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$33 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

3 Author(s)
Xiaofeng Dai ; Dept. of Signal Process., Tampere Univ. of Technol., Tampere ; Harri Lähdesmäki ; Olli Yli-Harja

Cluster analysis is widely applied to discover the function of previously unannotated genes. This paper presents a novel stratified beta-Gaussian mixture model, sBGMM, for clustering genes based on gene expression data, protein-DNA binding data and data that can provide information for constructing priors such as protein-protein interaction (PPI) data. An expectation maximization (EM) type of algorithm for Beta mixture model is first developed and then combined with that of Gaussian mixture model. This combined algorithm can jointly estimate the parameters for both Beta and Gaussian distributions and is used as the core in the sBGMM method. The stratification property of sBGMM is exhibited as Stratum-specific prior probabilities and is constructed by the pre-cluster results obtained from PPI data in this study. This proposed sBGMM method differs from other mixture model based methods in its integration of two different data types into a single and unified probabilistic modeling framework and incorporation of prior information from a third data source. Several well-studied model selection methods, such as Akaike information criterion (AIC), modified AIC (AIC3), Bayesian information criterion (BIC), and integrated classification likelihood-BIC (ICL-BIC) are applied to estimate the number of clusters, and simulation results show that AIC3 works best for sBGMM. Simulations also indicate that combining two different data sources into a single mixture model can greatly improve the clustering accuracy and stability, and employing priors to stratify the model can further enhance its performance. This proposed method provides a more efficient use of multiple data sources than methods that analyze different data sources separately.

Published in:

Biocomputation, Bioinformatics, and Biomedical Technologies, 2008. BIOTECHNO '08. International Conference on

Date of Conference:

June 29 2008-July 5 2008