Skip to Main Content
Cluster analysis is widely applied to discover the function of previously unannotated genes. This paper presents a novel stratified beta-Gaussian mixture model, sBGMM, for clustering genes based on gene expression data, protein-DNA binding data and data that can provide information for constructing priors such as protein-protein interaction (PPI) data. An expectation maximization (EM) type of algorithm for Beta mixture model is first developed and then combined with that of Gaussian mixture model. This combined algorithm can jointly estimate the parameters for both Beta and Gaussian distributions and is used as the core in the sBGMM method. The stratification property of sBGMM is exhibited as Stratum-specific prior probabilities and is constructed by the pre-cluster results obtained from PPI data in this study. This proposed sBGMM method differs from other mixture model based methods in its integration of two different data types into a single and unified probabilistic modeling framework and incorporation of prior information from a third data source. Several well-studied model selection methods, such as Akaike information criterion (AIC), modified AIC (AIC3), Bayesian information criterion (BIC), and integrated classification likelihood-BIC (ICL-BIC) are applied to estimate the number of clusters, and simulation results show that AIC3 works best for sBGMM. Simulations also indicate that combining two different data sources into a single mixture model can greatly improve the clustering accuracy and stability, and employing priors to stratify the model can further enhance its performance. This proposed method provides a more efficient use of multiple data sources than methods that analyze different data sources separately.