Skip to Main Content
Generating high quality gene clusters and identifying the underlying biological mechanism of the gene cluster are the important goals of clustering gene expression analysis. To get high quality cluster results, most of the current approaches rely on choosing the best cluster algorithm whose design biases and assumptions meet the underlying distribution of the data set. There are two issues for this approach: (1) usually the underlying data distribution of the gene expression data sets is unknown, and (2) there are so many clustering algorithms available and it is very challenging to choose the proper one. To provide a textual summary of the gene clusters, the most explored approach is the extractive approach that essentially builds upon techniques borrowed from the information retrieval, in which the objective is to provide terms to be used for query expansion, and not to act as a stand alone summary for the entire document sets. Another drawback is that the clustering quality and cluster interpretation are treated as two isolated research problems and are studied separately. But cluster quality and cluster interpretation are closely related and must be addressed in a coherent and unified way. It is essential to have relatively high quality clusters first, in order to get a correct, informative biological explanation of the gene cluster, otherwise, the biological explanation will be incorrect or misleading, no matter how good or robust the text summarization technique is. Based on this consideration, we design and develop a unified system GE-Miner (gene expression miner) to address these challenging issues in a principled and general manner by integrating cluster ensemble and text summarization and provide an environment for comprehensive gene expression data analysis. Experimental results demonstrate that our system can obtain high quality clusters and provide concise and informative textual summary for the gene clusters.