Skip to Main Content
In this paper, we present an unsupervised novel approach for protein sequences clustering by incorporating the functional domain information into the clustering process. In the proposed framework, the domain boundaries predicated by ProDom database are used to provide a better measurement in calculating the sequence similarity. In addition, we use an unsupervised clustering algorithm as the kernel that includes a hierarchical clustering in the first phase to pre-cluster the protein sequences, and a partitioning clustering in the second phase to refine the clustering results. More specifically, we perform the agglomerative hierarchical clustering on protein sequences in the first phase to obtain the initial clustering results for the subsequent partitioning clustering, and then, a profile Hidden Markove Model (HMM) is built for each cluster to represent the centroid of a cluster. In the second phase, the HMMs based k-means clustering is then performed to refine the cluster results as protein families. The experimental results show our model is effective and efficient in clustering protein families.
Date of Conference: 13-15 July 2008