Skip to Main Content
In this article, we present a novel algorithm for measuring protein similarity based on their three dimensional structure (protein tertiary structure). The PROSIMA algorithm using suffix tress for discovering common parts of main-chains of all proteins appearing in current NCSB protein data bank (PDB). By identifying these common parts we build a vector model and next use classical information retrieval tasks based on the vector model to measure the similarity between proteins - all to all protein similarity. For the calculation of protein similarity we are using tf-idf term weighing schema and cosine similarity measure. The goal of this work to use the whole current PDB database (downloaded on June 2009) of known proteins, not just some kinds of selections of this database, which have been studied in other works. We have chose the SCOP database for verification of precision of our algorithm because it is maintained primarily by humans. The next success of this work is to be able to determine protein SCOP categories of proteins not included in the latest version of the SCOP database (v. 1.75) with nearly 100% precision.