Skip to Main Content
With the advancement of biological techniques, researches in the fields of marine evolution, ecology, and aquaculture have an explosive increasing rate both in volume and diversity. More than tens of thousands of genomic sequences were available for important marine species. However, most of the structures and corresponding functions remain unresolved and unknown. To discover the biological characteristics of genomic sequences of a marine species, an efficient and effective method for detecting distantly related proteins based on experimentally known functions from model species becomes an important strategy. In this study, Ensembl and NCBI genetic databases were employed to build a primitive database of selected marine species. The system contained an abundance of useful DNA, RNA and Protein information, and was named as the Marine Species Genome Database (MSGD). To identify remote proteins, we have proposed a novel LESS (length encoded secondary structure) profile to improve the information retrieval applications, especially for identifying protein sequences without resolved structures and within low sequence identity. The matching algorithms applied several existing secondary structure prediction techniques and a feasible encoding mechanism with respect to the length distribution of secondary structures. Due to the conservation of secondary structures of proteins in evolution, the proposed system demonstrated its suitability for similarity comparison of distantly related proteins, and several important protein sequences can be retrieved by MSGD while those well-known residue-based matching methods missed the identification.