Skip to Main Content
Life science researchers frequently need to query large protein data sets in a variety of different ways. Protein data sets have a rich structure that includes its primary structure, which is described as a sequence of amino acids, and its secondary structure, which is described as a sequence of folding patterns of the protein. Both these structures are important as the amino acid sequence is often used to find homologous proteins, and the secondary structure can produce important hints about the functionality of proteins. While there are tools for querying each of these structures independently, there are no tools for declarative querying on both these structures. Even the tools that allow querying on either one of these structures are not based on any formal algebra, and as a result require complex rewriting of the tools programming logic when the "query evaluation plan" changes. This paper introduces PiQA, a Protein Query Algebra, which provides a rich set of algebraic operations on both the primary and secondary structure of proteins. Using PiQA one can pose several interesting complex queries involving both the primary and the secondary structure of proteins. In addition, simple existing tools that query only on the primary structure, such as BLAST, can also be expressed in this algebra. PiQA is an important first step in developing an algebra that can form the basis of a declarative querying language for querying protein data sets.