Skip to Main Content
Reading literature is one of the most time consuming tasks a busy scientist has to contend with. As the volume of literature continues to grow there is a need to sort through this information in a more efficient manner. Mapping the pathways of genes and proteins of interest is one goal that requires frequent reference to the literature. Pathway databases can help here and scientists currently have a choice between buying access to externally curated pathway databases or building their own in house. However such databases are either expensive to license or slow to populate manually. Building upon easily available, open-source tools we have developed a pipeline to automate the collection, structuring and storage of gene and protein interaction data from the literature. As a team of both biologists and computer scientists we integrated our natural language processing (NLP) software with the gene ontology (GO) to collect and translate unstructured text data into structured interaction data. For NLP we used a machine learning approach with a rule induction program, RAPIER (http://www. cs. utexas. edu/users/mUrapier. html). RAPIER was modified to learn rules from tagged documents, and then it was trained on a corpus tagged by expert curators. The resulting rules were used to extract information from a test corpus automatically. Extracted genes and proteins were mapped onto Locuslink, and extracted interactions were mapped onto GO. Once information was structured in this way it was stored in a pathway database and this formal structure allowed us to perform advanced data mining and visualization.