By Topic

Using natural language processing and the gene ontology to populate a structured pathway database

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

4 Author(s)

Reading literature is one of the most time consuming tasks a busy scientist has to contend with. As the volume of literature continues to grow there is a need to sort through this information in a more efficient manner. Mapping the pathways of genes and proteins of interest is one goal that requires frequent reference to the literature. Pathway databases can help here and scientists currently have a choice between buying access to externally curated pathway databases or building their own in house. However such databases are either expensive to license or slow to populate manually. Building upon easily available, open-source tools we have developed a pipeline to automate the collection, structuring and storage of gene and protein interaction data from the literature. As a team of both biologists and computer scientists we integrated our natural language processing (NLP) software with the gene ontology (GO) to collect and translate unstructured text data into structured interaction data. For NLP we used a machine learning approach with a rule induction program, RAPIER (http://www. cs. utexas. edu/users/mUrapier. html). RAPIER was modified to learn rules from tagged documents, and then it was trained on a corpus tagged by expert curators. The resulting rules were used to extract information from a test corpus automatically. Extracted genes and proteins were mapped onto Locuslink, and extracted interactions were mapped onto GO. Once information was structured in this way it was stored in a pathway database and this formal structure allowed us to perform advanced data mining and visualization.

Published in:

Bioinformatics Conference, 2003. CSB 2003. Proceedings of the 2003 IEEE

Date of Conference:

11-14 Aug. 2003