By Topic

Refining the extraction of relevant documents from biomedical literature to create a corpus for pathway text mining

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

5 Author(s)
Harte, R. ; PPD Discovery, Inc., Mento Park, CA, USA ; Lu, Y. ; Osborn, S. ; Dehoney, D.
more authors

For biologists to keep up with developments in their field or related fields, automation is desirable to more efficiently read and interpret a rapidly growing literature. Identification of proteins or genes and their interactions can facilitate the mapping of canonical or evolving pathways from the literature. In order to mine such data, we developed procedures and tools to pre-qualify documents for further analysis. Initially, a corpus of documents for proteins of interest was built using alternate symbols from Locuslink and the Stanford SOURCE as MEDLINE search terms. The query was refined using the optimum keywords together with MeSH terms combined in a Boolean query to minimize false positives. The document space was examined using a strategy employing; latent semantic indexing (LSI), which uses Entrez's "related papers" utility for MEDLINE. Documents' relationships were visualized using an undirected graph and scored by their relatedness. Distinct document clusters, formed by the most highly connected related papers, are mostly composed of abstracts relating to one aspect of research. This feature was used to filter irrelevant abstracts, which resulted in a reduction in corpus size of 10% to 30% depending on the domain. The excluded documents were examined to confirm their lack of relevance. Corpora consisted of the most relevant documents thus reducing the number of false positives and irrelevant examples in the training set for pathway mapping. Documents were tagged, using a modified version of GATE2, with terms based on GO for rule induction using RAPIER.

Published in:

Bioinformatics Conference, 2003. CSB 2003. Proceedings of the 2003 IEEE

Date of Conference:

11-14 Aug. 2003