Skip to Main Content
We present an information retrieval model for combining evidence from concept-based semantics, term statistics, and context for improving search precision of genomics literature by accurately identifying concise, variable length passages of text to answer a user query. The system combines a dimensional data model for indexing scientific literature at multiple levels of document structure and context with a rule-based query processing algorithm. The query processing algorithm uses an iterative information extraction technique to identify query concepts, and a retrieval function for systematically combining concepts with term statistics at multiple levels of context. We define context by variable length passages of text and different levels of document lexical structure including terms, sentences, paragraphs, and entire documents. Our results demonstrate improved search results in the presence of varying levels of semantic evidence, and higher performance using retrieval functions that combine document as well as sentence and passage level information versus using document, sentence or passage level information alone. Initial results are promising. When ranking documents based on the most relevant extracted passages, the results exceed the state-of-the-art by 13.89% as assessed by the TREC 2005 Genomics track collection of 4.5 million MEDLINE citations.