By Topic

Information extraction from semi-structured and un-structured documents using probabilistic context free grammar inference

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$33 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

4 Author(s)
Ramesh Thakur ; International Institute of Professional Studies, Devi Ahilya Viswavidyalaya, Indore, India ; Suresh Jain ; Narendra S. Chaudhari ; Rahul Singhai

Large number of research papers are available in the form of un-structured (text) format. Knowledge discovery in un-structured document has been recognized as promising task. These documents are typically formatted for human viewing, which varies widely from document to document. Frequent change in their formatting causes difficulties in constructing a global schema. Thus, discovery of interesting rules from it is a complex and tedious process. Recently, conditional random fields (CRFs) and hand-coded wrappers have been used to label the text (such as Title, Author Name(s), Affiliation, Email, Contact number, etc. in research papers). In this paper we propose a novel hybrid approach to infer grammar rules using alignment similarity and probabilistic context free grammar. It helps in extracting desired information from the document.

Published in:

Information Retrieval & Knowledge Management (CAMP), 2012 International Conference on

Date of Conference:

13-15 March 2012