By Topic

Extract list data from semi-structured document using clustering

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$33 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

3 Author(s)
Hui Xu ; Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China ; Juanzi Li ; Peng Xu

This paper is concerned with list data extraction from semi-structured documents. By list data extraction, we mean extracting data from lists and grouping it by rows and columns. List, which has structured characteristics, is used to store highly structured and database-like information in many semi-structured documents, such as business annual reports, online airport listings, catalogs, hotel directories, etc. List data extraction is of benefit to text mining applications on semi-structured documents. Several research efforts have been done on structured data extraction from semi-structured documents by utilizing the word layout and arrangement information. However, as far as we know, few studies have been sufficiently investigated on list data extraction making use of the semantic information previously. In this paper, we propose a clustering based method making use of not only the layout and arrangement information but also the semantic information of words for this extraction task. We show experimental results on plain-text annual reports from Shanghai Stock Exchange, in which 73.49% of the lists were extracted correctly.

Published in:

Natural Language Processing and Knowledge Engineering, 2005. IEEE NLP-KE '05. Proceedings of 2005 IEEE International Conference on

Date of Conference:

30 Oct.-1 Nov. 2005