By Topic

Text and Layout Information Extraction from Document Files of Various Formats Based on the Analysis of Page Description Language

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$33 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

4 Author(s)
Takashi Hirano ; Mitsubishi Electric Corporation, Information Technology R&D Center ; Yuichi Okano ; Yasuhiro Okada ; Fumio Yoda

We propose a document analysis method, which extracts text and layout information from document files of various formats. This method analyzes the page description language (PDL) data generated from a printed document. By converting the document to PDL data, this method can handle various document formats. Graphic elements such as text objects, image objects, and path objects in the PDL data are analyzed to extract text and layout information (character size, character position, and table position). By applying OCR to the image objects and the path objects, text images in source documents and vectorized font characters in engineering drawings are converted to text. Moreover, tables in various documents are detected by analyzing path objects. Therefore, it is possible to extract the full content information from document files of various formats as long as the document is printable.

Published in:

Ninth International Conference on Document Analysis and Recognition (ICDAR 2007)  (Volume:1 )

Date of Conference:

23-26 Sept. 2007