Skip to Main Content
We propose a document analysis method, which extracts text and layout information from document files of various formats. This method analyzes the page description language (PDL) data generated from a printed document. By converting the document to PDL data, this method can handle various document formats. Graphic elements such as text objects, image objects, and path objects in the PDL data are analyzed to extract text and layout information (character size, character position, and table position). By applying OCR to the image objects and the path objects, text images in source documents and vectorized font characters in engineering drawings are converted to text. Moreover, tables in various documents are detected by analyzing path objects. Therefore, it is possible to extract the full content information from document files of various formats as long as the document is printable.