Skip to Main Content
Large scale scanning and document conversion efforts have led to a renewed interest in OCR systems and workflows. This paper describes a new format for representing both intermediate and final OCR results, developed in response to the needs of a newly developed OCR system and ground truth data release. The format embeds OCR information invisibly inside the HTML and CSS standards and therefore can represent a wide range of linguistic and typographic phenomena with already well-defined, widely understood markup and can be processed using widely available and known tools. The format is based on a new, multi-level abstraction of OCR results based on logical markup, common typesetting models, and OCR engine-specific markup, making it suitable both for the support of existing workflows and the development of future model-based OCR engines.
Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on (Volume:2 )
Date of Conference: 23-26 Sept. 2007