Syntactic segmentation and labeling of digitized pages fromtechnical journals
Krishnamoorthy, M.
Nagy, G.
Seth, S.
Viswanathan, M.
Rensselaer Polytech. Inst., Troy, NY;
This paper appears in: Pattern Analysis and Machine Intelligence, IEEE Transactions on
Publication Date: Jul 1993
Volume: 15,
Issue: 7
On page(s): 737-747
ISSN: 0162-8828
References Cited: 32
CODEN: ITPIDJ
INSPEC Accession Number: 4465239
Digital Object Identifier: 10.1109/34.221173
Current Version Published: 2002-08-06
Abstract
A method for extracting alternating horizontal and vertical
projection profiles are from nested sub-blocks of scanned page images of
technical documents is discussed. The thresholded profile strings are
parsed using the compiler utilities Lex and Yacc. The significant
document components are demarcated and identified by the recursive
application of block grammars. Backtracking for error recovery and
branch and bound for maximum-area labeling are implemented with Unix
Shell programs. Results of the segmentation and labeling process are
stored in a labeled x-y tree. It is shown that
families of technical documents that share the same layout conventions
can be readily analyzed. Results from experiments in which more than 20
types of document entities were identified in sample pages from two
journals are presented
Index
Terms
Available to subscribers and IEEE members.
References
Available to subscribers and IEEE members.
Citing Documents
Available to subscribers and IEEE members.