Abstract
This paper describes a fast and flexible method for extracting
text regions from a document page containing text, graphics, and
pictures. Such regions can be given as an input to an OCR system. The
user fixes two parameters, the minimum width w of the text to be
detected, and the precision ε needed (both expressed as a
percentage of the image width), according to the implementation needs.
The method works by subdividing the page into overlapping columns whose
width and inter-shift depend on w and ε, and by performing text
lines extraction on each column separately. Successively, a statistical
analysis of the text line elements found in each column is performed,
and they are connected to form complete text lines. Finally, related
pieces of text are merged into blocks so that a sensible reading order
is provided for the OCR system. The algorithm is very fast, is able to
work on low-resolution document pages and is robust against skew. The
algorithm as also very flexible: no assumptions are made on the layout
of the document, the shape of the text regions, and the font size and
style; the main assumption is that the background is uniform and the
text approximately horizontal. Despite the statistical nature of the
method, a single line of text of a certain font size is generally
sufficient to warrant detection. Experimental results are shown which
demonstrate the effectiveness of the method on several different kinds
of documents
Index
Terms
Available to subscribers and IEEE members.
References
Available to subscribers and IEEE members.
Citing Documents
Available to subscribers and IEEE members.