Home  |   Login  |   Logout  |   Access Information  |   Alerts  |   Purchase History  |   Cart  |   Sitemap  |   Help   
 
Abstract
BROWSE SEARCH IEEE XPLORE GUIDE SUPPORT
arrow_leftView TOC
Email/Printer Friendly Format  
 

A scale space approach for automatically segmenting words from historical handwritten documents
Manmatha, R.   Rothfeder, J.L.  
Dept. of Comput. Sci., Massachusetts Univ., Amherst, MA, USA;

This paper appears in: Pattern Analysis and Machine Intelligence, IEEE Transactions on
Publication Date: Aug. 2005
Volume: 27,  Issue: 8
On page(s): 1212-1225
ISSN: 0162-8828
INSPEC Accession Number: 8507963
Digital Object Identifier: 10.1109/TPAMI.2005.150
Current Version Published: 2005-06-20

Abstract
Many libraries, museums, and other organizations contain large collections of handwritten historical documents, for example, the papers of early presidents like George Washington at the Library of Congress. The first step in providing recognition/retrieval tools is to automatically segment handwritten pages into words. State of the art segmentation techniques like the gap metrics algorithm have been mostly developed and tested on highly constrained documents like bank checks and postal addresses. There has been little work on full handwritten pages and this work has usually involved testing on clean artificial documents created for the purpose of research. Historical manuscript images, on the other hand, contain a great deal of noise and are much more challenging. Here, a novel scale space algorithm for automatically segmenting handwritten (historical) documents into words is described. First, the page is cleaned to remove margins. This is followed by a gray-level projection profile algorithm for finding lines in images. Each line image is then filtered with an anisotropic Laplacian at several scales. This procedure produces blobs which correspond to portions of characters at small scales and to words at larger scales. Crucial to the algorithm is scale selection that is, finding the optimum scale at which blobs correspond to words. This is done by finding the maximum over scale of the extent or area of the blobs. This scale maximum is estimated using three different approaches. The blobs recovered at the optimum scale are then bounded with a rectangular box to recover the words. A post processing filtering step is performed to eliminate boxes of unusual size which are unlikely to correspond to words. The approach is tested on a number of different data sets and it is shown that, on 100 sampled documents from the George Washington corpus of handwritten document images, a total error rate of 17 percent is observed. The technique outperforms a state-of-the-art gap metr- - ics word-segmentation algorithm on this collection.

Index Terms
Available to subscribers and IEEE members.

References
Available to subscribers and IEEE members.
Citing Documents
Available to subscribers and IEEE members.
You are not logged in.
Guests may access Abstract records free of charge.
Login
Username
Password
» Forgot your password?
Please remember to log out when you have finished your session.
You must log in to access:
• Advanced or Author Search
• CrossRef Search
• AbstractPlus Records
• Full Text PDF
• Full Text HTML
Access this document
Full Text: PDF (1664 KB)
» Buy this document now
»  Learn more about
»  Learn more about
    purchasing articles
    and standards

Rights and Permissions
» Learn More
Download this citation
Available to subscribers and IEEE members.
 
arrow_leftView TOC   |  Back to toparrow_up
Indexed by IEE Inspec
© Copyright 2009 IEEE – All Rights Reserved