A low-cost optical character recognition (OCR) system can be realized by means of a document scanner connected to a CPU through an interface. The interface performs elementary image processing functions, such as noise filtering and thresholding of the video image from the scanner. The processor receives a binary image of the document, formats the image into individual character patterns, and classifies the patterns one-by-one. A CPU implementation is highly flexible and avoids much of the development and manufacturing costs for special-purpose, parallel circuitry typically used in commercial OCR. A processor-based recognition system has been investigated for reading documents printed in fixed-pitch conventional type fonts, such as occur in routine office typing. Novel, efficient methods for tracking a print line, resolving it into individual character patterns, detecting underscores, and eliminating noise have been devised. A previously developed classification technique, based on decision trees, has been extended in order to improve reading accuracy in an environment of considerable character variation, including the possibility that documents in the same font style may be produced using quite different print technologies. The system has been tested on typical office documents, and also on artificial stress documents, obtained from a variety of typewriters.
Note: The Institute of Electrical and Electronics Engineers, Incorporated is distributing this Article with permission of the International Business Machines Corporation (IBM) who is the exclusive owner. The recipient of this Article may not assign, sublicense, lease, rent or otherwise transfer, reproduce, prepare derivative works, publicly display or perform, or distribute the Article.