Cart (Loading....) | Create Account
Close category search window

The BBN Byblos Japanese OCR system

Sign In

Cookies must be enabled to login.After enabling cookies , please use refresh or reload or ctrl+f5 on the browser for the login options.

Formats Non-Member Member
$31 $13
Learn how you can qualify for the best price for this item!
Become an IEEE Member or Subscribe to
IEEE Xplore for exclusive pricing!
close button

puzzle piece

IEEE membership options for an individual and IEEE Xplore subscriptions for an organization offer the most affordable access to essential journal articles, conference papers, standards, eBooks, and eLearning courses.

Learn more about:

IEEE membership

IEEE Xplore subscriptions

4 Author(s)
Macrostie, E. ; Dept. Speech & Language Process., BBN Technol., Cambridge, MA, USA ; Natarajan, Premkumar ; Decerbo, M. ; Prasad, R.

The BBN Byblos OCR system implements a script-independent methodology for OCR using hidden Markov models (HMMs). We have successfully ported the system to Arabic, Pashto, English, and Chinese. We discuss our effort in configuring the system to perform recognition of noisy machine printed Japanese documents. The data for our experimentation was taken from the University of Washington (UW-II) Japanese OCR corpus and the LDC Japanese Business News Supplement corpus. We evaluated the performance of a whole-character configuration in which each character was modeled using a separate HMM. As in the case of our Chinese OCR system [P. Natarajan et al., 2001], we also used a sub-character modeling approach [P. Natarajan et al., 2003] in which each Japanese character was spelled using a shared set of automatically generated sub-characters. We experimentally evaluated the performance of different sub-character clusters as well as different HMM topologies to identify the best overall system configuration. On a fair test using noisy/degraded images from the UW-II corpus, the best sub-character configuration resulted in a character error rate of 20.13%, On relatively cleaner data, consisting of scanned newspaper images, the system delivered an error rate of 7.85%. Using a whole-character configuration the corresponding error rates were 11.94% and 4.55% respectively.

Published in:

Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on  (Volume:2 )

Date of Conference:

23-26 Aug. 2004

Need Help?

IEEE Advancing Technology for Humanity About IEEE Xplore | Contact | Help | Terms of Use | Nondiscrimination Policy | Site Map | Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest professional association for the advancement of technology.
© Copyright 2014 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.