Skip to Main Content
The BBN Byblos OCR system implements a script-independent methodology for OCR using hidden Markov models (HMMs). We have successfully ported the system to Arabic, Pashto, English, and Chinese. We discuss our effort in configuring the system to perform recognition of noisy machine printed Japanese documents. The data for our experimentation was taken from the University of Washington (UW-II) Japanese OCR corpus and the LDC Japanese Business News Supplement corpus. We evaluated the performance of a whole-character configuration in which each character was modeled using a separate HMM. As in the case of our Chinese OCR system [P. Natarajan et al., 2001], we also used a sub-character modeling approach [P. Natarajan et al., 2003] in which each Japanese character was spelled using a shared set of automatically generated sub-characters. We experimentally evaluated the performance of different sub-character clusters as well as different HMM topologies to identify the best overall system configuration. On a fair test using noisy/degraded images from the UW-II corpus, the best sub-character configuration resulted in a character error rate of 20.13%, On relatively cleaner data, consisting of scanned newspaper images, the system delivered an error rate of 7.85%. Using a whole-character configuration the corresponding error rates were 11.94% and 4.55% respectively.