Skip to Main Content
We introduce the research of document digitization technology and its applications for constructing digital libraries in China. We focus on two major objectives of document digitization technologies: performance and efficiency. Taking the most representative TH-OCR product as an example, the up-to-date research achievements on both kernel OCR technologies and peripheral technologies in China are presented. The kernel technologies include high performance multilingual (Chinese, Japanese, Korean and English) text recognition, layout analysis, understanding and reconstruction; the peripheral technologies include the network document digitization workflow and intelligent proofreading, which greatly improve the efficiency. The applications of TH-OCR has two types of final output digital documents, one is the reconstructed electronic document with full text and layout information of the original paper-based document, the other is the multilevel document with OCR output text layer under the image layer. Numerous applications indicate that current technologies can greatly facilitate the mass-volume digitization labour in building digital library infrastructure.