Conferences >2015 Fifteenth International ...

Developing a commercial grade Tamil OCR for recognizing font and size independent text

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

Optical Character Recognition (OCR) of Indic scripts such as Tamil and Sinhala has lagged behind those for languages based on the Latin script. Several attempts to build ...Show More

Metadata

Abstract:

Optical Character Recognition (OCR) of Indic scripts such as Tamil and Sinhala has lagged behind those for languages based on the Latin script. Several attempts to build commercial grade OCR for these languages have failed in the past owing to them not generalizing well. This paper describes a set of training regimes for Tamil using the Tesseract engine that have enabled us to develop a robust Tamil OCR system. We describe in detail our training regime, which results in a performance improvement of 12.5% over the default Tamil module shipped with Tesseract on a set of ancient Tamil documents, which were part of an authentic project to digitize important Tamil manuscripts of Sri Lanka.

Published in: 2015 Fifteenth International Conference on Advances in ICT for Emerging Regions (ICTer)

Date of Conference: 24-26 August 2015

Date Added to IEEE Xplore: 11 January 2016

ISBN Information:

DOI: 10.1109/ICTER.2015.7377678

Conference Location: Colombo, Sri Lanka

Contents

References is not available for this document.

Developing a commercial grade Tamil OCR for recognizing font and size independent text

Abstract:

Metadata

Abstract:

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Developing a commercial grade Tamil OCR for recognizing font and size independent text

Alerts

Abstract:

Metadata

Abstract:

Authors

Figures

References

Citations

Keywords

Metrics

References

IEEE Account

Purchase Details

Profile Information

Need Help?