Although research interest in machine printed Thai character recognition has been an intense research area in the past decade, there are only a few results available for Thai document layout analysis. In addition, directly using the method proposed for other languages with Thai documents is not possible since Thai documents have a unique characteristic (i.e., Thai characters can be placed in four different levels). This paper proposed an approach to eliminate that characteristic by removing nonmiddle-level characters from the image based on heuristic rules derived from Thai language properties: nonmiddle-level characters are usually smaller than middle-level characters, and the gap between each level is smaller than the gap between two consecutive lines. After they are removed, one can use any existing methods with Thai documents without any modification. The experimental results show that the proposed method can effectively remove nonmiddle-level characters from 200 test images with 99.46% accuracy even when the image contains various font sizes.
Published in:
Document Analysis and Recognition, 2005. Proceedings. Eighth International Conference on
Date of Conference: 29 Aug.-1 Sept. 2005