Document classification through image-based character embedding and wildcard training | IEEE Conference Publication | IEEE Xplore

Document classification through image-based character embedding and wildcard training


Abstract:

Languages such as Chinese and Japanese have a significantly large number (several thousands) of alphabets as compared to other languages, and each of their sentences cons...Show More

Abstract:

Languages such as Chinese and Japanese have a significantly large number (several thousands) of alphabets as compared to other languages, and each of their sentences consists of several concatenated words with wide varieties of inflected forms; thus appropriate word segmentation is quite difficult. Therefore, recently proposed sophisticated language-processing methods designed for languages such as English cannot be applied. In this paper, we address those issues and propose a new and efficient document classification technique for such languages. The proposed method is characterized into a new “image-based character embedding” method and character-level convolutional neural networks method with “wildcard training.” The first method encodes each character based on its pictorial structures and preserves them. Further, the second method treats some of the input characters as wildcards in the classification stage and functions as efficient data augmentation. We confirmed that our proposed method showed superior performance when compared conventional methods for Japanese document classification problems.
Date of Conference: 05-08 December 2016
Date Added to IEEE Xplore: 06 February 2017
ISBN Information:
Conference Location: Washington, DC, USA

Contact IEEE to Subscribe

References

References is not available for this document.