Abstract:
Intuitively, word representation for logographic languages like Chinese can be enhanced by its internal characteristics. Several research endeavors tried to learn Chinese...Show MoreMetadata
Abstract:
Intuitively, word representation for logographic languages like Chinese can be enhanced by its internal characteristics. Several research endeavors tried to learn Chinese word embeddings with characters, radicals, or subcharacters containing rich semantic information. In this paper, motivated by Four-Corner Method for Character Indexation, we extract features from four corners of characters with important morphological charactertics. Based on the features from four corners, we propose a model to utilize characters and four corner features of words to capture both semantic and morphological information. Moreover, we apply an attention scheme to integrate internal information dynamically, which includes two strategies to assign different weights for elements according to the word frequency. Experimental results on social news corpus and Chinese Wikipedia Dump show exploiting the four corner morphological features is crucial for capturing the meanings of Chinese words. Meanwhile, the results on word analogy, word similarity, and text classification tasks demonstrate that our approach obtains better results than state-of-the-art approaches.
Published in: IEEE Transactions on Big Data ( Volume: 8, Issue: 4, 01 August 2022)