Model framework. It is mainly composed of two parts: text representation and classification.
Abstract:
With the development of the Chinese Internet, a large amount of Chinese short text data has been generated. The multilabel classification of Chinese short texts enables m...Show MoreMetadata
Abstract:
With the development of the Chinese Internet, a large amount of Chinese short text data has been generated. The multilabel classification of Chinese short texts enables more effective management and analysis. However, due to the sparsity of Chinese short text features, and the fact that commonly used multilabel classification models are primarily designed and developed in English, traditional sampling methods can easily lead to poor classification results. In response to these challenges, we propose a Chinese multilabel short text classification method based on GAN and enhanced with pinyin. Firstly, we utilize BERT, augmented by pinyin embedding, as a method for text vector representation to enrich text information. Secondly, multiple hidden layers of BERT are integrated with the generators of the GAN model to comprehensively learn the feature distribution. Finally, the improved sampling method is used to help the model learn better. Experimental results show that the method proposed in this article performs better in processing Chinese multilabel short text classification tasks.
Model framework. It is mainly composed of two parts: text representation and classification.
Published in: IEEE Access ( Volume: 12)