A Hybrid Classification Method via Character Embedding in Chinese Short Text With Few Words

,


I. INTRODUCTION
With the rapid development of web services, more and more Chinese short texts with few words are generated on the Internet, such as microblog, news title and search snippets. The main difference between Chinese short texts with few words and traditional Chinese short text is the extremely short length. The Chinese short texts with few words are considered to just contains few Chinese characters, for instance the length of one news title is usually required to less than 20 characters. Recently, Chinese short texts with few words, e.g. news title and invoice name are posted at unprecedented The associate editor coordinating the review of this manuscript and approving it for publication was Jerry Chun-Wei Lin .
rates, and they are usually an overview of all the content. In addition, the extreme short length, feature sparsity, high ambiguity and other inherent properties of these texts pose huge challenges to text classification. The urgent demand to process Chinese short texts with few words has attracted vast amount of attention and research [1]- [4].
However, most existing short text classification methods rarely notice the Chinese short text with few words. Considering the uniqueness of these texts, it is hard to apply existing classification methods to these texts directly due to the following problems. First, the extreme short text length incurs the lack of information for meaningful analysis and effective classification. Second, there is no explicit word delimitation in Chinese texts which is different from the blank space between two words in English, and Chinese word segmentation may deteriorate the performance of classification. In addition, due to the existence of interfering words in Chinese short text with few words, it is more difficult to identify keywords for correct classification. Thus, it is a challenge in the tackling of very short Chinese text classification due to the efficiency and effectiveness.
To handle of the short text classification, existing methods can roughly be divided into two categories: auxiliary information based methods and representation learning based methods. Auxiliary information based methods utilize the external knowledge base like Freebase, Probase and DBpedia to expand the features of short text. For example, Wang et al. [5] mapped short document to Wikipedia concepts for short text classification. Li et al. [6] used Probase to expand the feature space, and more semantic contexts of terms are introduced to make up of the data sparsity and disambiguate terms. Representation learning based methods seek to learn better feature representations of short text for improving the performance of classification, and many researches and methods have been proposed along this line, such as Naive Bayes(NB) [7] and Attention Mechanism [8]. For example, Zhou et al. [9] proposed the Compositional Recurrent Neural Network (RNN) for Chinese short text classification, which is a hybrid model of character-level and word-level future representations based on long short term memory (LSTM). Ma et al. [10] proposed to learn distributional representations with Gaussian process approach by assuming that short text is a specific sample of one distribution in a Bayesian framework, and the short texts classification is converted to select the most probable Gaussian distribution. Meanwhile, Yu et al. [11] proposed an open source library toolkit for short text analysis and classification which supports effective text pre-processing and fast training/prediction procedures based on representation learning, the toolkit is widely applied as an easy-to-use and extensible tool.
Among the above approaches, auxiliary information based methods can improve the classification performance with the help of external knowledge base [12], but most of the methods in this category are restricted in conditions and require plenty of time, they suffer from the problem of information unavailable and high communication overhead. It is hence hard to be applied in short text classification efficiently. On the other hand, while the representation learning based short text classification methods have achieved tremendous success, two problems prevent the further development of these methods in Chinese short text with few words. The first problem is the identification of key words. The Chinese text with few words have fewer keywords even just one because of the extreme short length, and the category of text usually determined by this key word. The second problem is the feature representation of Chinese short text with few words, the Chinese nonstandard representation, Chinese word segmentation and existence of distracters may result in the ambiguous meaning and misclassify.
For the first problem, the attention mechanism and feature selection are adopted to identify key words. The attention-based LSTM is utilized to weight each word for focusing on the word close to the sentence meaning. The query vectors and key-value vectors are mapped to an output vector for assigning the weight to each word, the vital information is assigned to a larger weight and the features which may not helpful for classification is assigned to a small weight. In addition, the feature selection is leveraged to reserve the notional words like nouns, verbs and adjectives that hold much information than other functional words, which reduces the possible negative influence of some redundant information. In this way, the key words in short texts with few words can be recognized for classification.
For the second problem, Chinese character embedding and feature selection are conducted for representation learning. Character embedding avoid the errors that generated by word segmentation and make full use of Chinese character information which is pretty meaningful. Meanwhile, the semantic similarity between texts and class label information is calculated for feature selection, then the results are sorted and the last few characters are considered to interfering words and removed. As the feature selection is the process to extract feature subsets from the original datasets for reducing the algorithm time complexity, it can effectively identify keywords and improve classification accuracy [13].
In summary, we integrate the attention mechanism and the feature selection via character embedding for the classification of Chinese short text with few words (short for AFC). The Chinese character embedding vectors are calculated as text representation firstly, then the attention mechanism is applied to assign each word a different weight, for which one is useful for classification would be lager, vice is small. Meanwhile, the semantic similarity between content and class label information are computed, then the all words are sorted and last words in list are deleted for feature selection. After that, the weighted vectors are aligned based on feature selection. The main contributions of this paper are summarized as follows: • The Chinese character embedding is utilized for representation learning, which overcomes the deficit of word segmentation and takes full advantage of significant information of character.
• The attention mechanism is utilized to weight the characters of text to make a further weakening of noise, which enhance the key words influence in classification.
• The feature selection is leveraged to reduce the impact of irrelevant information and align sentence vectors for text classification.
• The comprehensive experiments over three real data sets show that our method outperforms state-of-the-art models and evaluate the effectiveness of our method. The remainder of this paper is organized as follows. Related work is introduced in Section II and details of the proposed AFC method are provided in Section III. Experimental results and analysis on three real world datasets are VOLUME 8, 2020 presented in Section IV, followed by our main conclusions in Section V.

II. RELATED WORK
Short text classification has played a crucial role in many applications of Natural Language Processing (NLP) for providing proper service of document management. It aims to process the text which length is very short, typically no longer than 100 characters, such as blog content, online reviews, news title and so on. Due to the sparseness, noises and non-standardability of short text, traditional text classification methods usually fail to achieve satisfied performance. In recent years, short text classification has attained much attention from multiple disciplines. In this section, we survey the related works on short text classification, including representation learning based methods, feature selection based methods and attention based methods.

A. REPRESENTATION LEARNING BASED METHODS
The feature representation learning of text is the fundamental problem of short text classification, which is the main intuition behind the deep learning methods in essence [14]. Mikolov et al. proposed Word2vec language model [15], which is a distributed representation of vocabulary based on neural networks. Peters et al. proposed Embeddings from Language Models(ELMo) [16] that is a new type of deep contextualized word representation, this model addresses both the challenges of different characteristics for word use and different words meaning across linguistic contexts. The vectors are derived from a bidirectional LSTM and trained with a coupled language model objective on a large scale corpus. Devlin et al. proposed a language representation model called Bidirectional Encoder Representation from Transformers(BERT) [17], which utilized the masked language model and next sentence prediction to pre-train a deep bidirectional transformer and text-pair representations. Although afore mentioned model can obtain better representation vectors of words, it is difficult to achieve satisfied performance on small data scale and be applied in Chinese short text with few words directly.
Different from English text, Chinese text is made up of character sequences rather than word sequences, and word is not a natural concept without argument. Therefore, recently, there have already been some effort on devoting character embedding to Chinese NLP task. For example, Zhao [18] initially investigate the possibility of exploiting character dependencies for Chinese, and character-level dependency is shown as a good alternative to word boundary representation. Zhang et al. [19] developed a character-based parsing model that can produce character-level constituent trees for Chinese NLP, and demonstrated the importance and effectiveness of character-level information in Chinese parsing. Sun et al. [20] proposed a method to utilize radical for learning Chinese character embedding, and applied it on Chinese character similarity judgement and Chinese word segmentation. Li et al. [21] presented a character-level neural dependency parser together with character-level dependency treebank for Chinese, it showed that Chinese character embedding played an important role for NLP task performance. However, most of previous character embedding methods combined word-level and character-level features together for NLP task, it is a challenge in the classification of Chinese short text with few words by character embedding only.

B. ATTENTION BASED METHODS
While attention mechanism based methods have achieved tremendous success in many fields like natural language processing [22], there have been more and more researchers strive to develop attention based methods to short text classification [23], [24]. Li et al. proposed [25] a two-level attention networks to identify the sentiment of short text, both local and long-distance dependent features are captured simultaneously by the attention mechanism, and then the attention-based features are utilized to captures more relevant features. Hu et al. proposed [26] a heterogeneous graph neural network for semi-supervised short text classification. In this method, a dual-level attention mechanism which include node-level and type-level attention is utilized to learn the importance of neighbour node and different types to a current node. Meanwhile, the self-attention mechanism has been successfully applied in many NLP tasks, Wang et al. [27] proposed a method which densely connect convolutional neural network(CNN) with multi-scale feature attention, which can produce variable n-gram features to adaptively select multi-scale features for short text classification.
In recent years, due to the special properties of Chinese text, many efforts have been made to solve the problem of Chinese short text classification. For example, Zhou et al. proposed [28] a hybrid attention networks for Chinese short text classification, the word-level and character-level features are utilized for capturing class-related attentive representation to improve classification performance. Lu et al. proposed [29] a multi-representation mixed model with attention and ensemble learning for Chinese news headline classification. However, few of methods focus on the classification problem of Chinese short text with few words like news headline which average length is usually less than 20 characters. The extreme short length, no explicit word delimitation and the identification of keywords are still the huge challenges for classification.

C. FEATURE SELECTION BASED METHODS
The objective of feature selection is to prepare more clean and understandable data for building more simple and comprehensible model to improve the performance of data mining and machine learning [30]. Along this line, many works extend the basic feature selection with external knowledge or just optimize the feature representation in NLP tasks. In the methods with feature representation optimization, to take the external knowledge base like Probase into consideration, Liu et al. proposed [13] a feature selection method based on part-of-speech and HowNet for Micro-blog Mining, the words with larger amount of information by different partof-speech are selected based on HowNet, and HowNet is a common knowledge base which describes the concepts represented by Chinese and English and reveals the relationship between concepts and attributes of concepts. Méndez et al.
proposed [31] a feature selection method based on semantic ontology cluster to build feature vectors, the words are grouped into topics for training different Machine Learning tasks.
In the short text classification, feature selection has been proven to be a effective and efficient method to process data. Meng et al. proposed [32] a feature selection method to solve the sparseness problem of short text classification, it considers the number of the short texts that shared some words with the same importance, and the intersections among different short texts' feature vectors are increased. Tommasel et al.
proposed [33] an online short text feature selection method for social media, which focuses on discovering implicit relations amongst new posts, already known ones and their corresponding authors to identify groups of socially related posts in the situation of short text stream. Liu et al. applied [34] four different feature selection algorithms to solve the multi-class sentiment classification problem, the experimental results show that feature selection is an effective method to improve classification accuracies and more features do not necessarily lead to better result.

III. PROPOSED METHOD
The whole framework of our proposed AFC is illustrated as FIGURE 1. Our method contains three components: Chinese character embedding, attention mechanism and feature selection. The motivation and details of our proposed method is described as follows.

A. MOTIVATION
The Chinese short text with few words refers to Chinese text with much more short length compared to the traditional normal text or even short text. For instance, the length of Chinese news title is usually required to less than 20 characters (e.g. Intel: no worries about talent backup), and the average length of Chinese invoice name is also about 20 characters. Meanwhile, these Chinese short texts with few words are usually an overview of all the content, such as the title is the summarization of the news and the invoice name is the generalization of the bills. However, in contrast to short texts classification, Chinese short texts with few words attracted far less attention in recent decades.
The extreme short length of Chinese texts with few words lead to the difficult of feature representation learning and key words identification. More specifically, firstly, different from English and other western languages, Chinese sentence is made up of character sequences rather than word sequences, and there is no delimiter between Chinese words. Due to Chinese character embedding has proven to be an effective representation method and the extreme short length of Chinese texts with few words, we propose to learn Chinese character embedding alone, which is the difference between our proposed method with the method in [9], [28]. Secondly, the extreme short length and the few Chinese words in our task means the importance of keywords identification, and the texts classification usually determined by even one keyword. The attention mechanism contained in our proposed method is to assign much more weights to the words that have closer semantic relations to the sentence meaning.
Finally, the feature selection method is applied to remove meaningless and disturbing features in Chinese texts with few words, which can improve the quality of feature vectors and help to identify keywords. Due to the reservation of notional words and semantic similarity calculation, the feature selection method reduces the possible negative influence of some interfering words and redundant information, which is the main difference between our proposed method with the method in [9], [28]. VOLUME 8, 2020

B. CHARACTER EMBEDDING REPRESENTATION
The Chinese texts are different from English for the meaning of each character in words, and there is no explicit word delimitation in Chinese sentences. The traditional method to learn feature representations is to create Chinese word segmentation, and the result of segmentation is treated as the input of learning model [35]. Although the errors caused by word segmentation may be detrimental to the performance, the word embedding has been proven to be effective in Chinese NLP tasks like Named Entity Recognition (NER) [36]. On the other hand, character embedding representations have been used in NLP tasks such as word relatedness computation [37], Part-Of-Speech (POS) tagging [38] and short text classification [28]. In addition, Yin et al. [39] demonstrated that characters contain rich information and can be effectively applied in word similarity computation and analogical reasoning, the method showed that character features are capable of indicating semantic meanings of words and significantly improve the performance. Therefore, these character embedding methods can achieve sound performance in Chinese NLP tasks [40], [41].
However, the feature information in Chinese short text with few words like news headline and invoice name is much less compared to document or even short text, and the effect of word embedding for indicating semantic meanings and reducing sparsity is limited. In addition, since the Chinese texts are written as continuous sequences of characters, the word segmentation may incur errors as afore mentioned. Therefore, we introduce the character embedding based on word2vec [15] for feature representation learning. The aim of the word2vec tool is to map each Chinese character in texts into a vector. Due to the extreme length of Chinese short texts with few words, we set the value of window parameter as 1 in word2vec tool, which refers that we just consider one character left and right.
Suppose the Chinese short text is denoted as D = where l is the number of characters in text and Y is the set of text labels. Each character can be represented as a vector, which denoted as d i = {R M }, where R is the short text space and M denotes the attribute dimensionality. For an input Chinese short text with few words, the character embedding vectors are encoded firstly by the model of word2vec, and these vectors are then utilized as the input of the attention mechanism.

C. ATTENTION MECHANISM
On the basis of character embedding representation, we assign different weights to different embedding vectors for the observation that not all words contribute to classification equally. The aim of the attention mechanism is to focus on the words which have closer semantic relations to the meaning of short texts. In our proposed method, the attention-based LSTM is utilized for mapping vectors into feature space with weighting, which make the keywords in classification have more subtle value.

Suppose the output of LSTM is denoted as
. Similar to character embedding representation, each h i can be represented as h i = {R N }, where N is the attribute dimensionality. The new representation z a can be computed as the sum of weighted vectors h i , which is shown as (1): where α i denotes the attention weight, and it can be computed as (2): where f and g are the activation functions of network, we introduce sigmoid and softmax function in the proposed method respectively, W 1 and W 2 are the weight matrices and b denotes the bias vector of model.

D. SHORT TEXT FEATURE SELECTION
Feature selection is a process of removing meaningless and disturbing features from the original feature set to improve the performance, it greatly increases the classification accuracy of text by extracting the most representative feature subsets [33]. Due to that not all words have a positive effect on classification especially in the Chinese short texts with few words, the reservation of notional words and semantic similarity calculation are adopted to improve the quality of feature vectors and help to identify keywords. Different from the traditional feature selection methods which applied in text process, we propose to select features in Chinese short text with few words as the following steps: Firstly, we have observation that the notional words like nouns, verbs and adjectives hold much information than other functional words, and the latter words only play the role for making the whole sentence fluent and complete [13]. Therefore, we propose to reserve only nouns, verbs and adjectives in each text, where part-of-speech tagging is implemented with Jieba Python Library.
Secondly, since the existence of interfering words and word ambiguities produce difficulties in the comprehension of languages and result in the unsatisfying performance, we apply word embedding based on word2vec to calculate the semantic similarity between each word in short text and all class label information, then the results are sorted and the last words are removed. More specifically, we will add a removed word when the length of sentence increased 5, and no words will be removed when the sentence length is less than 5. The process of feature selection can be shown as FIGURE 2: For the representations of hybrid method, we combine the outputs of the attention mechanism with feature selection and get the sentence vectors z. The output vectors can be computed as (3): where FS denotes the process of select characters based on the result of feature selection.

IV. EXPERIMENTS
In this section, we conduct extensive experiments to evaluate the effectiveness of our proposed method. Three real Chinese short texts datasets have been used in our experiments, where two of them are Chinese news titles datasets, namely THUC-News dataset and Toutiao dataset, and the rest one is Chinese invoice dataset.
In the following, we firstly introduce the details of datasets. Secondly the benchmark methods and experiment settings are introduced in detail. Then the classification results with observations of our proposed AFC and other competing methods are given. Finally, the properties of AFC are analyzed with certain dataset.

A. DATASETS
THUCNews Data Set 1 Thucnews is a dataset which generated by collecting historical data of Sina News from 2005 to 2011 [42]. In this paper, we only pick the news title of 200,000 articles, and 10 categories are selected to operate the experiment including finance, realty, stocks, education, science, society, politics, sports, game and entertainment. More specifically, 5000 news title are chosen from each category for training and 500 news for testing and validating respectively.
Toutiao Data Set 2 Toutiao data set is crawled from JinriToutiao, which is the well known news portal website in China. In the experiment, this dataset contains 32 categories including car, education, movie, news, science and so on. More specifically, there are 43,156 news title in training set, 15,986 in testing dataset and 4796 in validating dataset respectively.
Invoice Data Set 3 It is a real invoice dataset that comes from the support program of scientific and technological research in Anhui Province (No.1704a0902029), and we have uploaded this dataset to Github. There is a total  of 4,200 categories, 6.5 million manually labeled data and 10 million unlabeled data, and the data format after pre-processing is shown in TABLE 1. In the experiment, 10 classes are chosen, and 4000 invoice name are selected from each class for training and 1000 for testing.
Finally, the statistics of all three real-world data sets are summarized in Table 2.

B. COMPARED METHODS
We compare our proposed AFC with the following baseline methods: • Bigrams + LR/SVM, which are the baselines for text classification. In the experiment, we conduct these algorithms with bigrams as proposed in [43], Logistic Regression (LR) and SVM are adopted as classifiers respectively.
• LibShortText [11], which is an open source library tool for for large-scale short text classification. It has achieved sound performance in short text analysis.
• TextGrocery [44], which is a simple and efficient short-text classification tool based on LibLinear, the Jieba Tool is embedded as as default tokenizer to support Chinese tokenize.
• Character enhanced word embedding model (CWE) [37], which takes characters as basic unit and learn embeddings according to the internal structures of words. It proposed a multiple prototype character embeddings and an effective word selection method.
It is a hybrid model of character-level and word-level features based on RNN with LSTM or Bidirectional LSTM, which tends to construct the missing semantic information caused by word segmentation. In the experiment, we implemented this method with LSTM.
• Hybrid Attention Networks (HANs) [28]. It is a hybrid model which combines the word-and character-level selective attentions. The method applies RNN and CNN to extract the semantic features of texts, and it captures VOLUME 8, 2020 class-related attentive representation from word-and character-level features.

C. EXPERIMENT SETTINGS
The hyper-parameters of the neural networks are described as follows.
In the experiments, we choose the hyper-parameter settings that followed the previous studies [9], [28], [37]. In addition, the size of LSTM hidden layer is setted as 100 and the vector size of the character embedding as 100.
The method of stochastic gradient descent is utilized to train all models with learning rate which is 0.01 and momentum of 0.9. Moreover, we train word embedding and character embedding using word2vec tool with the window parameter as 1, 4 and accuracy is adopted in the experiments as the metric. Table 3 shows the experimental results on three data sets.

D. EXPERIMENTAL RESULTS
We have the following observations from experimental results: • Neural network based methods (e.g., CWE and C-RNN) deliver a relatively good result compared to traditional Bag-of-Word methods (e.g., Bigrams based baselines and LibShortText) in most cases, which reveals the ability of neural network to capture effective semantic representations of Chinese short texts.
• We compare our proposed AFC to the CWE, and these two methods are all based on Chinese character embedding. We find that AFC performs better than the CWE in all datasets. We beleive that the feature selection and LSTM based attention mechanism used in AFC can assign more subtle value to the keywords in classification.
• We also compare the AFC to the C-RNN and HANs.
The experimental results show that AFC outperforms in Toutiao and Invoice datasets, and obtains a result that is competitive in the THUCNews dataset. We believe that the reason is the feature selection method used in AFC can help identify keywords and remove useless data. This result also demonstrate the effectiveness of our proposed method.
• Overall, in all datasets expect THUCNews, our proposed AFC performs best compared to the state-of-theart methods. In the THUCNews dataset, AFC has a competitive result with the best baseline method. The results prove the effectiveness of the proposed method.

E. ANALYSIS OF PROPERTIES IN AFC
To learn the better feature representations for short text classification, we introduce the Chinese character embedding into our proposed method. In experiments, we use word2vec approach to train character feature vectors. We apply our method on Invoice dataset with word embedding and character embedding under different data size, and the classification 4 code.google.com/p/word2vec   results are shown in Table 4. We can see that the proposed AFC based on character embedding outperforms the one with word embedding significantly, and achieves a relatively good result even with less data. It is conducive to learn powerful features on stream data in the future work. In our method, feature selection is introduced to reduce the possible negative influence of some redundant information and sentence vector alignment. To analyze the influence of feature selection in our proposed method, we conduct additional experiments with and without feature selection. The vectors in AFC without feature selection is leant by sentence embedding [45]. The results are reported as Table 5. We can observe that AFC with feature selection has a more desirable performance than sentence embedding, which proves the importance of feature selection in our designed method.

V. CONCLUSION
In this paper, we propose a hybrid method of feature selection and attention network for Chinese short text classification with few words, called AFC. It improves previous methods with a novel framework by incorporating Chinese character embedding, attention mechanism and and feature selection to obtain better feature representations for short text classification. In our proposed method, character embedding representations are computed, LSTM-based attention network is utilized weight each word for identify keywords and feature selection is adopted to reduce the possible negative influence of some redundant information. The hybrid model can obtain better feature representations from training data in Chinese short texts with few words. Extensive experiments conducted on three real-world datasets show the proposed method outperforms competing methods in the effectiveness.
At the same time, experimental results show that our proposed method can achieve better results on Chinese short texts with few words, but when the Chinese texts consist of several to dozens of words, the performance may be not satisfied. In future work, we will try to improve the classification accuracy on Chinese short text.