A Sentiment Analysis Method of Capsule Network Based on BiLSTM

Nowadays, capsule network model is widely used in image processing, whose feature engineering is not suitable for sentiment analysis based on texts obviously. In this paper, we propose a capsule network model with BiLSTM named caps-BiLSTM for sentiment analysis to solve the problem, and introduce the experimental results on different datasets. At the beginning of caps-BiLSTM, a convolution layer is used to transform the instance to hide vector. Then the capsule module constructs the capsule representation to the n-gram model. The state probability of a certain capsule is calculated by the capsule model. If the state probability of a given instance is the largest among all capsules, a higher coupling coefficient is assigned. Finally, in order to fusion the data features, the output of the capsule enters into a BiLSTM structure, which is used as a decoder to get the probability representation. Experimental results based on MR, IMDB and SST datasets show that the proposed method can achieve better performances than the traditional machine learning methods and the compared deeping learning models.


I. INTRODUCTION
Sentiment analysis is an important task in the field of natural language processing, which is to identify the emotional polarity in the original texts, such as positivity, negativity or neutrality. With the development of social networks, the role of Internet users is transforming. More and more people express their opinions on the Internet, and these opinions produce a complete text finally. These texts contain a large amount of available information, which are great of use to understand what the users think in evaluating social networks or products [1].
For a long time in the past, the methods of sentiment analysis focused on combining with sentiment dictionaries and machine learning. On the one hand, the lexicographical method analyzes the text through the position and grammar of words to determine the polarity of opinions [2], [3].
The associate editor coordinating the review of this manuscript and approving it for publication was Shirui Pan .
On the other hand, the method of machine learning can learn sentiment classification through the texts manually annotated [4], [5]. However, in recent years, deep neural network structures based on convolution or recursion has become the main method of sentiment analysis, such as the convolutional neural network(CNN) used by Kim [6] and the long short time memory (LSTM) neural network used by Xu et al. [7].
As have been mentioned above, these methods, that provide different solutions to the problems of sentiment analysis, are used in various sentiment analysis tasks widely, and have achieved excellent results of performance evaluation. However, most of the traditional methods have the limitation of relying too much on language knowledge, or it is difficult to extract text features comprehensively. Moreover, the learning and training process after feature encoding is not effective enough in traditional models. The accuracy of evaluation results is less than satisfying. Therefore, the task of sentiment analysis still faces great challenges.
In order to solve the problems of the previous chapters, a BiLSTM-integrated capsule neural network model is considered, with the goal of extracting text features in a wider range, increasing the learning and generalization ability of the network to improve the accuracy of sentiment analysis. In this paper, we propose a model named caps-BiLSTM, which obtains word vectors used to represent semantic information first, and then inputs these vectors into the capsule network to learn the similarity between the inputs and the targets. Finally, the resulting vectors are used as the input of the BiLSTM network to train and predict the emotional category labels of the text. Experimental results show that the accuracy of our method is significantly improved compared with the baseline models. Our main work of this paper includes the following three parts: (i) using a convolutional layer based n-gram extracts the features of sentences and captures the dependencies on words with the capsule network, so as to present the rich structures of the feature sequence; (ii) the modified dynamic routing algorithm is adopted to update the weight parameters, so as to extract more accurate text information and improve the reliability of routing adaptive tuning; (iii) BiLSTM is used for a deeper learning in the outputs of the capsule network to increase the learning and generalization ability of the model to obtain a better performance.

II. RELATED WORK
It is obvious that the algorithms based on machine learning effectively show their superior performance in sentiment analysis task than other methods of previous work. Among the researches, Nakagawa et al. [8] proposed a Conditional Random Field(CRF) model based on the dependency tree, which achieved good results. Pang et al. [4] used Naive Bayes classification and maximum entropy to study the emotional polarity of texts. However, with the increasing size of datasets and more annotated datasets being exposed, the booming development of deep learning methods provides inspiration to address the challenge of sentiment analysis.
Benefiting from the large-scale corpus, several methods based on deep learning can learn the potential semantic and syntactic features of articles actively with better robustness and higher accuracy, which effectively make up for the shortage of artificial feature engineering. Convolutional Neural Network(CNN) [9] is able to extract the text features effectively, and has made a breakthrough in sentiment analysis. Liu et al. [10] conducted experiments on the task of sentiment analysis with CNN, obtaining satisfactory results. However, CNN is still unsatisfactory in capturing long-distance dependent information, since CNN only retains the size of feature but ignores the important information of feature direction and spatial position. The Recurrent Neural Network(RNN) [11] and capsule neural network [12] has addressed the above limitations.
In recent years, RNN has been applied in many fields, such as language model [13], speech recognition [14] and so on. RNN can predict the outputs according to the long distance characteristics by learning the previous information. Since the original RNN still has some defects, such as gradient disappearance and gradient dispersion, researchers have proposed a series of variants of RNN, including Long Short-Term Memory [15] (LSTM), bidirectional LSTM(BiLSTM), Gated Recurrent Unit(GRU) [16]. These models overcome the gradient problem of RNN and perform better than the original RNN in many tasks. In the task of sentiment analysis, Tang et al. [17] combined LSTM and target-related information to process the texts, which improved the accuracy of the results. At present, the neural network model combining attention mechanism is also widely applied to sentiment analysis, and has achieved better results than the previous methods [18], [19].
With the vigorous development of CNN and RNN, Hinton et al. [20] put forward the concept of "capsule" for the first time. The core work of capsule network is to introduce spatial relationships and position directions on the basis of the traditional neural network, and identify objects by combining invariance and covariability. Referring to the work of Hinton, Sabour proposed the capsule network, which used the vector output capsule instead of the traditional convolution neural network scalar output, not using pooling but dynamic routing algorithm. The process of dynamic routing can adjust coupling coefficient of connection strength according to the similarity between the target and the input to get better results. Inspired by state -of-the-art performance of the capsule network on MNIST, Zhao et al. [21] have introduced the capsule network into text-related tasks and verified the superiority of its performance on multiple datasets, which is the first work done by the capsule network in the actual modeling process. Later, Wang et al.. [22] combined capsule network and attention mechanism for sentiment analysis based on text and realized the complexity of routing calculation. Not only does this model pay special attention to the feature during training, but also can effectively adjust parameters of different features. The capsule network can extract richer text features while retaining the position information of words, grammar and syntax, so that effectively encode and improve the representation of text, and further obtain effective information. Some scholars [23] made a summary of the current research work on capsule networks, pointing out that there are not many applications of capsule networks so far, and had a broader development prospect.

III. OUR MODEL
In this paper, a capsule network model incorporating BiLSTM is proposed, which is called caps-BiLSTM, as shown in figure 1.This model consists of two parts, namely capsule combination module and BiLSTM learning module. First, a sentence represented by a word embedding is input into the model. Then, after feature extraction and dynamic routing iterations, the output vector is used to predict the polarity of sentiment. VOLUME 8, 2020

A. CAPSULE COMBINATION MODULE
The capsule network was first proposed by Sabour [12] in 2017, and the first practical modeling application was in the work of Zhao et al. [21] in 2018. Pooling in traditional neural network means deleting several pieces of data that contain knowledge of features, so some features may be lost. The capsule network not only overcomes the point mentioned above, but also proposes a more intuitive way to group information on a hierarchical level to solve more complex problems as well. In the capsule network, the neuron vector replaces the single neuron node in the traditional neural network, and the training process uses dynamic routing algorithm. Dynamic routing algorithm can increase the weight of vital features and find more hidden features, which improves the performance of the network.

B. N-GRAM CONVOLUTIONAL LAYER
The convolutional layer is an intermediate process of text on capsule representation, whose input is word embedding representation of sentences. Word embedding is the distributed representation of a word, in which words of the vocabulary are mapped to vectors. The network performance can be improved by initializing the text with word embedding as the input of the convolutional layer, for they can capture the grammatical and semantic information of words from large scale text corpus. In this work, we used the vector representation of the text obtained by the GloVe [24] word embedding as the input of the model, which is used in natural language processing tasks commonly.
Different from image processing, considering the composition, hierarchy and structure of the text, n-gram features are extracted at different positions of the sentence in the convolutional layer in the capsule network through the convolutional filter. x∈R (L×V ) represents the input of a sentence (L is the length of the sentence, V is the dimension of the word embedding), and x i ∈ R V represents the word embedding corresponding to the ith word in the sentence. W α ∈ R L−K 1 +1 is the convolutional filter, where K 1 is the length of n-gram, that is, the length of the sliding window on the sentence, which is used to extract feature at different positions of the sentence.
Convolutional filter W α generates a feature map m a in each window (after sliding a step into the next window). Every element in this feature map is shown in Equation (1).
where • is the unit multiplication, and b 0 is the bias, f is the nonlinear activation function, and l is the sliding step distance. The above is the process of extracting text features by filters. So for a total of B -filters with the same n-gram size, B feature maps can be generated and rearranged as shown in Equation (2).

C. CAPSULE MODULE
The input vectors of the original capsule layer come from the output neuron M of the convolutional layer, and the output of the capsule replaces the scalar of the traditional neural network with the vector so as to retain the instantiation parameters. First, the activation function is used to transform the feature vector M i of each n-gram sliding window into the corresponding feature capsule u i , as shown in Equation (3)-(4).
where W b is the filter shared by different sliding windows, and b 1 is the capsule bias. Meanwhile, g is the nonlinear activation function. W ij represents the weight matrix of the correlation between the input layer and the output layer. The capsule network updates the weight of the coupling coefficient through the dynamic routing process. The coupling coefficient of the adjacent capsule depends on the similarity of them, and the coupling coefficient of similar capsules is larger. The dynamic routing strategy is more efficient than the basic routing in CNN, ensuring that the output of each capsule is sent to the corresponding parent in the subsequent layer.
Given each prediction vector u j|i and its probability of existence a j|i , the purpose of the dynamic routing iteration process is to update the coupling coefficient c ij , which symbolizes the similarity between the input vector and the target value, and to assign higher weight c ij to the v j and u j|i that are closer to each other. The equation for coupling coefficient updating is based on Zhao et al. [21]. The initial value of b ij is 0.The dynamic routing process is shown in Equation (5)- (8). Equation (7) is the compression function, which can squeeze the modulus of the input vectors to [0,1).

D. BILSTM MODULE
LSTM is a popular recursive neural network structure designed to posses better ability to capture long-term dependencies than standard RNN. Its emergence solves the longterm dependency problem of RNN and may make it better in discovering and exploiting dependencies of long distance data. LSTM is an improved model of RNN, which introduces some gates internally to try to solve the gradient problem. As a recursive neural network, LSTM computes an output vector on the basis of the current input and the output of the previous unit, which is then used as the input of the next unit. Finally, the output of the hidden layer is used for classification. As mentioned above, in LSTM, each neuron introduces a new structure, which is mainly composed of four parts: input gate i t , output gate o t , forget gate f t and memory unit c t . The internal structure of individual components of LSTM module is shown in figure 2.
where x t represents the input at time t, σ is the sigmoid activation function, and is the corresponding multiplication of elements. c t is used to reduce the problem of gradient disappearance or explosion, so LSTM can learn longer information dependence. The forget gate f t is a reset memory unit, i t and o t represent the input and output gates, which are used to control the input and output of the memory unit. But LSTM can not encode the information on back to front. In order to access the backward and forward features in a given time, the bidirectional LSTM network was proposed. The forward LSTM and backward LSTM are combined to BiLSTM, which can provide additional context information and has better and faster learning ability. The structure of BiLSTM is shown in Figure 3. In this module, BiLSTM is used to learn the output of the capsule layer, in order to enhance the fitting effect of the network features and the generalization ability on the new data set.
For the output of BiLSTM, the softmax activation function is used for classification, as shown in Equation (15).

A. DATASETS
To evaluate our proposed methods, we conducted experiments on three benchmark datasets: Movie Review [25](MR), Internet Movie Database [26] (IMDB) and Stanford Sentiment Treebank [27] (SST).These datasets are widely used in the sentiment analysis. The experimental results are compared with the published algorithm results to achieve the objective comparison effect. MR is from Cornell University, a collection of English film reviews collected from www.rottentomatoes.com, including 5331 positive English film reviews and 5331 negative English film reviews. The average sentence length is 20.
IMDB is made up of 50000 Internet Movie Reviews with serious polarization. This dataset contains 25000 training data and 25000 test data, including 25000 positive and 25000 negative comments. The average sentence length is 294.
SST is an extension of MR data set, which provides the divided training set, verification set and test set, with 11855 sentences in total. According to the sentiment tendency of sentences, data labels are divided into five categories: very positive, positive, neutral, negative and very negative. The average sentence length is 19.

B. PARAMETER SETTINGS
In this experiment, in the text initialization stage, the pre trained GloVe word embedding is used to transform the text into a vector, each word vector dimension is 300. We got the best parameter values through repeated trainings. In the initial convolution layer, the size of the sliding window is set to 4; the 4-gram feature of the sentence is extracted; the number of filters is 32; and the activation function of the convolution layer is ReLU. In the capsule layer, the size of the capsule is set to 16; the number of capsules is 32; and the number of dynamic route iterations is 3. The number of neurons in BiLSTM layer is set to 128. In order to reduce the training loss, Adam optimizer is selected with learning rate of 0.001. The batch size is 128. In order to prevent from over fitting, a dropout layer is added with a parameter of 0.5. In order to reduce the training time and ensure that the training can be stopped in time when the over fitting phenomenon occurs, the early stopping mechanism is introduced. The initial number of epoch is set to 50. When the verification loss lasts for 10 epochs without improvement, the iteration is stopped. The experimental environment is shown in Table 1.

C. COMPARISON OF CLASSIFICATION ACCURACY
The evaluation protocol of this experiment is accuracy, as shown in Equation (16).
where T represents the number of samples with correct classification and N represents the total number of samples. We compare our proposed method caps-BiLSTM with the following method: SVM [4] (Kim et al.,2020). The sentiment analysis experiment was carried out on the above three datasets. Table 2 presents the experimental results on all the datasets. The results clearly demonstrate that our proposed method significantly outperforms the compared models for IMDB and SST. And the deep learning methods are superior to the method of machine learning represented by the SVM. However, the experimental result of MR dataset is a bit poorer than LR-LSTM, but still better than other seven models. This is because n-gram model can extract the information on the text very well, and the dynamic routing algorithm of the capsule network can find the capsules with higher similarity in the target value. Hence, BiLSTM can learn the context characteristics of the sequence and classify accordingly. Compared with the experimental results, the accuracy rate of caps-BiLSTM model improved by 1.12% at least on IMDB than ToWE-CBOW, while the accuracy rates of MR and SST were not significantly different from that of the baseline models such as Tree-LSTM and CapsNet-dynamicrouting, indicating that the performance of our model in the long texts was better than that in the short texts.

D. VERIFICATION LOSS ANALYSIS
The loss of caps-BiLSTM on each validation data set (val_loss) changes with the epoch as shown in Figure 4. It can be seen that the loss of the model reached the minimum around the 5th to 7th epoch. After that, there is over fitting phenomenon in training, and the accuracy decreases  accordingly. Because the early stopping mechanism is added to the training, the training is stopped after gaining 10 epochs with the minimum loss, and the best classification result is obtained. In this way, the best model in training can be found automatically without setting the epoch number.

E. ANALYSIS OF DYNAMIC ROUTE ITERATIONS
The accuracy of the model on each data set changes in the number of dynamic routing iteration as shown in Figure 5. It can be seen that when the number of iterations is 3, all datasets can achieve the highest accuracy. This is because dynamic routing can't fully learn the characteristics of data when the number of iterations is less than 3, and it may cause over fitting phenomenon when the number of iterations is more than 3, both of which will reduce the accuracy.

V. CONCULSION AND FUTURE WORK
In this paper, we propose a BiLSTM based capsule network model for sentiment analysis, which mainly improves the performance of the capsule network. By comparing the machine learning method and some deep learning methods, the validity of caps-BiLSTM network in feature extraction and the accuracy in classification is proved. This model has a favorable performance in the sentiment analysis task. While the generalization ability of the model on the new data set is still weak. Consequently, to improve the generalization ability of the model and reduce the verification loss is the main goal in the future work. From 2007 to 2009, she held an assistant position at the Hebei University of Technology. Since 2010, she has been an Experimentalist with the School of Artificial Intelligence, Hebei University of Technology. She is the author of more than ten articles and over ten inventions. Her research interests include intelligent information processingbased big data, deep learning in image processing and applications, and representation learning in knowledge graph.
YUNLIANG CHEN received the B.Sc. and M.Eng. degrees from the China University of Geosciences and the Ph.D. degree from the Huazhong University of Science and Technology, China. He is currently an Associate Professor with the School of Computer Science, China University of Geosciences, Wuhan, China. His research interests include data mining, cloud computing, and the IoT.
YAO DONG is currently a Senior Experimentalist with the School of Artificial Intelligence, Hebei University of Technology. She is a member of the China Computer Federation. Her interests mainly include artificial intelligence and robot path.
JIANXIN LI is currently an Assistant Professor with the School of Information Technology, Deakin University, Australia. He has published 70 high-quality research articles in top international conferences and journals, including PVLDB, IEEE ICDE, ACM WWW, IEEE ICDM, EDBT, ACM CIKM, IEEE TKDE, and ACM WWW. His research interests include social computing, query processing and optimization, big graph data analytics, and educational data computation.