Sentiment Analysis With Comparison Enhanced Deep Neural Network

Sentiment analysis is a significant task in Natural Language Processing. It refers to classification based on the emotional tendency in text by extracting text features. The existing results show that models based on RNN and CNN have good performance. In order to improve the performance of text sentiment analysis, we reformulate the classification task as a comparing problem, and propose Comparison Enhanced Bi-LSTM with Multi-Head Attention (CE-B-MHA). In fact, it is efficient to classify by comparison mechanism instead of doing complex calculation. In this model, bidirectional LSTM is used for initial feature extraction, and valuable information is extracted from different dimensions and representation subspaces by Multi-Head Attention. The comparison mechanism aims to score the feature vectors by comparing with the labeled vectors. The experimental results show that CE-B-MHA has better performance than many existing models on three sentiment analysis datasets.


I. INTRODUCTION
Today, countless text messages are produced in the internet every day. On social media, people express their opinions in text form. On pages with film reviews, they are also presented in text form. These text information with large quantity and rich content can become an important data resource.
The short texts published on various platforms contain strong emotional tendencies and reflect the diverse views held by users. Sentiment analysis of these massive text messages is of great value to various industries. For example, by analyzing citizens' different attitudes towards the same news event, the government can understand the public's opinions on social events and related policies. By analyzing the user's attitude towards a product function, the manufacturer can improve the product easily. The potential value of sentiment analysis attracts extensive attention from researchers in different fields, such as data mining and natural language processing. Sentiment analysis for user-generated text has become a research hotspot in relevant fields.
The associate editor coordinating the review of this manuscript and approving it for publication was Huazhu Fu . Sentiment analysis [1], also known as opinion mining [2], is a research field to analyze people's subjective feelings such as emotions, evaluations, opinions and attitudes towards products, services, organizations, individuals, events, subjects and their attributes. Text sentiment analysis task is one of the most important tasks in the field of natural language processing. It is found that the neural network model which is based on Long Short-Term Memory (LSTM) or Gated Recurrent Unit (GRU) can effectively extract the context relations of sentences. The classification accuracy of the model can be improved by introducing Attention Mechanism to filter irrelevant information.
Attention mechanism was first proposed in the field of computer vision [3], and gradually entered the field of natural language processing [4]. The purpose of this mechanism is to allow the model to focus on important objects, which are mathematically represented as weighted sums. In ''Attention is all you need'' [5] published by Google machine translation team in 2017, the Multi-Head Attention mechanism used in this paper was proposed.
Sequential information is particularly important for natural language processing tasks. It represents the logic and structure of the text. However, Multi-Head Attention mechanism can obtain global structure, but it is hard to effectively capture sequence information. In this paper, Bi-LSTM is introduced to capture sequence information to get the order of the text and improve the model.
LSTM is a special type of recurrent neural network [6], has achieved great success in solving many problems. LSTM network introduces a self-cycling mechanism that makes it easier to learn long-term dependent information than a simple cycling structure. The bidirectional LSTM (Bi-LSTM) model inputs the forward and backward directions of the sequence into the LSTM network, which can better capture the contextual order information of the sequence [7].
Existing sentiment analysis models based on deep neural network have little to do with psychology, and have to use too much parameters. On the basis of deep neural network, we introduce comparison mechanism, a psychological concept, to enhance the learning ability of the model, which uses the comparison of text itself instead of using too much parameters. Comparison is the most intuitive and effective way of classification in people's daily life. People always learn new things by comparing. In this paper, we reformulate the sentiment analysis task as a comparing problem. Sentiment analysis task aims to train a model that maps the item feature to a score. We get the score by comparing the items with labeled samples, instead of fitting the hard-to-learn patterns.
In this paper, Comparison Enhanced Bi-LSTM with Multi-Head Attention (CE-B-MHA) is proposed to improve the performance of sentiment analysis. CE-B-MHA combines the ability of Multi-Head Attention to obtain global information with the ability of Bi-LSTM to obtain local sequence information, and enhances it with a comparison mechanism.
The main contributions of our work can be summarized as follows: •We combine Bi-LSTM with Multi-Head Attention, and propose a new model with good performance. We use Bi-LSTM to capture the context relationship, and use Multi-Head Attention to capture long distance text features.
•We propose a text sentiment analysis model, namely CE-B-MHA. On the basis of constructing deep neural network with Bi-LSTM and Multi-Head Attention, we use comparison mechanism to enhance the model and improve its performance.
•We use public datasets to verify the efficiency of our model, and compare it with the current methods. Experimental results show that CE-B-MHA can improve the performance compared with several baselines.
The rest of our paper is structured as follows: Section II discusses related works, Section III gives details of our approach, Section IV gives our experimental results, and Section V summarizes this work.

A. SENTIMENT ANALYSIS TASK
The sentiment analysis task can be divided into binary classification task and multi-class classification task according to the different classification objectives. In many cases, the researchers divided the emotional polarity of the text into positive and negative categories, commonly known as binary classification task. Generally, positive text indicates positive emotional tendency, while negative text indicates negative emotional tendency. This simple classification method can be applied in many real situations, such as analyzing whether users are favorable or unfavorable to a certain commodity and literary works, and public opinion towards a certain social event. In the research process, binary classification is also a common task to evaluate the classification ability of models.
In addition, multi-class classification task is also an important direction. Multi-class classification task can be divided into emotional level classification and fine-grained emotion classification. Emotional level classification refers to divide text emotional tendency from negative to positive into several levels. The classification of emotions into 1 to 5 can be called a five-classification problem. The fine-grained emotion is classified according to the categories of emotion. There is no common criterion for this classification, and it is generally self-determined based on research questions, such as emotions that can be divided into joy, anger, sadness, surprise, disgust, fear and neutrality.

B. SENTIMENT ANALYSIS METHODS
The methods used in text sentiment analysis can be divided into two categories: emotion dictionary method and machine learning method. Constructing an emotion dictionary and using it as a tool is the traditional method to judge the emotional polarity of texts [8]. Most emotional dictionaries need to be constructed manually. The principle is to summarize the words with emotional tendency to form a dictionary. Emotional words have strong indicating ability, which is one of the important signals that the text contains emotional tendency. When the text is entered, it is matched with the contents of the dictionary, looking for emotional words in the text to determine the emotional polarity of the text. However, there are some limitations in the emotion dictionary method. It covers insufficient forms of emotional expression and cannot timely cover the emerging forms of expression, which makes the accuracy of textual emotional judgment relatively low.
Nowadays, machine learning is a common method for researchers to analyze text emotion. The computer processes the text, extracts the text features and outputs the sentiment analysis. Machine learning methods have obvious advantages over emotion dictionary method which rely on manual work heavily. The machine learning methods of text sentiment analysis are mainly divided into supervised sentiment analysis and unsupervised sentiment analysis. The method proposed in this paper is a supervised sentiment analysis method, which is briefly introduced in the following part.
The basic principle of supervised sentiment analysis is to use the labeled text training model and use the trained model to conduct sentiment analysis on the unlabeled text. In addition to traditional machine learning methods, such as support vector machine, there are also deep learning methods such as CNN and RNN. Pang et al. [9] applied three representative classifiers (support vector machine, Naive Bayes and maximum entropy) to conduct experimental research on text sentiment analysis task, which has a high accuracy rate. Kim [10] proposed the classification of text CNN, becoming one of the important baselines of sentiment analysis task. Brueckner and Schuller [11] used Bi-LSTM in sentiment analysis task, which contributes to the solution of obtaining both historical information and future information by using the bidirectional propagation mechanism. Tang et al. [12] used two different RNNs to conduct sentiment analysis in combination with texts and themes. Wang et al. [13] proposed an Attention-based Long Short-Term Memory Network for aspect-level sentiment classification. The attention mechanism can concentrate on different parts of a sentence when different aspects are taken as input. Baziotis et al. [7] believed that deep LSTM with attention(D-LSTM) can improve the performance of the model. Shen et al. [14] proposed a novel LSTM, called ON-LSTM, to deal with natural language processing problems. The neurons in the LSTM are specifically ordered to express richer information. Du et al. [15] proposed a new network architecture, called CRAN, which combines a recurrent neural network with convolution-based attention model and further stacks an attention-based neural model to build a hierarchical sentiment classification model. In recent years, more and more new methods have emerged in this field. Many researchers have realized the advantages of machine learning methods and applied them to the task of text sentiment analysis to improve the classification accuracy.

III. MODEL DESCRIPTION
The CE-B-MHA first generates word vectors based on text, and enters them into the Bi-LSTM network. Bi-LSTM can capture the context relationship of encoded word sequences initially. At the same time, the Multi-Head Attention mechanism in the model can capture long distance text features effectively. Finally, comparison mechanism scores the items by comparing with samples, instead of fitting the hard-tolearn patterns.

A. WORD EMBEDDING
Before text enter into the network, it needs to be converted into word vectors for computer processing. Therefore, Embedding Layer is added to encode the text. Word2Vec is a commonly used tool for training word vectors [16], which can convert a word into vector form quickly and effectively according to a given corpus. In this paper, Word2Vec is used in advance to process text and generate the required word vectors. After loading the text, the sentence is divided into words, and the stop words are removed. In the embedding layer, word vectors are read and input into the model as initialization values.

B. Bi-LSTM
We introduces LSTM network to capture contextual order information of sequences. LSTM [6] is a common method in processing sequence data. It introduces a self-cycling mechanism that makes it easier to learn long-term dependent information than a simple cycling structure.
LSTM is a special type of recurrent neural network, which has internal structure named ''LSTM cell''. In the model, the ''gate'' structure is used to realize the selective passing of information. The gate consists of a sigmoid layer with weights in the [0,1] and a multiplication operation to remove or add information to the cell state. Three ''gate'' structures were set in each LSTM cell: Forget Gate, Input Gate and Output Gate, performing different functions to control cell state.
In LSTM cell, x t represents the input of the current cell, h t−1 represents the output of the previous cell, h t represents the output of the current cell, c t−1 and c t represent the previous cell state and the current cell state. f t , i t and o t are the outputs of three gates.
The model uses the Forget Gate to determine what information to discard from the cell state. The gate inputs h t−1 and x t , and outputs a weight between 0 and 1 multiplied by the number in the cell state c t−1 : 1 for ''completely retained'' and 0 for ''completely discarded''. Use the input gate to determine how to add new information to the cell state. Candidate value vectors are generated using a tanh layer and multiplied with the results of the sigmoid layer to determine the values that should be added to the cell state. Add this value to the cell state and update the old cell state to c t . The output gate is used to determine the output value based on the cell state c t . The cell state c t is processed by tanh and multiplied with the result of the sigmoid layer to get the cell output h t . LSTM can output all the hidden vectors including h 1 to h t . LSTM cell is abstracted as a function with inputs of h t−1 and x t , outputs of h t . which can be expressed as follows: The bidirectional LSTM (Bi-LSTM) can extract the features of sequences from the front and back directions, and can better capture the contextual order information of sequences. In this paper, Bi-LSTM is adopted to extract local order information of text, and the forward LSTM and backward LSTM are combined to form Bi-LSTM. The forward and backward LSTM respectively corresponding to the output

C. MULTI-HEAD ATTENTION
Multi-Head Attention (MHA) mechanism is used to fully capture long-distance features and obtain global information. The output vectors of Bi-LSTM layer, h 1 to h t , are combined to form a matrix, which will become the three inputs of MHA, named Q (Query), K (Key), V (Value) (Q = K = V ), as shown in FIGURE 1.
The Scaled Dot-Product Attention (SPQA) compute the dot product between Q and K , and going to divide it by a scale of √ d K . Purpose of √ d K is to play a regulatory role, so that the inner product is not too large. Soft-max operations are used to normalize the result into a probability distribution, and then multiplied by the matrix V .
In Multi-Head Attention, a linear transformation is performed for Q, K and V with different parameters W Bi-LSTM layer can effectively obtain the context order information of sequences; Multi-Head Attention mechanism can learn information from different dimensions and representation subspaces, and fully capture long-distance text features. They complement each other and can improve the emotional analysis ability of the model effectively.

D. COMPARISON MECHANISM
We introduce comparison mechanism to enhance the learning ability of the model. Comparison mechanism scores the sentence embedding which generate from MHA by comparing with samples. Positive samples and negative samples are selected from the labeled training data. The samples are selected by random in this paper. The number of positive samples and negative samples should be equal to obtain better results. Corresponding sentence vectors of these samples are generated to become a part of the model.
A simple method is used to generate sentence vectors of positive and negative samples. We get word vectors of samples from Embedding Layer, and sample vectors are obtained by averaging all the word vectors in the sentence.
We use neural network with hidden layer as a similarity function to get a similarity score for classification. The activation functions of neural networks are respectively ''relu'' and ''sigmoid''. The input of the neural network is the connection (Concat) between the sentence embedding and a sample vector, and the output layer size is 1. Its hidden layer size V is the length of sentence vector.
In the formula, the neural network is represented by two linear transformations and bias, and the parameters are W 1 , b 1 , W 2 and b 2 . The s represents the sentence embedding, and sample represents the sample vector.
Every sample can calculate similarity score with sentence embedding. The method to integrate them is to calculate weighted sum of the scores. The result of the Comparison Mechanism can be obtained in this way.
In the formula, w i represents the weight of each score. We use a layer of neural network to calculate the weights. K represents the size of samples selected.

A. DATASETS
The experiment was performed on three datasets, which are Large Movie Review Dataset [17], Semeval2017-task4-A English [18] and Stanford Sentiment Treebank [19]. In the experiment, their training and test sets were directly used. 20% of the original training set was assigned as the verification set, and the rest was assigned as training set.
The Large Movie Review Dataset (IMDB) is a commonly used IMDB comment sentiment analysis dataset, which contains 25,000 positive and 25,000 negative samples. Semeval2017-task4-A English (Semeval2017) is a dataset provided by task4 of the Semeval2017 competition, containing more than 7,000 positive and 3,000 negative samples. Stanford Sentiment Treebank (SST) is a sentiment analysis dataset provided by Stanford, containing 5,000 positive and 4,500 negative samples. For the datasets with three sentiments, the neutral samples were removed. [14], D-LSTM [7], and Bi-LSTM with attention were used as baselines. We did ablation experiments on Bi-LSTM, MHA, B-MHA, CE, and CE-B-MHA. Where B-MHA refers to Bi-LSTM with MHA, and CE refers to independent Comparison Mechanism. We implemented the model by using the Keras, a popular deep learning tool based on Python. We use Adam [20] as optimizer, and cross entropy as the loss. TABLE 1, 2 and 3 respectively show the experimental results of the three metrics. The item with the highest score in each column is highlighted in bold.     CE-B-MHA are the highest among all models. For Semeval2017, the AUC of CE-B-MHA is the highest among all models. Although CE-B-MHA can achieve good results comparing with the baselines, it is not optimal on all metrics. The enhancement effect of Comparison Mechanism on Precision is weak. This may be due to an imbalance in the number of positive and negative data.

C. THE EFFECT OF COMPARISON MECHANISM
We tested the effects of different sample sizes on the enhancement of Comparison Mechanism. The sample size means the number of positive and negative sample pairs. FIGURE 3 shows the results of CE-B-MHA on Semeval2017 with different sample sizes. With fewer samples, the enhancement of Comparison Mechanism is not obvious. When the sample size is higher than 35, the influence of sample size on the model performance decreases.

V. CONCLUSION
We propose a method to analyze text sentiment named Comparison Enhanced Bi-LSTM with Multi-Head Attention (CE-B-MHA). Experiments show that the sentiment analysis effect of CE-B-MHA is improved, compared with existing classification models. CE-B-MHA has a complex internal structure. It combines the ability of Multi-Head Attention to obtain global information with the ability of Bi-LSTM to obtain sequence information. The Bi-LSTM network is used to obtain the internal relation between the front and back directions of sentences and obtain the local order information. In addition, Multi-Head Attention mechanism is used to fully capture the features of long distance and learn relevant information from different dimensions and representation subspaces. On the other hand, CE-B-MHA introduces comparison mechanism to enhance the learning ability of the model. Comparison mechanism scores the items by comparing with samples instead of fitting the hard-to-learn patterns, and achieves good result.