Predicting Chinese Phrase-Level Sentiment Intensity in Valence-Arousal Dimensions With Linguistic Dependency Features

Phrase-level sentiment intensity prediction is difficult due to the inclusion of linguistic modifiers (e.g., negators, degree adverbs, and modals) potentially resulting in an intensity shift or polarity reversal for the modified words. This study develops a graph-based Chinese parser based on the deep biaffine attention model to obtain dependency structures and relations. These obtained dependency features are then used in our proposed Weighted-sum Tree GRU network to predict phrase-level sentiment intensity in the valence-arousal dimensions. Dependency parsing results using the Sinica Treebank indicate that our graph-based model outperforms transition-based methods such as MLP and stack-LSTM with identical findings for English dependency parsing. Experimental results on the Chinese EmoBank indicate that our Weighted-sum Tree GRU network model outperforms other transformer-based neural networks such as BERT, ALBERT, XLNET and ELECTRA, reflecting the effectiveness of linguistic dependencies in phrase-level sentiment intensity predication tasks. In addition, our proposed model requires fewer parameters and less inference time for quantitative analysis, making the proposed model is relatively lightweight and efficient.


I. INTRODUCTION
Sentiment analysis involves the use of linguistic processing to differentiate the positive and negative emotional content of utterances, as well as their emotional strength values [1], [2], [3]. Continuous real-valued sentiment scores, called 'intensity', provide more fine-grained emotional information. SemEval-2016 Task 7 focused on determining the sentiment intensity of English and Arabic utterances [4]. Various participating teams achieved promising results using different methods includes random forest [5], pointwise mutual information [6], Gaussian regression [7] and linear regression with manual rules [8]. A shared task on dimensional sentiment analysis for Chinese phrases was also organized at IJCNLP-2017 [9]. Affective states were represented in the The associate editor coordinating the review of this manuscript and approving it for publication was Arianna Dulizia .
valence-arousal space [10]. The valence represents the degree of pleasant and unpleasant (i.e., positive and negative) feelings, while the arousal represents the degree of excitement and calm. Deep learning-based neural computing approaches such as ensemble Long Short-Term Memory models [11], boosted neural networks [12] and feed-forward neural networks [13] were used to predict the sentiment intensity of Chinese multi-word phrases.
Linguistic modifiers such as negators (e.g., not, never), degree adverbs (e.g., very, totally, slightly) and modals (e.g., would, could) are commonly used in opinion expressions, and play an important role in recognizing sentiment intensity [14]. For example, in Chinese '' '' (totally not agree) and ' ' '' (not totally agree) convey different meanings. The former is composed of a degree adverb ' ' '' (totally), a negator '' '' (not) and a verb '' '' (degree), meaning that the speaker totally disagrees with the subject, while the latter features the same modifiers in a different order, meaning that the speaker does not completely disagree, but rather partially agrees. Phrase-level sentiment intensity prediction is difficult because linguistic modifiers may lead to an intensity shift or polarity reversal for the words they modify [15].
However, Chinese syntactic parsing to obtain linguistic dependency information is rarely addressed, motivating us to develop a Chinese dependency parser to extract dependency structures and relation features between words for phrase-level sentiment analysis. In addition, Chinese phrases may have the same dependency parsing results such as the previous examples '' '' (totally not agree) and '' '' (not totally agree), but expressing different meanings with almost opposite affective states. Hence, we propose a Weighted-sum Tree Gated Recurrent Unit (Tree GRU) network to tackle the same ordering problem, originating from different sentences with the same dependency structure, for phrase-level sentiment intensity prediction in valence-arousal dimensions.
The main contributions are summarized as follows.
(1) Developing a Chinese Dependency Parser for Syntactic Structure Analysis We develop a graph-based Chinese dependency parser based on the deep biaffine attention model [16] to obtain linguistic structures and relations between words. The Sinica Treebank [17] was used to evaluate dependency parsing results, indicating that our graph-based model outperforms transition-based methods such as MLP [18] and stack-LSTM [19] with identical findings for English dependency parsing.
(2) Exploring Linguistic Dependency Features for Chinese Phrase-level Sentiment Intensity Prediction We propose a Weighted-sum Tree GRU network to leverage linguistic dependency features for phrase-level sentiment intensity prediction in the valence-arousal dimensions. Chinese Valence-Arousal Phrases (CVAP) from the Chinese EmoBank corpus [20] were used to evaluate performance. In experiments, our Weighted-sum Tree GRU neural network with linguistic dependency information outperformed other transformer-based neural networks (i.e., BERT [21], RoBERTa [22] and MacBERT [23], ALBERT [24], XLNet [25], and ELECTRA [26]), in the two-dimensional valencearousal space, confirming the effectiveness of exploited linguistic dependency features. In addition, our proposed model contains fewer parameters and requires less inference time for quantitative analysis, so our proposed model is relatively lightweight and efficient.
The rest of this paper is organized as follows. Section 2 reviews related studies for phrase-level sentiment intensity predication. Section 3 describes our proposed network architecture for valence-arousal rating prediction. Section 4 describes experiments and discusses experimental results for model performance evaluation. Conclusions are drawn in Section 5.
Both switch and shift models are used to constrain the negation [27], [28], [29], [30]. A contextual shift approach was proposed to predict positive and negative sentiment for each term [27], [28], incorporating an optional SVM algorithm to learn and classify the sentiment shifts composed of bi-gram and uni-gram features to obtain better classification performance. A rule-based model was used to detect the intensity of emotions in informal English [29], which was improved using an unsupervised version of SentiStrength 2 [30].
Linguistic features are identified based on semantic rules and use a linear offset model to classify sentiment [31]. The Semantic Orientation CALculator (SO-CAL) is applied to the polarity classification task [32], assigning a positive or negative label to a text to capture textual opinions related to the main topic. A linguistic modifiers-based model was proposed to improve emotion classification by designing negation, intensifiers and modalities that may change the emotional meaning of the text [33]. The Valence Aware Dictionary for sEntiment Reasoning (VADER) uses a rule-based model that constructs gold-standard lists through lexicon features [34].
Machine learning approaches focus on the use of data and algorithms to train models to predict sentiment scores. Random forest [5] was used as a pairwise strategy to predict the sentiment intensity scores. Point-wise mutual information [6] was used to check for similarity between words and prototypical sets, where words with high similarity were incorporated into the emotional lexicon. Adaptive boosting [36] was used to combine multiple weak classifiers into a single strong classifier. The Gaussian regression model [7], [35] was used to compute sentiment intensity scores by incorporating multiple features including direct search results, Word2Vec search, rule-based search, and 5-level Stanford sentiment classifier output [37]. A linear regression model [8] was used to analyze the data noise that affected sentiment intensity prediction performance. Deep learning techniques can be used in a variety of ways, including modifying the architecture of neural networks or integrating multiple neural network models. The Part-Of-Speech (POS) embedding and word cluster was fed into the dense Long Short-Term Memory (LSTM) network architecture [11], to undergo 100 training iterations using different hyper-parameters and training data to improve generalization and reduce data noise. A boosted neural network model [12] was used to improve the accuracy of misinterpreted data. A multi-layered feed-forward neural network [13] was proposed to include the types of known modifier words, valence-arousal value of headwords, and the distributional semantics of both kinds of words for valencearousal intensity prediction. A pipelined neural network model composed of two neural networks (NN) models was proposed to predict phrase-level sentiment intensity [15], in which the first NN model was used to combine the re-weighting mechanism in the hidden layer, and the second NN model considered not only individual but also group weights.
In summary, we follow the research development of neural networks-based deep learning methods since neural computing techniques usually achieve promising results. In this paper, we propose a Weighted-sum Tree GRU network to fully use of exploited dependency features, obtained by our developed Chinese dependency parser, for phraselevel sentiment intensity prediction in the valence-arousal dimensions. Figure 1 shows our proposed network architecture for Chinese phrase-level sentiment intensity prediction, comprised of two main parts: 1) graph-based dependency parsing; and 2) a Weighted-sum Tree GRU network.

A. GRAPH-BASED DEPENDENCY PARSING
We use the graph-based deep biaffine attention model [16] for Chinese dependency parsing. At the embedding layer, the concatenation of a pretrained skip-gram word embedding and a trainable Part-of-Speech (POS) embedding is used as the representation for each word. The recurrent output vector from the following Bidirectional Long Short-Term Memory (BiLSTM) layers then serves as the contextualized word representation for dependency parsing. We then reduce the dimensionality of the recurrent output vector using the Multi-Layer Perceptron (MLP) layers to strip away irrelevant information. Finally, the scores of all the directed arcs between every pair of words are calculated using the biaffine transformation. The cross-entropy loss function is used to calculate the loss at training time, while at testing time the optimal parsing tree is searched using the Maximum Spanning Tree algorithm. Our network architecture uses two deep biaffine attention models with common embeddings and BiLSTM layers are used to obtain dependency arc and type. The objective is to predict the probabilities of all modified words in a sentence. After training the deep biaffine attention model, we can obtain the probability matrix of all arcs and the corresponding dependency matrix. The obtained dependency tree structure and relations will be used respectively as the input order and features in the following Weighted-sum Tree GRU network.

B. WEIGHTED-SUM TREE GRU NETWORK
Considering the use of dependency relations of words and the syntactic information of the dependency tree in our model, we adopt the tree-structured Recurrent Neural Networks (RNNs) [38]. The benefit of the tree-structured RNN over the standard RNN is its capability to compute the hidden states of multiple children from their hidden states. The order of input features to the tree-structured RNN follows the tree structure of the dependency parsing results (dependency tree and dependency relation). Then, the hidden states of the direct syntactic children of the ROOT nodes are passed to the following feed-forward network to predict VA scores from 1 (highly negative or clam) to 9 (highly positive or excitable).
Different sentences with various degrees of sentiment intensity may share the same dependency tree structure, producing identical features and thus resulting in incorrect VA prediction. To solve this problem, we propose a Weighted-sum Tree GRU network, where the state of a component node state (h) is produced based on the multiple hidden states (h k ) of its children. To model the influence of the different left-right order of the dependents, different weight matrices (U (r) k for the reset gate: r k ; U (h) k for the candidate hidden state:ĥ; U (z) k for the update gate: z) will be learned for the input hidden states (h k ) of the different left-right order (k). Different sets of weight matrices are used when inputting hidden states into the node state of the headword. The discussion of a headword accounts for the left-right order of the dependency words. Figure 2 shows the left-right order of the sentence '' ''(We very welcome your visit). We set a window size as 2 to minimize the number of weight matrices. For instance, '' '' (we) is the left-2 dependent (k= -2) and '' '' (visit) is the right-2 dependent (k = 2) of their headword '' '' (welcome). For the dependent words with a left-right order greater than N , the leftmost Left-N weight matrices (U (r) N ) will be used. Figure 3 shows the pseudocode of our proposed network architecture. The dependency features and the window index are fed into the node_forward function as the input to calculate the hidden state h, reset gate r and update gate z. The transition equations are described in detail as follows. The component node state h is a linear interpolation between the previous hidden stateh; and the candidate hidden stateĥ, as shown in Eq. (1), where the update gate z decides how much the unit updates its previous hidden state, and it is computed by Eq. (2). The previous hidden state is the summation of all the input hidden states shown in Eq. (3). The candidate hidden state is then computed by Eq. (4), where the reset gate r is computed as Eq. (5). When calculating the candidate hidden state, the previous hidden state (h k ) state is ignored if the corresponding reset gate (r k ) is close to 0. Finally, the predicted valence and arousal values are outputted after passing through the MLP.
Compared with the child-sum Tree GRU network [39], our proposed network imports the past h k information of various lower-level nodes when computing the node state h of the headword. However, the previous hidden-layer states of all modifiers for this headword are treated using the same parameters. Therefore, the child-sum Tree GRU model [39] cannot simulate the ordering of the target word in a sentence, mainly because different sentences sometimes generate identical dependency tree structures. Figure 4 uses the two phrases '' '' (totally not agree) and '' '' (not totally agree) as examples, expressing different meanings with almost opposite affective states, but having the same dependency parsing result. Hence, these two phrases have the same valence-arousal rating prediction in the child-sum Tree GRU framework. Our proposed Weighted-sum Tree GRU network for the sentiment intensity task can handle the same ordering problem. We use a set of weight matrices to reflect the modifier hidden state information. These weight matrices are learned using the Tree GRU neural network to extract the related reset gate, candidate hidden state and update gate features of the input phrases, allowing the proposed method to handle different sentences with the same dependency structure.

IV. EXPERIMENTS FOR PERFORMANCE EVALUATION A. DATASETS
Sinica Treebank [17] was divided into two mutually exclusive datasets to evaluate dependency parsing performance. The training set includes 56,957 sentences with a total of 337,174 words. The test set includes 690 sentences with 5,160 words.
The Chinese Valence-Arousal Phrases (CVAP) set from the Chinese EmoBank [20] was used to evaluate sentiment intensity prediction performance. A total of 52 modifiers (including 4 negators, 42 degree adverbs, and 6 modals) were combined with the affective words in the Chinese Valence-Arousal Words (CVAW) set [40] to form multi-word phrases. VA ratings were annotated through crowdsourcing with each phrase randomly assigned to 10 annotators. Both the valence and arousal dimensions use a nine-degree scale. A value of 1 on the valence and arousal dimensions respectively denotes extremely high-negative and lowarousal sentiment, while a 9 denotes extremely high-positive and high-arousal sentiment, and 5 denotes a neutral and medium-arousal statement. Outlier ratings were identified and excluded from the calculation of the average VA ratings for each phrase. Finally, a total of 2,998 Chinese phrases were constructed in the CVAP. We randomly distributed in groups of 5 for cross-validation evaluation.
All experiments were implemented using PyTorch. The hyper-parameters of our proposed Weighted-sum Tree GRU network were set up as follows: batch size 256; word vector dimension 250; POS vector dimension 50; parameter dimension of dependency features was 100; memory size of Weighed-sum Tree GRU was 256; hidden state of MLP was 512; Adagrad was used as the optimizer; and the number of epochs was restricted to 50. We use BERT, 7 RoBERTa, 8 MacBERT, 9 ALBERT, 10 XLNet, 11 and ELECTRA 12 to compare the performance of sentiment intensity prediction with the following hyper-parameters: batch size 64; average pooling style; the pre-trained models with 12-layer, 768-hidden and 12-heads; the optimizer is AdamW; and the number of epochs was 20.

C. METRICS
Two metrics were used to evaluate paring results: 1) Unlabeled Attachment Score (UAS), the proportion of tokens that are assigned the correct head, and 2) Labeled Attachment Score (LAS), the proportion of tokens that are assigned both the correct head and the correct dependency relation label.
The sentiment intensity predication performance is evaluated by examining the difference between machine-predicted ratings and human-annotated ratings using two metrics to independently evaluate the valence and arousal predictions: Mean Absolute Error (MAE) and Pearson Correlation Coefficient (PCC), defined as Eq. (1) and (2).
where a i ∈ A and p i ∈ P respectively denote the i-th actual value and predicted value, n is the number of test samples, µ A and σ A respectively represent the mean value and the standard deviation of A, while µ P and σ p respectively represent the mean value and the standard deviation of p. the mae measures the error rate and the pcc measures the linear correlation between the actual values and the predicted values. A lower MAE and a higher pcc indicate more accurate prediction performance.

D. DEPENDENCY PARSING RESULTS
In the first set of experiments, two transition-based methods were used to compare performance as follows: • MLP (Multi-layer Perceptron) [18] An MLP-based fast parser is proposed to obtain dependency parsing results. The input features, including words, POS tags and arc labels are merged and then all feature parameters are summed by linear transformation. Finally, the softmax function is used to make the classification decision.
• Stack-LSTM (Stack Long Short-Term Memory) [19] A transition-based RNN is proposed to follow the Stanford parser's features from the partial parsing trees, combining the partial dependency tree into the highest two layers of the stack. In addition to words, the dependency tree also contains actions and labels. Therefore, if the properties are different, the labels and words are trained by composition, then the stack is updated by a linear transformation. Table 1 shows the dependency parsing results. The graphbased method (our adopted Deep Biaffine Attention model) outperforms the MLP [18] and Stack-LSTM [19], with identical findings for English dependency parsing [43]. In our experience, when conducting transition-based methods, words in a sentence are put into the stack from left-to-right. However, sentences can have complicated and non-linear syntactic structures. The graph-based method calculates the weights of all possible edges from word to word and searches for the optimal solution using graph theories, which is more suitable for Chinese syntactic structure.

E. SENTIMENT INTENSITY PREDICTION RESULT
In the second set of experiments, the following transformerbased models were compared to demonstrate their performance for phrase-level sentiment intensity prediction.
• BERT (Bidirectional encoder Representations for transformers) [21] BERT uses an encoder architecture with an attention mechanism to construct a transformer-based neural network architecture, providing state-of-the-art results in a wide variety of natural language processing tasks. BERT was pre-trained on two tasks: 1) Masked Language Models (MLM): a fixed ratio of tokens was masked to train BERT and the model then predicts the original value of the masked words based on the context; 2) Next Sentence Prediction (NSP): BERT was trained to predict whether the following sentence was probable or not based on the previous sentence. Through pretraining, BERT learns contextual embeddings for representations from large-scale data sets. After pre-training, BERT can be fine-tuned on smaller data sets to optimize its performance on specific tasks.
• RoBERTa (a Robust optimized BERT pre-training approach) [22] RoBERTa is a replication study of BERT pre-training that carefully measures the impact of key parameters and training data size. The model modifications include removing next sentence predictions, dynamically changing the masking pattern applied to the training data, and training in large batches.
• MacBERT (MLM as correction BERT) [23] MacBERT revisits the Chinese pre-trained language model series and improves upon RoBERTa, particularly the masking strategy that adopts MLM as correction (Mac). This Mac pre-training task was proposed to alleviate the inconsistency problem of pre-training to downstream tasks.
• ALBERT (A Lite BERT) [24] ALBERT was proposed to improve the training and results of the BERT architecture using three different techniques: factorization of the embedding matrix, cross-layer parameter sharing, and inter-sentence coherence prediction.
• XLNet [25] XLNet was proposed as a generalized autoregressive pretraining method that 1) enables learning of bidirectional contexts by maximizing the likelihood over all permutations of the factorization order; and 2) overcomes the limitations of BERT in neglecting dependencies between the masked positions. In addition, XLNet integrates the Transformer-XL mechanism into pretraining, which allows for the input of longer texts and reduces a pretrain-to-finetune discrepancy.
• ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements) [26] A new pre-training task called replaced token detection was proposed as an alternative to masking the input in BERT. ELECTRA consists of two parts: 1) generator: some tokens were replaced with plausible samples from a small generator network; and 2) discriminator: it predicts whether each token in the input was replaced by a generator or not. Table 2 shows the results of sentiment intensity prediction on multi-word phrases. Our proposed Weighted-sum Tree GRU model outperforms the BERT [21], RoBERTa [22] and MacBERT [23], ALBERT [24], XLNet [25], and ELECTRA [26] models in both the valence and arousal dimensions, and equals the BERT result in Valence PCC. This indicates that the dependency paring can capture the modifier relationships between words, which helps enhance sentiment intensity prediction performance.   We also conducted a quantitative analysis to compare model size and inference time required. On a server using Nvidia GeForce RTX 2080 Ti GPUs with the same settings, the different transformer-based models require approximately 105M parameters and 640ms of inference time, while our proposed Weighted-sum Tree GRU model is relatively lightweight compared to the BERT-like transformer models, requiring only 43.8% of the number of parameters and 56.6% of the inference time, and does not require large amounts of data for pre-training.
In summary, our proposed Weighted-sum Tree GRU model is simple, but is effective and efficient in phrase-level sentiment intensity prediction due to the full use of linguistic dependency features for predicting sentiment intensity. Table 3 shows the dependency parsing results of some phrases used for the case study and their valence-arousal rating predictions using the previously compared models. In the phrase '' ''(should be quite suitable), the modal '' '' (should be) modifies the headword '' '' (suitable) with a deontic relation, which indicates the speaker's attitude towards whether an event is true or not. In addition, the degree adverb '' '' (quite) plays a semantic role to emphasize the statement. By obtaining these dependency features correctly, our Weighted-sum Tree GRU model predicts valence and arousal ratings respectively of 6.032 and 4.301, which are close to the human-annotated ratings of 6.375 and 4.333. In the phrase '' '' (very sensitive), the degree adverb ''

F. CASE STUDY
'' (very) is a behavioral relation used to modify the headword '' ''(sensitive), which indicates how quickly the speaker reacts to an external stimulus. Our proposed Weighted-sum Tree GRU model respectively predicts valence and arousal ratings of 3.826 and 6.225, which are the nearest to the human-annotated ratings of 3.813 and 6.375. Moreover, we can identify two phrases '' '' (really not like) and '' '' (not really like) as having different meanings, despite having identical dependency modifiers. Our proposed Weighted-sum Tree GRU model can properly process the same modifiers in different orders, but with the same dependency structure. For the phrase'' '' (humanannotated VA ratings are 2.111 and 6.714), our model predicts a valence of 2.432 and an arousal of 6.888. For the phrase '' '' (VA ratings are 3.944 and 4.929), our model predicts very similar respective valence and arousal ratings of 3.943 and 4.975.

V. CONCLUSION
We propose a Weighted-sum Tree GRU network for phraselevel sentiment intensity prediction, making the following contributions: (1) We develop a Chinese dependency parser based on the graph-based deep biaffine attention model to obtain dependency tree and relational information. Experimental results on the Sinica Treebank indicate that our graph-based model achieved a UAS of 92.9% and a LAS of 88.5%, which outperforms transition-based methods with identical findings for English dependency parsing.
(2) We propose a Weighted-sum Tree GRU model to include exploited dependency features for predicting Chinese phrase-level sentiment intensity in valence-arousal dimensions. Experimental results on the Chinese EmoBank indicate that our Weighted-sum Tree GRU model achieved an MAE of 0.392 and a PCC of 0.936 in the valence dimension and an MAE of 0.399 and a PCC of 0.915 in the arousal dimension, which outperforms several transformer-based models. Quantitative analysis also confirms that our model is relatively lightweight and efficient compared against BERT-like transformers, especially without the need of large amounts of data for pre-training.
Future work will exploit other semantic features and develop advanced models to further improve performance.