Toward Transformer Fusions for Chinese Sentiment Intensity Prediction in Valence-Arousal Dimensions

BERT (Bidirectional Encoder Representations from Transformers) uses an encoder architecture with an attention mechanism to construct a transformer-based neural network. In this study, we develop a Chinese word-level BERT to learn contextual language representations and propose a transformer fusion framework for Chinese sentiment intensity prediction in the valence-arousal dimensions. Experimental results on the Chinese EmoBank indicate that our transformer-based fusion model outperforms other neural-network-based, regression-based and lexicon-based methods, reflecting the effectiveness of integrating semantic representations in different degrees of linguistic granularity. Our proposed transformer fusion framework is also simple and easy to fine-tune over different downstream tasks.


I. INTRODUCTION
Sentiment analysis involving the use of natural language processing and computational linguistics to automatically identify affective information from texts has emerged as a leading technique for emotional AI applications [1], [2], [3], [4], [5], [6].Representation of affective states is an essential issue in sentiment analysis and can be generally divided into category-based and dimension-based approaches.Categorybased approaches represent affective states as several predefined discrete classes, such as positive, negative and neutral.Dimension-based approaches represent affective states as continuous numerical values, called intensity, in multiple dimensions to provide more fine-grained emotional information [7], [8], [9].
Figure 1 shows the two-dimensional valence-arousal (VA) space.Valence expresses the degree of pleasant and unpleasant (i.e., positive and negative) feelings, while arousal expresses the degree of excitement and calmness.Based on The associate editor coordinating the review of this manuscript and approving it for publication was Alessandro Floris .this representation, any affective expression can be mapped into the VA coordinate plane as a point by recognizing their valence-arousal ratings.For example, an affective word '' '' (priceless), with human-annotated VA ratings 6.2 and 4.8 in the Chinese EmoBank corpus [9], is located in the low-arousal and high-valence quadrant.A single sentence '' '' (It's a priceless treasure from somewhere) contains this affective word as a modifier to express an object with high value, with respective VA ratings of 7.5 and 6.5.A multi-word phrase '' '' (extremely painful) has a degree adverb to modify the affective word to express a negative-arousal and high-arousal feeling (with VA ratings of 1.65 and 7.993).The multi-sentence text '' '' (physical and mental pains that are unbearable and medically incurable) contains multiple affective words to reflect negative complicated perceptions (valence 2.889 and arousal 4.286).
However, existing Chinese PLMs were mainly trained on character-based sequences due to two main limitations.The first limitation is the need for word segmentation preprocessing due to a lack of delimiters between Chinese characters, and incorrectly segmenting word boundaries will cause error propagation, affecting the language representation in different contexts [46].Nevertheless, word semantics can be exploited to enrich the character representation of Chinese PLM [47].Taking the sentence '' '' (I am very touched every time when I read articles related to the Van Gogh brothers) from the Chinese EmoBank corpus [9] as an example, this sentence can be correctly segmented as (every)'' (time) (read) (Van Gogh) (brothers) (related) (pronounced as De) (articles) (then) (very) (touched)''.After word segmentation [48], we can find the affective word '' '' (touched) is modified by a degree adverb '' '' (very) to express a positive-valence and high-arousal feelings.This is a helpful clue to predict the sentiment intensity of this sentence with VA ratings of 7.0 and 6.75.The second limitation is the need for huge pre-training data sets.Because Chinese words usually contain multiple characters, more data is needed to sufficiently reflect contexts for training a word-level Chinese PLM.For example, the above-mentioned sentence has 17 tokens in terms of characters, but only 11 tokens in terms of words.This shows the need for greater amounts of data to train a word-level Chinese PLM as opposed to characterlevel.
Recently, fusion-based methods have been used for categorical sentiment analysis [49], [50], [51] to classify sentiments as predefined discrete classes on multimedia or multimodal targets.To our best knowledge, there is no transformer-based fusion model for dimensional sentiment analysis to identify the intensity in continuous numerical values, especially for Chinese texts.We are thus motivated to develop a Chinese word-level BERT to address the above limitations for latent language representations and propose a transformer fusion framework based on different linguistic granularities for Chinese sentiment intensity prediction in the valence-arousal dimensions.The main contributions are summarized as follows: (1) We develop a Chinese word-level BERT for contextual language representation.
We use the NCTU word segmentation tool [48] to process collected text corpora, with a total of 2.8 billion words.We pre-train a Chinese word-level BERT model () [44] over the same Masked Language Model (MLM) task based on a dynamic masking strategy [52].We plan to release our word-level BERT as a pre-trained language model for further research.
(2) We explore transformer fusion methods for Chinese sentiment intensity prediction.
We propose a transformer fusion framework to integrate word-level and character-level transformers for Chinese sentiment intensity prediction in the valence-arousal dimensions.Chinese Valence-Arousal Sentences (CVAS) and Chinese Valence-Arousal Texts (CVAT) from the Chinese EmoBank corpus [9] were used to evaluate performance.In experiments, our proposed fusion model outperformed other neuralnetwork-based, regression-based, and lexicon-based models, confirming the effectiveness of our transformer fusion framework.
The rest of this paper is organized as follows.Section II describes related studies for dimensional sentiment intensity prediction.Section III introduces details of our transformer fusion model for valence-arousal rating prediction.Section IV presents the experimental results and evaluation analysis.Conclusions are finally drawn in Section V.

A. LEXICON-BASED METHODS
A number of one-dimensional sentiment lexicons provide word-level sentiment intensity, including SentiWordNet [10], SO-CAL [11], SentiStrength [12], and VADER [13].Affective Norms of English Words (ANEW) [14], [16] and Extended ANEW [15] are three-dimensional lexicons which provide real-valued scores for the valence, arousal and dominance dimensions.Lexicon-based methods typically determine the sentiment intensity of a given text by averaging the sentiment scores of words matched in the lexicon [4].These approaches are simple and easy to implement, but do not capture real sentiment expressions due to complex linguistic usages in the texts.For example, two phrases '' '' (totally not agree) and '' '' (not totally agree) have the same words with different ordering, and express meanings with almost opposite affective states.Hence, lexicon-based methods are usually used to provide baseline results for reference.

B. REGRESSION-BASED METHODS
Regression-based methods have been intensively studied to predict valence-arousal ratings.A cross-lingual approach was used to train a linear regression model for valence-arousal score prediction, in which the dimension scores of English seed words were regarded as the source language and their translated Chinese seed words were viewed as the target language [17].The valence ratings of new words were estimated based on semantic similarity scores and a kernel model which was trained using least mean squares estimation [18].A locally weighted regression method was proposed to improve linear regression to predict the valence-arousal values of affective words [19].A community-based weighted graph model that performs the regression task on a graph was developed to predict the dimension scores of words [20].A linear regression model was built to predict sentence-level affective ratings based on combinations of partial affective ratings of word n-grams [21].The support vector regression was used to predict the sentiment intensity of words and phrases [22].

C. NEURAL-NETWORK-BASED METHODS
In recent years, neural network models with sentiment embeddings that capture contextual and emotional information of words have been applied to dimensional score prediction [23].To learn sentiment embeddings, a word vector refinement model was proposed to refine existing pretrained word vectors using real-valued intensity scores provided by affective lexicons [24].A boosted neural network trained on character-enhanced word embeddings was used to predict valence-arousal ratings of words [25].A convolutional neural network (CNN) was trained on Twitter word embeddings to exploit neural activation values for Twitter sentiment classification and quantification [26].A densely connected long short-term memory (LSTM) network was used to concatenate features at different levels to predict dimension scores of Chinese affective words and phrases [27].An ensemble of different neural networks was developed to determine the intensity level for different emotion categories such as anger, fear, joy and sadness [28].Bidirectional LSTM and CNN were combined to consider global and local information to predict emotional intensity of tweets [29].A neural-network-based architecture that combines convolutional layers, fully-connected layers, linguistic features, and pretrained CNN activations in a non-sequential fashion was used for emotion intensity prediction in tweets [30].An adversarial attention network was presented to predict the dimension scores of short texts [31].A pipelined neural network model was used to sequentially learn word intensity and modifier weights for phrase-level sentiment intensity prediction [32].A weighted-sum tree GRU model was developed to include dependency features for predicting Chinese phrase-level sentiment intensity in valence-arousal dimensions [33].

D. TRANSFORMER-BASED METHODS
Recently, BERT-like transformer architectures have been widely used for dimensional sentiment analysis.The pre-trained and case-sensitive BERT-base model was fine-tuned to predict the degree of sentiment intensity associated with multiple entities for aspect-based sentiment analysis [34].A multi-task architecture based on the RoBERTa transformer was proposed to predict empathy and distress scores [35].The RoBERTa multi-task model and the vanilla ELECTRA model was combined to predict empathy scores [36].A demographic-aware EmpathBERT architecture was presented to infuse demographic information for empathy prediction [37].The BERT transformer was used to recognize the emotion intensity scores of Japanese tweets on the topics of vaccinations [38].Pre-trained BERTweet was used as the shared text encoder between a multi-label emotion classifier and a multi-dimension emotion regressor in a multi-task learning framework [39].The pre-trained MacBERT transformers were used to fine-tune valencearousal score prediction shared task for educational texts [40].The BERT model was combined with specific sentiment word masking to improve sentence-level valence-arousal prediction [41].The pre-trained RoBERTa-Large model was fine-tuned with categorical emotion labels to predict the continuous dimensions of valence, arousal, and dominance scores [42].The domain-distilled BERT model was proposed to learn domain-invariant features on scarce language resources for dimensional sentiment score prediction [43].
Recently, transformer-based fusion methods have also been used for sentiment analysis, usually with promising results.BECMER combines a CNN model on audio signals and a BERT transformer on the lyrics for music emotion recognition [49].The HFU-BERT framework improves the BERT transformer by integrating heterogeneous language, audio, and visual features for multimodal emotion recognition [50].A stacking method was used to fine-tune BERT to generate metadata for each emotion type separately and then assemble them to train a meta-classifier for emotion category prediction [51].
Different from the above fusion methods used for categorical sentiment analysis on multimedia and modal targets, we aim to develop a transformer-based fusion framework for dimensional sentiment analysis for Chinese texts.This paper reports the pre-training of a Chinese word-level BERT for contextualized language representations and propose a transformer fusion framework to combine word-and character-level BERT transformers for sentiment intensity prediction in the valence-arousal dimensions.

III. TRANSFORMER FUSION MODEL
Figure 2 shows our proposed network architecture for Chinese dimensional sentiment intensity prediction, comprised of two parts: 1) Chinese word-level BERT; and 2) word-/character-level transformer fusion.

A. CHINESE WORD-LEVEL BERT
BERT (Bidirectional Encoder Representations from Transformers) [44] is a pre-trained language model proposed by Google Research that uses a multi-layer transformer architecture as its network architecture.BERT uses an encoder architecture with an attention mechanism [53] to construct a transformer-based neural network architecture, providing state-of-the-art results in a wide variety of natural language processing tasks.There are two steps in the framework: 1) pre-training, in which the model is trained on unlabeled data over predefined tasks and 2) fine-tuning, in which the BERT model is first initialized with the pre-trained parameters and then fine-tuned using labeled data from the downstream tasks.
BERT proposes two pre-trained tasks: 1) Masked Language Model (MLM): a fixed ratio of tokens is masked to train BERT and the model then predicts the original value of the masked words based on the context and 2) Next Sentence Prediction (NSP): BERT is trained to predict whether the following sentence is probable or not based on the previous sentence.Through pre-training, BERT learns contextual embeddings for representations from large-scale data sets.After pre-training, BERT can be fine-tuned on smaller data sets to optimize its performance on specific tasks.
While a character-level BERT pre-trained model is publicly released [52], a Chinese word-level BERT is lacking due to the need for pre-processing in Chinese word segmentation over huge data sets.Therefore, we collected a huge set of text corpora and segmented the texts into words using the NCTU word segmentation tool [48] to train the word-level BERT model.We only trained the MLM task using the dynamic masking strategy [54] for language model training.We use the SentencePiece that uses Byte-Pair Encoding (BPE) as the subword detection mechanism.

B. WORD-/CHAR-LEVEL TRANSFORMER FUSION
We further propose a transformer fusion framework to combine our developed word-level BERT with the existing character-level BERT for sentiment intensity prediction in the valence-arousal dimensions.In the encoding layer, the word/character-level token embedding X emb at a given position is obtained by looking up the embedding vector and adding up the word vectors that correspond to that position, as shown in Eq. ( 1).The positional encoding uses sine and cosine functions to encode the positional information, ensuring a consistent relative relationship among different positions.For the self-attention mechanism, three parameter matrices W q , W k and W v are used to respectively map the input vector X emb to three new vectors Q = W q X emb , K = W k X emb , and V = W v X emb .The residual convergence of our multi-head vector is accelerated by the layer normalization calculation shown in Eq. ( 2).Finally, the multi-headed embedding vector X enc is computed using two linear transformations and the activation function GeLU, as shown in Eq. ( 3).
In the decoding layer, we obtain the different granularity embedding X word and X char via 1-layer transformer that is identical to the encoding layer using a 2-head multihead attention with 256 hidden dimensions, while using max-pooling to retain the important features for each dimension [55], as shown in Eq. ( 4).Eventually, we concatenate different granularity embeddings P word and P char together, as shown in Eq. ( 5), which are used to obtain prediction scores using a 2-layer Multi-Layer Perceptron (MLP) with the activation function hyperbolic tangent (tanh).
P word , P char = max(X word , X char ) (4) r = tanh(tanh(P word + P char )) For sentiment intensity prediction in the valence-arousal dimensions, we use the downstream task datasets to finetune pre-trained word/character-level BERT model in our transformer fusion framework to obtain the valence-arousal ratings.
Take the following sentence '' '' (Why can I give up so resolutely now?) with VA of 4.333 and 4.000 as an example.We can obtain a 17-tokens character sequence as '' '' and a 9-tokens word sequence '' (why) (I) (now) (can) (so) (resolutely) (pronounced as De) (give up)'' to generate the embeddings at both the character-and word-levels based on Eq. ( 1).Both embedding sequences at different levels of linguistic granularity are fed into the encoder layer of the 12-layer word-/character-level BERT model, using the process described in Eq. ( 2) and using the GeLU activation function specified in Eq. (3).Then, the outputs are passed to the decoder layer of a 1-layer transformer using 2-head multi-head attention.Consequently, through the max pooling operation described in Eq. ( 4), we can respectively obtain sampled word/character embeddings for fusion.Finally, following Eq.( 5), we merge the word/character representations through 2-layer MLPs to predict the VA ratings.Comparing the predicted results of this example sentence, standalone word-level BERT predicted a valence of 4.969 and an arousal of 5.314, while the standalone character-level BERT model predicted VA ratings of 5.048 and 4.801.Our word/character-level BERT fusion model can obtain improved valence (3.916) and arousal (4.003) results, relatively close to human-annotated VA ratings of 4.333 and 4.000.

IV. EVALUATION A. DATASETS
Chinese valence-arousal sentences (CVAS) and Chinese valence-arousal texts (CVAT) from the Chinese EmoBank corpus [9] were used to evaluate sentiment intensity prediction performance.Valence-Arousal (VA) ratings were annotated through crowdsourcing with each instance randomly assigned to 10 annotators.Both the valence and arousal dimension use a nine-degree scale.A value of 1 on the valence and arousal dimensions respectively denotes extremely high-negative and low-arousal sentiment, while a 9 denotes extremely high-positive and high-arousal sentiment, and 5 denotes a neutral and medium-arousal sentiment.Outlier ratings were identified and excluded from the calculation of the average VA ratings.
CVAS was collected from Chinese tweets, including 2,852 single sentences with an average of 11.7 characters or 7.3 words.CVAT collects web texts crawled across six different categories: news articles, political discussion forums, car discussion forums, hotel reviews, book reviews, and laptop reviews.A total of 2,969 multi-sentence texts were included in the CVAT each with an average of 55.1 characters or 35.5 words.Each instance in the CVAT is about five times comparing with CVAS in terms of character or word lengths.In addition, the ratios between the number of characters divided by the number of words are respectively near 1.6 in the CVAS and 1.55 in the CVAT.

B. SETTINGS
To train Chinese word-level BERT, we collected the following text resources: LDC Chinese Gigaword (Version 2.0), 1 Sinica Balance Corpus (Version 4.0), 2 Chinese Information Retrieval Benchmark (Version 3.03), 3 Taiwan Panorama Magazine, 4 Mandarin Conversation Dialogue Corpus (MCDC), 5 National Educational Radio Corpus, 6 Microphone Speech Database (TCC300), 7 and NYCU text corpus (collected from Chinese Wikipedia 8 and other web pages).After preprocessing based on NCTU word segmentation tool [48] and text normalization, we obtained about 2.8 billion words to train Chinese word-level BERT model.Our quantity scale is huge, but it still has a clear gap comparing with English BERT released by Google 9 that was trained on 3.3 billion words.
The experimental implementations were carried out using the ASUS Taiwan Computing Cloud (TWCC) 10 computing resource.The hyper-parameters of our transformer fusion framework were set up as follows: batch size 16; max pooling style; decoder used 1-layer transformer; and compared character-level BERT. 11Our developed word-level BERT both had 12-layers, 768-hidden and 12-heads; 2-layer MLP dimensions of 768; the optimizer was AdamW; and the number of epochs were restricted to 20.

C. METRICS
We used five-fold cross-validation evaluation, identical to that used for the Chinese EmoBank corpus [9].The sentiment intensity predication performance is evaluated by examining the difference between machine-predicted ratings and human-annotated ratings using two metrics to independently evaluate the valence and arousal predictions: Mean Absolute Error (MAE) and Pearson Correlation Coefficient (PCC), defined as follows where a i ∈ A and p i ∈ P respectively denote the i-th actual value and predicted value, n is the number of test samples, and σ A respectively represent the mean value and the standard deviation of A, while µ P and σ P respectively represent the mean value and the standard deviation of P.
The actual and predicted real values range from 1 to 9, so MAE measures the error rate in a range where the lowest value is 1 and the highest value is 9.A lower MAE indicates more accurate prediction performance.The PCC is a value between −1 and 1 that measures the linear correlation between the actual value and the predicated value.A lower MAE and a higher PCC indicate more accurate prediction performance.Each metric for the valence and arousal dimensions is ranked independently.A model's overall ranking is computed based on the cumulative rank across the four metrics.The lower the cumulative rank, the better the system performance.

D. RESULTS
In the first set of experiments, the following four model types were compared to demonstrate their performance.Experimental results of the first three types were obtained from the Chinese EmoBank corpus evaluation [9] for reference, whereas the last one was conducted by this study.of a given instance in CVAS (or CVAT) by averaging the valence (or arousal) ratings of words/phrases in the CVAW and CVAP.
• Transformer-based methods: including character-level BERT model () [44] released by the Google Research, our developed word-level BERT and transformer fusion model.
Table 1 shows the prediction results of CVAS.For both the lexicon-based and regression-based methods, the SVR approach outperformed the others in both the valence and arousal dimensions.The character-level BERT outperformed the other neural-network-based methods in both dimensions.Comparing the results achieved by our character-level BERT, our word-level BERT had a slightly lower cumulative rank.In our observations, short sentences with an average of 7.3 words (or 11.7 characters) do not provide sufficient information for valence-arousal rating prediction using complicated neural networks, especially for those word-level based models.Our fusion model ranked first for valence MAE (0.494) and valence PCC (0.891), while the character-level BERT ranked first for arousal MAE (0.700).Finally, both methods tied first for overall performance with the same cumulative rank.
Table 2 shows the prediction results of CVAT.For lexiconbased, regression-based, and neural-network-based methods, we obtained nearly consistent findings.For transformerbased methods, the overall performance of our word-level BERT was close to that of character-level BERT in terms of overall cumulative rank.Based on our observations, the average word length of a given text in CVAT is about five times that of a short sentence in CVAS.These characteristic benefits the word-level based models.Our fusion model ranked first for valence MAE (0.519), arousal MAE (0.494) and arousal PCC (0.695) and second for valence PCC (0.831).Overall, our fusion model ranked first in terms of cumulative rank.
In summary, almost all models on the CVAS clearly underperformed the corresponding model results on the CVAT.The valence-arousal ratings for CVAS data containing single sentences from Twitter were more difficult to predict than for multi-sentences texts that provide more information in CVAT.Comparing results achieved by word-level BERT on CVAS and CVAT, we find that performance improve with increased sentence length.Character-level BERT outperformed wordlevel, possibly because the insufficient size of pre-trained data sets, with a difference of about 500 million words.However, our fusion model combining word-and character-level BERT provided the best overall performance by including features in different linguistic granularities.

V. CONCLUSION
We propose a transformer fusion framework for Chinese sentiment intensity prediction in the valence-arousal dimensions, making the following contributions: (1) We develop a Chinese word-level BERT model based on huge collected data sets to obtain contextual language representations.We plan to release the pre-trained language model for further research.

FIGURE 1 .
FIGURE 1. Two-Dimensional valence-arousal space.Based on this representation, any affective expression can be mapped into the VA coordinate plane as a point by recognizing their valence-arousal ratings.

FIGURE 2 .
FIGURE 2. Our proposed transformer fusion framework.We propose word-level BERT to fuse the existing character-level BERT for Chinese dimensional sentiment intensity prediction.Two transformers in different granularities are separately pretrained and fine-tuned, and then jointly optimized to predict valence-arousal values.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

( 2 )
We propose a transformer fusion framework to predict valence-arousal ratings for dimensional sentiment analysis.Experimental results on the Chinese EmoBank indicate that our fusion model integrating word-and character-level BERT outperformed other neural-network-based, regression-based and lexicon-based methods.Future work will exploit other semantic features and develop other pre-training tasks to further improve performance for Chinese dimensional sentiment analysis.

TABLE 1 .
Results of sentiment intensity prediction on CVAS.

TABLE 2 .
Results of sentiment intensity prediction on CVAT.