Context-Specific Heterogeneous Graph Convolutional Network for Implicit Sentiment Analysis

Sentiment analysis has attracted considerable attention in recent years. In particular, implicit sentiment analysis is a more challenging problem due to the lack of sentiment words. It requires us to combine contextual information and precisely understand the emotion changing process. Graph convolutional network (GCN) techniques have been widely applied for sentiment analysis since they are capable of learning from complex structures and preserving global information. However, these models either only focus on extracting features from a single sentence and ignore the context semantic background or only consider the textual information and overlook the phrase dependency when constructing the graph. To address these problems, we propose a new context-specific heterogeneous graph convolutional network (CsHGCN) framework that can combine all context representations. It has a complete context that reflects the information on documents more comprehensively. It has a dependency structure that obtains token-token semantic acquisition more accurately. The experimental results on a Chinese implicit sentiment dataset show that our proposed model can effectively identify the target sentiment of sentences, and visualization of the attention layers further demonstrates that the model selects qualitatively informative tokens and sentences.


I. INTRODUCTION
Sentiment analysis is an important problem in natural language processing (NLP). According to the expression of subjective and objective emotion in texts, Liu [1] divided texts into implicit emotional and explicit emotional texts. The implicit emotional text is defined as ''A language fragment (sentence, clause or phrase) subjective sentiment but contains no explicit sentiment word''. Here are two examples that briefly illustrate the difference between explicit and implicit emotions. The associate editor coordinating the review of this manuscript and approving it for publication was Pengcheng Liu . Example 1 shows an explicit emotional tendency through the word '' (beautiful)''. However, some texts, such as Example 2, show implicit emotional tendencies with no explicit emotional words. Unstructured texts produced by users in social media contain side writing of real social life, which reflects the behavior of people in real life. These data are widely used in applications of user feedback in the catering industry, online public opinion analysis, and personalized product recommendations [2].
Representation learning is an essential intermediate step in text sentiment classification tasks. In previous studies, convolutional neural networks (CNNs) [3] and recurrent neural networks (RNNs) [4] have been widely applied to text sentiment analysis. To represent the hierarchical relationship between phrases in a sentence, syntax trees or dependency trees are introduced to replace the original sequential structure in CNNs and RNNs [5], [6]. In the past two years, graph neural networks(GNNs) such as graph convolutional network (GCNs) [7], [8] and graph recurrent network (GRNs) [9] have attracted widespread attention. GNNs are simple and effective and can effectively capture deep-level domain features. They are widely used in the field of sentiment analysis, including the text level [10], [11], sentence level [12], and aspect level [13] etc.
Currently, most of the research on sentiment analysis focuses on explicit sentiment classification tasks. These tasks are already accurate, and the improvement is limited. However, implicit sentiment analysis is a more challenging problem due to the lack of sentiment words. Statistics show that Chinese implicit sentiment sentences account for 15%-20% of the total sentiment sentence. 1 In Chinese texts, the implicit sentiment classification task is not solved better than the explicit one because of the following challenges: • From the linguistic level of emotional expression, the Chinese sentence can make different emotions by different contextual semantic backgrounds, and the acquisition of semantic features is more complicated.
• From the perspective of words, the implicit text does not contain emotional terms, and the words are relatively objective and neutral, which leads to text representation methods based on the bag of words model not effectively representing the semantics of a sentence.
• Implicit emotional sentences are euphemistic expressions of subjective emotional tendencies, and they are closely related to the individual's cognitive background.
There is no formal standard definition for it. It can be observed that the implicit sentiment classification task has higher requirements for learning text representations, and it needs more semantic information to infer emotional tendency in text. We find that the current research on text sentiment classification still has the following limitations. Previous studies either focused on extracting features from a single sentence and ignored contextual semantics; or only considered discourse information when constructing the graph, ignoring the dependencies between the tokens.
In Example 3, the first sentence Example 3-1 is labeled, and the second sentence Example 3-2 is unlabeled. Example 3-2 is the context of the target sentence Example 3-1. Given the above limitations, we propose a new context-specific heterogeneous graph convolutional network (CsHGCN) framework. First, we separate the emotional target sentence from its context in a document and then represent all the remaining context text as a heterogeneous graph. In the heterogeneous graph, the nodes of the graph are composed of tokens or sentences. A syntactic dependency tree constructs the edge between the Chinese token node, and the boundary between each token and sentence node is built by term frequency-inverse document frequency (TF-IDF). The border between each sentence node is constructed by sentence sequence order. Finally, we propose a framework for building convolutional graph networks (GCNs) [7] over heterogeneous graphs, extracting semantic background information and word dependency. In summary, our main contributions are as follows: • We propose a new context-specific heterogeneous graph convolutional network (CsHGCN) framework for implicit sentiment analysis. The whole context at the document level is considered a heterogeneous graph. The dependency structure is maintained, and the obtained context features are more accurate.
• We apply a novel GCN model to implicit sentiment classification tasks. GCN is considered as an adaptation of the convolutional CNN for encoding local information of unstructured data, and it can effectively capture deep-level domain features.
• Extensive experimental results show that in implicit sentiment classification, the scores of all models can be improved by the context semantic background and that our proposed CsHGCN has better performance and interpretability.

II. RELATED WORK A. SENTIMENT ANALYSIS
According to different methods of text processing, sentiment analysis is mainly divided into two categories: the process based on sentiment lexicon and the plan based on deep learning. The advantage of the method based on sentiment lexicon is that it is simple in structure. Nevertheless, the creation of a sentiment lexicon requires feature selection, as well as a large number of labeled data, which costs considerable time and human resources. Moreover, to obtain high classification accuracy, large size and high-quality emotional dictionaries are required. For the above reasons, there are relatively few studies on sentiment analysis based on dictionaries [14], [15]. Methods based on the deep neural networks have stronger knowledge representation ability [16]. CNN [3] and RNN [4] are widely applied to sentiment classification tasks. Zhang et al. [17] proposed that character-level CNN can effectively extract local information of the same convolution window. The model based on RNN can encode sequence information. Ma et al. [18] proposed sensing LSTM to extract common-sense information in the sequence model. Yang et al. [19] proposed a hierarchical attention model HAN, which achieved excellent results. Tai et al. [20] and Teng and Zhang [21] used Tree-LSTM to encode tree-structured sentences to extract hierarchical information for sentiment analysis. Although Tree-LSTM can acquire more accurate semantic information, it has disadvantages such as slow training and difficulty parallelizing. Later, Mou et al. [22] combined the advantages of CNN and RNN and proposed a tree-based convolutional neural network (TB-CNN). However, word order features were not considered in TB-CNN. Subsequently, Liao et al. [23] proposed a multilayer convolutional neural network model, the semantic dependency tree-based CNN (SDT-CNN), based on the syntactic dependency tree, which can learn implicit dependency representation and a context-explicit semantic background representation of emotional information in the text. However, the context of each sentence in the SDT-CNN model is irrelevant, which may cause some semantic deficiencies. In contrast, the heterogeneous graph we proposed maintains not only the dependency structure but also has edges connected between contexts, so the information obtained is more complete.

B. GRAPH CONVOLUTIONAL NETWORK
In recent years, graph convolutional neural networks have received increasing attention. Scarselli et al. [24] proposed a graph neural network (GNN) for encoding an arbitrary graph structure. Bruna et al. [25] and others first proposed a CNN structure based on spectral graph theory. The author generalized the CNN model to a regular grid structure. Defferrard et al. [26] and others proposed a CNN algorithm based on spectral graph theory, which defined graph convolution. On this basis, Kipf and Welling [7] proposed a graph convolutional neural network model to obtain the best results on a benchmark data set. In recent research on related NLP tasks, Yao and Mao [10] and others proposed constructing a heterogeneous graph on the entire corpus for text-level classification. Wu et al. [27] and others repeatedly eliminated the nonlinearity between GCN layers and folded the function into a linear transformation to reduce the extra complexity of GCNs. Zhang et al. [28] and Chen et al. [13] introduced the syntactic dependency tree into GCN to encode the syntactic structure and used the new GCN structure in sentiment analysis at the sentence level and aspect level. In previous studies, when building document-level graphs, the edges between token nodes were either based on chronological order or cooccurrence. For comparison, we consider the hierarchical relationships within sentences when constructing heterogeneous graphs, and the structure of the dependency tree is used to obtain more accurate semantic information. Fig.1 shows the overall framework of our model contextspecific heterogeneous graph convolutional network (CsHGCN), which consists of four parts: word coding layer, information extraction layer, attention layer, and final vector representation layer.

III. CONTEXT-SPECIFIC HETEROGENEOUS GRAPH CONVOLUTIONAL NETWORK
In the following sections, the components of CsHGCN are described in detail.

A. WORD REPRESENTATION AND BIDIRECTIONAL GRU CODING
In this component, each sentence is represented as Context-specific token sentence graph. Nodes begin with ''S'' are sentence nodes, others are token nodes. The blue bold edges are token-token edges and the blue thin edges are document-word edges. Except blue, the same color tokens mean the tokens equally. For example, two green ''T'' represent the same token on different dependency trees.
token in the sentence is mapped to a low-latitude vector space to obtain a word embedding matrix E ∈ R n×d e , where n is the size of the vocabulary and d e is the dimension of the word vector. Sending E to the bidirectional gated recurrent unit (Bi-GRU) [29] network to obtain the hidden layer state vector where h i ∈ R m represents the hidden layer state vector of the Bi-GRU network at time t, and m represents the vector dimension.

B. CONTEXTUAL-SPECIFIC GRAPH
We construct a text graph of word segmentation nodes and sentence nodes for the context so that GCN can obtain sufficient node information. In text graph G = (V , E), where |V | = n is the number of nodes, which is the summary of nodes with sentences and tokens in a document, and E is the edge set of graph G. It consists of token dependency, term frequency-inverse document frequency (TF-IDF) of the token in the sentence, and the sentence order. The TF-IDF of the token determines the weight of the edge sentencetoken, where term frequency (TF) is the number of times the token appears in the sentence, and inverse document frequency (IDF) is the logarithmically scaled inverse fraction of the number of sentences that contain the token in the document. To relate the dependency tree of each sentence, we incorporate sentence order as a feature to represent the relationship between sentence nodes. Matrix X ∈ R n×m is composed of the eigenvector x v ∈ R m of n nodes, where m is the eigenvector dimension, and A ∈ R n×n is the adjacency matrix of graph G, which is composed of the relationship between each node. Formally, the weight of the edge between node i and node j is defined as for the formula: DT (i, j) is the dependency relationship between the token nodes i and j in the dependency tree. The construction of the text graph is shown in Fig.2. When the input is the target sentence, the heterogeneous graph is degraded into a dependency tree, as shown in Fig.3.

C. CONVOLUTIONAL OVER THE HETEROGENEOUS GRAPH
X and A in the constructed text heterogeneous graph G are used as the inputs of the GCN, and the propagation mode between layers is shown in the below equations.
is a symmetrical and normalized adjacency matrix, D is a degree matrix of nodes, I is an identity matrix, W j is a weight matrix, ρ is an activation function, j is the j-th layer, and b is a bias, H (0) = X . A single-layer GCN can only rely on one layer of convolutions to obtain the information of neighbor nodes, and by deepening the layers of the GCN, it can integrate the knowledge of a broader neighborhood. Therefore, we send the text graph into a simple two-layer GCN, and each node in the graph represents the updated formula below: A two-layered GCN can allow information passing between nodes that are within two steps away. Even if there are no directly connected edges between two nodes in the graph, GCN still enables information exchange between nodes. The number of sentences in our data set document is relatively small. Further experiments have found that the performance of the two-layer GCN is better than that of the single-layer GCN. It has the same result as Kipf and Welling [7].

D. CONTEXT-SPECIFIC ATTENTION
Not all context tokens contribute equally to judging target sentence sentiment. Hence, we match the essential features related to the target sentence from the matrix H G context with the overall context information, and a correlation attention weight is set for each context token. First, we make the word-level attention with target sentence vector H G target that is output from the GCN layer at the word level. The attention is calculated as follows: As shown in (9), we first feed the word representation H G target obtained by GCN to a layer of the MLP network to obtain O G target and then calculate the similarity between vector O w and each of its tokens to obtain a i , where o w is the same as HAN [19], and it is the context vector that is randomly initialized and learned during the training process. Finally, the weighted vectors in O G target are summed to obtain the final representation o t of the target sentence.
Next, we need to match in all context nodes with o t to obtain a related message. Similarly, we feed the matrix H G context into the single-layer MLP to obtain O G context and use the target sentence representation obtained in (8) as the query for each token in O G context . We perform the attention calculation to obtain β i . Similarly, (11) indicates that the vector in O G context is weighted and summed according to the allocation weight to obtain the final representation O c of the context.
The final vector representation is such as (12), where f (.) represents the splicing operation, and r integrates the information interaction of the target sentence and relevant context information.

E. SENTIMENT CLASSIFICATION AND TRAIN
The obtained vector r is fed to the fully connected layer, and the input result is classified into three categories using softmax. The formula is as follows: P ∈ R (d p ) represents the possibility of classification, d p represents the number of classifications, and W p ∈ R d p ×m and b ∈ R d P represent training weights and biases. During training, a standard gradient descent (SGD) algorithm is used to update the parameters. L 2 − norm regularization is used to prevent overfitting. The loss function is cross-entropy. The formula is as follows: where D represents the index of the labeled document, d p represents the output feature dimension the same as (13), Y represents the real label matrix, θ represents all the training parameters, and λ represents the penalty coefficient.

IV. EXPERIMENTS A. PROBLEM DEFINITION
Our work is to study contextual-based implicit sentiment mining and analysis. The formal task description is as follows: where Doc(.) represents a text with multiple sentences (number of sentences ≥ 2), and it has at least one target sentence S t (number of targets ≥ 1). The rest is context S c , F(.) represents the model frame, and Label ∈ [neutral, positive, negative] represents the set of predictive labels.

B. DATASETS AND EXPERIMENT SETTINGS
We use public data from The Evaluation of Chinese Implicit Sentiment Analysis (SMP-ECISA 2019) for the experiments. 2 The original data are from Weibo, travel websites, product forums, etc. The initial corpus is noisy, and the expression is informal. Therefore, we perform some preprocessing on the dataset before the text analysis. Our model requires a complete sentence structure and contextual semantics. Therefore, we filtered out the following: 1) Sentences without subject-predicate structure (keeping the dependency syntactic structure intact). 2) Target sentences without context (contrast test of the influence of context on target sentence judgment, excluding irrelevant variables). Finally, we randomly extracted 80% of the dataset as the training set, 10% as the validation set, and the remaining 10% as the test set. The detailed statistics are shown in TABLE 1.

C. MODELS FOR COMPARISON
To fully validate and understand our model, we selected the following baseline models for comparison: • The HAN [19] model uses a hierarchical attention mechanism at the word level and the sentence level, so the model can give different ''attention'' to the abilities of sentences and words of different importance in the text. We divide HAN into two parallel models of context, and target sentences are spliced into the final representation after two layers of attention mechanism.
• The Tree-LSTM [20] model is a kind of LSTM network based on a tree structure, which solves the problem of sentiment classification of nonlinear structures such as dependency trees. We also divide the Tree-LSTM into two submodels, context and target sentences, and use their last hidden state as the overall context and the expression of the target sentence.
• The Tree-GCN [28] model uses Bi-LSTM to encode the input word vector to obtain the hidden state with context information and then uses a GCN convolution to obtain the neighboring node information, which enhances the robustness of the GCN.

D. IMPLEMENTATION DETAILS
We set the word embedding dimension to 200 dimensions, the GRU hidden layer size to 64 dimensions, the batch size to 64, the GCN hidden layer size to 128 dimensions, and the initial learning rate to 0.002. Adam [30] was used to train the dataset up to 20 epochs. One hundred batches output a loss value. If the verification loss did not decrease 10 consecutive times, the training stopped.

E. RESULTS
The above TABLE 2 shows the experimental comparison results between our model and other baseline models. It can be seen from the table that our models CsHGCN and CsHGCN* obtained the best classification accuracy with context information or not, respectively. Specifically, the accuracy of all models increased significantly after adding context information, and this result is consistent with common sense: the expression of implicit sentiments depends on its context and semantic environment. Additionally, compared with other baselines, CsHGCN performs better on the evaluation indexes of precision, recall, and macro-F1 after adding context.
We also note that Tree-GCN performs better than Tree-LSTM in the tree-based models. Nevertheless, CsHGCN shows the best performance, indicating that GCN's ability to extract features is stronger than LSTM, and adding heterogeneous graphs can enrich node information.
The HAN model has higher precision for negative labels with and without contextual sentences. One possible reason is that in the absence of context, HAN, which adopts a linear-based information exchange and attention mechanism, is somewhat competitive in the classification tasks of short texts, with the highest precision and F1 values among. The dependency tree in CsHGCN is a nonlinear structure. Without context, the number of edges in the graph is small, and the structure of the graph is simple, which leads to incomplete knowledge learned by our model, making the boundary between positive and negative learned by the model unclear. After adding context, sentence nodes are added to the heterogeneous graph. As the number of edges in the graph increases, the information exchange of heterogeneous graph nodes increases, which makes the precision and F1 value increase by 4.34% and 3.54%, respectively. HAN's precision and F1 value only increased by 0.79% and 0.25%, respectively, when it has context. This shows that CsHGCN can better exploit contextual information.

F. EFFECT OF HETEROGENEOUS GRAPH
To study the contribution of each component in the context-dependent heterogeneous graph, we ran an ablation study on the test set (TABLE 3). First, we removed the dependency tree structure to observe the impact of the dependency structure of each token on the model. Then, we removed the sentence nodes to keep the effect of the heterogeneous structure on the model. We found the following. (1) When we remove the dependency tree structure, the score drops by 1.88% Acc. (2) Acc drops s by 6.04% when we remove the sentence nodes. (3) Correctly, the performance after removing the sentence nodes is more significant than that of the dependency structure model. One possible reason is that the sentence nodes have more edges in the heterogeneous graph, which can obtain richer and more integrated information.

G. CASE STUDY
To better understand how CsHGCN works, we used a few sentences of test set data as an example to show the visualization effect, as shown in TABLE 4. The results of the following visualizations are not the few that work best, but in most cases, the output that the model visualization can explain.
To show the relatively important words in unimportant sentences, the color depth color w of the words is shown in (16).  Because in our heterogeneous graph, the second part of the attention of sentence node β s and context word segmentation node β t were simultaneous, we defined the color depth β s of the sentence as (17), where β i represents the context sentence node.
In the first example, we chose a document containing one context sentence, and the sentiment label is the implicit positive tendency. It can be seen that the model accurately (dividends)'', etc. were desirable. Moreover, the model focused on the context of the second sentence, which shows that the second sentence was more closely related to the target sentence. In the third example, we chose a document with multiple context sentences and without emotional tendencies. It can be seen that the weight of attention for each token in the target sentence is similar. One possible reason is that implicit sentiment sentences have no explicit sentiment words to focus on, and there is no emotional tendency. The model considers that all participles have the same contribution to sentiment classification. It also reflects that the implicit sentiment classification task is much more complicated than the explicit one.

V. DISCUSSION
We performed some statistical experiments to determine the effect of context length on model performance.
According to the statistics, the minimum number of context word segmentation in the test dataset is one token, and the maximum is 833 tokens. In particular, to reduce the contingency of the results, we divide the test data into nine blocks by length, each with approximately 200 documents. The final result is shown in Fig. 4.
We use Tree-GCN as the baseline. It can be seen from the figure that when the context is short, the information is scarce, and it is difficult to infer the sentiment of the target sentence; therefore, the accuracy of each model is not high. At this time, the useful information attention of the target sentence is essential. As contextual information increases, the accuracy of all models is rising, indicating that contextual information is positively correlated with the accuracy of the model.

VI. CONCLUSION AND FUTURE WORK
We analyze and compare the advantages and disadvantages of existing models and propose a new implicit sentiment classification method, a context-dependent heterogeneous graph convolutional neural network. We build a token-sentence graph for each chapter, and the experimental results show that GCN can obtain global collaborative information very well. The experimental results also show that GCN uses the dependency structure between a token in the heterogeneous graph and the long-distance dependence of sentence nodes to improve the accuracy of the model. This research may be further improved in the following aspects. First, the attribute information of edges is not considered in our dependency tree, i.e., the labels of each edge. We plan to design a specific GNN to add edge information. Second, we consider incorporating domain knowledge. Finally, due to the high cost of annotation, the lack of implicit sentiment labeled data becomes a major obstacle, and we consider using transfer learning to extract knowledge from a document-level corpus and apply the knowledge to this task.
ENGUANG ZUO was born in Xinjiang, China, in 1993. He received the bachelor's degree from Central South University, in 2016. He is currently pursuing the master's degree in computer application technology with Xinjiang University. His research interests include sentiment analysis and event detection.
HUI ZHAO was born in Xinjiang, China, in 1972. She received the Ph.D. degree from the Dalian University of Technology.
She has been a Professor with the College of Information Science and Engineering, Xinjiang University, since December 2011. Her research interests include artificial intelligence, natural language processing, and image processing.
BO CHEN was born in Henan, China, in 1993. He received the bachelor's degree from Northeast Electric Power University, in 2016. He is currently pursuing the master's degree in computer technology with Xinjiang University. His research interests include data mining and machine translation.
QIUCHANG CHEN was born in Guangdong, China, in 1995. She received the bachelor's degree from the Nanyang Institute of Technology, in 2017. She is currently pursuing the master's degree in computer technology with Xinjiang University. Her research interests include data mining and machine translation. VOLUME 8, 2020