Loading web-font TeX/Main/Bold
Tree-Structured Neural Networks With Topic Attention for Social Emotion Classification | IEEE Journals & Magazine | IEEE Xplore

Tree-Structured Neural Networks With Topic Attention for Social Emotion Classification


The framework of the proposed solution: An input document is first classified as a single-sentence document (SSDoc) or a multi-sentence document (MSDoc). We first obtain ...

Abstract:

Social emotion classification studies the emotion distribution evoked by an article among numerous readers. Although recently neural network-based methods can improve the...Show More

Abstract:

Social emotion classification studies the emotion distribution evoked by an article among numerous readers. Although recently neural network-based methods can improve the classification performance compared with the previous word-emotion and topic-emotion approaches, they have not fully utilized some important sentence language features and document topic features. In this paper, we propose a new neural network architecture exploiting both the syntactic information of a sentence and topic distribution of a document. The proposed architecture first constructs a tree-structured long short-term memory (Tree-LSTM) network based on the sentence syntactic dependency tree to obtain a sentence vector representation. For a multi-sentence document, we then use a Chain-LSTM network to obtain the document representation from its sentences’ hidden states. Furthermore, we design a topic-based attention mechanism with two attention levels. The word-level attention is used for weighting words of a single-sentence document and the sentence-level attention for weighting sentences of a multi-sentence document. The experiments on three public datasets show that the proposed scheme outperforms the state-of-the-art ones in terms of higher average Pearson correlation coefficient and MicroF1 performance.
The framework of the proposed solution: An input document is first classified as a single-sentence document (SSDoc) or a multi-sentence document (MSDoc). We first obtain ...
Published in: IEEE Access ( Volume: 7)
Page(s): 95505 - 95515
Date of Publication: 23 July 2019
Electronic ISSN: 2169-3536

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Social emotion classification is to study the evoked emotion distribution among a great number of readers who have read a same article [1]. In some news websites, readers can mark their emotion as one of the following ones, ’touch’, ’surprise’, ’amusement’, ’sadness’, ’curiosity’, and ’anger’, after reading a piece of news, and the social emotion distribution is displayed as a vote histogram at the end of a news article. Fig. 1 illustrates such a social emotion distribution of one article in a well-known Chinese news website, Sina News.1 Understanding and predicting social emotion distribution for one article or news are envisioned with lots of applications, such as article narrative classification, online information diffusion, public opinion monitoring and etc. [2]–​[4].

FIGURE 1. - Social emotions displayed as a vote histogram in Sina News.
FIGURE 1.

Social emotions displayed as a vote histogram in Sina News.

As a sub-task of sentiment analysis, the earliest research of social emotion classification appeared in the SemEval-2007 task 14 [5]. The early methods are word-emotion models, which have focused on the features of individual words to find direct relations between words and emotions [1], [2], [6]. Such word-emotion models cannot distinguish different emotions of a same word in different contexts. Later on, some topic-emotion models have been proposed. They utilize a topic model like Latent Dirichlet Allocation (LDA) to discover the topical information of a document and model the relation between words and emotions [1], [7]–​[11]. However, both the word-emotion models and topic-emotion models treat individual word separately without considering the semantic information in between words of a sentence.

Recently, with the successive application of neural network technology in some natural language processing tasks, such as neural language models [12]–​[14], text classification [15] and sentiment classification [16], some neural network-based methods have been proposed for social emotion classification [17], [18]. They adopt a convolutional neural network (CNN) or a recurrent neural network (RNN) to learn the semantic features (like the temporal order) and obtain a vector representation of the document. Although the neural network is a powerful tool to obtain document representation, these work have not fully utilized some important language and document features, like the syntactic information of a sentence and topical distribution of a document.

In this paper, we study the problem of social emotion classification by designing a new neural network architecture for obtaining document representation. We argue that the syntactic dependency relation in between words of a sentence is also an important feature for a sentence, which actually have been verified in many related tasks like semantic relatedness [19]–​[23]. Furthermore, we also support the claim in the existing work that the document topical distribution would help to distinguish the emotion of a same word in different contexts. Instead of establishing an emotion-topic model, we propose to include the document topical distribution as a kind of attention mechanism into the proposed neural network. The attention mechanism is to assign different components different importance weights [24], [25], which has been successively applied in many neural networks for machine translation and document classification tasks [26]–​[28].

In this paper, we propose a two-layer neural network structure with a LDA-based attention mechanism for social emotion classification. For a document consisting of more than one sentence, the lower layer is implemented with a Tree-LSTM network (tree-structured long short-term memory network), which encodes the syntactic feature of each sentence into a sentence vector. The upper layer is a Chain-LSTM (chain-structured LSTM) with sentence-level LDA attention, where the Chain-LSTM is used to obtain the hidden state of each sentence, and our sentence-level LDA attention is computed as the similarity between each sentence topic distribution and the document topic distribution. We then use the attention weights to weight the sum of sentence hidden states for obtaining the document representation. For a document with a single sentence, the proposed structure only contains the lower-layer Tree-LSTM network, yet the word-level LDA attention is computed as the similarity between each word topic distribution and the document topic distribution. After obtaining document representation, we input it into a softmax layer to obtain the final social emotion distribution. Experimenting on three public datasets, the results have validated the superiority of our proposed scheme over the state-of-the-art ones in terms of higher average Pearson correlation coefficient and MicroF1 performance.

The main contributions of our work are as follows:

  • We propose a hierarchical neural network structure to obtain document vector representation.

  • We propose to use a Tree-LSTM network to encode syntactic dependency for words in a sentence and integrate a LDA-based attention mechanism into the neural network, to simultaneously exploit syntactic and topic information for social emotion classification.

  • We experiment and compare the proposed scheme over three public datasets.

The remainder of this paper is organized as follows: Section II briefly reviews the related work. The proposed scheme is presented in Section III and experimented in Section IV.The paper is concluded in Section V.

SECTION II.

Related Work

A. Social Emotion Classification

The study of social emotion classification could be dated back to the SemEval-2007 task 14 [5]. Since then, three main solution approaches can be identified, namely, word-emotion, topic-emotion and neural network. The word-emotion methods are based on the handcrafted features of individual words for classification [1], [2], [6]. For example, the SWAT system [6] first establishes a word-emotion mapping dictionary, which is used to score each word in an unlabeled news headline and derive the overall emotion score. The emotion-term model [1] treats individual words as independently generated from social emotion labels, and then finds the relation in between words and social emotions with a Bayesian approach. However, such word-emotion models ignore the fact that a same word could convey different emotions in different contexts.

The topic-emotion approaches try to distinguish word emotions in different contexts by establishing the relations in between the document topic distribution and social emotions [1], [7]–​[11]. For example, the emotion-topic model [1] has introduced an emotion layer into the LDA topic model to jointly model topics and emotions. The affective topic model [8] has also designed an intermediate layer into the LDA model to associate each topic with words’ emotions.

Recently, the neural network approaches have gained lots of focuses for its capable of learning and extracting hidden semantic information of a document [17], [18], [29]. For example, Zhao et al. [17] have proposed a parallel network of Bidirectional LSTM (BiLSTM) and a convolution neural network (CNN) to obtain a document representation as the concatenation of the output vectors of the two networks. Li et al. [18] have modeled one document as a single sequence and designed a CNN network which contains a word-level and a phrase-level convolution layer to model word-phrase and phrase-sentence relation respectively. A phrase represents a nugget of words in the document. Li et al. [29] have proposed a hybrid neural network that leverages semantic domain knowledge from unsupervised teaching models like Bi-Term Topic Model (BTM), Replicated Softmax Machine (RSM) or Word2vec. However, these neural networks have ignored the syntactic information of a sentence.

B. Attention Mechanism

Attention mechanism, which can guide the network to unequally treat each component of the input according to its importance, has been shown the capability to improve the performance of neural networks in many natural language processing tasks [26]–​[28]. For example, Yang et al. [27] have proposed a hierarchical attention network (HAN) consisting of a layer of word encoder and a layer of sentence encoder, each of which has been implemented by a BiGRU (Bidirectional Gated Recurrent Units) with an attention mechanism. The network can extract the important words or sentences automatically for document classification. Kokkinos and Potamianos [28] have proposed a tree-structured BiGRU network with attention for sentence-level sentiment classification, where the attention mechanism is used to learn the importance of words in a sentence. Different from these work where the attention mechanism is MLP-based (multilayer perceptron), we explore a novel LDA-based attention mechanism in this paper.

SECTION III.

The Proposed Solution

Our neural network models leverage sentence syntactic information via dependency analysis and integrate topical attention via LDA analysis. In social emotion classification, a target document can consist of a single sentence or multiple sentences. We design different neural networks for the two kinds of documents: For a single-sentence document (SSDoc), we obtain the SSDoc representation using a Tree-LSTM structure with the Word-level LDA-Attention mechanism (Section 4.1). For a multi-sentence document (MSDoc), we first obtain each sentence vector by a Tree-LSTM structure, and then feed sentence vectors into a Chain-LSTM structure with the Sentence-level LDA-Attention mechanism to generate the MSDoc representation. Commonly, after document modeling, a softmax output layer is used for social emotion classification.

The network framework is shown in Fig. 2, where the left part is for SSDoc and the right part is for MSDoc social emotion classification. At first, an input document is decided as a SSDoc or a MSDoc. As a Chinese sentence does not contain natural word delimitations, we then apply a word segmentation technique to divide one sentence into multiple words.2 For each word, we use a pre-trained Word2vec model [12] to obtain a word embedding, which is a low dimensional dense vector of real numbers. In this paper, we use \mathbf {w} , \mathbf {s} and \mathbf {d} to denote a word embedding, sentence vector and document vector, respectively.

FIGURE 2. - The framework of the proposed solution: An input document is first classified as a single-sentence document (SSDoc) or a multi-sentence document (MSDoc). In the sentence level, we perform the word segmentation for Chinese sentence to obtain the composing words and use Word2Vec to obtain each word embedding. We next perform a syntactic analysis to obtain the dependency tree for each sentence and construct a Tree-LSTM network accordingly. For SSDoc, we compute the word-level LDA attention from the LDA model to obtain the weighted summation of words’ hidden vectors as the document representation. For MSDoc, the lower-layer is also a Tree-LSTM network yet without attention. We use a Chain-LSTM together with sentence-level LDA attention to obtain a multi-sentence documentation representation. Based on the document representation, we use a linear layer together with the softmax to output the emotion distribution of the document.
FIGURE 2.

The framework of the proposed solution: An input document is first classified as a single-sentence document (SSDoc) or a multi-sentence document (MSDoc). In the sentence level, we perform the word segmentation for Chinese sentence to obtain the composing words and use Word2Vec to obtain each word embedding. We next perform a syntactic analysis to obtain the dependency tree for each sentence and construct a Tree-LSTM network accordingly. For SSDoc, we compute the word-level LDA attention from the LDA model to obtain the weighted summation of words’ hidden vectors as the document representation. For MSDoc, the lower-layer is also a Tree-LSTM network yet without attention. We use a Chain-LSTM together with sentence-level LDA attention to obtain a multi-sentence documentation representation. Based on the document representation, we use a linear layer together with the softmax to output the emotion distribution of the document.

A. Vector Representation for Single-Sentence Document

1) Tree-LSTM

We propose to use a Tree-LSTM network to obtain a hidden vector for each word of a sentence. To build a Tree-LSTM, we first apply a dependency analysis tool, Stanford Parser [30] (for English) or LTP [31] (for Chinese), to obtain the dependency tree for one sentence, where each node represents a word and the syntactic dependency relations are captured by the tree edges.

The left part of Fig. 3 presents the dependency tree for a sentence ’Test to predict breast cancer relapse is approved’. We can see that each word is modeled as a node in a five-layer tree structure, where a parent node is connected with one or more child nodes and each parent-child relation is with some syntactic dependency. By modeling each node as a Tree-LSTM unit, we construct a Tree-LSTM network to obtain each word a hidden state vector. Specifically, a Tree-LSTM unit accepts the hidden state(s) of its child node(s) and the word embedding of the current node to compose the hidden state of the current node. The hidden states of the child nodes for leaf nodes are initialized to zero vectors. The process follows the dependency structure, starting from the bottom leaf nodes till the root node.

FIGURE 3. - An example of dependency tree produced by the Stanford Parser. The nodes represent words. The edges represent syntactic relations. The raw sentence is “Test to predict breast cancer relapse is approved”.
FIGURE 3.

An example of dependency tree produced by the Stanford Parser. The nodes represent words. The edges represent syntactic relations. The raw sentence is “Test to predict breast cancer relapse is approved”.

The right part of Fig. 3 illustrates the internal structure of a Tree-LSTM unit. For one Tree-LSTM unit, let C(w_{n}) denote the set of its child word(s). The transition functions of the Tree-LSTM unit are calculated as follows:\begin{align*} \mathbf {f}_{j}=&\sigma (\mathbf {W}^{(f)}\mathbf {x}_{n} + \mathbf {U}^{(f)}\mathbf {h}_{j} + \mathbf {b}^{(f)}),\quad j \in C(w_{n}) \\ \mathbf {i}_{n}=&\sigma (\mathbf {W}^{(i)}\mathbf {x}_{n} + \mathbf {U}^{(i)}\tilde {\mathbf {h}}_{n} + \mathbf {b}^{(i)}) \\ \mathbf {o}_{n}=&\sigma (\mathbf {W}^{(o)}\mathbf {x}_{n} + \mathbf {U}^{(o)}\tilde {\mathbf {h}}_{n} + \mathbf {b}^{(o)}) \\ \mathbf {u}_{n}=&tanh(\mathbf {W}^{(u)}\mathbf {x}_{n} + \mathbf {U}^{(u)}\tilde {\mathbf {h}}_{n} + \mathbf {b}^{(u)}) \\ \mathbf {c}_{n}=&\mathbf {i}_{n} \odot \mathbf {u}_{n} + \sum _{j \in C(w_{n})}{\mathbf {f}_{j} \odot \mathbf {c}_{j}} \\ \mathbf {h}_{n}=&\mathbf {o}_{n} \odot tanh(\mathbf {c}_{n})\end{align*}

View SourceRight-click on figure for MathML and additional features. where \tilde {\mathbf {h}}_{n} = \sum _{j \in C(w_{n})}{\mathbf {h}_{j}} . The hidden state \mathbf {h}_{n} \in \mathbb {R}^{d_{m}} , the memory cell \mathbf {c}_{n} \in \mathbb {R}^{d_{m}} , where d_{m} is the memory unit dimensionality. We use \odot to denote the pointwise product operation. We can see that the gate functions and the hidden state of a node are both dependent on the state of its child node(s), which can exploit the syntactic information of a sentence.

2) Word-Level LDA Attention

The attention mechanism is to assign different importance for each individual component when composing components into a single embedding. In the literature, attention mechanism has been applied for text classification and sentiment classification, yet the attention weights are obtained with a multilayer perceptron (MLP) in the neural network [27], [28]. In this paper, we propose to use word-level LDA attention to weight each word hidden state to obtain the SSDoc vector representation. We argue that the topical information is important in the social emotion classification task, as an article usually focuses on some specific topics, yet readers’ emotional reactions might often be related to particular topics described in an article. To this end, we exploit the LDA topic model [32] to obtain word-level attention.

According to the LDA model, a document is considered as a mixture over various latent topics and each latent topic is a probability distribution over all the words. We first train the LDA model with a training corpus. Then for an input document, the LDA model can output the topic probability distribution of the document, denoted by \mathbf {p}_{d} = \{\mathrm {Pr}(k|d)\}_{k=1}^{K} , and the topic probability distribution of each word in the document, denoted by \mathbf {p}_{w_{n}} = \{\mathrm {Pr}(k|w_{n})\}_{k=1}^{K} , where k represents the topic index and K the number of topics. We use the cosine distance with a scale of [0, 1] between a word \mathbf {p}_{w_{n}} and the document \mathbf {p}_{d} to compute the topical similarity between them, which is then normalized to get the word-level LDA attention weight:\begin{align*} sim(w_{n}, d)=&\frac {\mathbf {p}_{w_{n}} \cdot \mathbf {p}_{d}^{\mathsf {T}}} {|\mathbf {p}_{w_{n}}| |\mathbf {p}_{d}| }, \tag{1}\\ \alpha _{w_{n}}=&\frac {0.5 \times sim(w_{n}, d) + 0.5}{\sum _{n'=1}^{N} [0.5 \times sim(w_{n'}, d)+0.5]}, \tag{2}\end{align*}

View SourceRight-click on figure for MathML and additional features. where N is the number of words in a sentence. Note that stop-words, regraded as the words without topic information, are not considered in the LDA model. So they have no topic distributions and get zero attention weight.

Based on the word hidden state vector \mathbf {h}_{n} and its topical attention weight \alpha _{w_{n}} , we compute the SSDoc representation \mathbf {d}_{ss} as the sentence vector \mathbf {s} as follows:\begin{equation*} \mathbf {d}_{ss} = \sum _{n=1}^{N}{\alpha _{w_{n}} \mathbf {h}_{n}}.\end{equation*}

View SourceRight-click on figure for MathML and additional features. We argue that the vector \mathbf {d}_{ss} not only contains sentence syntactic information as from word hidden vectors, but also includes word topical attention weights.

B. Vector Representation for Multi-Sentence Document

We treat a multi-sentence document as a sequence of sentences and obtain a MSDoc representation in two stages. In the first stage, we model each sentence by a Tree-LSTM network in a similar way as that for single-sentence document. The difference is that instead of using the word-level LDA attention in composing a SSDoc vector, we use the hidden state of the root node as the sentence vector directly for simplifying the network structure and avoiding the reuse of topical information in the upstream document vector. Let \mathbf {s}_{m}, m=1,\ldots,M , denote the m th sentence vector in a multi-sentence document.

In the second stage, a Chain-LSTM network with sentence-level LDA attention is employed to transform sentence vectors into a MSDoc vector. Using the Chain-LSTM can help to capture potential inter-sentence relations (such as “cause” and “contrast”) into the document representation [16]. The transition functions of a Chain-LSTM unit are as follows:\begin{align*} \mathbf {f}_{m}=&\sigma (\mathbf {W}^{(f)}\mathbf {s}_{m} + \mathbf {U}^{(f)}\mathbf {h}_{m-1} + \mathbf {b}^{(f)}) \\ \mathbf {i}_{m}=&\sigma (\mathbf {W}^{(i)}\mathbf {s}_{m} + \mathbf {U}^{(i)}\mathbf {h}_{m-1} + \mathbf {b}^{(i)}) \\ \mathbf {o}_{m}=&\sigma (\mathbf {W}^{(o)}\mathbf {s}_{m} + \mathbf {U}^{(o)}\mathbf {h}_{m-1} + \mathbf {b}^{(o)}) \\ \mathbf {u}_{m}=&tanh(\mathbf {W}^{(u)}\mathbf {s}_{m} + \mathbf {U}^{(u)}\mathbf {h}_{m-1} + \mathbf {b}^{(u)}) \\ \mathbf {c}_{m}=&\mathbf {i}_{m} \odot \mathbf {u}_{m} + \mathbf {f}_{m} \odot \mathbf {c}_{m-1} \\ \mathbf {h}_{m}=&\mathbf {o}_{m} \odot tanh(\mathbf {c}_{m})\end{align*}

View SourceRight-click on figure for MathML and additional features. where the forget gate \mathbf {f}_{m} , the input gate \mathbf {i}_{m} , the output gate \mathbf {o}_{m} control the flow of memory from the previous moment to the next moment. And \mathbf {h}_{m} \in \mathbb {R}^{d_{m}} and \mathbf {c}_{m} \in \mathbb {R}^{d_{m}} denote the hidden state and the memory cell in the m th sentence, respectively.

1) Sentence-Level LDA Attention

For a multi-sentence document, we argue that not every sentence is equally important for the composition of MSDoc vector representation. Furthermore, we argue that the more similar between a sentence and the whole document in terms of their topic distributions, the more important of the sentence. As such, we propose the following sentence-level LDA attention to weight each sentence and aggregate the weighted sentence vectors to form a MSDoc vector.

Notice that the LDA model does not directly output the topic probability distribution for each individual sentence \mathbf {p}_{s_{m}} = \{\mathrm {Pr}(k|{s_{m}})\}_{k=1}^{K} , but can output the probability of one word appearing in each topic \mathrm {Pr}(w|k) . So we first compute the topic probability of one sentence \mathrm {Pr}(k|{s_{m}}) from its composing words’ topic probability \mathrm {Pr}(w|k) :\begin{align*} \mathrm {Pr}(k|s_{m})=&\frac {q(k|s_{m})}{\sum _{k=1}^{K} q(k|s_{m}) }, \tag{3}\\ q(k|s_{m})=&\frac {\sum _{w \in s_{m}}{\mathrm {Pr}(w|k)\mathrm {Pr}(k|d)}}{N_{s}}\tag{4}\end{align*}

View SourceRight-click on figure for MathML and additional features. where N_{s} is the number of words in the sentence. We next compute the topical similarity between a sentence s_{m} and the document d by applying the information radius (IR) based on the Kullback-Liebler (KL) divergence as follows:\begin{equation*} sim(s_{m}, d) = 10^{- IR(\mathbf {p}_{s_{m}}, \mathbf {p}_{d})},\tag{5}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
where \begin{equation*} IR(\mathbf {p}_{s_{m}}, \mathbf {p}_{d}) = KL\left({\mathbf {p}_{s_{m}} || \frac {\mathbf {p}_{s_{m}} + \mathbf {p}_{d}}{2}}\right) + KL\left({\mathbf {p}_{d} || \frac {\mathbf {p}_{s_{m}} + \mathbf {p}_{d}}{2}}\right).\end{equation*}
View SourceRight-click on figure for MathML and additional features.
A large similarity sim(s_{m},d) generally indicates that the sentence is important for conveying document topic information. We then compute the sentence-level LDA attention as follows:\begin{equation*} \alpha _{s_{m}} = \frac {e^{sim(s_{m},d)}}{\sum _{m'=1}^{M}{e^{sim(s_{m'},d)}}},\tag{6}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
where M is the number of sentences in the document.

Based on the sentence hidden state vectors \mathbf {h}_{m} and its topical attention weight \alpha _{s_{m}} , m=1,\ldots,M , we then compute the MSDoc vector \mathbf {d}_{ms} as follows:\begin{equation*} \mathbf {d}_{ms} = \sum _{m=1}^{M} \alpha _{s_{m}} \mathbf {h}_{m}\end{equation*}

View SourceRight-click on figure for MathML and additional features. We argue that the MSDoc representation \mathbf {d}_{ms} not only contains the syntactic information of individual sentences from their composing words by the lower-layer Tree-LSTM network, but also includes the topic information of the whole document from its composing sentences yet weighted by the sentence-level LDA attention.

C. Social Emotion Classification

1) Output

After obtaining the document representation \mathbf {d}_{ss} or \mathbf {d}_{ms} , we apply a linear layer to transform it into a label vector \mathbf {z}=\{z_{l}\}_{l=1}^{E} , where E is the number of emotion labels. We then normalize \mathbf {z} by another softmax layer to output the predicted final emotion classification vector \hat {\mathbf {y}} = \{\hat {y}_{l}\}_{e=l}^{E} as follows:\begin{equation*} \hat {y}_{l} = \frac {e^{z_{l}}}{\sum _{l=1}^{E} e^{z_{l}} }, \;\; \mathbf {z} = \mathbf {W}^{(t)} \mathbf {d},\tag{7}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where \mathbf {W}^{(t)} is the weight matrix for linear layer.

2) Training

We note that in our task of social emotion classification, the evaluation objective is an emotion probability distribution, other than a single most likely emotion label. So in the training phase, we use the Kullback-Leibler divergence between the ground truth emotion distribution \mathbf {y} and the predicted emotion distribution \hat {\mathbf {y}} as the loss function:\begin{equation*} \mathrm {Loss}(\hat {\mathbf {y}},\mathbf {y}) = \frac {1}{E}\sum _{l = 1}^{E}{y_{l} (\log (y_{l}) - \log (\hat {y}_{l}))}\tag{8}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where \hat {y}_{l} and y_{l} represent the predicted probability and the true probability of the l th emotion label respectively. We train the whole network with the Adam optimizer.

SECTION IV.

Experiment Results

A. Experiment Datasets

We use three public datasets for out experiments: The dataset SinaNews, [33] contains only multi-sentence documents; The dataset SemEval [5] contains only single-sentence documents; And the dataset ISEAR [34] contains both types of documents. For the ISEAR dataset, we regarded all documents as a multi-sentence document to ensure the unification of the network framework (even sometimes a document only containing one sentence). This is to examine the adaptiveness of the proposed scheme.

SinaNews [33]: It is a Chinese dataset, which consists of 5258 pieces of hot news collected from the social channel of the news website (http://www.sina.com) from January to December 2016. For each news, it includes the headline, the news body and the user ratings over 6 emotion labels: ’anger’, ’touch’, ’sadness’, ’amusement’, ’curiosity’ and ’surprise’. On average, each news article was voted by 770.41 users. To be consistent with the baseline methods [33], we use the 3109 articles published from January to June as the training dataset and the 2149 articles published from July to November as the testing dataset.

SemEval [5]: It was provided by the SemEval-2007 task 14, which contains 1250 English news headlines extracted from news web sites (such as Google news, CNN) and newspapers. Each headline is annotated by emotion scores of [0, 100] over 6 emotions: ’anger’, ’disgust’, ’fear’, ’joy’, ’sadness’ and ’surprise’ (0 = the emotion is not present in the headline; 100 = the maximum). After removing 4 samples without scores, 1000 headlines were used as the training dataset and 246 headlines were used as the testing dataset, which is the same experiment setting as that in [18].

ISEAR [34]: It is an English dataset with 7666 samples, where each sample contains a paragraph of text labeled by one of the following 7 emotions: ’joy’, ’fear’, ’anger’, ’sadness’, ’disgust’, ’shame’ and ’guilt’. Different from the SinaNews and SemEval, it was tagged with a single label, not a distribution of emotion intensities. Like [29], we randomly select 60% of the samples as the training dataset and the remaining 40% as the testing dataset.

Table. 1 presents the details of the three datasets, where #samples denotes the number of samples with the highest vote of each emotion label, #votes/valence represents the total number of votes/valences for each emotion label.

TABLE 1 Statistics of the Three Datasets
Table 1- 
Statistics of the Three Datasets

B. Parameter Setting

For training the Word2vec model, we resort to public corpus, since the three datasets are too small. For the Chinese dataset SinaNews, we used the Chinese Wikipedia corpus3 to train a Chinese Word2vec model with SkipGram algorithm [12]. For the two English datasets SemEval and ISEAR, we directly used the 300-dimensional English Word2vec model provided by Google.4

For obtaining the dependency tree, we use the Stanford Parser [30] for the two English datasets. For the Chinese dataset, we used the LTP toolkit provided by HIT [31] to perform sentence splitting and dependency tree construction. Since the SemEval contains only 1250 samples, we resort to another news headline dataset ABCnews, [35] for the LDA model training. The ABCnews dataset contains 1,103,665 headlines published from 2003-02-19 to 2017-12-31 from Australian Broadcasting Corporation.5 Considering that SemEval was established in 2007, only 355,151 samples before 2007 (including 2007) in ABCnews were used as the LDA training corpus.

Table 2 presents the detailed parameter settings, where d_{w} denotes the dimensionality of word embedding, K denotes the number of topics when training LDA models, and d_{m} is the dimensionality of memory cell and hidden state in Tree-LSTMs and Chain-LSTMs.

TABLE 2 Parameters for Our Methods
Table 2- 
Parameters for Our Methods

C. Evaluation Metrics

We adopt the widely used performance metrics for our experiments: the Micro-averaged F1 score, denoted as MicroF1 , and the average Pearson correlation coefficient, denoted as AP . The former can reflect the accuracy of the predicted top-ranked emotion among all emotion labels; While the latter can measure the difference between the predicted emotion probability distribution and the actual distribution.

We denote the actual top-ranked emotion triggered by an article d as e^{d}_{top} , and denote the predicted emotion as \hat {e}^{d}_{top} . If there are emotions with the same probability in the actual emotion distribution, their positions are interchangeable. According to [29], we also compute the MicroF1 as follows:\begin{equation*} MicroF1 = \frac {\sum _{d \in \mathcal {D}_{test}}{\mathbb {I}_{d}}}{|\mathcal {D}_{test}|}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where \begin{equation*} \mathbb {I}_{d} = \begin{cases} 1, & \; \mathrm {if}\; \hat {e}^{d}_{top} = e^{d}_{top}, \\ 0, & \; \mathrm {otherwise}, \end{cases}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
and \mathcal {D}_{test} is the testing dataset. A larger MicroF1 indicates the system performs better in predicting the top-ranked emotion label.

According to [33], we compute AP as follows:\begin{equation*} AP = \frac {\sum _{\mathcal {D}_{test}}{r(\hat {\mathbf {y}}, \mathbf {y})}}{|\mathcal {D}_{test}|},\end{equation*}

View SourceRight-click on figure for MathML and additional features. where \begin{equation*} r(\hat {\mathbf {y}}, \mathbf {y}) = \frac {cov(\hat {\mathbf {y}}, \mathbf {y})}{\sqrt {var(\hat {\mathbf {y}})var(\mathbf {y})}}.\end{equation*}
View SourceRight-click on figure for MathML and additional features.
r is the Pearson correlation coefficient between the predicted emotion distribution \hat {\mathbf {y}} and the ground truth distribution \mathbf {y} , and cov denotes the covariance operation. AP ranges in [-1,+1] . The closer the AP to 1, the more effective the prediction.

D. Comparison Schemes

We give a brief description of the peer schemes for performance comparison: Some of them are from the literature; While some are the schemes designed by ourselves for more comprehensive comparison.

The following peer schemes are from the literature:

1) SWAT

It was once the best method proposed in the SemEval-2007 task 14 [36], which creates word-emotion mapping with a corpus of hand annotated headlines and then scores emotions of each headline in the testing dataset by averaging the emotion score of each headline word.

2) Emotion Term Model (ET) [1]

It assumes that words are independently generated from social emotion labels. Following the Bayesian method, it models the association between word and social emotion.

3) Emotion Topic Model (ETM) [1]

It constructs an emotion-topic model by adding an additional layer for emotion modeling into the LDA model. As a result, affective terms and social emotions are combined to complement social emotion classification.

4) Weighted Multi-Label Classification Model (WMCM) [11]

It introduces a concept of “emotional concentration” to compute the weight of documents for each emotion. A topic model is used to estimate the joint probability of the document and each word. Finally the social emotion of a document is inferred by the Bayesian theory.

5) Contextual Sentiment Topic Model (CSTM) [9]

It aims at distinguishing explicitly generalized topics that are context-independent from both a background theme and a contextual theme.

6) Social Opinion Mining Model (SOM) [33]

It constructs a social opinion network based on the opinion distance computed by a word mover distance algorithm. The social opinion of a new sample is computed based on the nearest neighbor analysis.

7) 1-HNN-BTM [29]

It designs a hybrid neural network which incorporates the biterm topic model for social emotion classification.

8) Weighted PCNN [18]

It is a hierarchical CNN network with both word- and phrase-level convolution. It also employs an ’emotional concentration’ indicator to weight documents in the training phase to reduce the impact of noisy instances.

9) CNN, Poria2015Deep and CNN-SVM [38]

The CNN network was proposed in [37] which includes 7 layers: 1 input, 2 convolution, 2 max-pooling, 1 fully connected and 1 softmax output layer. The CNN method directly uses the classification result of the CNN network. The CNN-SVM method treats the output of the fully connected layer in the trained CNN as the feature vector of the input document. Then the feature vectors are used to train a SVM classifier. The results of these two methods are from [33].

We also design the following experimental models to more clearly verify the effectiveness of the syntactic information and the LDA-based attention mechanism.

10) Hierarchical LSTM (HLSTM)

It is a hierarchical structure of two LSTM networks. A lower layer LSTM network is used to obtain each sentence vector from its word embedding. Then sentence vectors are fed into an upper layer LSTM network for composing the document representation.

11) Hierarchical LSTM With Word-Level Attention (HLSTM-WA)

We include a general attention mechanism based on MLP [27] into the lower layer LSTM network of HLSTM when composing sentence vectors.

12) Hierarchical LSTM With Sentence-Level Attention (HLSTM-SA)

We include a general attention mechanism based on MLP [27] into the upper layer LSTM network of HLSTM when composing document representation.

13) Hierarchical LSTM With Word-Level LDA Attention (HLSTM-WA-LDA)

We add Word-level LDA Attention into the first stage for composing sentence vectors on the basis of HLSTM.

14) Hierarchical LSTM With Sentence-Level LDA Attention (HLSTM-SA-LDA)

We add Sentence-level LDA Attention into the second stage for composing document representation on the basis of HLSTM.

15) Tree-LSTM + Chain-LSTM (TCLSTM)

It is a hierarchical structure of two neural networks: The lower layer is a Tree-LSTM network to compute sentence representations; and the upper layer is a Chain-LSTM network to compute the document representation. No attention mechanism is used here.

16) Tree-LSTM With Word-Level Attention + Chain-LSTM (TCLSTM-WA)

We include a general attention mechanism based on MLP [27] into the lower layer of the above TCLSTM scheme when composing sentence vectors.

17) Tree-LSTM + Chain-LSTM With Sentence-Level Attention (TCLSTM-SA)

We include a general standard attention mechanism based on MLP [27] into the upper layer of the above TCLSTM structure when composing a document representation.

18) Tree-LSTM With Word-Level LDA Attention + Chain-LSTM (TCLSTM-WA-LDA)(Proposed)

This is our proposed scheme for only single-sentence documents. Notice that in the dataset SemEval with only single-sentence documents, the sentence representation is also used as the document representation.

19) Tree-LSTM + Chain-LSTM With Sentence-Level LDA Attention (TCLSTM-SA-LDA)(Proposed)

This is our proposed scheme for multi-sentence documents or a mixture of multi-sentence documents and single-sentence documents.

E. Experimental Results

Table 3 summarizes the experiment results for the two evaluation metrics Micro-F1 and AP.

TABLE 3 Experimental Results on the Three Datasets
Table 3- 
Experimental Results on the Three Datasets

We first discuss the performance of the schemes proposed in the literature. To this end, we divide these schemes into three groups: the word-emotion models that predict social emotions based on word features (including SWAT, ET), the topic-emotion models that associate topic with social emotions (including ETM, WMCM, CSTM) and the neural network-based models (including 1-HNN-BTM, Weighted PCNN, CNN, CNN-SVM). Comparing the word-emotion model with the topic-emotion model, the latter performs better on the SinaNews and SemEval dataset, and also achieves competitive results on the ISEAR dataset. This may be attributed to that leveraging topic features can distinguish between different emotions expressed by a same word in different contexts. As for neural network-based models, they outperform the other two groups of schemes on the SemEval and ISEAR dataset with their ability of learning semantic features of a document. But the results need to be improved on the SinaNews dataset. One possible reason is that the documents in the SinaNews dataset are much longer sequences, each of which is composed of many sentences. These neural network-based models (i.e., CNN, CNN-SVM) treat each document as a single sequence without distinguishing the intra-sentence and inter-sentence features, which might lead to a low quality of the learned document representation.

Fig. 4 compares the performance of our proposed models with that of the best HLSTM model and the best scheme in the literature. In the SSDoc SemEval dataset, the proposed TCLSTM-WA-LDA improves MicroF1 by 1.00% over the best scheme in the literature. In the MSDoc SinaNews dataset, the proposed TCLSTM-SA-LDA improves MicroF1 by 6.91%, and AP by 0.07; While in the MSDoc ISEAR dataset, it improves MicroF1 by 9.36%, AP by 0.17 over the best scheme in the literature. Notice that those HLSTM models are not from the literature. Instead, we design these HLSTM models in experiments for comprehensive comparisons. Although they are not our focus in this paper, we notice that in most cases, these HLSTM models perform better than the best scheme in the literature.

FIGURE 4. - Performance comparison among our proposed models, the best HLSTM Model and the best scheme in the literature for the three datasets.
FIGURE 4.

Performance comparison among our proposed models, the best HLSTM Model and the best scheme in the literature for the three datasets.

On one hand, we attribute the improvements to the adoption of a hierarchical network structure. The lower layer network encodes word-to-word relations into a sentence representation, and the upper network can learn inter-sentence features when composing a document representation. Meanwhile, another difference from the existing neural network-based schemes is that we implement the lower layer with a tree-structured network (i.e., Tree-LSTM), which incorporates the syntactic information into the learning of sentence representations, further improving the ability to learn long-distance dependencies in between words. In Table 3, the one-to-one comparison between HLSTM and TCLSTM, that is, between HLSTM-{WA, SA, WA-LDA, SA-LDA} and TCLSTM-{WA, SA, WA-LDA, SA-LDA}, demonstrates that the TCLSTM models generally perform better than those HLSTM models, which verifies the effectiveness of including the syntactic information when learning a document representation.

On the other hand, like the topic-emotion schemes in the literature, our proposed scheme also values the topic information in social emotion classification. With the help of the designed LDA attention mechanism, the words or sentences that convey more topic information of the input document are assigned more attentions in the final document representation. In Table 3, compared to the models without attention (such as TCLSTM), the models with the LDA attention (such as TCLSTM-{WA-LDA, SA-LDA} generally achieve better results. Furthermore, by one-to-one comparing TCLSTM-{WA, SA}-LDA with TCLSTM-{WA, SA}, we can find that the models with the LDA attention also outperform the models with the MLP-based attention mechanism. This can also be observed among those HLSTM models. One possible reason is that the MLP-based attention cannot well capture the key words or sentences for social emotion classification, yet the proposed LDA attention can.

F. Illustration of LDA Attention

Fig. 5 illustrates the computation of the word-level and sentence-level LDA attention with a multi-sentence document from the ISEAR experiments. After having trained a LDA model, we can obtain the topic distribution of the document, the topic distribution of each word, and the word distribution of each topic. The left part of the figure illustrates the computation word-level LDA attention for the 5th sentence in the document. We can find the word “bad”, “late”, “guilty” which can trigger a “sadness” emotion are assigned with higher weights; While the neutral word “felt” is assigned a smaller weight. The right part of the figure illustrates the computation of sentence LDA attention. We can observe that the 2nd, 4th, and 5th sentence get greater weights. A close look reveals that they contain obvious words related to the emotion of the document “sadness” like “hated”, “guilty”, “lost”. In contrast, as no emotional words present in the last sentence, it gets a smaller attention.

FIGURE 5. - Illustration of LDA attention. After having trained a LDA model, we can obtain the topic distribution of the document, the topic distribution of each word, and the word distribution of each topic. The left part illustrates the computation of word-level LDA attention for the 5st sentence. The topic similarity between each word of a sentence and the document is calculated by (1), and the word-level LDA attention is then obtained by (2). Note that the stop words, like “I”, “to”, “as”, “had”, “been”, “to”, “my”, “of” are not considered in the LDA model, so they have no topic distribution. The right part illustrates the computation of sentence-level LDA attention. The topic probability distribution of one sentence is computed by (3), and the topical similarity between a sentence and the document is computed by (5). Finally, the sentence-level LDA attention weights are computed by (6). The bottom of the figure marks the attention weights in different colors according their values.
FIGURE 5.

Illustration of LDA attention. After having trained a LDA model, we can obtain the topic distribution of the document, the topic distribution of each word, and the word distribution of each topic. The left part illustrates the computation of word-level LDA attention for the 5st sentence. The topic similarity between each word of a sentence and the document is calculated by (1), and the word-level LDA attention is then obtained by (2). Note that the stop words, like “I”, “to”, “as”, “had”, “been”, “to”, “my”, “of” are not considered in the LDA model, so they have no topic distribution. The right part illustrates the computation of sentence-level LDA attention. The topic probability distribution of one sentence is computed by (3), and the topical similarity between a sentence and the document is computed by (5). Finally, the sentence-level LDA attention weights are computed by (6). The bottom of the figure marks the attention weights in different colors according their values.

SECTION V.

Conclusion

In this paper, we have proposed hierarchical tree-structured neural networks with LDA attention for social emotion classification. The lower layer Tree-LSTM network encodes sentence syntactic information to sentence representation according to the sentence dependency tree analysis. We have also proposed to include LDA attention to identify document topic-related key words or key sentences and assign them higher weights when composing document representation. Experiments on three public datasets show that the proposed schemes with integration of syntactic and topical information can effectively improve system performance in terms of higher MicroF1 and AP in both SSDoc and MSDoc dataset, compared with the state-of-the-art schemes.

In this paper, we have mainly focused on mining the syntactic information within a sentence. In our future work, we would like to further mine some hidden relations in between sentences, say for example, by exploring some linguistic knowledge when modeling a paragraph and a document.

References

References is not available for this document.