A Stochastic Attention CNN Model for Rumor Stance Classification

Rumor stance classification is a task for identifying different stances about specific social media posts, and it is considered as an important step to prevent rumors from spreading. However, most previous studies for the rumor stance classification, which focus on content features of posts, ignore the textual information granularity and reading (or writing) habits of the public. In social networks, people may have different stances towards unverified posts with different reading habits. Based on this observation, we propose a Stochastic Attention Convolutional Neural Net (SACNN) under different textual information granularity to catch different habits of the public for the rumor stance classification task. Specifically, being treated as view-windows of readers, convolutional kernels in a SACNN contain trainable convolutional kernels and stochastic untrainable convolutional kernels, and different sizes of the stochastic convolutional kernels are used to simulate casual online reading habits and to extract course-grained features such as phrase features. After the convolutional layer, fine-grained features such as keywords can be extracted by a pooling layer. The experiments show that the average accuracy and F1 score of our proposed model are respectively 0.26% and 0.46% higher than the state-of-the-art results on the rumor stance classification task on PHEME dataset.


I. INTRODUCTION
The popularity of social network makes rumors spread quickly, and the unverified rumors may cause anxiety and panic [1], [2]. Usually, it is difficult for the public to distinguish verified information from rumors [3]. Therefore, the rumor stance classification is considered as a crucial component in debunking false rumors. Specifically, the rumor stance classification task aims to classify replies of a post to supporting, denying, querying and commenting. In order to efficiently finish this task, researchers set out to develop a rumor stance classification system that identifies the attitude of Twitter users towards the truthfulness [4]. An effective rumor stance classification system can be useful for limiting the diffusion of rumor information which brings anxiety and panic to individuals, communities and society [5].
The associate editor coordinating the review of this manuscript and approving it for publication was Rajeeb Dey .
In the past few years, there is a growing interest in studying rumor stance classification task in social media [6]- [9], but conventional methods do not consider the temporal correlation between replies. In order to extract textual features and temporal correlation between replies [10]- [13], Convolutional Neural Networks (CNNs), Long Short-Term Memory (LSTM) and the attention mechanism are proposed to finish this classification task [14], [15]. However, these models neglect the reading habits of the public. Most people read posts quickly and express their stances according to the key information they pay attention to or the keywords that they inadvertently see. This reading habit may lead that different people could have different or even opposite stances on the same post. Moreover, posts on different stances may only differ in some keywords.
Our work is motivated by the following observations: Firstly, different readers have different reading habits, and based on different reading habits and occupations, readers may focus on different key information. Therefore, for a broadcasted post, any word or phrase might become the key information for different readers to express their stances, and the traditional attention mechanism cannot completely reflect such habits. Therefore, we propose a stochastic attention mechanism to simulate this observation. For a traditional CNN, its classification can be expressed as a process of constructing a conditional probability P(y|x), and this process is equivalent to making a smooth nonlinear fitting and to extract features such as key phases or keywords for classification. However, if some readers have different reading and writing habits from the public's habits, using the traditional mapping P(y|x) may result in misclassification. The stochastic attention mechanism gives stochastic initialized weights to different keywords, and almost any keyword is likely to be given a high activation probability based on the stochastic initialized weights, and the keywords that are classification related can be retained by the full connection layer. Secondly, readers may express their stances about a post based on a key phrase or a single keyword, and therefore, in order to capture the key information, the information transmission process in the proposed model should be a textual information graining process from coarse to fine granularity. Therefore, we use a feature extraction process of the CNN to simulate the information graining process, which uses a convolutional layer to extract phrase features and uses a pooling layer to extract keywords. Lastly, previous work utilizes word2vec to realize word embedding process of texts, but each local context window in the word2vec model is trained independently without using global information. In order to make full use of the global statistical information of corpus, we use GloVe for word embedding [16].
Based on the above observations, we propose a Stochastic Attention Convolutional Neural Net (SACNN), which contains a CNN model and a stochastic attention mechanism. The main contributions of our paper are the follows: Our main contributions can be summarized as follows: 1) We propose a stochastic attention mechanism to simulate different reading habits. The stochastic attention mechanism contains stochastic convolutional kernels and a stochastic pooling layer. The stochastic convolutional kernels simulate the stochastic attention for phrases, and the stochastic pooling operation simulates the stochastic attention for keywords. 2) The proposed stochastic attention mechanism can simulate different readers' reading habits by selecting the keywords related to classification. Different readers may use different keywords, and the stochastic attention mechanism is able to give stochastic initialized weights to different keywords. Therefore, based on this random convolution kernel, almost any keyword is likely to be given a high activation probability, and then the model can filter out the keywords that are not classification related through the full connection layer. 3) We propose a SACNN to simulate a graining process of text information, and the proposed SACNN uses a convolutional layer to extract course-grained features such as phrase features and uses a pooling layer to extract fine-grained features such as keywords. Moreover, we use GloVe for word embedding. The experiments verify the effectiveness of the proposed stochastic attention mechanism. 4) In this paper, convolutional kernels are used to extract key information. As different readers tend to use key information of different lengths, this paper explores the influence of stochastic convolutional sizes on rumor stance classification results through experiments. Furthermore, this paper discusses the influence of the ratio between stochastic pooling and max pooling on the rumor detection task.
The remainder of this paper is organized as follows: Section 2 introduces the related works about the rumor stance classification task, including stance classification based on single tweet and stance classification based on sequence. Section 3 details the proposed model. In Section 4, experimental results prove the feasibilities of the proposed model. The last section gives the conclusion and outlook.

II. RELATED WORK
In the past few years, lots of research about rumor stance classification has been done, and the research can be divided into two stages: in the first stage, an independent post is used for text analysis and replies of the post are neglected. In the second stage, some sequential models are used to learn the conversational structure of both the source post and its replies to finish the rumor stance classification task:

A. STANCE CLASSIFICATION BASED ON SINGLE TWEET
The stance classification task is proposed by Mendoza (2010) et al. [6], 14 posts are used in their experiments, and half of data is marked as true and the other half is marked as false. Their experiments are finished by manual annotation and classification. Subsequently, Qazvinian et al. [7] try to solve the textual classification problem by using machine learning algorithms, which can be treated as the first work for the automatic stance classification in blogs [7]. Hamidian, Diab [8], and Zeng et al. [9] apply supervised learning algorithms for the stance classification task, such as like J48, logistic regression, naive Bayes, random forest, and so on [8], [9].

B. STANCE CLASSIFICATION BASED ON SEQUENCE
When the Convolutional Neural Network is introduced into the rumor stance classification task, it does not consider the sequential relation between the replies. Actually, CNN processes the text information in a parallel way. By introducing self attention mechanism, CNN can achieve satisfactory results for the rumor stance classification [17]- [26]. In addition, some scholars consider the sequential relation between replies. According to the structural relation between the source and the replies, rumor detection can be processed in the form of tree, or in the form of linked list [27]- [32].
Some scholars combine the rumor position classification task with the rumor veracity classification task by using the idea of multi-task learning [25], [33]. Recently, sequential stance classification attracts more and more interest. Elena Kochkina et al. win the SemEval 2017 RumorEval competition by using sequential stance classification algorithms (SemEval 2017 Task 8, Subtask A) [10]. They propose a LSTM-based sequential model, which models the conversational structure of posts to addresses the task of rumor stance classification. Another work that has utilized sequential structure is done by Arkaitz Zubiaga et al. They propose an algorithm that makes use of the sequence in tree-structured conversation threads in Twitter [11]. To incorporate temporal ordering of posts, Michal Lukasik et al. use Hawkes Processes and Gaussian Processes for modelling information diffusion for stance classification [12], [13]. Compared with the non-sequential methods, significant improvements have been achieved in their experiments. Chen et al. [14] utilize a Convolutional Neural Network (CNN) for short text classification by using multiple filters for SemEval-2017 Task 8 [14]. It is found that the convolutional windows and word embedding methods obtain an improvement on the experimental results. However, a single convolutional operation only extracts the features of a single post, which makes the final classification results unsatisfactory.

C. RELATED TERMINOLOGIES STOCHASTIC ATTENTION MECHANISM
Using stochastic convolutional kernels and stochastic pooling to simulate the reading and writing habits of different readers. stochastic convolutional kernel: A kind of convolutional kernel which is untrainable, the convolutional kernels can be initialized randomly. stochastic pooling: Different from max pooling, stochastic pooling randomly select a unit as the activation value from its pooling kernel.
According to the viewpoint of this paper, these models neglect reading habits of the public. Most people read posts quickly and express their stances according to the key information they pay attention to or the keywords that they inadvertently see. This habit may lead that different people could have different or even opposite stances on the same post. Moreover, comments from different stances may only differ in some keywords. Therefore, this paper proposes a Stochastic Attention Convolutional Neural Net (SACNN) to alleviate this question.

III. THE PROPOSED MODEL
In this paper, our proposed SACNN tries to classify stances from the viewpoints of the stochastic attention mechanism and a graining process of text. CNNs have natural advantages in representing granularity information in text classification: The convolutional kernels can be treated as view-windows of readers, and different sizes of kernels mean different visual fields. CNNs extract phrase features based on convolutional kernels, and these features can be regarded as coarse-grained features. After convolutional layers, fine-grained features such as keywords can be obtained by pooling layers. Therefore, the convolution and pooling operations can be treated as a graining process of text information. The conventional attention mechanism introduces sub neural nets into the CNN in dimensions of channels and kernels, and then assigns different weights to channels (or kernels) in training process. Although this attention mechanism is effective in image processing, it brings a risk of over-fitting for text data. In order to simulate the reading habits, this paper proposes a stochastic attention mechanism to the CNN model. According to the habits of online reading, people always read posts rapidly and always have different concerns for the same event, therefore, different readers may have different or even opposite views on the same post.

A. STOCHASTIC ATTENTION MECHANISM
Attention mechanism plays an important role in sequential learning tasks [26], [27]. For example, people may support rumors at beginning, but as more and more evidence, pictures, and URLs are provided, deny posts will gradually increase. However, the traditional attention mechanism cannot entirely reflect different reading habits of the public. In this paper, we propose a stochastic attention mechanism to the CNN model under different information granularity, which simulates different online reading habits. For coarse-grained convolutional layers, convolutional kernels are divided into two parts: trainable convolutional kernels and untrainable stochastic convolutional kernels. We use a couple of untrainable stochastic convolutional kernels in different sizes to simulate the stochastic attention to different phrases, and different initialized kernels will extract different weighted phrase-features. Meanwhile we use a couple of trainable convolutional kernels to extract general text features. According to the theory of Extreme Learning Machines (ELMs) and Broad Learning Systems (BLSs), the classification model will be convergent when the number of stochastic hidden units is large enough [34], [35]. In fine-grained features extraction, the pooling layers are also divided into two parts, one is a max pooling layer, and the other one is a stochastic pooling layer. The stochastic pooling layer is used to simulate the stochastic attention on key words.

B. A STOCHASTIC ATTENTION CNN MODEL
In our model, the training process is divided into the following steps. Firstly, for text data, we use the word embedding method to build input matrices, and we call this word embedding process a word embedding layer, which is used as an input of a Stochastic Attention CNN (SACNN). Then, we realize a stochastic attention mapping from coarse granularity to fine granularity based on the proposed SACNN. The traversal of the convolutional kernels on a post can be regarded as a reading process with attention windows in different sizes. A reader may read posts unconsciously and use some vocabulary to express or imply his or her views unintentionally. In order to simulate this reading habits, we introduce the idea of stochastic attention to convolutional VOLUME 8, 2020 and pooling layers, and the convolutional kernels are divided into two parts, one of them are untrainable kernels which are initialized randomly, the other of them are trained using conventional BP algorithm. We initialize the two groups of convolutional kernels in sizes of 3, 4, 5, respectively. In pooling layers, the max pooling method is used to extract the most important fine-grained features. In other words, it is extracting keywords from phrases. Subsequently, when keywords are not obvious, we use the stochastic pooling method for features extraction. After the pooling layer, a full-connection layer is used for classification, and the dropout method is used to alleviate the over-fitting problem. The structure of a SACNN is shown in Fig 1:   FIGURE 1. The structure of a SACNN.
In Fig 1, the dotted lines denote stochastic initialized weights which are untrainable, and the solid lines are common weights which can be trained using the BP algorithm. The convolutional layers in a SACNN can be expressed as follow: where, W (i) denotes the i-th convolutional kernel, which can be optimized by the BP algorithm, andW (j) denotes the j-th randomly initialized convolutional kernel, which does not need to be optimized during the training process. After the convolution operations, a sigmoid active function is added to Formula (1): After convolutional layers, course-grained features such as phrases are extracted by the stochastic attention method and the conventional convolutional kernels, and then fine-grained features such as keywords are extracted by the pooling layers. The pooled features can be expressed as Formula (3): where, P max denotes the max-pooling operation, P stochastic denotes the stochastic-pooling operation, and α is a coefficient. The features extracted by pooling layers are passed to a full connection layer to finish the stance classification task.
With the increasing of hidden convolutional features, when the number of stochastic convolutional kernels is large enough, the CNN will converge to a satisfactory Local Optimal Solution based on the theory of ELM and Stochastic Gradient Descent. In ELMs, the Local Optimal Solution can be obtained following the Formula (4) and Formula (5): tion of h1 and h2. 5. After the convolutional layer, h is passed to the pooling layer. 6. The pooling layer contains a stochastic pooling part and a max pooling part, the output of the pooling layer is denoted as p. 7. The output of the pooling layer p is passed to the full connection layer. 8. The output of the SACNN is calculated using the output of the full connection layer. 9. Calculate the gradients and update the parameters. end for

IV. EXPERIMENTS
In our experiments, we use the PHEME rumor dataset which provides tweet-level annotations of stances, and this is a dataset collected and annotated within the journalism use case of the PHEME FP7 project. These rumors are associated with 9 different breaking news. The PHEME dataset is created for the task of social media rumors analysis, and contains Twitter conversations which are initiated by a source tweet. In these datasets, there are 8 events in English labelled supporting, denying, agreed, disagreed, appeal for more information (querying) or commenting. Because these annotations are different from our final purposes, Zubiaga et al. convert the labels to support, deny, query, comment. The dataset contains 330 conversational threads (297 in English, and 33 in German), with a folder for each thread. Table 1 gives the numbers and proportions of posts from different stances of 8 events [11]. Events ebola-essien, putinmissing and prince-toronto have only 199 tweets.
Details can be found in Reference [36]. We directly use the labels of Reference [36] in our experiments. Notably, in any event, the stance data is slightly skewed towards the comment category, and the proportion of comment tweets is near 65%.

A. EVALUATION MEASURES
Accuracy is often treated as a suitable evaluating indicator to estimate a classifier on a multi-classification task. As mentioned above, our datasets are imbalance and the stance data are slightly skewed towards the comment, therefore, estimating a classifier by a single Accuracy indicator is insufficient. In this paper, we also use micro-averaged indicators named F1 micro and macro-averaged F1 macro scores, and the micro-averaged F1 score is equivalent to the Accuracy indicator [12]. The evaluating indicators are defined as follows: where, tp k (true positives) refer to the number of instances which are correctly classified in class k, f p k is the number of instances which are incorrectly classified in class k, and f n k is the number of instances that belong to class k but were not classified as such.

Majority vote.
A classifier based on the training label distribution. SVM. Support Vector Machines with the cost coefficient selected via nested cross-validation.
CRF. Conditional Random Field over temporally ordered sequences using both text and neighbouring label features. The model is trained using l 2 penalized log-likelihood where the regularisation parameters are chosen using crossvalidation [11].
GP. Gaussian Processes in supervised settings where a multitask learning kernel has been used to learn correlations across different rumors. In this paper, we use a single task kernel result [12].
HP. Hawkes Processes for classifying sequences of temporal textual data, which exploit both temporal and textual information [13]. LSTM. A simple Long Short-Term Memory -based classifier which takes the sequence of words of the tweet as its input.
Turing. A branch-LSTM, a neural network architecture that uses layers of LSTM units to process the whole branch of tweets, thus incorporating structural information of the conversation. Turing used in this paper is the name of the winner of SemEval 2017 Task 8.A [10].

C. RESULTS
The PHEME dataset contains eight different topics. We compare our test results with different methods, such as Turing [10] (the winner of SemEval 2017 Task 8.A), CRF [11], GP [12], HP [13], and a simple LSTM-based classifier. As Table 2 shows, our model achieves the state-of-art evaluating indicators.
Moreover, we compare the Accuracy and F1 macro with some baseline methods, such as Majority Vote, SVM and models above. References [12], [13] try to use temporal information in training by Hawkes process and Gaussian processes of tweet generation. As we can see in Table 2, the SACNN model is effective to extract features in the task VOLUME 8, 2020  of stance classification. We do not consider the sequential relation between posts, and our experiments verify that the sequential relation may not be such important for the rumor stance classification task. Based on the stochastic attention mechanism, the SACNN model extracts most classification related features and key information for the rumor stance classification task.
Since some methods use only four topics in training data, our experiment also includes these events. We list our experimental results in Table 3 for comparison. In Table 3, the TCNN is the Text CNN model, which has only a trainable convolutional layer and a max pooling layer. The PTCNN is the Text CNN model which adds the stochastic pooling to the pooling layer. As we can see in Table 3, our model achieves competitive results with the commonly used machine learning models. After the stochastic convolutional layer and the trainable convolutional layer, the coarse grained information can be obtained. The fined grained information is extracted by the stochastic pooling layer and max pooling layer. The SACNN is trained by BP algorithm, and the experiments verify that the simple SACNN model is effective for the rumor stance classification task. In event Ferguson, the F1 macro index of our algorithm is the supreme, and in other cases, the evaluating indicators of our model is only lower to the GP algorithm.
In order to show the accuracy comparison intuitively, we show the Accuracy indicator of the 4 events in Table 3 and In order to show the performance of our model in each event, we display all the results in Table 4. The Accuracy is often deemed as an indicator of classifiers on a classification task. However, as mentioned above, for PHEME dataset, the samples are highly imbalanced, and stance posts are biased towards comments. As we can see in Table 4, other indicators are not particularly good when accuracy is high. Our algorithm performs well in events putinmissing, prince-toronto and ebola-essien.
Lastly, we discuss the influence of hyper-parameters in our proposed model. Because we use stochastic convolutional kernels and a stochastic pooling layer, we discuss the effects of different sizes of convolutional kernels and different hyper-parameter alpha in our experiments. Table 5 shows the experimental evaluating indicators on eight events based on different sizes of stochastic convolutional kernels.
In theory, a group of stochastic convolutional kernels in different sizes can better reflect the stochastic attention mechanism than a group of stochastic convolutional kernels in a single size. As we can see in Table 5, the size of stochastic convolutional kernels can affect the final experimental indices obviously. The combination of stochastic convolutional kernels in different sizes yields better results than convolutional kernels in a single size. When the size of convolutional kernels is too large, the model extracts much noise, which corresponds to an overlong phrase, so the SACNN model cannot extract effective phrase features. When the size of convolutional kernels is too small, the SACNN model cannot extract enough phrases to build the phrase features. In our experiments, a group of stochastic convolutional kernels in sizes of [3,4,5] achieves the best result.
We also discuss the influence of hyper-parameter alpha, in this experiment, we use a group of stochastic convolutional kernels in sizes of [3,4,5] and a group of trainable convolutional kernels in the same sizes. Alpha is selected from [0, 0.2, 0.5, 0.8, 1], and the experimental results are shown in Table 6 and Fig 3. In theory, the keywords extracted from phrases should contain both stochastic attentions and max attentions. As we can see in Table 6 and Fig 3, when the alpha = 0, the pooling layer in the proposed model is a max pooling layer, it means that the keywords most likely to be noticed in phrases are extracted. When the alpha = 1, the pooling layer is a stochastic-pooling layer, which means that the keyword in a phrase is selected randomly. The experimental results show that when the keywords are taken from both max pooling    and stochastic pooling, the proposed model achieves the best results, and this conclusion coincides with our theory.

V. CONCLUSION AND FUTURE SCOPE
In this paper, we propose a SACNN model to finish the rumor stance classification task on PHEME dataset. In training, we increase the number of convolutional features generated by the stochastic attention mechanism to improve the training quality of the CNN according to the theory of BLSs and ELMs. Each evaluating indicator on the PHEME dataset is calculated separately according to each event, and the final result is the micro-average of eight events. For VOLUME 8, 2020 the Accuracy indicator, our model achieves the state-of-art results.
Data preprocessing is important in the training process. In our work, the GloVe is applied for word embedding. According to different data characteristics, better word embedding technology might be considered in our future research. Combining multi-mode features, including text characteristics and other features such as user information may bring some improvement to the classification task on unbalanced datasets, and this will be our feature research as well. The rumor stance classification task is important for the rumor detection task and the veracity classification task. Effectively combine the rumor stance classification with other related tasks in a multi-task frame will also be our future work.