Solving Stance Detection on Tweets as Multi-domain and Multi-task Text Classification

Stance detection on tweets aims at classifying the attitude of tweets towards given targets. Existing work leverage attention-based models to learn target-aware stance representations. While those methods achieve substantial success, most of them usually train a model for each target separately despite the scarcity of annotated data for each target. To alleviate limitation of annotated data, some methods turn to external linguistic resources, additional sentiment annotations or target-aware data augmentation techniques for better detection results. We argue that the sharedness of stance-related features across targets in the existing stance detection dataset is not fully exploited. However, directly training on mixed examples of all targets may confuse the model in learning best features for each target. To this end, we borrow the idea from transfer learning and multi-task learning, and formulate stance detection on tweets as a multi-domain multi-task classification problem. We apply the target adversarial learning to capture stance-related features shared by all targets and target descriptors for learning stance-informative features correlating to specific targets. Experimental results on the benchmark SemEval 2016 dataset demonstrate the effectiveness of our model, which outperforms BERT model by over 2% on macro average F1 and achieves superior performance than many recent methods utilizing external resources. We further provide detailed analyses to illustrate the superiority of fully utilizing features shared by different targets.


I. INTRODUCTION
Stance detection [1,2] is the task of identifying the attitude conveyed in the text towards a given target. It is useful for various opinion mining tasks, especially important for understanding peoples' opinions expressed towards targets of interest on social media platforms. In this work, we focus on detecting stances on tweets. Unlike conventional aspectlevel sentiment classification, the target may not appear in the text. In consequence, it is crucial to decide the relationship between the opinioned entity and the given target to accurately understanding the stance. For instance, in Figure 1, none of the words in the target appears in the tweet and we need to infer that the tweet is in favor of the target as it uses the phrase like have rights. Moreover, people tend to use different expressions towards different targets, which makes it difficult for a single model to learn diverse stance features.
Various models have been proposed for stance detection to tackle those challenges. Some [3,4] have leverage models with intense target-text interactions to learn target-aware representations. Some [5,6] utilized external knowledge bases with rich commonsense information to build connections between targets and mentioned entities in tweets. While those models greatly advance the performance of stance detection, most of them usually train separate models for each target and intensify the scarcity of annotated data available for each VOLUME 4, 2016 target. Recent work from [7] proposed a target-aware data augmentation method to alleviate this issue. Others [8,9] tried to leverage training data from external stance detection datasets. However, most methods ignore the sharedness of stance features across different targets and such features in existing datasets are not fully exploited.
To remedy this, we train a single model for all targets in the hope of capturing stance-related patterns shared by different targets. Since people may use different expressions and opinioned entities when referring to different targets, a good model should be capable of disentangling such distinctions among targets. To this end, we propose S-MDMT (Stance Detection as Multi-Domain Multi-Task Classification), which treats different targets as different domains for the first time and employs sophisticated multi-domain text classification models with a target-aware design. Given a <tweet, target> pair, S-MDMT first represents the pair with a BERT encoder. Then it applies a shared encoder with a target-adversarial layer to learn stance features shared across targets. Meanwhile, it adopts target-descriptors and a target-aware attention structure to learn stance features specific to different targets. Then the target-invariant and target-specific stance features are aggregated for final stance classification.
We conduct extensive experiments on the SemEval 2016 task 6 dataset [1]. S-MDMT outperforms BERT by 2% in terms of average macro-f1 of Favor and Against class. Visualization of shared and private features from S-MDMT proves that our model captures stance features share by different targets. Ablation study further verifies that those shared features help improve the performance.
In summary, the contributions of this work are three-fold: • We novelly treat the task of stance detection on tweets as a multi-domain multi-task learning problem to fully utilize stance features shared by different targets in the existing dataset. • The proposed S-MDMT model achieves new state-of-theart results, outperforming many state-of-the-art methods that use any external resources, multiple stance detection datasets or data augmentation methods. • We provide in-depth analysis of shared and private features of different targets on the stance detection task.

A. STANCE DETECTION
Recently, there is a growing interest in detecting stances of texts on microblog platforms [1,10,11,12]. Unlike traditional sentiment analysis tasks, stance detection is more challenging as the target may not appear in the text and the relations between a target and an opinioned entity in the tweet is harder to infer. Existing work have studied various settings of stance detection on tweets. [1] proposed the single target stance detection task, where for a given tweet, only one target is considered at one time for stance prediction. As multi-target stance detection is quite useful for scenarios including political elections and brand comparisons, some researchers [13,14] introduced new datasets for multi-target stance detection, where multiple targets are given to a tweet. Researches in [15,16,17] further propelled this task by modeling dependency among different targets. Besides, some [18,19,20] studied the problem of cross-target stance detection with external resources like knowledge bases and lexicons, and neural models like graph networks. Recently, Allaway and McKeown [21] proposed the task of zeroshot stance detection and [22] further studied this problem with adversarial learning. Meanwhile, many [8,9] tried to leverage training data from multiple stance detection datasets simultaneously and achieved better results than single dataset learning. While most datasets only contain sentence-level annotations, Ye et al. [23] proposed a token-level stance detection dataset with 2025 labeled tweets for fine-grained analysis of stance detection models.
In this work, we focus on the single target stance detection task and adopt the dataset from [1] as the testbed to evaluate our method. Under this setting, various targetaware approaches [24,25] were proposed to integrate target semantics into final text representations. For example, BiCond [25] initialized the cell state of LSTM that modeled tweets with cell states of the LSTM that modeled targets. TAN [24] applied target-specific attention to each word in tweets and learned a weighted sentence representation concerning the given target. Research in [4] adopted an attention-based ensemble model to learn long-term dependency in the stance detection task. Ghosh et al. [26] improved existing models with better tweet-specific pre-processing.
Recent research work in [5] and [6] utilized external knowledge bases like ConceptNet and Wikipedia respectively to tackle the issue of absence of targets in tweets and acquired better performance. HAN [3] employed a hierarchical attention network that incorporated extra linguistic features. As tweets expressing stance are unlikely to sentiment-neutral, AT-JSS-Lex [27] incorporated additional sentiment annotations to facilitate stance detection using a multi-task learning manner. Since annotated data for each target are limited, Li et al. [7] also proposed to use target-aware data augmentation methods to enrich the training set. MeLT [28] leveraged additional user history messages by introducing a hierarchical message-transformer. Different from those work, to remedy the scarcity of labeled data, we fully exploit the existing datasets by mining shared features across targets instead of using additional annotations or augmented data. We employ the strong BERT [29] model to learn target-aware tweet representation by taking advantage of the BERT-Pair input format.

B. MULTI-DOMAIN AND MULTI-TASK LEARNING
Domain adaptation techniques have been extensively studied in sentiment analysis tasks including sentence-level crossdomain and multi-domain sentiment classification, aspectlevel sentiment classification [30,31,32,33]. These work usually deal with scenarios where different domains have different sizes of labeled data (sometimes even no labeled data for training). Various models had applied shared-private architectures where shared part learns features invariant across domains and private part learns features belonging to specific domains. Gradient adversarial layer [34] had been employed to learn domain-agnostic representations by reversing the gradients of domain classification and the learned representations had been shown to have less domain-private features. Besides, [33] had applied mixture-of-experts to learn features only shared by some domains and [35] allowed instancewise domain-similarity adaptation. We novelly treat stance detection on tweets as a multi-domain fine-grained text classification problem and solve it in a multi-task learning manner. In this way, we could make full use of existing techniques for domain adaptations in sentiment analysis. We incorporate these structures into a BERT-based model and applied to the stance detection task, which has not been studied by previous work.

III. MODEL
In the task of stance detection on tweets, people tend to use different opinion expressions and refer to distinct entities when expressing stance towards specific targets. Therefore, we could virtually treat tweets talking about different targets as tweets that come from different domains and borrow the idea of multi-domain and multi-task learning to tackle this task. Figure 2 portrays an overview of the proposed model S-MDMT. It contains two parts: the shared part for learning target-agnostic stance features and the private part for capturing target-specific stance features. In the shared part, we only extract features from tweets and apply target adversarial learning to acquire target-invariant stance features. In the private part, we combine target and tweet to learn target-aware stance representations through multiple layers of self-attention. The stance-related features from shared and private parts are then concatenated for deciding final stance orientation.

A. PROBLEM DEFINITION AND NOTATIONS
Given a target T A and a tweet T W , the task of stance detection aims to identify the attitude y s of tweet towards the target as Favor, Against or None. Specifically, we have a total of K targets and for each target we only have limited training data. The goal of this work is to overcome the scarcity of annotated data for each target by leveraging shared stance features across various targets.

B. BERT AS TEXT ENCODER
Our model use BERT [29] to acquire contextualized representations of target and tweet for further stance modeling. BERT build on Transformer networks [36] is widely adopted in sentiment analysis tasks. BERT is pre-trained by predicting masked tokens in the input (MLM task) and discriminating whether two sentences are consecutive (NSP task). The MLM task allows BERT to learn semantics for a token with both left and right context. And the NSP task enables BERT to infer the relationship between an input sentence pair. Downstream tasks usually exploit BERT by fine-tuning pretrained models on corresponding datasets. Given an text snippet X = {x 1 , x 2 , . . . , x T } of |T | tokens, BERT encodes its semantics with multiple-layers of self-attention interactions. The hidden vector of [CLS] from the last layer is considered VOLUME 4, 2016 as the representation of the whole text. Existing work [29,37] utilized the input format of NSP task and extended it into tasks including machine reading comprehension and sentiment analysis. For example, [37] transformed the input format of aspect-based sentiment classification to be compatible with that of NSP. Following [37], in Section III-D we also transform the input into [CLS]target[SEP]tweet[SEP] to take advantage of BERT's ability in modeling relationship between input text pairs. In our model, we use BERT to encode tweet, target, and target-tweet pair.

C. SHARED FEATURE EXTRACTOR
The shared part of S-MDMT is used to capture stance features shared by all targets. Stance detection is essentially a target-dependent task, the final stance depends on the relationship between the target and opinioned entity in a tweet. Nevertheless, we argue that in many cases, the attitude a tweet expressed is deterministic whatever the opinioned entity in the tweet is. And the expressions used in those tweets are shared across targets. To better facilitate this idea, we only feed tweets into the shared part. Given a tweet X tw from the input pair, we first encode the tweet with BERT, and the hidden output of [CLS] (denoted as h s ) from the last transformer layer is used as the tweet representation. It is intuitive that if a strong classifier cannot tell the target involved in the tweet based on the extracted features, then those features are more likely to be target-agnostic. We enforce this by adding a gradient reversal layer (GRL) [34] before the target discriminator.
where h s is the hidden output from the shared encoder,h s has the same value as h s while the gradient of is reversed by GRL.
In the forward pass, stance features go directly into the target classifier. In the backward pass, gradients of target classification are updated in the reverse direction for shared features and target-invariant stance features are learnt. The loss for target classification in the shared part is defined as: is the predicted probability of j-th target of the i-th training instance andŷ i tc1 (j) is the corresponding gold probability.

D. PRIVATE FEATURE EXTRACTOR
As mentioned earlier, stance detection requires understanding the attitude that a tweet expressed towards a target. If the target is not directly mentioned or the opinioned entity has nothing to do with the target, it would be difficult to deliberate the actual stance of the tweet towards the target. Hence, learning targetaware representations is crucial for the above circumstances.
To this end, the private part of S-MDMT is designed to capture such target-specific stance features. We achieve this through the following process: 1) We first encode the target and targettweet pair using a BERT encoder separately. The target-tweet pair is used to capture deep interactions between a target and tweet pair. The target text is used to better represent semantics of a target. We refer to representations of target and target-tweet pair from BERT encoder as h ta

1) Target Descriptors
Motivated by [32], we introduce target descriptors to explicitly capture stance features relating to specific targets. Each target descriptor is specific to one target. Those target descriptors are useful in aggregating target-aware stance features for stance predictions. The target descriptors are parameterized by a matrix D ∈ R K×dn , where K is the number of targets and d n is the dimension for each descriptor. Target descriptors are randomly initialized and learned jointly with other parameters during training.
Given the enhanced target-aware tweet representation h tt , we add a simple target classifier here to enforce its targetawareness: where W p1 ∈ R dm * d h , W p2 ∈ R K * dm , d h is the size of hidden dimension of BERT output, d m is the dimension of intermediate hidden features, N s is the number of training instances, K is the number of different targets, y i tc2 (j) is the predicted probability of j-th target of the i-th training instance andŷ i tc2 (j) is the corresponding gold probability. We then evaluate its similarity to target descriptors and hope that the gold target descriptor gets higher similarity scores. We perform a simple bi-linear transformation between h tt and target descriptor matrix D as follows: where W sim ∈ R 2d h * dn , D ∈ R K * dn , d h is the size of hidden dimension of BERT output, d n is the dimension of a target descriptor, and K is the number of targets. We normalize the similarity scores and calculate a weighted sum of target descriptors as the virtual targeth ta instead of the original target: where s i is the i-th value in s (the similarity between the ith target descriptor and h tt ). D i is the vector of i-th target descriptor and K is the number of targets. We use this mixture of target descriptors as the target representation for the following reasons: 1) the original target usually gets the highest attention score and comprises the major semantics of the new target. 2) some features are shared and consistent in stance polarity across certain targets. For instance, a tweet expressing opinions supporting 'Legalization of Abortion' may also be in favor of 'Feminist Movement'. 3) for each input target-tweet pair, to accurately decide stance orientation, stance features regarding related targets may also be helpful. For example, a tweet that supports 'Hillary Clinton' may also involve stance features on her policy and actions on 'Feminist Movement'.

2) Target-aware Stance Representation
After acquiring a more comprehensive representation of the target, we learn a target-aware tweet representation based on BERT's hidden outputs of target-tweet pair. As different tokens don't contribute equally to stance detection, we employ a target-aware attention network to discriminate each token's contribution to the final stance and relation to the target.
where h j is the feature vector of the j-th token in the targettweet pair, W att ∈ R d h * dn , s j is the relatedness of h j to the current target featuresh ta . T is the number of tokens in the target-tweet pair. h p is considered as the target-aware tweet representation that encodes stance features towards the given target.

E. STANCE CLASSIFIER
We then combine stance features from the shared and private parts of S-MDMT and feed the concatenated vector h sc into a simple network with sigmoid non-linearity activation as follows: where ⊕ refers to vector concatenation, W sc1 ∈ R dm * 2d h , W sc2 ∈ R No * dm , d h is the size of hidden dimension of BERT output, d m is the dimension of intermediate hidden features, N o is the number of different stance labels, and y s is probabilities of features through a softmax layer. The classifier is trained with the following cross-entropy loss of stance classification: where N s is the number of training instances and N o is the number of different stance labels.

F. MULTI-TASK LEARNING
The stance classifier and target discriminators are jointly trained with back-propagation. The final loss function for our model can be written as: where λ 1 and λ 2 are hyper-parameters balancing the influence of target classification in the shared part and the private part respectively.

IV. EXPERIMENTAL SETUPS
In this section, we describe the dataset used for training and evaluation, implementation details of the proposed model, and strong baselines adopted for performance comparison.

A. DATASET
We conduct experiments on SemEval 2016 Task 6 Sub-task A.
The dataset is comprised of 4163 English tweets crawled from Twitter 1 and each is assigned with a target and a manually annotated stance label towards that target. There are a total of five targets, which are 'Atheism' (CC), 'Climate Change is a real Concern' (CC), 'Feminist Movement' (FM), 'Hillary Clinton' (HC), and 'Legalization of Abortion' (LA). The detailed statistics of this dataset are shown in Table 1. We use the official train/test split. While the task didn't provide an official validation set, we run a 5-fold cross-validation on the training set and report averaged results on the test set.

B. IMPLEMENTATION DETAILS
We adopt uncased BERT base for all our experiments, which has 12 transformer encoder layers. We fine-tune BERT base model with Adam optimizer. The dropout rate is set to 0.5 for all parameters. The learning rate is chosen from {1, 2, 3, 4, 5} × 10 −5 and batch size for training is set to 8. We choose λ 1 and Similar to previous work, we adopt the macro-average of F1-score across targets as the evaluation metric, which is calculated as: where P and R are precision and recall. Then the average of F F avor and F Against is calculated as the final metrics F macro .   Note that the final metrics do not disregard the None class. By taking the average F-score for only Favor and Against classes, we treat None as a class that is not of interest.

C. BASELINES
We compare our model with the following methods: • SVM-ngrams [2]: Five SVM classifiers (one per target) trained on the corresponding training set for the target using word and character n-grams features. • MITRE [38]: The best system in SemEval-2016 sub-taskA is MITRE. This model uses two RNNs: the first one is trained to predict the task-relevant hashtags on a very large unlabeled Twitter corpus. This network is used to initialize the second RNN classifier, which was trained with the provided subtask-A data. • BiCond [25]: A method that employs conditional LSTM to learn a representation of the tweet considering the target. • TAN [24]: An LSTM-attention based model that incorporates target-specific information into stance identification. • PNEM [4]: An ensemble model that uses two densely connected and nested BLSTM networks with attention to capture the long-term dependency in stance detection. • Disc-STS [39]: A model that jointly models interactions among sentiment-stance-target information. An external sentiment classifier for tweets is used. • HAN [3]: A hierarchical attention-based method that utilizes additional linguistic features like dependency information.
• AT-JSS-Lex [27]: A multi-task learning-based method that adopts sentiment classification as an auxiliary task and uses manual constructed stance and sentiment lexicons for attention. • CKEMN [5]: A commonsense knowledge enhanced memory network for stance detection. • BERT base [29]: BERT model that takes <target, tweet> as input and uses a linear transformation with a softmax layer for final stance classification. • Stancy [40]: A BERT based method that combines an additional cosine similarity between the target representation and the target-tweet pair representation. • MT-DNN M DL [9]: A BERT based model that exploits multi-dataset learning from various stance detection datasets. MT-DNN SDL is the corresponding method that only learns from a single stance detection dataset. • MoLE [8]: A RoBERTa [41] based model that leverages features from multiple stance detection datasets and learns from heterogeneous labels using label embeddings. • ASDA [7]: A target-aware data augmentation method that generates new examples with conditional masked language modeling and auxiliary sentences. • MeLT [28]: A hierarchical message-encoder which utilizes additional features from user history. For Stancy [40], we run the experiments using the code 2 provided by original paper.   Table 2 shows the classification results of different methods on SemEval2016 dataset. We can observe that the proposed S-MDMT model outperforms all existing methods in terms of overall macro-averaged f1 and it achieves comparable results to the highest ones on each target. Note that some recent methods [7,8,9,28] only report partial results on the dataset, e.g in [7] the target 'Climate Change' was removed because of the limited and highly skewed data. For fair comparison with [7], we also calculate the macro averaged F1 across the other four targets, which is 66.07 compared to 65.01 reported in [7]. Previous models can be roughly divided into different groups according to whether using pre-trained models and whether using additional external resources.

A. MAIN RESULTS
For methods neither using pre-trained language models nor external resources, most of them focus on designing target-aware models to learn stances of tweets towards given targets. For example, BiCond [25] encodes tweets with target-conditioned initialization and TAN [24] learns the stance representation with target-specific attention. PNEM [4] employed a densely connected BLSTM and nested BLSTM with attention to capture dependency between targets and tweets, which achieved excellent performance.
Generally, methods that utilize external resources but not adopting pre-trained language models usually achieve better results than the above methods. For HAN [3], various linguistic features like sentiment, n-grams, and dependencybased relations are utilized. For Disc-STS [39], extra annotations of sentiment are incorporated during three-way interactions among sentiment-stance-input. For AT-JSS-Lex [27], sentiment labels are also used to construct an auxiliary task, and manually constructed sentiment and stance lexicons are utilized for target-specific attention. For CKEMN [5], the commonsense knowledge from ConceptNet and trans-E embeddings are leveraged. Disc-STS and AT-JSS-Lex, which exploit stance and sentiment related resources, obtain higher performances than HAN and CKEMN that utilize dependency or entity-relation knowledge.
While most of these methods only utilized target-to-tweet attention, BERT base with self-attention learns interaction between the target and tweet with multiple layers and achieves better results. Pre-trained language model based methods (Stancy [40] and MT-DNN [9]) generally work better than methods using traditional non-pre-trained methods (BiCond [25] and TAN [24]). Thus, the BERT model serves as a better backbone model for learning target-aware stance representations compared to previously used BLSTM networks. Furthermore, recent models [7,8,28] that combine both pretrained language models and external resources further boost the performances.
The proposed S-MDMT model further outperforms strong baselines like BERT, MoLE, and MT-DNN by 2.99%, 0.62%, and 0.89% respectively, which demonstrates the effectiveness of exploiting shared features among different targets without using any external resources. This verifies that shared and private stance features of different targets are under-explored by previous methods. We leave it as future work to see whether these external resources are complementary to information captured by our model S-MDMT.

B. ABLATION STUDY
We have shown the effectiveness of leveraging shared stance features among different targets in V-A. Here, we present further analyses on the effect of different components of S-MDMT with some ablation experiments.
We first compare our model with two variants: one that doesn't have the shared part and the other one that doesn't have the private part of the full mode. We can see that performances drop by over 1.9% when removing either shared part or private part, which reveals that both shared and private features contribute to the improvements over the BERT base model. The private part capturing target-aware stance features has a larger impact than the shared part, which again shown that target information plays an important role in stance detection.
Then we explore three extra variants to study the influences of other modules in S-MDMT. It can be observed that when removing the gradient reversal layer (GRL), the shared part degrades to learn target-specific stance features instead of shared stance-related features and its performance is similar to that of BERT base model as expected. Besides, we also experiment with removing the target classification module in the private part. Instead of using a mixture of target descriptors, this variant only use the descriptor of the corresponding target. It can be seen that the performance drop by 2.08% when neglecting mining stance features of related targets. This variant also illustrates the effectivness of multi-task learning in stance detection task. Moreover, we testify the effect of utilizing target descriptors. This variant performs similarly with variant (4) as in both cases, private features specific to related targets are not fully exploited.
To intuitively assess sharedness of stance features in both shared and private parts, we visualize the features from the shared and private parts with t-SNE on the test set. As shown in Figure 3, there are more entangled data points in the shared part than in the private part, which verifies that the shared part learns more target-general stance features. Moreover, most data points are well separated in according to their corresponding targets, showing that most stance features are specific to corresponding targets. Besides, we can observe from Figure 3b that some data points corresponding to 'Feminist Movement' and 'Legalization of Abortion' are closer than to other data points. This is consistent with our intuition that some tweets expressing stance towards these two targets may be talking about similar topics with overlapped expressions.

C. VISUALIZATION OF FEATURES
We also demonstrate similar phenomenon from another perspective in Figure 4. We draw Figure 4 based on the normalized confusion matrix of the target classification in the private part. In the confusion matrix, the row is the right target and column is the predicted target. Darker color means more examples are predicted in the cell.
We can observe that:   proves that some features are shared across these targets. Contrary to those targets, as 'Climate Change' has little overlap of topics with other targets, most of its texts are correctly classified. • By comparing between results of the BERT model and our model in Table 2, the improvement of macro averaged F1 score of 'Legalization of Abortion' is greater than that of 'Atheism'. While the performance on 'Climate change is a real concern' also increases by a large margin, it is mainly caused by the class imbalance of in this target. As our model extracts better features than the BERT baseline and improves the prediction on Against class, which as a result improves the overall macro-average F1 of Favor and Against.
We also present 5 top important words of stance classification for each target using LIME [42] in Table 4. LIME is a probing method that weights the importance of each input token by corrupting the input text and calculating the changes in the confidence of predictions. Intuitively, removing important words from the text usually causes the decrease of the confidence of correct class, sometimes even resulting in the flipping of predictions. Those words chosen by LIME are key features used by our model to make stance predictions, which should be highly informative of stance towards their corresponding target. Extracted important words in Table 4 verify the above assumption, e.g. murder is frequently used to express an 'against' stance towards 'Legalization of Abortion'. rights is commonly used in 'Legalization of Abortion' to support the birthright of women. Those words prove that our model learns proper stance features for each target to perform stance classification.

VI. CONCLUSION
In this paper, we propose the S-MDMT model for stance detection on tweets. We novelly formulate the task as a multidomain multi-task learning problem. And we employ the shared-private structure in stance detection for the first time to fully exploit the shared stance features across different targets. Experimental results on the SemEval 2016 dataset demonstrate the effectiveness of the proposed S-MDMT model, which outperforms existing target-aware models that using external resources and additional sentiment annotations. We provide a detailed analysis of the shared features across different targets.
The proposed S-MDMT method could also be combined with existing methods that utilizing external resources like linguistic analyzers and commonsense knowledge bases. We leave it as future work to explore whether those resources are complementary to the multi-domain multi-task formulation.