Enhancing Aspect-Based Sentiment Analysis With Capsule Network

Existing feature-based neural approaches for aspect-based sentiment analysis (ABSA) try to improve their performance with pre-trained word embeddings and by modeling the relations between the text sequence and the aspect (or category), thus heavily depending on the quality of word embeddings and task-specific architectures. Although the recently pre-trained language models, i.e., BERT and XLNet, have achieved state-of-the-art performance in a variety of natural language processing (NLP) tasks, they still subject to the aspect-specific, local feature-aware and task-agnostic challenges. To address these challenges, this paper proposes a XLNet and capsule network based model XLNetCN for ABSA. XLNetCN firstly constructs auxiliary sentence to model the sequence-aspect relation and generate global aspect-specific representations, which enables to enhance aspect-awareness and ensure the full pre-training of XLNet for improving task-agnostic capability. After that, XLNetCN also employs a capsule network with the dynamic routing algorithm to extract the local and spatial hierarchical relations of the text sequence, and yield its local feature representations, which are then merged with the global aspect-related representations for downstream classification via a softmax classifier. Experimental results show that XLNetCN outperforms significantly than the classical BERT, XLNet and traditional feature-based approaches on the two benchmark datasets of SemEval 2014, Laptop and Restaurant, and achieves new state-of-the-art results.


I. INTRODUCTION
As a fine-grained task of the sentiment analysis in natural language processing, aspect-based sentiment analysis (ABSA) [1], [2], also known as target-specific sentiment classification [3], has attracted attentions from many researchers. Compared with traditional sentiment analysis, ABSA aims to identify more complete and in-depth sentiment polarity of a text sequence towards a given aspect, usually in the form of explicitly mentioned aspect terms or implicit aspect categories (e.g., [4]). Most of previous neural models for ABSA have mainly contributed to utilizing pre-trained word embeddings as additional features and designing task-specific neural structures with some complex syntactic The associate editor coordinating the review of this manuscript and approving it for publication was Jiankang Zhang . features (e.g., [5]- [7]) or attention mechanisms (e.g., [3], [4], [8]- [11]). More recently, some scholars even propose complicated models with the help of more advanced neural structures, e.g., transformation networks [12], parameterized convolutional neural networks (PCNN) [13], gated convolutional networks (GCN) [14], memory networks [15], graph networks [16], [17] or semantic cognition networks (SCN) [18]. However, just as Li et al. pointed out in [19], the improvement of these models measured by the accuracy or F1 score has reached a bottleneck, because the commonly used embeddings are pre-trained via word2vec or GloVe, which can only provide context-independent word-level features and are insufficient for capturing the complex semantic dependencies in the sentence.
To address context-independent problems and capture context-aware relations, some scholars propose contextu- VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ alized word embeddings by pre-training them over largescale datasets with deep LSTM, e.g. CoVe (Contextualized Word Vectors) [20], ELMo (Embedding from Language Models) [21], ULM-FiT (Universal Language Model Fine-tuning) [22]), or Transformer e.g. OpenAI GPT (Generative Pre-Training) [23], BERT (Bidirectional Encoder Representation from Transformers) [24] and XLNet [25]), which assign different contextual embeddings to each word and fine-tune them in a lightweight task-specific network. More recently, some scholars have conducted some initial attempts to couple the deep contextualized word embedding layer with downstream neural models for the ABSA tasks and established the new state-of-the-art results. Especially, Xu et al. in [26] propose a novel post-training approach on BERT to enhance the fine-tuning performance for aspect extraction and aspect sentiment classification. Chi et al. [27] also propose a BERT-based model for constructing auxiliary sentence from the aspect and converting the ABSA into a sentence-pair classification task. Based on the work of BERT, Zhilin et al. [25] further propose another denoising auto-encoding based pre-training method, XLNet, for overcoming the limitations of neglecting dependency between the masked positions and pretrain-finetune discrepancy.
Although the experiment results from [25] show that XLNet outperforms BERT on 20 tasks, i.e., question answering, natural language inference, sentiment analysis, and document ranking, it still lacks of more extensive applications and analysis on ABSA task. In this paper, we try to apply XLNet and capsule network [28] into the task of ABSA, aiming to alleviate the following challenges: (1) Aspect-specific challenge. Most of feature-based works try to employ various complex attention mechanisms and neural networks for catching the relations between the aspect and the context, and generating aspect-specific representations for final classification. Canonical BERT or XLNet model usually takes the hidden state vector of some special token in the input sequence (e.g., the dummy token [CLS]) as the global aggregated representations of the sequence without explicitly considering the relations between the aspect and the sequence, thus subjecting to aspect-specific challenge.
(2) Local feature-aware challenge. Both XLNet and BERT are good at capturing the long-range dependency relations of the sequence with the help of Transformer-XL or Transformer, but they usually neglect the local features of the sequence such as partial and hierarchical relations that play an important role in traditional feature-based methods. Such local features might provide informative messages for final prediction.
(3) Task-agnostic challenge. To enhance the performance of fine-tuning methods, it is necessary to ensure fully learning task-related knowledge during the fine-tuning step. However, the limited number of supervised training data usually causes the failure of full task-awareness of the model. Consequently, how to make full use of the limited supervised training data for fine-tuning in the downstream task is of great importance. Sun et al. point out in [27] that constructing an auxiliary sentence from the aspect and converting ABSA to a sentence-pair classification task with the help of NSP (next sentence prediction) training objective can achieve further improvements over canonical BERT. But whether their work is effective for XLNet or not is still left for verifications due to the removing of NSP training objective in XLNet.
In this paper we propose a XLNet and capsule network-based model for ABSA by firstly extending the work of [27] to XLNet for verifying their aspect-specific and task-agnostic enhancing abilities. Then, we propose a capsule network with the corresponding dynamic routing algorithm to explore the partial and hierarchical relations of the sequence. We also present some merging strategies to yield the final representations of the sequence, which are fed to a softmax classifier. Finally, we conduct extensive experiments over some benchmark corpus to show the effectiveness of our proposed method. To the best of our knowledge, this is the first work that capsule networks and XLNet are empirically investigated for ABSA.
The rest of this paper is organized as follows: Section II briefly summarizes some related works about pre-training language models and capsule network. Section III presents the details of our model, including the structure and its implementations. Section IV conducts extensive experiments over two widely-used datasets and gives in-depth analyses. The conclusions and the directions for the future research are provided in Section V.

II. RELATED WORKS A. PRE-TRAINING LANGUAGE MODELS
Many previous works have been contributed to ABSA (i.e. the aspect-level sentiment classification) by using various canonical neural networks, e.g. MemNet [29], RAM [8], IAN [30], DSAN [6] and DTDA [7], all of which tried to explicitly model the aspect information and generate aspect-specific representations for improving the final classification performance. Ma et al. also propose termed Sentic LSTM in [11] by augmenting the long short-term memory (LSTM) network with a hierarchical attention mechanism consisting of a target-level attention and a sentence-level attention, as well as exploiting the integration of commonsense knowledge into the recurrent encoder. Although they all used various canonical pre-trained word embeddings, e.g. word2vec or GloVe, as an important component in downstream models to offer significant improvements over embeddings learned from scratch, the polysemous and task-specific structure dependency problems have still posed many restrictions.
In the past two years, with the rapid developments and successful applications of deep contextualized word embeddings such as CoVe [20], ELMo [21], ULM-FiT [22], OpenAI GP [23] and BERT [24], the methods based on pre-training language models on a large network with a great amount of unlabeled data and fine-tuning in downstream tasks have made a breakthrough in many natural language understanding (NLU) tasks. Unlike ELMo and ULM-FiT that are intended to provide additional features for a particular architecture that bears human's understanding of the end task, BERT leverages heavily on language model pre-training and has achieved state-of-the-art performance in many NLU tasks ranging from sequence to sequence-pair classification to question answering. The pre-trained BERT can be easily fine-tuned with just one additional output layer to create a state-of-the-art model for a wide range of tasks. With the great success of BERT, some scholars have conducted initial attempts to couple BERT with downstream neural models for the ABSA and established the new state-of-the-art results. For example, Sun et al. [27] propose to construct an auxiliary sentence from the aspect and convert the ABSA into a sentence-pair classification task. Youwei et al. [31] propose an attentional encoder network (AEN) for ABSA to eschew recurrence and employ attention-based encoders for modeling the relations between the sequence and the aspect. They also apply pre-trained BERT and obtain new state-of-the-art results on SemEval 2014 dataset. Huang and Carley [32] propose a novel target-dependent graph attention network (TD-GAT) for ABSA by explicitly utilizing the dependency relationship among words and also demonstrate that using BERT representations further substantially boosts the performance.
Although the great success achieved by BERT, it might subject to the problems of neglecting dependency between the masked positions and pretrain-finetune discrepancy. As a result, Zhilin et al. [25] propose another denoising auto-encoding based pre-training method XLNet, which is shown to outperform BERT-related models on SQuAD, GLUE, RACE and other 17 datasets. Compared with BERT, XLNet has demonstrated its advantages in treating the long document and text generation tasks, e.g. language understanding [33] and question-answering [34]. But there still lack of further extensive applications of XLNet on the ABSA task, as well as comparisons with BERT-based works.

B. CAPSULE NETWORK
The most classical and representative approach to extract the spatial patterns of the sequence is CNN. However, CNN is unavoidably more restricted to encode rich structures presented in a sequence, especially hierarchical relations between different layers. As a result, Sara et al. [28] propose the notion of capsule network with an iterative routing process to decide the credit attributions between nodes from lower to higher layers, thus allowing to encode intrinsic spatial relations between the parts and the whole. Previous related works about capsule network have mainly focused on the fields of computer visions and graphs. In the past two years, some scholars began to demonstrate its promising potentials in NLP tasks. For example, Wei et al. [35] investigate capsule network with dynamic routing for text classification and propose three strategies to stabilize the dynamic routing processing. Their experimental results show that capsule network performs better than traditional max pooling, average pooling and self-attention mechanisms on the basis of recurrent neural network and CNN encoders, and achieve competitive results over the compared baseline methods on text classification benchmarks, especially multi-label text classifications. Zhang et al. [36] explore the using of attentional capsule network for relation extraction in a multi-instance multi-label learning framework and demonstrate that capsule network improves the precision of the predicted relations and particularly multiple entity-pair relation extraction. Some scholars also use capsule network for sentiment analysis, e.g. caps-BiLSTM [37] and SC-BiCapsNet [38], and demonstrate its effectiveness in improving the accuracy of classification results. However, these works mainly rely on the traditional word embeddings trained via word2vec or GloVe, and using capsule network and dynamic routing algorithm to improve the sensitive abilities about spatial and hierarchical information. There still lack of the further applications of capsule network on ABSA.

A. TASK DEFINITION
For ABSA classification task, a text sequence might contain different aspects associated with different sentiment polarities. For example, the sentiment polarity of ''Staffs are not that friendly, but the taste covers all.'' will be positive if considering the aspect food, but negative for the aspect service.
Let x = (x 1 , . . . , a i , . . . , a j , . . . , x k ) be a word sequence consisting of k tokens, a be an aspect (e.g. a i or a j ), and y = {y 1 , . . . , y m } be a set of pre-defined sentiment polarity categories (e.g {negative, neutral, positive}), the goal of ABSA task is to identify the correct sentiment polarity of x with respect to a, which can be formulated as a function f that measures the conditional probability distributions over all possible labels in y as follows: In this section, we propose a XLNet and capsule network-based model XLNetCN for ABSA and its overall architecture is illustrated in Figure 1. XLNetCN consists of four parts: (1) Input layer. It aims to build the suitable input sequence in the required format of the XLNet model, which can be a single sequence or a pair of sequences. After that, the SentencePiece [39] are used for tokenization, and the position encoding and segment recurrence mechanism are employed to yield the final input representations. In XLNetCN, we use the aspect tokens to build auxiliary sentence and form it into a sequence pair with the target text sequence.
(2) XLNet Layer. It uses XLNet as the encoder that consists of 12 Transformer-XL blocks and 12 self-attention heads, and then outputs the whole layer hidden representations. The vectors for the special token [CLS] are regarded as the global aspect-related representations of the text sequence.
(3) CapsNet Layer. It takes a capsule network to further extract the spatial and hierarchical features from the VOLUME 8, 2020 text sequence via a dynamic routing algorithm, which are compressed into the representation with pre-determined size aiming to cover more local n-gram features.
(4) Output layer. Some pooling strategies are explored to combine the global aspect-specific representations from the XLNet layer and the local spatial and hierarchical representations from the CapsNet layer, thus yielding the final representation of the text sequence that are fed to a simple softmax classifier for calculating the conditional probability distributions over pre-defined labels.

C. INPUT LAYER
Inspired by the work of [27], we propose a method of constructing auxiliary sentence and converting the original classification task (might be multi-label) into a sentence-pair one. To facilitate the comparisons with BERT-pair in [27] and other BERT-based models in [24] whose each input only consists of a text sequence, we construct different auxiliary sentences and denote them in Table 1  More concretely, the auxiliary sentences listed above are further explained as follows: • Sentence for single It means that the input consists of the original text sequence without auxiliary sentence, just as the canonical BERT model for text classification, e.g. [24], [26].
• Sentence for QA The auxiliary sentence to generate from the aspect is a pseudo-question without containing the aspect and sentiment polarity information, e.g. ''what is the polarity?''.
• Sentence for QA-A The auxiliary sentence to generate is a pseudo-question that contains the aspect tokens. For example, given the aspect ''service'', the generated sentence is ''what is the polarity of the service?''.
• Sentence for NS-A The auxiliary sentence to generate is the aspect tokens. For example, given the aspect ''service'', the generated sentence is ''service''.
• Sentence for AS-AP The auxiliary sentence to generate from the aspect is a pseudo-sentence that contains both the aspect and sentiment polarity information, e.g. ''The polarity of service is positive''. The task also needs to be converted into a binary classification one for obtaining the probability distribution over the set {0, 1}, in which the probability value of 1 is used as the matching score.
• Sentence for AS-P Similar like AS-AP except for removing away the aspect information, e.g. ''The polarity is negative'', ''The polarity is neutral'' and ''The polarity is positive''.
• Sentence for AW-AP Similar like AS-AP except that the auxiliary sentence only includes the aspect and sentiment polarity tokens, e.g. ''service negative'', ''service neutral'' and ''service positive''.
• Sentence for AW-P Similar like AW-AP except for only including the sentiment polarity tokens, e.g. ''negative'', ''neutral'' and ''positive''. For AS-AP, AS-P, AW-AP and AW-P, the task also should be converted from the original predictions over a multi-label set (e.g. {''negative'', ''neutral'', ''positive''}) to a binary set (e.g. {0, 1}) that indicates the relations between two sequences in the pair. For example, given the aspect ''service'' with the pre-defined set {''negative'', ''neutral'', ''positive''} in the situation of AS-AP, three auxiliary sentences are generated as: ''The polarity of service is negative'', ''The polarity of service is neutral'', and ''The polarity of service is positive'' respectively. We take the label of the sequence with the highest matching score for the predicted result. As shown in the latter experiments, this is a necessary operation for significantly improving the performance of the classifier.
After building the input sequence, we follow the same token pre-treatments of XLNet by taking SentencePiece [39] for tokenization and restricting the input sequence to be no more than the given length threshold t such that t ≤ 512. For the sentence-pair situations, if the input sequence, e.g. (x 1:k , a 1:r ), exceeds more than t tokens, we preserve the whole auxiliary sentence and only shorten the text sequence in the following way: Similarly like BERT, XLNet can model multiple segments by treating the concatenation of two segments as one sequence to perform permutation language modeling. But XLNet doesn't contain the NSP training objective anymore.

D. XLNet LAYER
XLNet is a multi-layer Transformer encoder based on the work of BERT, and consists of 12 (or 24) layers (Transformer-XL blocks) and 12 (or 16) self-attention heads. The XLNet layer can output a sequence of embeddings for each token in the input or an aggregated embedding vector that pools the input. Let H = {H 1 , H 2 , . . . , H t } ∈ R t×d h be the last layer hidden representation of the input E = {E 1 , E 2 , . . . , E t } and denoted as: However, we believe that the H sequence might contains intrinsic spatial and hierarchical information that would be informative for the model. As a result, we further use a capsule network to extract such local features from H .

E. CapsNet LAYER
The CapsNet layer uses a capsule network to automatically learn child-parent (or part-whole) relationships from the output embedding sequence H of the XLNet layer that are viewed as input capsules, and compresses them into the representations with pre-determined size. Formally, for the child capsule H i , each parent capsule V j aggregates all the incoming messages u j from each child capsule and squashes u j to ||u j || ∈ (0, 1) via the squash function [28] as follows: where c ij is the coupling coefficient that can be viewed as the voting weight on the information flow from the i child capsule to the j parent capsule. H j|i denotes the information be transferred from H i into V j .
We take a single layer feed forward neural network with a ReLU activation function as the transformation to obtain the information ReLU (W T ij H i ) from each child capsule H i and W ij ∈R d h ×(cn * d c ) is the transformation matrix corresponding to the position j of the parent capsule, where cn is the number of capsules and d c is the dimension of the neural vector in each capsule.
The dynamic routing process [40], [41] is implemented via an EM iterative process of refining the coupling coefficient c ij , which measures proportionally how much information would be transferred from H i to V j .
At iteration n, the coupling coefficient c ij is computed via a softmax function for ensuring that all the information from the child capsule H i can be transferred to the parent: where b n ij is simply a temporary value that will be iteratively updated with the value b n−1 ij of the previous iteration and the scalar product of H n j|i and V n j , which is essentially the similarity between the input to the capsule and the output from the capsule. Particularly, b 0 ij is initialized with 0. The coefficient depends on the location and type of both the child and the parent capsules, which is the iterative refinement of b ij . The pseudo-codes for dynamic routing algorithm are presented as follows: After the source input is encoded into cn capsules, we concatenate these capsules into a vector: The vector V * ∈R d h can be regard as the representations of the source sequence that contains its part-whole relationship features.

F. OUTPUT LAYER
Before final prediction, we explore several strategies to pool the global aspect-specific representation H [CLS] and local feature representation V * into an aggregated vector O∈R d : We use four different pooling strategies, sum, maximum, concatenation and average, which are denoted as add, max, con and avg for short respectively.
Let θ be the set of all trainable parameters for XLNetCN. An additional output layer with a softmax classifier turns the output O into the conditional probability distributions VOLUME 8, 2020 P(y i |O, θ) over the category set y = {y 1 , . . . , y m } or y = {0, 1} in the following ways: where W ∈R m×d is the task-specific parameter matrix and m is the number of labels. The label with the largest y x = argmax(P(y i |O, θ)) value would be selected as the predicted result and a standard calculation loss J (x, θ) is achieved based on the canonical cross-entropy function.
We fine-tune all parameters from XLNetCN as well as W jointly by maximizing the log-probability of the correct label.

IV. EXPERIMENTS
In this section, we evaluate our model on the widely-used benchmark dataset SemEval-2014 Task 4, 1 which consists of Restaurant reviews and Laptop reviews [1]. SemEval 2014 Task 4 totally contains four different subtasks and we only consider two of them in the following experiments, including Subtask 2: aspect term polarity and Subtask 4: aspect category polarity. Subtask 2 aims to determine the polarity of each aspect term (e.g. positive, negative, neutral or conflict), given a set of aspect terms within a sentence. Subtask 4 aims to determine the polarity of each aspect category, given a set of pre-identified aspect categories (e.g. {food, service}).
Both original Laptop and Restaurant datasets contain sentences with multiple marked aspect terms or aspect categories that each have a 3-label sentiment polarity. For the sake of comparability with [8], [26], [42], we discard those reviews with the polarity of ''conflict''. See detailed statistics about the number of samples for both datasets in Table 2, in which the columns of ''Max Sequence Length'' and ''Max Aspect Length'' denote the maximum sequence length and maximum aspect lengths tokenized by SentencePiece respectively.
From Table 2, we can see that both datasets only have a small number of training examples and their sequences are far less than 512 tokens. So, we use their maximum sequence length as the default value to preserve their complete meanings. While for the aspects, since the average aspect lengths in both datasets are about 1.8 for Subtask 2 and 2.4 for Subtask 4, we only take the first five tokens of the aspect sequence in the following experiments.
When fine-tuning, we try to keep the same settings as those of [27] and set the number of epochs to 4. The initial learning rate is 2e-05, and the batch size is 24. To avoid over-fitting, we use the regularization strategy Dropout with a default value of 0.1, and the default Adam optimizer with β 1 = 0.9 and β 2 = 0.999.
For convenience, we denote various auxiliary sentences and pooling strategies in the form of posix, e.g. XLNetCN-AS-AP-add means constructing auxiliary sentence with AS-AP and using add strategy for pooling. We report accuracy (Acc) and macro F1 (MF1) scores as the evaluation metrics in the following experiments.

A. EXPERIMENT 1: EFFECTS OF ITERATIVE ROUTING ANALYSIS
In this section, we study how the iteration number n and the number of capsules affect the performance of capsule network on Subtask 2. We only consider XLNetCN-NS-A-add and XLNetCN-AS-AP models in this experiment. Figure 2 and 3 shows the accuracy comparisons of 2-6 iterations in the dynamic routing process and the pair denotes (cn, d c ) that satisfies cn×d c = d h .
From the above results, we found that the performance of these two XLNetCN models is unsteady on the conditions of different iteration numbers and capsule number settings. But generally, they both reach the best when iteration is set to 3 and the pair (cn, d c ) is (24,32). So, in the following experiments we will take 3 as the default value for the iteration   number n, and (24, 32) for (cn, d c ) without further explicit mention.

B. EXPERIMENT 2: EFFECTS OF AUXILIARY SENTENCE AND POOLING STRATEGY ANALYSIS
In this section, we explore the performance of various XLNetCN models under different situations of auxiliary sentences and pooling strategies so as to confirm the effectiveness of XLNetCN. We summarize the experimental results in Table 3 and boldface the best scores. For the sake of comparisons, we use ''single'' to denote that the XLNetCN model only uses the text sequence without auxiliary sentence.
From the results in Table 3, we can see that: (1) The performance of various XLNetCN-QA models is comparable to the corresponding XLNetCN-single models. Although the former consists of a sequence-pair in its inputs, there doesn't contain any aspect or polarity messages in the auxiliary sentence, which as a result can't provide any help for the training of XLNetCN. For all XLNetCN-QA-A models, they outperform both XLNetCN-single and XLNetCN-QA due to the including of the aspect tokens, which allows XLNet to learn the sequence-aspect relationship and yield aspect-specific representations. We can see similar results from XLNetCN-NS-A whose auxiliary sentence only includes the aspect tokens. XLNetCN-NS-A gains better improvements on Restaurant over Laptop due to the larger number of training samples, which ensures more fully pre-training of XLNet.
(2) When using other auxiliary sentences except for QA, the performance of XLNetCN is improved significantly, which suggests the effectiveness of auxiliary sentences for improving the aspect-aware and task-agnostic abilities. From the comparisons between NS-A and QA, we can see that the Acc and MF1 values of NS-A are improved obviously due to the including of the aspect, which demonstrates that XLNet can better learn the aspect-specific representations for the sequence and confirms the importance of catching the relationships between the sentence and the aspect.
(3) From the results of AS-P, AW-P, NS-A and QA, the performance of XLNetCN, especially MF1 score, is improved significantly after converting the task from original 3-label to 3-label. According to our analysis, there might include at least two reasons. Firstly, although AS-P and AW-P don't contain aspect tokens in their auxiliary sentence, the polarity information and the converted task allow the model to learn the relations between the aspect and sequence even without NSP training objective in XLNet, since it concatenates them together and all possible variants of masking are fully exploited. Secondly, the increasing number of training data allows the model to be more fully trained and learn more task-agnostic knowledge. The converting task into 2-label is also helpful for decreasing the uncertainties of some noising labels in multi-label task, e.g. neural label that is often confused with positive or negative label, which can be confirmed from the greatly increasing MF1 values of AS-P and AW-P over NS-A. Similar conclusions can also be drawn from the experimental results of BERT-pair in [27]. As a results, we see that constructing suitable auxiliary sentence and converting the task into a binary classification one can provide an effective and feasible way for improving the aspect-specific and task-agnostic learning abilities.
(4) For each XLNetCN that contains auxiliary sentence and is converted into 2-label task, e.g. AS-AP, AS-P, AW-AP and AS-P, it achieves significant improvements over XLNetCN-single and other XLNetCN models that only include auxiliary sentence, which obviously confirms the importance of converting the task. From the results of AS-P and AS-AP, as well as AW-P and AW-AP, XLNetCN achieves better performance obviously when the auxiliary sentence contains both the aspect and polarity information, rather than simply the polarity information, which confirms again the effectiveness and importance of explicitly modeling the VOLUME 8, 2020 relations between the aspect and the text sequence. From their great improved performance over NS-A, we can see that converting the task from 3-label into 2-label one is critical. Generally, the performance of AW-P and AW-AP is not so good as the corresponding AS-P and AS-AP, but the differences are minor. According to our analysis, although the auxiliary sentence structures of AW-P and AW-AP are similar with those of AS-P and AS-AP for having polarity label or both the aspect and polarity information, the auxiliary sentence in AS-P or AS-AP has more complete syntactic structures such that XLNet can learn more accurate contextualized meanings. We can see similar results in [27].
(5) Among four different pooling strategies, the adding strategy achieves best results in AS-AP on both two datasets. The main reasons we think might lie in the facts that there usually exists some informative words that have important impacts on the final prediction of the sentiment classification task. The adding strategy allows the model to enhance the impacts of these informative words, e.g. sentiment-related words, with the help of capsule network that extracts local n-gram information from the text sequence.
From the above results, we can see that XLNetCN-AS-AP-add gains best performance on both Laptop and Restaurant datasets. So, in the following experiments we mainly use it as our representative model for comparisons.

C. EXPERIMENT 3: EFFECTS OF CAPSULE NETWORK
In this section, we firstly conduct experiments on Subtask2 to explore the effectiveness of capsule network in learning the local features and then demonstrate its advantages over other neural network. Let XLNetbase denote the canonical model that doesn't use capsule network and directly takes H [CLS] for downstream task. For fair comparisons, most of hyper-parameters for XLNetbase and XLNetCN are remained consistently except for the settings about capsule networks. The experimental results are illustrated in Table 4.
From the results in Table 4, we can see that capsule network can slightly improve the performance of the XLNetbase model either on the canonical situation with one sequence as input or on the situations with auxiliary sentence, which demonstrates that capsule network can improve the performance of XLNet by further exploiting the local and spatial features.
In the following experiment, we continue to compare capsule network with other classical neural encoders, e.g. LSTM and CNN, by simply replacing the capsule network in the CapsNet layer with LSTM or CNN respectively. For LSTM, we only consider the forward direction and take the last hidden state vector as the output with the dimension of 768. For CNN, we consider two different encoders with the convolutional kernel of [2] and [2,3,4] respectively. See Figure 4 for their accuracy results on Subtask 2.
From the results in Figure 4, we can see that capsule network outperforms both LSTM and CNN obviously almost all the time.

D. EXPERIMENT 4: COMPARISONS WITH BERT
In this section, we compare XLNetCN with the corresponding BERT-based works, denoted as BERTCN, by simply replacing the XLNet with BERT in the XLNet layer on the Subtask 2. See Table 5 for the experimental results.
From the results in Table 5, we can observe that both XLNet and BERT demonstrate their great potentials on the ABSA task. Generally, XLNet outperform BERT obviously on the Laptop dataset. Since XLNetbase can better tackle the pretrain-finetune discrepancy problem and is pre-trained over a much larger corpus than BERTbase, it outperforms than BERTbase on the Laptop data when training data is limited such that full fine-tuning can't be ensured. While for the Restaurant dataset, they perform differently and sometimes BERTCN outperforms XLNet. The main reasons we think might be mainly due to the larger number of training data in the Restaurant dataset than Laptop, which as a result enables both BERT and XLNet to be more fully trained for learning task-specific knowledge. After constructing auxiliary sentence and converting the 3-label task into the 2-label one, BERTCN and XLNetCN achieve competitive results on AS-AP with minor difference, which suggests that both of them have their own advantages as long as enough training data is available. Additionally, we can see that the removing of the NSP training objective in XLNet doesn't produce too much effects on the final performance about the ABSA task.

E. EXPERIMENT 5: COMPARISONS WITH RELATED WORKS
In this section, we first compare the performance of our model with some recently typical works on the Subtask 2. The selected typical baseline models are briefly introduced as follows: • MemNet [29]. A deep memory network that explicitly captures the importance of each context word when inferring the sentiment polarity of an aspect, and such importance degree and text representation are calculated with multiple computational layers. Each computational layer is a neural attention model over an external memory.
• RAM [8]. A multiple-attention based model that non-linearly combines the results of multiple attentions with a recurrent neural network, which provides a tailor-made memory for different opinion targets of a sentence.
• IAN [30]. An LSTM-based model that uses the interactive attention networks (IAN) to interactively learn attentions in the contexts and targets, and generate the representations for targets and contexts separately, which finally yields context representation for sentiment classification.
• DSAN-mean [6]. A dependency subtree attention network (DSAN) that takes the syntax distances between the aspect and context words into consideration and utilizes a bidirectional GRU (Gated Recurrent Unit) network and dot-product attention mechanism to generate aspect-related representations for the sentence.
• DTDA-mean [7]. A bidirectional GRU model that takes dependency tree and distance attention (including syntactic distance and relative distance) to better capture the relations between sentences and aspects.
• BERTADA-base. The BERTbase model in [42] that is further trained on a domain-specific dataset and evaluated on the test set from the same domain.
• AEN-BERT [31]. Attentional encoder network that uses pre-trained BERT to generate word embeddings of the sequence and the aspect respectively, which are then fed to an attentional encoder layer that consists of multi-head attention and point-wise convolution transformation for learning their relationships.
• TD-GAT-BERT(5) [17]. A target-dependent graph attention network (TD-GAT) that explicitly utilizes the dependency relationship among words and BERT representations with five layers.
• LCF-ATEPC [43]. A multi-task learning model for ABSA based on domain-adapted BERT by equipping the capability of extracting aspect term and inferring aspect term polarity synchronously.
• SDGCN-BERT [44]. A model that uses GCN to capture the sentiment dependencies between multi-aspects in one sentence and also use BERT as feature representations. We report the Acc and MF1 metrics and boldface the score with the best performance across all models. Table 6 illustrates their results, where ''-'' means not reported and the results of the baselines come from their reported works respectively.
From the results in Table 6, we can see: (1) On both Laptop and Restaurant, XLNetbase and XLNetCN-single-add significantly outperform those featurebased methods that use typical LSTM or GRU models, such as TD-LSTM, MemNet, RAM, ATAE-LSTM, IAN, DTDA-sum and DTDA-mean. Without employing any additional attention mechanisms or dependency tree, XLNetbase achieves satisfactory results by simply using a softmax classifier followed by a XLNet model, which as a result demonstrates the obvious advantages of pre-trained language model over traditional word vectors, i.e. word2vec, Glove 1.8M and GloVe 2.2M, et al.
(2) Compared with BERTADA-base and XLNetADA-base that both are trained on a domain-specific dataset and evaluated on the test set from the same domain, XLNetbase and XLNetCN-single-add achieve underperformance due to the lack of pre-training on domain-specific dataset. AEN-BERT, TD-GAT-BERT(5) and LCF-ATEPC all use BERT as the feature representations and the improvements of performance than XLNetbase and XLNetCN-single-add suggests the importance of explicit modeling the relationship between the sequence and the aspect.  (3) Compared with AEN-BERT, TD-GAT-BERT(5), LCF-ATEPC and SDGCN-BERT, all sequence-pair XLNetCN models achieve significant improvements by simply converting the task into 2-label one. LCF-ATEPC also contains the aspect in the auxiliary sentence as a sentence-pair to yield aspect-specific representations, but it doesn't convert the task. So, we believe that the converting of the task is critical.
In the following experiment, we continue to compare our XLNetCN with BERT-pair [27] on Subtask 4, which is closed to our work for constructing auxiliary sentence. Similar like BERT-pair, we replace aspect tokens with both aspect and category tokens in the auxiliary sentence for Subtask 4. To facilitate comparisons, we put each XLNetCN with the corresponding similar BERT-pair model together in Table 7 and illustrate the experiment results, in which ''-''means not reported.
From the results in Table 7, we can see that all XLNetbase models outperform their corresponding BERT-pair models with three sub-task settings, which confirms the superiority of XLNet over BERT with respect to ABSA task. Additionally, from the better performance of XLNetCN-AW-AP-add over XLNetbase-AW-AP, we can see that capsule network is also useful for improving the performance of XLNetCN on Subtask 4.

V. CONCLUSION
In this paper, we propose a XLNet and capsule network-based model for ABSA aiming to address aspect-specific, local feature-aware and task-agnostic challenges. The main contributions are summarized as follows. (1) Constructing auxiliary sentence to model the sequence-aspect relation and generate global aspect-specific representations for enhancing aspect-specific ability and pre-training of XLNet so as to improve task-agnostic capability. (2) Employing a capsule network with the dynamic routing algorithm to extract the local and spatial hierarchical relations of the text sequence, and yield its local feature-aware representations, so as to address the local feature-aware challenge. (3) Conducting extensive experiments to demonstrate that XLNetCN outperforms significantly than the classical BERT, XLNet and traditional feature-based approaches on the two benchmark datasets of SemEval 2014 and achieve new state-of-the-art results. The results also show that adding additional neural network (e.g. capsule network) might be helpful for further improving the performance of pre-training language models on ABSA task.
In the future, we would like to further explore how to incorporate domain-related knowledge into XLNet with in-domain and cross-domain pre-training. We also want to extend our work with multi-task learning and evaluate the performance on other ABSA tasks.
JINDIAN SU was born in 1980. He received the Ph.D. degree. He is an assistant professor. His main research interests include natural language processing, artificial intelligence, and machine learning. He is a member of the CCF.
SHANSHAN YU was born in 1980. She received the Ph.D. degree. She is a lecturer. Her main research interests include machine learning, big data, and semantic Web. She is a Senior Member of the China Computer Federation.
DA LUO is currently pursuing the master's degree with the South China University of Technology, Guangzhou, China. His current research interests include knowledge graph and natural language processing. VOLUME 8, 2020