Unrestricted Attention may not Be All You Need - Masked Attention Mechanism Focuses Better on Relevant Parts in Aspect-Based Sentiment Analysis

Aspect-Based Sentiment Analysis (ABSA) is one of the highly challenging tasks in natural language processing. It extracts fine-grained sentiment information in user-generated reviews, as it aims at predicting the polarities towards predefined aspect categories or relevant entities in free text. Previous deep learning approaches usually rely on large-scale pre-trained language models and the attention mechanism, which applies the complete computed attention weights and does not place any restriction on the attention assignment. We argue that the original attention mechanism is not the ideal configuration for ABSA, as for most of the time only a small portion of terms are strongly related to the sentiment polarity of an aspect or entity. In this paper, we propose a masked attention mechanism customized for ABSA, with two different approaches to generate the mask. The first method sets an attention weight threshold that is determined by the maximum of all weights, and keeps only attention scores above the threshold. The second selects the top words with the highest weights. Both remove the lower score parts that are assumed to be less relevant to the aspect of focus. By ignoring part of input that is claimed irrelevant, a large proportion of input noise is removed, keeping the downstream model more focused and reducing calculation cost. Experiments on the Multi-Aspect Multi-Sentiment (MAMS) and SemEval-2014 datasets show significant improvements over state-of-the-art pre-trained language models with full attention, which displays the value of the masked attention mechanism. Recent work shows that simple self-attention in Transformer quickly degenerates to a rank-1 matrix, and masked attention may be another cure for that trend.


I. INTRODUCTION
Sentiment analysis [2] [1] [4] [3] is one of the prevalent tasks in natural language processing (NLP). With the fast development of social media and E-commerce platforms in recent years, sentiment analysis of online reviews aims at mining users' opinions towards social events or products. With effective detection of sentiment polarity expressed in these reviews [5], stakeholders have the opportunity to guide public opinion towards their favorable direction, or improve products for better user experience [6].
Aspect-based sentiment analysis (ABSA) [7] [8] [9] is a more fine-grained task in comparison to document or sentence-level sentiment analysis. ABSA analyzes the user's emotional state of aspect categories or entities [10] [11] [12] , and its advantage lies in the ability to capture aspectspecific sentiment information. Figure 1 shows two subtasks of ABSA. The first example is for Aspect Term Sentiment Analysis (ATSA), in which two aspect terms "food" and "service" are associated with positive and negative sentiments respectively. The second review shows the application of Aspect Category Sentiment Analysis (ACSA). The sentence does not mention the two aspect categories "food" or "ambience" directly, with only description terms that fall within their range.
With the development of deep learning, a large number of neural network models [15] [16] [28] [18] [29] [17] [19] are VOLUME 10, 2021 present in ABSA tasks. Among those models, Long Short-Term Memory (LSTM) [20] and attention mechanism [51] have achieved good results in many natural language processing tasks. LSTM networks and the attention mechanism are often combined in ABSA, because LSTM can effectively extract most of the information in the sentence, and attention calculates the correlation weights of different words in the comments to determine which part is the most relevant. In order to reduce the high computational complexity of LSTM and attention mechanism, Convolutional Neural Networks (CNN) are also used in ABSA because of their efficient feature extraction capabilities [34]. Traditional deep learning methods are built upon embedding representations, and the commonly used term embedding vectors include Word2Vec [23] and GloVe [24].
More recently, large-scale pre-trained language models gradually become standard baselines in NLP [35], represented by BERT [36] [25], XLNET [26], and RoBERTa [27]. These language models have strong representative capabilities due to their rich text representation learned in large datasets. They have achieved promising results in ABSA [50], in comparison to traditional deep learning models. As sentiment information related to aspect terms is the key to solving the ABSA task, it may be assumed that combining the attention mechanism with pre-trained language models should improve the performance at aspect level. However, we find that a simple addition of aspect attention to BERT hurts its performance on some ABSA datasets. Analysis of the data shows that the rich context of the whole sentence is confusing for a single aspect, especially when a sentence contains multiple aspects.
In this paper, we propose two attention mask mechanisms to limit aspect attention to the most relevant parts of a sentence. One is called Attention Mask Weight (AM-Weight), which sets a weight threshold to filter out irrelevant parts. The other is Attention Mask Word (AM-Word), keeping only the top βn words in weight assignment. Starting from pretrained language models, standard attention scores for each aspect term or category are calculated. Then the candidates are filtered by one of the mask mechanisms, weighted sum of their embedding representations are calculated, and the output goes through a fully-connected layer and softmax to output the sentiment classification label. Experiments are conducted on SemEval-2014 and the Multi-Aspect Multi-Sentiment (MAMS) datasets, which show the performance improvement with the attention mask.
Main contributions of this article can be summarized as follows: 1) This paper proposes two attention mask mechanisms for the aspect-based sentiment analysis task, and shows their value with detailed experiments. 2) State-of-the-art performance is achieved on three public datasets, with introduction of the attention mask over the RoBERTa baseline. 3) With sensitivity analysis and examination of both success/failure cases, recommended strategy is provided for the two attention mask mechanisms. The model also provides a potential remedy for the deficiency of self-attention mechanism in Transformer networks.

II. RELATED WORK A. TEXT REPRESENTATION
To deal with the issue of dimensional curse and independence assumption of one-hot vectors, low-dimensional embedding vectors were proposed as an alternative representation of individual terms in a large vocabulary. Word2Vec [23] and GloVe [24], as the most popular embedding representations, are widely used in a large number of NLP applications, displaying their clear advantage over one-hot vectors. However, pre-training of a term's representation vector over a large unlabeled dataset learns only the average of its semantics, ignoring its context in each appearance. It explicitly omits the polysemous case of the word, which may introduce large bias when the downstream application is not covered in the source text collection. Pre-trained language models using large-scale corpora were proposed in 2018 that take more context into consideration, such as ELMo [35], GPT [36], and BERT [25]. These pre-trained models have strong feature extraction capabilities from the data. BERT, with a bidirectional Transformer model for feature extraction, uses a special mask model for context representation and next sentence prediction for semantic coherence. Depending on the downstream task, BERT can be fine-tuned to better fit the task, and its effectiveness has been proved in multiple NLP tasks. Excellent performance of BERT highly relies on its complex Transformer framework, as static vectors from BERT do not show much improvement over Word2Vec or GloVe.
Recent experiments show that the static mask in the original BERT model limits its performance. RoBERTa [27], as an improved version of BERT, replaces its static mask with a dynamic one, expands the pre-training corpus, and removes the next sentence prediction task. In a number of NLP tasks, RoBERTa shows significant improvements over BERT, making it new state-of-the-art in the BERT family.

B. ASPECT-BASED SENTIMENT ANALYSIS
ABSA aims at predicting sentiment polarity for a specific aspect category or term. It has been widely adopted in many business and social applications, including sentiment analysis of social media [31] and conversational sentiment analysis [32]. Besides, it helps identify the polarity of users' opinions in traffic events [33], which helps determine the accurate condition. It is also applied to the healthcare data to predict drug side effects and abnormal conditions in patients [30].
Most of the recent deep learning work in aspect-based sentiment analysis use static word embeddings as input. LSTM networks and the attention mechanisms are often adopted for feature extraction. Target dependent Long Short-Term Memory (TD-LSTM) [14] model embeds the target word and its context into the vector space separately, and uses two LSTMs to obtain the relationship between the context and the target word to extract the emotional information of the aspect entity. Attention-based LSTM with Aspect Embedding (ATAE-LSTM) [13] embeds the splicing of sentences and aspect information into the same vector space, and applies LSTM to obtain the embedding vector information. The attention mechanism is also applied to obtain the most relevant part within the sentence for aspect sentiment analysis. Aspectbased sentiment analysis with gated convolutional networks (GCAE) [34] uses two convolutional neural networks to extract aspect-related and aspect-independent information, and designs a gate control mechanism to filter the output for aspect-related information. Constrained attention network for multi-aspect sentiment analysis (CAN) [37] restricts the attention weight by introducing sparse regularization, feeding more weight into the higher attention values. It introduces orthogonal regularization to restrict different aspects from focusing on the same parts of a sentence, ensuring more sparse attention for multiple aspects. These methods extract the semantic information of the word embedding through complex network structures, and have achieved competitive results in ABSA.
Recent developments of large-scale pre-trained language models, especially BERT, have made great impact on mainstream NLP tasks. Aspect-based sentiment analysis, as an extension to the coarse-grained sentiment analysis task, also follows the trend. As an extension to the BERT model, BERT-SPC sends "[CLS] + sentence sequence + [SEP] + aspect sequence+ [SEP]" to the hidden layer output of the pretrained BERT network for aspect sentiment classification. It achieves good performance in ABSA, following the next sentence prediction task in BERT that proves its ability to capture the relationship between aspect information and the whole sentence. As a simple baseline built over BERT, BERT-SPC does not involve any complex network engineering mechanism, but some of the other models has not achieved significant improvement over it in our experiment. Attentional Encoder Network (AEN-BERT) [38] uses BERT to embed the context sequence and aspect information, and then applies the attention mechanism to extract the semantic interaction between an aspect and its context. BERT-PT [39] transforms the aspect sentiment classification problem into a special machine reading comprehension problem. BERTpair-QA [40] constructs a question answering problem with aspect information, performing aspect sentiment analysis by combining automatic question answering and natural language understanding with BERT. CapsNet-BERT [41] mixes BERT with the capsule network to model the complex relationship between the aspect information and its context. Local context-focus on syntax-ASC (LCFS-ASC) [42] employs local context-focus and the syntactic dependency tree to obtain the sentence components related to the aspect, discarding the rest of the context that are irrelevant. Multi-head selfattention transformation (MSAT) [43] network applies multihead target specific self-attention to better capture the global dependence and introduces target-sensitive transformation to effectively tackle the problem of target sentiment analysis. Knowledge guided capsule network (KGCapsAN) [44] utilizes certain prior knowledge to guide the capsule attention process to solve the ABSA task. Knowledge-enabled BERT [45] utilizes the additional information from a sentiment knowledge graph and pre-trained language model to solve ABSA task.

III. METHOD
This section describes the aspect-based sentiment analysis task and how to apply the attention mask mechanism over a BERT-based model. Within the attention mask network, there are two variations of the mask mechanism, named Attention Mask Weight (AM-Weight) and Attention Mask Word (AM-Word), respectively.

A. PROBLEM FORMULATION
The ABSA task includes Aspect Term Sentiment Classification (ATSC) and Aspect Category Sentiment Classification (ACSC). In the ACSC task, the categories are predefined, and the sentiment polarities are restricted to A sentence or paragraph (In ABSA, we usually process a short piece of text in each instance, i.e., a sentence.) from the text collection is represented as consisting of n words, w i represents the i word in sentence S. A sentence may contain M targets and each target from the sentence S, and is represented as with m i terms. The goal of the ATSC task is to predict the sentiment polarities for M targets, P T M represents the sentiment polarity of the T S M in the sentence: Similarly, if a sentence contains N aspect categories, the ACSC task predicts the sentiment polarities for each of them, P A N represents the sentiment polarity of the A S N in the sentence:

B. ATTENTION MASK NETWORK FOR ABSA
This section introduces our proposed Attention Mask Network for aspect-based sentiment analysis, with its framework shown in Figure 2. The network consists of a contextualized embedding layer, an attention layer, two attention mask layers, and a fully-connected layer.

1) Input and contextualized embedding layer
Inputs of our model include a sentence and an aspect. When using BERT to embed an aspect and its contextual information, a special classification mark " [CLS]" is added at the beginning of the input sequence, and a separator "[SEP]" is placed at the end. In order to obtain the relationship between the aspect and its context, both are encoded in the same way. When using the RoBERTa pre-trained language model, the inputs are also processed in the same manner with identical input format. The only difference is that the "<s>" separator is used to replace "[CLS]", and "</s>" takes the place of "[SEP]".

2) Attention layer
The attention layer takes the output of the contextualized embedding layer as input. In this layer, we calculate the attention [22] [49] weight between aspect vectors and sentence vectors, and then apply one of the two mask mechanisms, mask attention score or mask attention word, to produce the final attention weight vector. Ideally, each aspect keeps attention only from the most relevant context after the mask.
is the input of sentence contextualized embedding, and x a = (x a 1 , ..., x a n ) is the input of aspect contextualized embedding, where x i ∈ R ds . A new sequence z = (z 1 , ..., z n ) with the same length is calculated as the attention output, where z i ∈ R dz . e ij is used as a compatibility function that measures the relatedness or similarity of two input elements: Then each weight coefficient α ij is calculated with the softmax function Each output element, z i is calculated by linearly transforming the input elements with their weighted sum: We choose scaled dot product as the compatibility function for effective calculation. Through linear transformation of the input, relevant contextual information for each aspect is extracted from the input. W Q , W K , W V ∈ R dx×dz are attention parameter matrices, separately tuned for each layer and attention head.

3) Attention mask weight
To retain only the part of sentence that has a significant attention score (highly relevant) for the aspect, we introduce a parameter γ as the threshold ratio of the mask. It is multiplied to the maximum of attention weights in the sentence, and the result is used as the threshold. When the attention value is lower than that threshold, it is masked to zero, other values remain unchanged.
where γ is the configurable parameter of the AM-Weight model, max w is the maximal attention score in a sentence, and α ij is the original attention score.

4) Attention mask word
When using the other attention mask mechanism in ABSA, a more intuitive method is adopted. β is the only parameter for this method, showing the percentage of words to keep in the original sentence. All n attention weights in a sentence are sorted in descending order, sorted_α ij = α ir1 , α ir2 , ..., α ir βn , ..., α irn (13) in which the r j subscript shows the rank of each weight in the sorted list. Top βn words with higher scores are kept. Other attention weights are set to zero for the lower values.
Words with zero attention weights are removed from the sentence, and the rest with non-zero attention are saved in the original order. Then they are fed into the sentiment classification layer.

5) Aspect sentiment classification
If the attention mask weight mechanism is used, the weighted sum of masked attention and original term vectors is used as the attention output, which is fed to the final aspect sentiment classification. For the attention mask word mechanism, we only keep the remaining βn words for contextualized embedding using BERT or RoBERTa, and perform aspect sentiment classification with their representation.

6) Loss
The cross-entropy loss is used to calculate the disagreement between the predicted label and the true label.

IV. EXPERIMENTS
This section introduces the task definition, datasets, evaluation metrics, baseline models and their performance comparison.

A. TASK DEFINITION
Experiments include two main subtasks of ABSA -Aspect-Term Sentiment Analysis (ATSA) and Aspect-Category Sentiment Analysis (ACSA).   Train  987  866  460  2313  Test  341  128  169  638   Restaurant  Train  2164  805  633  3602  Test  728  196  196  1120   ACSA Restaurant  Train  2179  839  500  3518  Test  657  222 94 973  reviews. Table 1 shows the statistics of the SemEval-2014 dataset, and Table 2 includes the details of the MAMS dataset, respectively. Both tables show the number of training, validation and test samples in each dataset, together with the label breakdown. SemEval-2014 contains only training and test sets, while MAMS also has a validation set in each collection.

C. EXPERIMENT SETTING
All models are implemented with the Pytorch [21] deep learning framework. We use the pre-trained BERT-baseuncased 2 and RoBERTa-base 3 models in English for finetuning. The number of transformer layers is fixed at 12, size of the hidden layer is 768, and the number of self-attention heads is 12. The total number of parameters of the pre-trained model is about 110M. We use Adam [48] optimizer for model tuning, and other hyper-parameters in the experiment are shown in Table 3.

D. BASELINE METHODS
In order to verify the performance of our attention mask mechanism, the two variants (AM-Weight and AM-Word) are applied over the base BERT and BoBERTa models and compared to other deep learning methods. Many of these competitors are optimized versions of BERT with more complex network structures. Evaluation metrics include overall classification accuracy and F1 score, and they are compared on both ATSA and ACSA tasks. [14] contains two LSTM networks, which model the context on each side of the target word, respectively. The hidden layer vectors from the two directions are concatenated and the joint vector determines the final aspect term sentiment.

TD-LSTM
ATAE-LSTM [13] attaches aspect terms to the representation of each word in the sentence. Self-attention mechanism is applied to calculate the attention value of different words in the sentence for aspect sentiment classification.
MemNet [47] designs multiple attention networks to calculate the importance of each word in the context of the aspect information, and uses their calculation result to classify the aspect sentiment.
In GCAE [34], two convolutional networks are applied to capture the information relevant and irrelevant to aspect terms, and the Tanh-ReLU unit of the gate control mechanism is used to filter the appropriate information for sentiment classification.
BERT-SPC sends the hidden layer output of the pretrained BERT network to a fully-connected neural network for aspect sentiment classification. The input sequence of the model is: "[CLS] + sentence sequence + [SEP] + aspect sequence+ [SEP]".
BERT-PT [50] designs a reading comprehension problem based on aspect sentiment classification. BERT is used to form the context embedding of the aspect terms, and a machine reading comprehension task determines the sentiment polarity of each aspect.
In AEN-BERT [38], the encoder uses the attention mechanism to model the context and the target word. Then it introduces label smoothing and regularization, and combines the pre-trained BERT model for aspect sentiment classification.
BERT-pair-QA-M [40] constructs an auxiliary question for each aspect. Its BERT model is trained on the question answering problem, and its result can be applied for sentence pair classification.
TD-BERT [46] changes the normal use of [CLS] tag in BERT. Taking the BERT embedding from the target word (instead of [CLS]), its hidden layer output is used for aspect sentiment classification.
CapsNet-BERT [41] inputs the sentence and the aspect word into the embedding layer separately, and the average value of the aspect word embedding is calculated as the input of the next layer. The sentence information is fed into a bidirectional GRU through the residual connection to obtain the contextual representation. Then the final category capsules are computed with the primary capsules, aspect-aware normalized weights, and capsule-guided routing weights.
BERT-Attention binds a single attention layer to the BERT output, and a fully connected layer is added after that for sentiment classification.
RoBERTa uses the hidden layer output from a pre-trained RoBERTa model, then a fully-connected network calculates the aspect sentiment classification. The input sequence of the model is: "<s> + sentence sequence + </s> + aspect sequence + </s>". 4. Performance comparison for aspect-term sentiment analysis with classification accuracy and Macro-F1 value of different models. "--" means that the metric is not reported. For our method or re-implementations from others, we run the program for 5 times with random initialization, and show "mean±std" as its performance. The best performance in each column is bold-typed. RoBERTa-Attention adds an attention layer to RoBERTa, the rest stays the same as the model above. Table 4 shows the performance of various models on the ATSA task, and Table 5 contains the result for the ACSA task. Several conclusions can be drawn from the performance comparison in the result tables. First of all, pre-trained language models have a much higher baseline in comparison to the traditional static embedding-based methods. The clear improvement proves that the strong encoding capability of pre-trained language models has successfully learned the rich semantic information in diversified context. Secondly, most of the BERT-based methods do not outperform the simple BERT-SPC model, as they fail to identify the relevant context for each given aspect. Third, AM-Weight-BERT obtains an average of +1.00% accuracy and +1.12% Macro-F1 over BERT-SPC on 5 datasets, indicating that the attention mask mechanism has great potential when combined with a very competitive baseline that is hard to improve upon.

E. EXPERIMENT RESULTS
In addition, application of the attention mask mechanism over RoBERTa achieves new state-of-the-art performance on five datasets. AM-Weight-RoBERTa has an average accuracy increase of 0.77% compared to RoBERTa-Attention over five collections, while the average F1 score has increased by 1.26%. Improvement of the AM-Word-RoBERTa model is 0.84% for average accuracy and 1.25% for average F1 score. The weight-based threshold achieves the best result in 2 of the 5 experiments, while limiting to top-N terms wins the other 3. Although it is hard to determine which one might be the best choice for each dataset, their difference is negligible in many cases. RoBERTa is a stronger baseline than BERT, which makes it harder to beat. However, application of the attention mask mechanism, which filters attention weights that are not strongly related to the specified aspect term or category, can still achieve significant performance improvement. That improvement does not rely strongly on the underlying model, which is a desirable attribute for the attention mask mechanism that can be applied to many existing networks. VOLUME

A. PARAMETER SENSITIVITY
Two new parameters γ and β have been introduced in the masked attention mechanism, one for AM-Weight and another for AM-Word. Experiments in the previous section show their performance improvement in sentiment analysis, but their sensitivity also needs to be evaluated for different values of the parameter. As MAMS is a larger collection with additional division of validation and test set, performance reported on MAMS tends to be more stable. Considering the imbalanced data distribution in both collections, macro-F1 is a better measure for model performance. Therefore, sensitivity analysis of these parameters is carried out on the MAMS dataset, with macro-F1 as the indicator of overall performance. Figure 3 shows the macro-F1 value of AM-Weight-RoBERTa, with mask ratio γ in the [0.1, 0.9] range. Figure 4 displays the change of performance of AM-Word-RoBERTa, where the mask percentage β takes the same range. From Figure 3, the optimal value for γ is between 0.5 and 0.6 for both the ATSA and ACSA tasks. The result shows that relevant parts for a given aspect tend to gain similar attention scores. While weight assignments over half of the maximum (γ=0.5) are meaningful, scores lower than that seems to bring in mostly noise. From Figure 4, 0.4-0.5 is the ideal range for β in the AM-Word model. Since each piece of review in the MAMS dataset contains at least two aspects, it is safe to assume that at most half of the content is directly related to an aspect. Although the peak values are very close for AM-Weight and AM-Word mechanisms, the curve is more stable in Figure 3. Overall, AM-Weight becomes a better choice when the optimal parameter is unknown, as it does not make any prior assumption what percentage of the context is relevant.

B. QUALITY ANALYSIS
Experiments in the previous section and sensitivity analysis show the average performance on the given datasets. Next we will look into individual cases how the masked attention mechanism works. Figure 5 visualizes the output weights of the attention mask score layer in different aspects of four sentences with the AM-Weight-RoBERTa model. Figure 6 shows the selected attention words of the attention mask word layer in different aspects of the same sentences under the AM-Word-RoBERTa model.

1) Case study
In Figure 5 (a) and (c) show that the AM-Weight-RoBERTa model accurately identifies relevant information and filters out irrelevant parts, with its attention mask mechanism on the ATSA and ACSA tasks. In (a), our model correctly finds the key descriptor "good" for the aspect term "salad" and "hardly ate" for "pasta". In (c), the key context of "staff" is identified as "waitress never asked us", and "needed" correctly matches aspect category "food". After passing the attention mask score layer, irrelevant information with low attention score is set to 0. It removes contextual noise to correctly predict the sentiment polarity of a given aspect.
In Figure 6 (a) and (c) show the same effect of the AM-Word-RoBERTa model, by accurately finding the relevant information and correctly predict the sentiments of the aspects on the ATSA and ACSA tasks. In (a), our model finds the key instances "service was attentive" for aspect term "service" and "the waiter lost us" for aspect term "waiter" . In (c), the key instances "waiter was very nice" for aspect category "staff" and "expensive bottle of wine" for aspect category "price" are correctly identified. The irrelevant part is removed after passing the attention mask word layer, so that part will not enter the contextualized embedding layer of RoBERTa.
The AM-Weight mechanism filters by the attention weight. Figure 5 shows that effect by marking relevant parts with weight over γ times the maximum attention weight in the   sentence. When the maximum attention weight is large, the attention weights are more focused, and AM-Weight has a better chance of limiting to the aspect-related words. AM-Word reserves a certain number of terms in the context, which is determined by the length of the sentence and β. That phenomenon can be observed from Figure 6, when the number of words in a sentence is large, more terms are kept, including highly relevant parts and remotely related context of the aspect.

2) Error analysis
The masked attention mechanism is not always right. In Figure 5 (b) the sentiment polarity toward "manager" should be positive and "waiter" should be negative. However, AM-Weight-RoBERTa assigns neutral sentiments to both "man-VOLUME 10, 2021 ager" and "waiter". In Figure 5 (d), the correct sentiment toward "staff" is negative and "menu" should be neutral, but AM-Weight-RoBERTa also assigns a neutral label to "staff". Here the context is quite long for both aspect terms, and the mask erroneously filters out the right parts that are relevant for the sentiment judgment of the aspect. With only the remaining context terms, it is hard for the downstream classifier to identify the correct sentiment polarity. In Figure 6 (b), AM-Word-RoBERTa assigns positive sentiment to "dinner" while the correct label should be negative, mainly because the keyword "not" is filtered out by the attention mask. Also in Figure 6 (d), the sentiments toward "food" should be neutral, but AM-Word-RoBERTa provides a positive judgment. These sentences are from the MAMS dataset, with multiple aspects in each sentence. Given the small β parameter in the AM-Word model, the number of candidate terms in relatively small, so the model is likely to ignore aspect-related context terms in the highly competitive selection. The AM-Weight model is more tolerant to a certain extent, as it keeps all contextual terms as long as they have sufficient attention weights.
When it comes to the choice between the weight-based and number-based masked mechanisms, the length of context is the main attribute for consideration. The AM-Weight model is usually a better option than AM-Word, as it tends to keep more candidates that have attention weights close to the maximum value. When the dataset mainly contains long sentences, the AM-Word is better with its fixed ratio of candidates. Masked attention has exhibited its power in filtering out irrelevant information, but it also runs the risk of losing relevant parts.

VI. CONCLUSION AND FUTURE WORK
Pre-trained language models show clear advantages over static embedding-based representations, as they have the ability of representing more complex context. They have displayed excellent performance in multiple tasks of NLP, including aspect-based sentiment analysis. However, early applications of BERT in ABSA tasks, often with complex custom-designed network structures, do not achieve significant improvements in comparison to a simple expansion of the base BERT model. That proves the representative power of pre-trained language models, which also sets a competitive baseline for further research.
In this paper, we propose two masked attention mechanisms, namely AM-Weight and AM-Word, and combine them with BERT-like language models for a better solution to the ABSA task. AM-Weight sets an attention weight threshold based on the maximum of attention scores, and AM-Word decides how many terms to keep based on a preset percentage of input. Both filter out the lower score parts that are assumed to be less relevant to the aspect of focus. By ignoring part of input that is claimed irrelevant, a large proportion of input noise is removed, keeping the downstream model more focused and reducing calculation cost. Experiments with multiple ABSA datasets prove the effectiveness of the masked attention mechanism, as both show significant improvements over the baseline BERT and RoBERTa model on ATSA and ACSA. We also analyze the sensitivity of their parameters with recommended optimal region. Results are examined with both success and failure cases. Depending on the length of input, preference is provided for the two options. AM-Weight usually works better for shorter sentences, and AM-Word is safer when the input is long.
From failure analysis, the masked attention mechanism does not work well when the input contains vague or semantically overlapping data, which is also one of the main challenges in sentiment analysis. Adaptive mask strategy selection and parameter setting will be what we focus on next, and incorporation of the word order information will be another key issue in improving the semantic expressiveness of attention-based language models. Besides, as a recent concept, the prompt framework [52] for sentiment analysis is proposed, which converts sentiment classification into a cloze-like task, making full use of the masked language model's representation power. It is an innovative paradigm in the NLP domain, and we will try modeling the ABSA task with that framework for richer in-depth semantic representation.
Recent research notes that a Transformer network with only self attention mechanism quickly degenerates to a rank-1 matrix, and the main remedy is the introduction of shortcut connections. Without the diversity of single-head networks brought in by the shortcuts, the expressiveness of an attention network is quite limited [53]. In this paper, we combine the Transformer-based language models with the mask attention mechanism, which also limits the complexity of attention distribution and introduces numerous context term combinations. It may be another cure for the rank collapse problem of Transformer networks, which requires further analysis and verification.