A Multi-Task Dual-Encoder Framework for Aspect Sentiment Triplet Extraction

Aspect Sentiment Triplet Extraction (ASTE) is a complex and important task in aspect-based sentiment analysis task, which aims to extract aspect-sentiment-opinion triplets from review sentences, to acquire comprehensive information for sentiment analysis. Most of the existing methods use pipeline approaches or end-to-end sequence tagging approaches to solve the ASTE task. However, the pipeline approaches suffer from error accumulation in practical applications. The existing sequence tagging approaches ignore the feature information of the three elements themselves, and cannot model and infer the three elements effectively by placing each word in the same position as importance. Based on this, a multi-task dual-encoder framework is proposed. First, a dual-encoder is constructed to encode and fuse sentence information and semantic information, respectively. Then, the signs and constraints implied between word pairs are used to complete multi-task inference and triplet decoding. Meanwhile, two grid tagging methods and their corresponding inference strategies are designed for the multi-task. The auxiliary task is used as a regularization of the main task, which improves the correct inference ability of the inference strategy for the main task and the robustness of the framework. Extensive testing on two benchmark datasets shows that the proposed framework is simple and effective, and significantly outperforms the existing methods.

encodes the token of the review sentence, and the latter is 85 inspired by previous work [36], which encodes the part-of-86 speech feature of the words in the sentence. Then, two grid 87 tagging tasks are designed, one of which is used as a main 88 task (called the word pair relationship tagging main task) and 89 is used to tag all word pair relations, including aspect terms, 90 opinion terms, and sentiment polarity; the other is used as 91 an auxiliary task (called the boundary prediction auxiliary 92 task) and is used to predict the boundaries of each aspect and 93 opinion terms (i.e., the start and end positions of aspect terms 94 and opinion terms). At the same time, we define the extraction 95 center (i.e., the beginning words of aspect terms and opinion 96 terms) and the sentiment center (i.e., the end words of aspect 97 terms) according to the position of different words in the 98 terms, and design effective tagging methods and inference 99 strategies for the above two tasks. Since these two tasks 100 are jointly addressed in training, the auxiliary task can be 101 regarded as a regularization of the main task to strengthen the 102 extraction of aspect terms and opinion terms. The framework 103 integrates the part-of-speech features and contextual features 104 of the review sentence, and after the joint training of multi-105 task, only one inference on the datasets is needed to extract 106 the triplet from the final prediction.

107
Extensive experiments are conducted on two benchmark 108 datasets and compared with existing approaches. Experimen-109 tal results show that the framework proposed in this paper 110 is significantly better than existing approaches, and further 111 research shows that each component we proposed is both 112 simple and effective. Specifically, the main contributions of 113 this paper are drawn as follows: 114 • A novel encoder is proposed to obtain the semantic 115 features of the words themselves, providing additional 116 information for the representation of each word, fur-117 ther enhancing the contextual information of the current 118 sentence, and sufficiently utilizing the token-level and 119 sentence-level semantic information. 120 • A multi-task grid tagging framework is proposed, which 121 considers the relative positions of words in terms and 122 assigns different importance to them. Based on this, 123 two grid tagging methods and inference strategies are 124 proposed to further enhance the ability of the framework 125 to extract triplets.

126
• Extensive experimental results show that the proposed 127 simple framework achieves significantly better extrac-128 tion results than existing complex approaches on two 129 benchmark datasets. The ABSA task consists of three basic single extraction 134 subtasks: ATE (Aspect Term Extraction), OTE (Opinion 135 Term Extraction), and ASC (Aspect Sentiment Classifica-136 tion). Among them, the purpose of the ATE task is to extract 137 aspect terms from sentences, which is usually regarded 138 as a sequence labelling task. [4] employed two types of 139 pre-trained embeddings (general-purpose embeddings and 140 domain-specific embeddings) to represent sentences and 141 then used a simple convolutional neural network to achieve 142 good results.
[5] formalized the ATE task as a sequence-143 to-sequence (Seq2Seq) learning task and introduced gated 144 unit networks and position-aware attention to improve the 145 model's ability to extract the aspect terms. The OTE task 146 is generally regarded as an auxiliary task of the ATE task, 147 and its purpose is to extract opinion terms of a given aspect 148 term. To this end, [7] proposed to incorporate the syntactic 149 structure of sentences and the syntax-based opinion possibil-150 ity score into the OTE task. which considers the interactions between the aspect term 203 and opinion term spans, was adopted; this approach uses the 204 semantics of the entire span to predict sentiment polarity. 205 Although these tasks have achieved good performance on the 206 ASTE task, they do not incorporate the feature information of 207 sentences themselves, and it is difficult to extract aspect terms 208 and opinion terms composed of multiple words. By contrast, 209 the framework we proposed adopts two improved grid tag-210 ging schemes, which are more sensitive to aspect terms and 211 opinion terms composed of multiple words, it can also extract 212 complete triplets at one time, and is unaffected by the position 213 between aspect terms and opinion terms.

215
In this section, the ASTE task is first defined. Next, our 216 framework consisting of a dual-encoder and a multi-task 217 framework is described. Subsequently, the inference strate-218 gies and triplet decoding algorithm are introduced. Finally, 219 how to train the framework is demonstrated.  Inspired by the successful practice on many NLP tasks, 230 we use BERT as our backbone model to encode the context 231 information. In order to better understand the structure used 232 in subsequent articles, we briefly introduce BERT before 233 introducing the framework.

234
Recall that BERT is a language model based on a 235 multi-layer bidirectional Transformer [37] proposed by 236 Google. The pre-trained BERT can generate word vectors of 237 the sequence, which can be used as a high-quality input for 238 downstream tasks. Specifically, we transform the sentence 239 into ''[CLS]+sentence+[SEP]'' to represent the entire input. 240 The special tokens [CLS] and [SEP] represent the begin-241 ning token and segment token, respectively. The processed 242 sequence is then fed into BERT for context encoding. First, 243 each token is converted into a vector by summing its TOKEN 244 EMBEDDING, SEGMENT EMBEDDING, and POSITION 245 EMBEDDING. After that, the vector sequence is fed into a 246 stack of Transformer layers to obtain the encoded contex-247 tual information. We use the hidden layer output of the last 248 Transformer block as the context representation. For ease 249 of understanding, the hidden representation in the subse-250 quent article refers to the representation of the word in each 251 sentence and does not contain special tokens. The details 252 of the BERT architecture used in this study are explained 253 in Section IV-B.  In the decoding stage, the word pair relations tagging main 267 task grid, which can extract complete triplets from the input 268 sentence in one go in an end-to-end fashion, can be decoded.    Finally, by adding the two feature representations for fea-307 ture fusion, the representation h i of each word, the represen-308 tation H of the whole review sentence, and the representation 309 r ij of each word pair (w i , w j ) are obtained as follows: To focus the framework's attention on those influential 316 feature representations, we propose word pair relationship 317 tagging as the main task and boundary prediction as an aux-318 iliary task. The schematic diagram of multi-task training is 319 shown in Figure 3. The two tasks jointly train the underlying 320 parameters (if the inference stage is included, the underlying 321 parameters are still shared by the two tasks during inference). 322

323
The purpose of the main task is to tag aspect terms, 324 opinion terms, and sentiment polarity corresponding to 325 the aspect terms. Reference [23] used six tags G = 326 {N , A, O, NEG, NEU , POS} to represent the relations of each 327 word pair (w i , w j ) in the sentence. In contrast, a ten-tag 328 tagging method, which uses a unified tag with position 329 330 NEU , POS} to represent the relationships of each word pair 331 (w i , w j ) in the sentence, is proposed in this study, where -B 332 and -I represent the relative position relationship of the word 333 pair, indicating the beginning or the inner part of a term. -A 334 and -O are used to detect whether a word pair formed by 335 two different words belongs to the same aspect or opinion 336 term, respectively. The ten-tag tagging method will help our 337 model to infer more accurately and extract the final triplets 338 (the inference strategies are described in Section III-C3). The 339 specific meanings of these tags are listed in Table 2. Figure 3 340 (left) shows an example of the main task tags for the sentence 341 in Figure 1. The main task uses an upper triangular grid to 342 accelerate the tagging of word pair relationships.   ing an auxiliary task as a regularization term can improve 349 the sensitivity and discriminative ability of our framework to 350 aspect terms and opinion terms composed of multiple words.

351
It can also give additional penalties for mislabeled aspect 352 terms and opinion terms, thus prompting the main task to 353 pay more attention to mislabeled word pairs. The auxiliary 354 task also adopts a grid tagging scheme and uses unified tags 355 Q = {N 2, A, O} to ascertain whether the word pair (w i , w j ) 356 belongs to the boundary of an aspect term or an opinion term. 357 In this task, the spans of aspect terms and opinion terms do not 358 overlap, so we only use a single A or O to mark the boundaries 359 of aspect terms or opinion terms. The specific meanings of 360 these tags are listed in Table 3. Figure 3 (right) shows an 361 example of auxiliary task tags for the sentences in Figure 1. 362 The boundary prediction auxiliary task only fills the diagonal 363 grid. Herein the heuristic inference strategies adopted by the 367 framework are introduced. As mentioned in Section III-C2, 368 a ten-tag tagging method is introduced, which helps the 369 framework to infer. For an aspect term, no matter how many 370 words it consists of, the first word pair (w i , w i ) must be 371 predicted as A-B to be correct. Therefore, we think that the 372 beginning of the terms is the extraction center and has an 373 important position. In addition, if the predicted tag of the 374 word pair (w i+1 , w i+1 ) is not A-I , which means that the aspect 375 term is composed of only one word. If the predicted tag of the 376 word pair (w i+1 , w i+1 ) is A-I , and only the tag of the cross 377 grid word pair (w i , w i+1 ) is predicted as A-A, the continuous 378 span from word pair (w i , w i ) to word pair (w i+1 , w i+1 ) can be 379 determined as an aspect term.

380
As shown in Figure 3 (left), ''cake with truffles'' is an 381 aspect term consisting of three words, and the corresponding 382 terms (cake, cake), (with, with), and (truffles, truffles) on 383 the main diagonal are marked as A-B or A-I . According to 384   Finally, the last round of predictions p L ij can be used to extract 423 triplets according to Algorithm 1.

424
For the auxiliary task, there are also some potential signs 425 and constraints to help the framework make inferences. When 426 the first A tag is found from top to bottom along the main 427 diagonal, we continue to search backward. If the A tag is 428 encountered again, it is considered that the first A tag and 429 the second A tag are the start and end positions of this aspect 430 term, respectively. If the O tag is encountered, the first A tag is 431 considered to be the boundary of this aspect term alone, that 432 is, this aspect term consists of a word. Similarly, when the first 433 one found is an O tag, the inference can be as outlined above. 434 In addition, for aspect terms and opinion terms composed 435 of multiple words, the A tag or O tag single encountered is 436 always the start of the aspect or opinion term, and the A tag 437 or O tag double encountered is always the end of the aspect 438 term or opinion term, and it is always paired with the nearest 439 preceding A or O tag to form an aspect term span or an opinion 440 term span.

441
As shown in Figure 3 (right), when inferring from top to 442 bottom, the O tag of ''try'' is encountered first, and then the 443 A tag of ''cake'' is encountered, which indicates that ''try'' 444 is an opinion term composed of a single word. And then, 445 when the A tag of ''truffles'' is encountered, because the 446 previous ''cake'' is also an A tag, ''cake'' and ''truffles'' are 447 the boundaries of this aspect term. The prediction formula 448 calculation of the auxiliary task is consistent with the main 449 task. Because the auxiliary task only uses the main diagonal 450 grid, in the auxiliary task, p t−1 i,: = (p t−1 i,i , p t−1 i,i ).

451
Compared with other complex frameworks, our framework 452 requires only one inference to achieve superior performance.   Similarly, for the auxiliary task, the loss function is defined 490 as: 498 Among them, the hyperparameter α can be used to adjust 499 the influence of the auxiliary task on the loss of the main 500 task. The selection of the hyperparameter α is discussed in 501 Section V-D.

502
Algorithm 1 Decoding Algorithm for the ASTE Main Task Input: The predicted results P of a sentence X . P(w i , w j ) denotes the predicted tag of the word pair (w i , w j ). Output: An aspect sentiment triplet set T set of the given sentence. 1: Initialize the aspect term set A set , opinion term set O set , and aspect sentiment triplet set T set with ∅. 2: while a span left index l ≤ n and right index r ≤ n do 3: if P(w i , w i ) = A-B when l ≤ i ≤ r, meanwhile P(w i+1 , w i+1 ) = A-I then 4: Regard the word w i as an aspect term a, A set ← A set ∪ {a}

503
In this section, the details of the experiments are intro-504 duced first. Including the datasets, experimental settings, and 505 evaluation method. It then briefly introduces the compared 506 methods, and finally presents the experimental results and 507 discusses them.

543
Our framework is compared with the following approaches 544 for the ASTE task: ion enhancement module proposed in [34] to deter-553 mine the aspect terms, sentiment polarity, and 554 opinion terms. Then, the relationship classifier 555 proposed in [21] was then adopted to conduct rela-556 tionship matching.

558
• BMRC: [29] transformed the ASTE task into a 559 multi-turn machine reading comprehension task. 560 In this task, three types of queries and a two-way 561 MRC structure were designed. One direction iden-562 tifies aspect terms, opinion terms, and sentiment 563 polarity in turn, and the other direction first identi-564 fies opinion terms and then aspect terms. , which only needs 582 a unified grid to handle the ASTE task in an end-583 to-end fashion.

584
• S 3 E 2 : [32] designed a graph sequence dual repre-585 sentation and modelling paradigm for the ASTE 586 task, which uses graphs to learn and represent the 587 semantic and syntactic relationships between word 588 pairs in sentences, and uses a graph neural network 589 to encode them to extract triplets.

590
• EIN: [26] adopted two encoders to display inter-591 action bi-directionally, and used a multi-layer 592 sequence encoder for target-opinion detection and 593 sentiment polarity classification simultaneously.

594
• ASTE-RL: In the framework of hierarchical rein-595 forcement learning, [33] considered aspect terms 596 and opinion terms as sentiment expression param-597 eters, and considered the interaction between 598 triplets, which improved the efficiency and enabled 599 the model to deal with multiple triplets.

600
• UniASTE: [31] first used sequence tagging to 601 predict the boundaries information of opinion 602 targets and opinion expressions, and then intro-603 duced a target-aware tagging scheme, taking each 604 word in the sentence as a potential opinion target 605 in turn.

607
The main results of the ASTE task on the ASTE-Data-V1 indicates that the proposed framework has a higher prediction 620 accuracy for the ASTE task than the baselines. In addition, 621 the F1-score of the end-to-end approaches is generally higher 622 than that of the pipeline approaches, which implies that estab-  To estimate the effectiveness of the different modules in the 642 proposed framework, an ablation study is conducted on the 643 ASTE-Data-V1 datasets. The results of the ablation study are 644 shown in Table 8.

645
w/o Label Name BERT encoder denotes removal of the 646 Label Name BERT encoder from the framework. It can be 647 seen that this simple feature information fusion can result 648 in significant information gain; additionally, the F1-score 649 drops the most (1.91%) on the 15res datasets and the least 650 (1.24%) on the 14res datasets, indicating that the Label Name 651 BERT encoder is suitable for datasets with a small amount 652 of sample data, and has a greater impact on these datasets. 653 In addition, compared to [32] using a graph convolutional 654 VOLUME 10, 2022 neural network (GCN) to learn features and improve the average F1-score by 1.19%, the Label Name BERT encoder 656 uses a simpler method to improve the F1-score by 1.59%. w/o 657 multi-task means that only a single task framework is used; 658 that is, only the main task framework is employed to train 659 the framework. The decline of the F1-score is also a good    We compare the performance of our model with the pre-710 vious work [23] on the above three settings and report the 711 average F1-score of five times with different random seeds 712 in Table 9. It can be seen from the table that no matter 713 under which setting, the extraction performance of our model 714 is always higher than that of [23], and as the length of 715 the terms increases, the difference in model performance 716 becomes increasingly larger, which indicates that our pro-717 posed inference strategies are effective, and good at solving 718 the triplet extraction task according to the importance of 719 words in different positions in multiple words.

721
In this section, four values of the hyperparameter α are set 722 to 0.001, 0.01, 0.1, and 1, respectively, and the effect of the 723 auxiliary task on the performance of the main task is studied 724 by adjusting the value of the hyperparameter α. The results 725 are shown in Figure 5. 726 Any value of the hyperparameter that is too small or too 727 large will affect the performance of the main task. From 728 0.001 to 0.01, the polylines in both figures show an upward 729 trend, which means that with the increasing influence of 730 the auxiliary task on the main task, the F1-score gradually 731 increases. At values greater than 0.01, it can be seen that 732 except for the polyline 16res, the rest of the experimental 733 results show a downward trend, indicating that the influence 734 of the auxiliary task on the main task is negative. We infer that 735 the reason why 16res is different from other datasets may be 736 that the aspect terms or opinion terms composed of multiple 737 words in the 16res datasets account for a small proportion. So, 738 after inference, the prediction of the auxiliary task tends to be 739 consistent with the prediction of the main task on the main 740 diagonal, and still has a positive influence within this range. 741 Finally, after a comprehensive analysis, the hyperparameter 742 α of the auxiliary task is set to 0.01.

744
In order to further demonstrate the effectiveness of the pro-745 posed ten-tag tagging approach, the extraction results using 746 the six-tag and the ten-tag tagging approaches based on five 747 examples are compared. Of the five examples, the second 748 and third are taken from the laptop domain, and the rest are 749 taken from the restaurant domain. The results are shown in 750  Table 10.

751
For the first example, the sentence is relatively simple, and 752 the correct triplet can be accurately extracted using either 753 six-tag or ten-tag tagging approach. For the second example 754 and the third example, the aspect term ''delivery times'' and 755  the opinion term ''not fix'' are composed of multiple words.

756
It can be seen that the ten-tag tagging approach can correctly 757 extract the corresponding triplet, while the six-tag tagging 758 approach produces errors and only extracts ''delivery'' and 759 ''not'', which proves that the feature enhancement method 760 we used is effective, and our framework is good at using the 761 enhanced features to search for complete aspect terms and 762 opinion terms. For the fourth example, the triplets in it cannot 763 be identified by the six-tag tagging approach. Although our 764 framework extracts all the correct triplets, there is an error due 765 to the extraction of a surplus triplet (receiver, superlatives, 766 positive). The reason for the error is that the multi-extracted 767 aspect term ''receiver'' incorporates the noun part-of-speech 768 feature and ignores the grammatical structure information, 769 causing the framework to make erroneous predictions. For the 770 fifth example sentence, the aspect terms and opinion terms 771 are composed of a single word, but the sentence structure 772 is very complex. There is a many-to-one situation between 773 the aspect terms and the opinion terms. The six-tag tagging 774 approach can only extract four triplets without one triplet 775 (appetizers, delectable, positive). However, the ten-tag tag-776 ging approach can completely find all the triplets, indicating 777 that our approach is good at coping with the complex many-778 to-one relationships between aspect terms and opinion terms.

780
In this paper, a multi-task dual-encoder framework that 781 incorporates semantic information is proposed to complete 782 the ASTE task in an end-to-end fashion in one pass. The