Encoding Syntactic Information into Transformers for Aspect-Based Sentiment Triplet Extraction

Aspect-based sentiment triplet extraction (ASTE) aims to extract triplets consisting of aspect terms and their associated opinion terms and sentiment polarities from sentences, a relatively new and challenging subtask of aspect-based sentiment analysis (ABSA). Previous studies have used either pipeline models or unified tagging schema models. These models ignore the syntactic relationships between the aspect and its corresponding opinion words, which leads them to mistakenly focus on syntactically unrelated words. One feasible option is to use a graph convolution network (GCN) to exploit syntactic information by propagating the representation from the opinion words to the aspect. However, such a method considers all syntactic dependencies to be of the same type and thus may still incorrectly associate unrelated words to the target aspect through the iterations of graph convolutional propagation. Herein, a syntax-aware transformer (SA-Transformer) is proposed to extend the GCN strategy by fully exploiting the dependency types of edges to block inappropriate propagation. The proposed approach can obtain different representations and weights even for edges with the same dependency type according to their adjacent dependency type of edges. Instead of using a GCN layer, we used an L-layer SA transformer to encode syntactic information in the word-pair representation to improve performance. Experimental results on four benchmark datasets show that the proposed model outperforms various previous models for ASTE.


Encoding Syntactic Information into Transformers for Aspect-Based Sentiment Triplet Extraction
Li Yuan , Jin Wang , Member, IEEE, Liang-Chih Yu , Member, IEEE, and Xuejie Zhang , Member, IEEE Abstract-Aspect-based sentiment triplet extraction (ASTE) aims to extract triplets consisting of aspect terms and their associated opinion terms and sentiment polarities from sentences, a relatively new and challenging subtask of aspect-based sentiment analysis (ABSA).Previous studies have used either pipeline models or unified tagging schema models.These models ignore the syntactic relationships between the aspect and its corresponding opinion words, which leads them to mistakenly focus on syntactically unrelated words.One feasible option is to use a graph convolution network (GCN) to exploit syntactic information by propagating the representation from the opinion words to the aspect.However, such a method considers all syntactic dependencies to be of the same type and thus may still incorrectly associate unrelated words to the target aspect through the iterations of graph convolutional propagation.Herein, a syntax-aware transformer (SA-Transformer) is proposed to extend the GCN strategy by fully exploiting the dependency types of edges to block inappropriate propagation.The proposed approach can obtain different representations and weights even for edges with the same dependency type according to their adjacent dependency type of edges.Instead of using a GCN layer, we used an L-layer SA transformer to encode syntactic information in the word-pair representation to improve performance.

Experimental results on four benchmark datasets show that the proposed model outperforms various previous models for ASTE.
Index Terms-Aspect sentiment triplet extraction, sentiment analysis, syntactic information, transformers.

I. INTRODUCTION
A SPECT-BASED sentiment analysis (ABSA) [1] aims to recognize the sentiment polarity and opinion of targeted aspects in a given sentence [1], [2], [3], which is a useful technique for various sentiment applications [4], [5], [6], [7], [8].ABSA is composed of several related subtasks, such as aspect term extraction (ATE), opinion term extraction (OTE), and aspect sentiment classification (ASC).Here, ATE indicates what aspect is being discussed, ASC shows how the sentiment polarity impacts the aspect, and OTE explains why the polarity is associated [9].
Previous works have attempted to either solve the above subtasks individually or solve two of the subtasks jointly, such as ATE and ASC [10], [11], [12], [13], [14], [15] or ATE and OTE [16], [17].To further integrate the tree subtasks, Peng et al. [9] pioneered a unified task, namely, aspect-based sentiment triplet extraction (ASTE), which aims to provide a complete analysis of a user-generated text by producing all triplets (aspect term, opinion term, and corresponding sentiment polarity) from sentences.Fig. 1 shows an example review.The ASTE task requires a model to generate three triplets: (staff, very courteous, Pos), (staff, great, Pos), and (food, terrible, Neg), where staff and food are aspect terms; very courteous, great, and terrible are corresponding opinion terms; and Pos and Neg denote their sentiment polarity.
Previous studies have typically accomplished ASTE tasks by using a two-stage pipeline approach with sequence labeling models [9].This approach first identifies the aspect terms with their sentiment, as well as the opinion terms.The extracted aspect terms are then matched with each opinion term to determine their consistency.Unfortunately, the pipeline approach ignores the relationships between elements and is prone to error propagation.Alternatively, another viable option is to apply a multitask strategy to integrate both stages into a joint framework [18], [19], [20], [21], [22].The main limitation of the joint approach is that it cannot efficiently handle scenarios in which a review contains multiple relational triplets that overlap with each other; e.g., in the previous example sentence, both opinion terms very courteous and great should be associated with the same aspect term, staff.
Several recent works have studied the overlapping triplet problem by applying a grid tagging scheme (GTS) [23], [24].Therefore, the ASTE task is converted to predict the relation tags of word pairs, as shown in the lower part of Fig. 1.The tags A and O denote that the word pair represents the same aspect term and opinion term, the tag N denotes no relation between the word pair, and Pos, Neg and Neu are the sentiment labels.For example, the polarities between word pairs (staff, courteous) and (staff, great) are both positive.However, the equivalence classification between word pairs may lead to an inappropriate association between the aspect terms and opinion terms.For example, great could be simultaneously associated with both aspects terms staff and food.
To address the limitations of the above models, graph-based methods have been proposed to introduce syntactic dependencies to model the relationship between words [25], [26].By parsing the text into a dependency tree, a special type of graph is constructed based on the adjacency matrix.Graph convolution networks (GCNs) can then propagate the representations through the edges from opinion words to the corresponding aspects.However, these models consider all syntactic dependencies to be of the same type and assign an equal weight to each edge.The inappropriate association of less important words may still occur through multiple iterations of graph convolution propagation.In the example shown in Fig. 1, the representation of courteous can be correctly propagated to staff through the path of edges courteous-acomp-was-nsubj-staff, but it can also be incorrectly propagated to food through courteous-acomp-wasconj-was-nsubj-food.
Dependency types are useful features to model word relationships from the syntactic aspect, and different dependency types should be assigned different weights.For instance, the dependency types nusbj and acomp indicate a subject-object relation, and increasing their weights can help accomplish correct propagation (e.g., from courteous to staff and terrible to food).On the other hand, even the same dependency type may necessitate different weights.For instance, the example sentence in Fig. 1 contains two edges with conj.The conj between was and was should be assigned a lower weight to block inappropriate propagation (e.g., from courteous to food and terrible to staff), but the conj between courteous and great should be assigned a higher weight to help propagation from great to staff.
Based on this notion, this study proposes a syntax-aware transformer (SA-Transformer) to incorporate the knowledge of dependency types into graph neural networks for triplet extraction.The proposed method extends graph neural networks in three aspects.First, it can distinguish not only between edges with different dependency types but also those with the same dependency type to achieve more accurate graph propagation.This is accomplished by developing an adjacent edge attention (AEA) mechanism to learn the edge representation for each edge according to the dependency types of its adjacent edges.That is, the edges that have adjacent edges with different dependency types may have different representations and weights.Second, the edge representations are encoded into contextual word representations to learn the syntactic and positional relationships between the words to enhance word pair representations.Third, given that a multiword aspect/opinion term (e.g., very courteous) is divided by multiple consecutive word pairs for prediction, this study devises an adjacency inference strategy to improve triplet extraction for multiword aspect/opinion terms.This strategy can iteratively predict the tag of each word pair according to the predicted results of its adjacent word pairs instead of predicting each word pair independently.The proposed SA-Transformer model is evaluated with respect to four benchmark datasets.Experimental results show that the proposed method outperforms various previous models for ASTE.
The main contributions of this study are summarized as follows.
r We propose the SA-Transformer, which incorporates the knowledge of dependency types to extend graph neural networks for the ASTE task.
r We design the AEA mechanism that can learn different representations and weights for different edges, even for those with the same dependency type, thus achieving more accurate graph propagation.
r Experiments conducted on four benchmark datasets show that the proposed method outperforms existing methods for ASTE.The rest of this paper is organized as follows.Section II briefly reviews existing methods for ASTE.Section III presents a detailed description of the proposed SA-Transformer model.Section IV summarizes the implementation details and experimental results.Conclusions are finally drawn in Section V.

II. RELATED WORKS
Previous ABSA works can be broadly divided into three independent extraction subtasks (ATE, OTE and ASC), the Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

TABLE I DIFFERENT SUBTASKS AND CORRESPONDING METHODS FOR ABSA
joint pair extraction subtask, and ASTE subtask.This section briefly reviews different methods for these subtasks, which are summarized in Table I.
Recent methods for ASC tasks mostly use graph-based models [40], [41], [42], [43], [44], which encode syntactic information to block the inappropriate propagation of unrelated contextual information to the aspect.For example, Tian et al. [41] proposed a type-aware graph convolutional network to capture the syntactic relation between context and target aspect.Xiao et al. [44] presented a syntactic edge-enhanced network with interactive attention, which leverages the edge information of a dependency parsing tree to interactively learn the representations of aspect terms with context.

B. Joint Pair Extraction
Recently, many researchers have focused on designing effective models to jointly extract aspect terms and sentiment polarity [10], [11], [12], [13], [14], [15], [16], [17].For example, Li et al. [11] designed a multigranularity alignment network to decrease the false alignment of features in ASC and ATE tasks.Li et al. [12] designed a two-layer stacked LSTM model in which the lower-layer network guides the upper-layer network to improve performance on ATE and ASC tasks.Hu et al. [13] proposed a span-based model that outperforms joint and collapsed models.
To efficiently align the features of aspect granularity and domains, Wang et al. [16] and Dai et al. [17] attempted to coextract both aspect and opinion terms.Wang et al. [16] proposed a coupled multilayer attention network that uses a couple of attentions in each layer to extract aspect and opinion terms.The multilayer structure can capture both direct and indirect relations between words to achieve more precise extraction.Dai et al. [17] developed a weakly supervised method to extract aspect and opinion terms.It first mined the extraction rules based on the dependencies between words and then used the mined rules to expand the training data for neural model training.

C. Triplet Extraction
Aspect sentiment triplet extraction aims to jointly extract aspect terms, opinion terms, and their corresponding sentiment polarity, presenting a greater challenge than the independent subtasks.Previous works can be separated into the pipeline, multitask, and word-pair methods Peng et al. [9] proposed a pipeline model for ASTE.It first extracted aspect terms, opinion terms, and sentiment polarities using the mutual influence between aspect and opinion terms and then employed a classifier to pair the extracted terms to obtain the final triplets.Peng et al. [9] also extended several joint pair extraction models [12], [16], [17] as a pipeline model.
Several studies have proposed multitask frameworks to jointly extract triplets [18], [19], [20], [21], [22].Zhang et al. [18] used a sequence tagging strategy to extract aspect and opinion terms and predicted sentiment polarities using a table filling method.Chen et al. [19] converted the ASTE task into a machine reading comprehension (MRC) task and proposed a bidirectional MRC framework to gather information useful for triplet extraction from both the aspect-to-opinion and opinion-to-aspect directions.Xu et al. [20] designed a span-level model that can capture the span-to-span interactions instead of word-to-word interactions between the aspects and opinions for ASTE.Dai et al. [21] presented a bidirectional sentiment-dependence detector with double embeddings to obtain better sentence representations and gather information from both the aspect-to-opinion and opinion-to-aspect directions.Zhang et al. [22] proposed a dual decoder with a span copy mechanism that can extract multiple and overlapped triplets based on multitype information.
Wu et al. [23] designed a grid tagging schema to formalize the ASTE task into a word-pair task where classifications are applied between word pairs.Moreover, Xu et al. [24] used a model with a position-aware tagging scheme to jointly extract triplets.However, these methods may associate unrelated opinion terms with the target aspect, even if they are syntactically irrelevant.To address this limitation, Chen et al. [25] proposed the S 3 E 2 model based on a GCN to learn dependency information.However, this model only considers the semantic information of syntactic adjacent contexts and ignores edge attributes.Zhao et al. [26] also developed a pointer-specific tagging method to integrate dependency information into GCN for ASTE.A triplet alignment scheme was then proposed to extract triplets by aligning the corresponding positions of the aspect and opinion terms.

III. SYNTAX-AWARE TRAMSFORER NETWORK
Given an input sentence X = {x 1 , x 2 , . . ., x n } with n tokens, the goal is to extract a set of opinion triplets {(a, o, s) ω } Ω ω=1 , where (a, o, s) ω is the ω-th opinion triplet, which consists of an aspect term of length a ω = {x l a , an opinion term o ω = {x l o , . . ., x r o } of length r o − l o + 1, and the corresponding sentiment polarity s ∈ {Pos, Neg, Neu}.Table II lists the notations used throughout the paper.
Fig. 2(a) shows the overall architecture of the proposed method, which is composed of four parts: a context encoder, SA-Transformer, syntactic relative distance, and adjacent inference strategy.The context encoder is used to produce the contextual word representations for an input sentence.SA-Transformer then uses dependency parsing to obtain the dependency structure of the sentence and represents it using an adjacency matrix and relationship matrix to record whether an edge exists between two words and the kind of dependency type, as shown in Fig. 2(b).Both matrices are used by the AEA to learn the edge representations (E) for each edge based on the dependency types of its adjacent edges, as shown in Fig. 2(c).The edge representations (E) are then added into the key (K) and value (V) of a scaled dot-product attention to be integrated into the contextual word representations.The syntax-enhanced representations of any two words can then be used to constitute a word-pair representation.In addition, the distance between the two words is calculated by their syntactic relative distance in a dependency tree and encoded into the word-pair representation as an extra feature.Finally, the adjacent inference strategy is used to iteratively predict the tag of each word pair from those of its adjacent word pairs.The details of each component are described as follows.

A. Context Encoder
To obtain the contextual word representations for each sentence, 300-dimensional GloVe [45] vectors are used as the initial word embeddings {w 1 , w 2 , . . ., w N }, where w i denotes the word vector of word x i .A bidirectional LSTM (BiLSTM) model [46] is then used as a context encoder to produce the hidden representations of the word vectors, defined as where respectively denote the forward and backward hidden representations of w i , − c i and − c i respectively denote the forward and backward LSTM unit states, and d h denotes the dimensionality of the hidden representations.The forward and backward hidden representations are then concatenated to comprise the final hidden representations, defined as where h i ∈ R d h denotes the final hidden representations of w i , and [:] denotes a concatenation operation.

B. Sa-Transformer
Once we obtain the contextual hidden representations of each word, SA-Transformer encodes syntactic dependency information into them in three steps: representation of the dependency structure, learning of edge representations with dependency types using AEA, and injection of edge representations into contextual representations.The details of each step are described as follows.
Dependency Structure Representation: A given sentence is first parsed as a dependency tree.Each dependency is represented as a tuple (x i , x j , r i,j ), where r i,j denotes the dependency type between the words x i and x j .The dependency structure can then be represented as an adjacency matrix A and relationship matrix R, where A = {a i,j ∈ {0, 1}} ∈ R n×n records whether an edge exists between two words, and R = {r i,j } ∈ R n×n records the dependency type of each edge.Both A and R are symmetric matrices.
Adjacent Edge Attention (AEA): Both adjacency matrix A = {a i,j } ∈ R n×n and relationship matrix R = {r i,j } ∈ R n×n are taken as input to learn the edge representations E = {e i,j } ∈ R n×n×d , where e i,j ∈ R d denotes the representation of the edge between words x i and x j , and d is the dimensionality of the edge representations.To accomplish this goal, an embedding layer is first applied to map R = {r i,j } ∈ R n×n to obtain the initial edge embeddings Z = {z i,j } ∈ R n×n×d z , where z i,j ∈ R d z denotes the initial edge embedding of e i,j ∈ R d , and d z Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. is the dimensionality of the initial edge representations.Each edge representation e i,j is determined based on the dependency types of the edges adjacent to x i and x j .In the example shown in Fig. 3, to learn the edge representation e 3,10 = conj, the AEA first looks up the matrices A = {a i,j } and R = {r i,j } to identify the edge representations adjacent to x 3 = was, i.e., e 3 = {e 3,2 , e 3,3 , e 3,5 , e 3,8 , e 3,10 } with dependency types {nsubj, self, acomp, cc, conj}, and those adjacent to x 10 = was, i.e., e 10 = {e 3,10 , e 9,10 , e 10,10 , e 11,10 } with dependency types {conj, nsubj, self, acomp}.AEA then takes the initial edge embeddings z 3,10 , z 3 and z 10 as inputs, uses scaled dot-product attention to learn the hidden edge representation e 3  3,10 based on e 3 and e 10  3,10 based on e 10 , and finally uses a gate function to combine e 3  3,10 and e 10 3,10 as the final representation of e 3,10 .By considering the dependency types of the adjacent edges, even two edges with the same dependency type can have different representations and weights.
The formal description of AEA is presented as follows.Let z i,j and z i be the initial edge embeddings of e i,j and e i , respectively (i.e., the adjacent edge representations of x i ); thus, the AEA learns the hidden representation e i i,j as where e i i,j ∈ R d z denotes the hidden representation of e i,j learned from its adjacent edge representations e i , W z ∈ R d z ×d z is a trainable weight matrix, and ] denotes the edge representation learned by the m-th attention head of scaled dot-product attention, defined as where Q m i,j ∈ R 1×d e denotes a query regarding the current edge representation z i,j , K m i ∈ R n×d e and V m i ∈ R n×d e respectively denote the key and value both regarding the adjacent edge representations and W V ∈ R d e ×d z are trainable weight matrices, d e = d z /M is the dimensionality of the edge representations in each head, A i ∈ R n×1 denotes a mask vector used to help Q m i,j query the key K m i to identify the edges connected to x i in the value V m i , and softmax( • ) denotes the attention weights for these adjacent edges of x i , which can be obtained in the training process according to their contribution to learning the current edge representation.The attention weight is then used to aggregate the adjacent edge representations of x i in V m i to generate the edge representation of the m-th attention head U i,m i,j .By concatenating the edge representations of all attention heads using (3), the hidden representation e i i,j can be obtained.Similarly, the hidden representation e j i,j can be learned from e j (i.e., the adjacent edge representations of x j ) using ( 3)-( 5) with z i,j and z j as inputs.Once e i i,j and e j i,j are obtained, the AEA uses a gate function to combine the two hidden edge representations as the final representation of e i,j .That is, where α is a combination coefficient, σ is the sigmoid activation function, [:] is a concatenation operation, and W r ∈ R 1×2˜d and b r ∈ R 1 are the trainable weight and bias, respectively.SA-Transformer: Once the edge representations E = {e i,j } are learned according to the dependency types, SA-Transformer adds them into contextual word representations.SA-Transformer is composed of L similar layers, i.e., S = [S (1) , S (2) , . . ., S (L) ].Each layer n ] represents the contextual representations of the words in the sentence, which are computed by combining the hidden representations of the current layer S(l) and the output of the previous layer S (l−1) using layer normalization.That is, Note that S 0 = h is the hidden word representation output by the context encoder.The hidden representations of each layer S(l) are generated by injecting the edge representations E = {e i,j } into the output of the previous layer S (l−1) , defined as where . ., D g n ] ∈ R n×d s denotes the word representations of all words learned by the g-th attention head, and W s ∈ R (G•d s )×d h denotes the output linear projection.Each word representation D g i in D g is injected with the edge representations by a scaled dot-product attention, defined as where , which represents the current word representation of x i in the (l-1)-th layer; K g i ∈ R n×d s and V g i ∈ R n×d s denote the key and value, respectively, regarding S (l−1) and e i , which represent the word representations of all words in the (l-1)-th layer and adjacent edge representations of x i , respectively; β is a balance coefficient; and W e,v ∈ R d s ×d are trainable weight matrices; d s = d h /G is the dimensionality of the word representations in each head; and A i ∈ R n×1 denotes a mask vector used to help Q g i query K g i identify both the words and edges connected to x i in V g i .For each word connected to x i , the contextual representation in S (l−1) is enhanced by combining its corresponding edge representation in e i using β.These syntax-enhanced word representations are then aggregated using the attention weight softmax( • ) to generate the word representation of x i in the g-th attention head D g i .

C. Syntactic Relative Distance
Once the dependency types are incorporated into the contextual word representations, the syntactic relative distance between words [47] is further introduced as an extra feature to enhance word pair representations.The syntactic relative distance between two words, denoted as dist(x i , x j ), is calculated by the number of hops in the path from word x i to x j in a dependency tree.As the example shows in Fig. 2, there are 4 hops between great to food, i.e., dist(great, f ood) = 4.
The representation of the syntactic relative distance between words, denoted as f d (i, j) ∈ R d d , is generated using an embedding layer with dist(x i , x j ) as input, where d d is the dimensionality of the syntactic relative distance representation.The representation of a word pair (x i , x j ) is generated based on the representations of the two words output by SA-Transformer, i.e., S where o i,j ∈ R 2d h +d d denotes the final representation of the word pair (x i , x j ), and [:] denotes a concatenation operation.
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

D. Adjacency Inference
The last step is to predict the relation tag of each word pair in the sentence as one of the six classes: aspect term (A), opinion term (O), positive (Pos), negative (Neg), neutral (Neu), and no relation (N).Generally, each word pair is predicted independently without considering other word pairs.In fact, other word pairs, such as adjacent word pairs, also contribute to tag prediction, especially for multiword aspect/opinion terms.Considering the example sentence in Fig. 4, both the aspect term vegetable salad and opinion term well done consist of two words.Each element in the matrix denotes a word pair representation, and the red rectangle contains the word pair representations for the two-word aspect and opinion terms, i.e., (vegetable, well), (vegetable, done), (salad, well), and (salad, done).Any of the four word pairs can be predicted using the information provided by the other three.For instance, suppose that the model correctly predicts the first three word pairs as (vegetable, well, Pos), (vegetable, done, Pos), and (salad, well, Pos) but incorrectly predicts the last one as no relation (salad, done, N).The three correctly predicted adjacent word pairs provide useful in-formation that done is highly likely to be an opinion term of salad with a positive sentiment.The model can thus correct the prediction of (salad, done, N) as (salad, done, Pos) in the next prediction iteration.
Based on this notion, we devise an adjacency inference strategy that can predict the tag of each word pair by leveraging the predicted results of its adjacent word pairs to effectively extract the triplets for multiword aspect/opinion terms.Given a word pair (x i , x j ), the adjacency inference calculates its tag probability distribution of the six classes {A, O, Pos, Neg, Neu, N} using an iterative process, defined as where p t i,j denotes the final tag probability distribution of (x i , x j ) in the t-th iteration, which is calculated by combining its current tag probability distribution c t i,j and that of its adjacent word pairs c t i,j using a balance coefficient γ t , σ denotes a sigmod function, W p ∈ R 1×2d y denotes a trainable weight matrix, and [:] denotes a concatenation operation.The adjacent tag probability distribution c t i,j is calculated as where c t−1 i−1,j , c t−1 i−1,j , c t−1 i−1,j−1 denotes the three adjacent tag probability distributions of (x i , x j ) in the (t-1)-th iteration, W o ∈ R d y ×3d y is a trainable weight matrix and d y = 6 is the number of tags.The current tag probability distribution c t i,j is calculated as where o t i,j denotes the hidden representation of (x i , x j ) which is initialized by its word pair representation o i,j , i.e.,

E. Training
The training objective is to minimize the cross-entropy error of the ground-truth distribution Y i,j ∈ Y and the predicted tag distribution p T i,j of all word pairs: where Φ and θ respectively denote the number of training samples and all trainable parameters.

A. Datasets and Evaluation Metrics
To evaluate the proposed SA-Transformer, four ASTE benchmark datasets were used, including Rest14, Lap14, Rest15, and Rest16, which mainly contain consumer reviews of laptop computers and restaurants.These datasets have been used for SemEval-2014 [48], SemEval-2015 [49] and SemEval-2016 [50].The statistics of the datasets are presented in Table III.
The precision (P), recall (R), and micro F 1 -score (F 1 ) are used as evaluation metrics for triplet extraction.Compared with precision and recall, F 1 is a more appropriate metric because it considers both precision and recall.A triplet is regarded as correctly predicted only if the predicted aspect term, opinion term, and sentiment polarity match the ground-truth aspect term, opinion term, and corresponding polarity, respectively.

B. Baselines
The baseline models used for comparison include the pipeline, multitask, and word-pair methods.The implementation details of each method are described as follows.

Pipeline Methods
r CLMA+ is the extended version of CLMA [16], which proposes a coupled multilayer attention network that can capture both direct and indirect relations between words to coextract aspect and opinion terms.Peng et al. [9] modified this method as CLMA+ by using CLMA in the first stage, followed by pairing the extracted aspect and opinion terms and identifying sentiment polarities to generate triplets.
r RINATE+ is an extended version of RINATE [17] that uses a weakly supervised method to extract aspect and opinion terms.It uses a set of extraction rules mined based on the dependencies between words to expand the training data for neural model training.Peng et al. [9] modified this method as RINATE+ using the same method as CLMA+.
r Li-unified-R+ is the extended version of Li-unified [12], a unified method that implements a two-layer stacked LSTM model to extract the aspect terms and their sentiment polarities.Peng et al. [9] modified this method as Li-unified-R+ by additionally extracting the opinion terms in the first stage and pairing the extracted terms to generate triplets in the second stage.
r TSF [9] is a two-stage pipeline model for ASTE.In the first stage, it extracts aspect terms, opinion terms and sentiment polarities using the mutual influence between aspect and opinion terms.A classifier is then used to pair the extracted terms to generate triplets in the second stage.Multitask Methods r OTE-MTL [18] uses a multitask learning framework to jointly extract aspect terms, opinion terms and sentiment polarities.It first uses a sequence tagging strategy to extract aspect and opinion terms, then predicts the sentiment polarities using a table filling method, and finally applies a decoding process to generate triplets based on heuristic rules.
r BMRC [19] proposes a bidirectional machine reading com- prehension framework with multiturn queries that are designed to gather information useful for extracting the aspect terms, opinion terms and sentiment polarities.The bidirectional structure can further ensure that information can be gathered from both the aspect-to-opinion and opinion-toaspect directions.
r Span-ASTE [20] proposes a span-level model that can cap- ture the span-to-span interactions instead of word-to-word interactions between the aspects and opinions for ASTE.It first enumerates all possible aspect and opinion spans, then uses a dual-channel span pruning strategy to filter out the invalid spans, and finally determines the sentiment relations between each valid aspect span and opinion span.
r DE-OTE-BISDD [21] presents a method based on double embeddings and bidirectional sentiment-dependence detection.The double embeddings fuse character-and wordlevel embeddings to obtain sentence representations.Multitask learning is then applied to extract aspect and opinion terms, using the bidirectional sentiment-dependence detector to determine the sentiment polarities by leveraging information gathering from both aspect-to-opinion and opinion-to-aspect directions.
r CopyMTL [22] presents a method to extract multiple and overlapped triplets using a span copy mechanism and a dual decoder.The span copy mechanism can capture the multitoken aspect and opinion words through multihead attention.The dual decoder is used to generate aspect and opinion words separately based on multitype information.Word-Pair Methods r GTS [23] pioneered the use of a grid tagging scheme for ASTE.It first enumerates all possible word pairs in a sentence and represents them as a grid.A classifier is then used to classify the relation tags of the word pairs to generate the triplets.
r JET [24] converts the ASET task into a structured pre- diction problem with a position-aware tagging scheme to jointly extract triplets.It develops a joint extraction model based on conditional random field (CRF) and semi-Markov CRF, which can effectively capture the interactions among aspect terms, opinion terms and sentiment polarities based on factorized features.r S 3 E 2 [25] proposes a graph neural network to exploit the semantic and syntactic information for ASTE.It first uses BiLSTM to learn the contextual semantics of sentences and then encodes the syntactic dependencies between words into graph representations to jointly extract the triplets.
r MAS [26] integrates syntactic dependencies into graph neural networks for ASTE.It proposes a pointer-specific tagging method to identify the relationships between the aspect and opinion terms.A triplet alignment scheme is then designed to extract triplets by aligning the corresponding positions of the aspect and opinion terms.

C. Implementation Details
The 300-dimensional GloVe [45] vectors were used as the initial word embedding, and the uniform distribution of U (−0.25, 0.25) was initialized to words that do not appear in the GloVe vectors.The dimension d h of the hidden state was set to 200.The spaCy1 with the en_core_web_trf version was used to parse each given sentence into a dependency tree and then build both a relationship matrix and an adjacency matrix from the dependency tree.Adam [51] was used as the optimizer with a maximum learning rate of 1e-3 and a decay factor of 0.5.The dimensions of the syntactic relative distance embedding d d and syntactic dependency features d z in AEA were 100 and 200, respectively, which were initialized by the uniform distribution of U(−0.5, 0.5).The grid search strategy was implemented to select the optimal values for the model hyperparameters.We ran each model five times and report the average results.Table IV summarizes the hyperparameter settings of the proposed method.The code of this paper is available at: https://github.com/YuanLi95/SA-Transformer-for-ASTE.

D. Comparative Results
Table V summarizes the comparative results of the proposed model and previous methods in terms of P, R, and F 1 .For F 1 , both the multitask (OTE-MTL, BMRC, Span-ASTE, DE-OTE-BISDD, and CopyMTL) and word-pair models (GTS, JET, S 3 E 2 , and MAS) notably outperformed the pipeline models (RINANTE+, CMLA+, Li-unified-R+, and TSF) for all datasets since the joint prediction of subtasks can significantly address the error propagation in the pipeline models.In addition, the word-pair models outperformed the multitask models for most datasets, indicating that word-pair classification can effectively extract the relationships for the nested labels.
The proposed SA-Transformer outperformed the previous methods with respect to F 1 for all datasets.There are three possible reasons to explain this.First, SA-Transformer incorporates the knowledge of dependency types into contextual word representations and thus can effectively reduce the number of syntactically irrelevant word pairs.Second, AEA enables the model to learn an appropriate representation for the dependency type of each edge, and the syntactic relative distance further learns the syntactic and positional information between words.Third, the adjacent inference can iteratively refine the predicted tag distribution of each word pair according to those of its adjacent word pairs and thus can more effectively extract the triplets for multiword aspect/opinion terms.

E. Ablation Studies
Ablation studies were conducted to investigate the effectiveness of each component in the proposed model: SA-Transformer, adjacent edge attention (AEA), syntactic relative distance (SRD), and adjacent inference (AF).Table VI shows the ablation results with the GTS model as the baseline.The various ablation models produced different degrees of performance decline, indicating that each component makes its own unique contribution to the proposed model.By removing the entire SA-Transformer (w/o SA-Trans), i.e., removing both the syntactic dependency module and transformer architecture, the proposed method was degraded to become similar to the GTS model, which thus caused the largest performance decline.Instead of removing the entire SA-Transformer, we replaced the SA-Transformer with the vanilla transformer [52] to retain the transformer architecture while removing the syntactic dependency module (w/o SA).This also resulted in a sharp performance decline, indicating that encoding the dependency type information into the weight and distribution can improve the model's ability to learn the relationships between word pairs.
In addition, the removal of AEA (w/o AEA) also resulted in a decline in performance because AEA can learn appropriate edge representations to achieve more accurate graph propagation.Although the model can work properly without SRD and AF, the performance still decreased because SRD can further capture syntactic and positional information, and AF can better handle multiword aspect/opinion terms.
To further investigate the computational cost of each component, the last two columns in Table VI show the average training and test times per epoch across all datasets for each component.For w/o SA-Trans, the computational cost was reduced by 46% (1.49 seconds) because of removal of the entire SA-Transformer, which thus required a lower computational cost similar to that of GTS.For w/o SA, the computational cost was reduced by 15% (0.66 second) because of removal of the syntactic dependency module, namely, both the adjacency matrix and relationship matrix used for dependency structure representation and their related operations.For w/o AEA, the computational cost was reduced by 11% (0.46 second) which is lower than that of w/o SA because only the adjacency matrix and its related operations were removed.Once the adjacency matrix is removed, the weight and representation of each edge can only be learned using the relationship matrix according to its contribution to the prediction and cannot be learned from its adjacent edges.For w/o SRD, it produced the least reduction in computational cost of 4% (0.16 second) among all components, indicating that calculating the syntactic relative distance to enhance the word pair representation is efficient.For w/o AF, it reduced computational cost by 13% (0.54 second) because it removed the inference strategy that can predict from the adjacent word pairs.

F. Effects of Dependency Parsing
To investigate the effects of using different dependency parsing toolkits for triplet extraction, we selected three dependency parsers including Deep biaffine [53], En_core_web_sm and En_core_web_trf.Deep biaffine uses biaffine attention to predict the dependencies and their types between words.En_core_web_sm and En_core_web_trf represent different versions of the Spacy toolkit, which respectively use the to-ken2vector and a transformer such as RoBERTa [54] as the context encoder.A random parser was also implemented to randomly generate a dependency tree for each sentence.Since Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.En_core_web_trf was used in the previous experiments, this experiment replaced it with the other three parsers to rerun the experiment on triplet extraction.Table VII shows the comparative results.The parsing performance on the English Penn Treebank (PTB) is also provided for reference.The results show that the three parsers Deep biaffine, En_core_web_sm, and En_core_web_trf achieved comparable results, and all significantly outperformed Random.This indicates that randomly generated dependencies and their types contain many errors that degrade the extraction performance, while the parsed results of the other three parsers can still maintain the extraction performance at a certain level.

G. Effects of Parameters
Since we used L layers in SA-Transformer, we investigate the effect of the number of layers on the performance of the proposed model, as presented in Fig. 5(a).As indicated, the best performance was achieved at L=2 on Rest14, Lap14, and Rest15, and L=3 on Rest16.Furthermore, we investigate the effect of the number of inferences T over the range of 0 to 3 for all datasets.As presented in Fig. 5(b), the best performance was achieved at T=2 on Rest14, Lap14, and Rest15 and T=3 on Rest16.T=0 means that AF is not used and the proposed model will degenerate to SA-Transformer w/o AF, thus performing worst for all datasets.
In addition, Fig. 5(c) presents the influence of the balance coefficient β in ( 12) and (13).β = 0 means that none of the syntactic dependency information is incorporated into the contextual word representations, thus performing worst for all datasets.The best performance was achieved at β = 0.5 on Lap14 and Rest16 and β = 1 on Rest14 and Res15.

H. Effects of Dependency Types
Different dependency types may yield different contributions to prediction performance.To investigate their effects, we removed one dependency type at a time to examine the performance change.Fig. 6 shows the change in F 1 -scores after removing the top 12 most frequently occurring dependency types in the datasets.The results show that most dependency types (e.g., nsubj, acomp, conj etc.) yield a positive contribution because removing them led to a certain degree of performance decline.Only selected dependency types (e.g., cc) caused a negative contribution to performance.For example, the dependency types nsubj and acomp are highly useful features because they can capture the subject-object relation between the aspect and opinion words (e.g., (staff, courteous) and (food, terrible) in Fig. 1).Conversely, the dependency type cc typically captures redundant relations (e.g., (was, but) and (courteous, and) in Fig. 1).

I. Case Study
To further explain the effectiveness of the proposed SA-Transformer, three test examples were selected from Rest14 and Rest15 for the case study.Table VIII shows the golden triplets, the predicted triplets of GTS, S 3 E 2 and our model, and the dependency structure of the three test examples.In the first example, GTS correctly extracts the aspect term food with the opinion term cold and the corresponding sentiment polarity.However, the word pair (food, soggy) is not considered a triplet because the distance between soggy and food is too far.It is difficult for GTS to effectively learn the potential relationship between them.S 3 E 2 and our model correctly predict all the triplets because both incorporate syntactic dependencies and thus can effectively aggregate syntax-related information during prediction.
In the second example, the irrelevant context word great is equally close to the aspect term ambiance as it is to the opinion term good.GTS regards both great and good as opinion terms Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.for ambiance, thus producing the incorrect triplet (ambiance, great, Pos).The same situation occurs for S 3 E 2 .After several iterations, GTS associates the aspect term ambiance with the irrelevant word refined, thus mistakenly predicting (ambiance, refined, Pos) as a triplet.The proposed SA-Transformer model can avoid the generation of the incorrect triplet (ambiance, refined, Pos) because it can assign different weights to different edges even if they have the same dependency type.For instance, it can assign a lower weight to the conj between is and is to block inappropriate propagation from refined to ambiance.It can also assign a higher weight to the conj between is and great to successfully aggregate information between service and great.
In the third example, the opinion term not great consists of multiple words.Both GTS and S 3 E 2 fail to completely extract the opinion term.SA-Transformer can do this because it applies adjacent inference to deal with multiword aspect/opinion terms.For this case, the sentiment polarity of the word pair (food, great) is correctly predicted as negative by learning the predicted result of its adjacent word pair (food, not).

J. Visualization
To further demonstrate how SA-Transformer improves the ASET task, we select the second example in Table VIII to visualize the attention weights of its words, as shown in Fig. 7.For the example sentence, SA-Transformer correctly predicts the tag of (service, refined) as positive because it assigns a higher weight to nsubj and acomp (black lines) and thus can effectively align service with refined through two graph propagation iterations.The above situation also occurs in the case of (ambiance, good).On the other hand, although the two edges conj between is and is and conj between refined and great have the same dependency type, they are assigned different weights (respectively lower and higher).This example demonstrates that learning edge representations for each edge by querying its adjacency edges can obtain more appropriate weights and representations and thereby result in more accurate graph propagation.

V. CONCLUSION
This article proposes a syntax-aware transformer that can encode dependency type information into both edge and word representations to improve graph neural networks for ASTE.By encoding the dependency types into edge representations, the proposed method can learn different representations and weights for different edges, even for those with the same dependency type, thus achieving more accurate graph propagation.Incorporating edge representations into contextual word representations can further learn syntactic and positional relationships between words to enhance word pair representations.To effectively extract triplets for multiword aspect/opinion terms, an adjacency inference strategy is developed to iteratively predict the tag of each word pair from the predicted results of its adjacent word pairs.Experiments on four benchmark datasets demonstrate the effectiveness of the proposed method.A series of experiments was also conducted for in-depth analysis, including an ablation study that showed that each component contributes to triplet extraction; a dependency parsing experiment that examined the effects of using different dependency parsing toolkits on extraction performance; a case study that presented several missed and correctly extracted triplets to discuss the effectiveness and limitations of different methods; and a visualization experiment that illustrated the attention weights of each dependency type for an example sentence to explain how the proposed method can accomplish proper graph propagation through weight assignment.
Future work will focus on incorporating other useful external knowledge to improve graph propagation and consider longrange information between word pairs to extend the adjacent inference strategy.Another direction is to investigate recent advancements in ABSA tasks such as large language models (LLMs) [55], prompt-based methods [56] and neurosymbolic AI frameworks [57] to improve ASTE.

Fig. 1 .
Fig. 1.Dependency parsing and grid tagging of a given sentence.N, A, O, Pos, Neg, and Neu respectively denote the word-pair tags of none, aspect opinion, positive, negative, and neutral.

Fig. 2 .
Fig. 2. The overall framework of the proposed SA-Transformer.

Fig. 3 .
Fig.3.Illustrative example of AEA to learn edge representations with dependency types.The dependency type "self" denotes the edge between a word itself.
syntactic relative distance representation is then concatenated with the word pair representation to generate the final representation of the word pair, denoted as S j .The syntactic relative distance representation is then concatenated with the word pair representation to generate the final representation of the word pair, denoted as
and W o ∈ R (2d h +d d )×(2d h +d d +d y ) are trainable weight matrices, and b c ∈ R d y and b o ∈ R 2d h +d d are trainable biases.After T iterations, the final tag probability distribution of all word pairs is denoted as p T = [p T 1,1 , p T 1,2 , . . ., p T n,1 , . . ., p T n,n ].

Fig. 5 .
Fig. 5.The effect of different parameters on different datasets.

Fig. 7 .
Fig. 7. Visualization of the attention of a given sentence.

TABLE II NOTATIONS
USED IN THE PAPER AND THEIR DESCRIPTIONS

TABLE III STATISTICS
OF DATASETS (#S, #T, #POS, #NEU, #NEG, MEAN, AND MAX RESPECTIVELY DENOTE THE NUMBERS OF SENTENCES, TRIPLETS, POSITIVE TRIPLETS, NEUTRAL TRIPLETS, NEGATIVE TRIPLETS, MEAN LENGTH AND MAX LENGTH)

TABLE V EXPERIMENTAL
RESULTS FOR TRIPLET EXTRACTION.EACH MODEL WAS RUN FIVE TIMES TO REPORT THE AVERAGE RESULT.THE BEST SCORES ARE IN BOLD AND THE SECOND BEST ARE UNDERLINED TABLE VI ABLATION STUDY RESULTS OF THE PROPOSED METHOD.EACH MODEL WAS RUN FIVE TIMES TO REPORT ITS AVERAGE RESULT VII EFFECTS OF USING DIFFERENT PARSING FOR TRIPLET EXTRACTION

TABLE VIII CASE
STUDY.THE ASPECT AND OPINION TERMS ARE RESPECTIVELY HIGHLIGHTED IN ORANGE AND BLUE.THE RED LINE AND THE BLUE DOTTED LINE RESPECTIVELY INDICATE THE PREDICTED TRIPLETS AND MISSED CORRECT TRIPLETS